Chapter 3 - Jupyter Notebooks and Appyters
Contents
Chapter 3 - Jupyter Notebooks and Appyters#
Authors: Ido Diamant
Maintainers: Ido Diamant
Version: 0.1
License: CC-BY-NC-SA 4.0
In this chapter, we’ll go over how to use Jupyter notebooks and Appyters, two very useful tools for processing data with Python.
Jupyter Notebooks#
Introduction to Jupyter Notebooks#
What is Jupyter?#
Fig. 3 https://jupyter.org/#
Jupyter notebooks are an interactive computing tool for organizing and executing Python code. They allow you to combine Markdown text and Python code in a single document, effectively creating a Python file with robust description and executable code.
Notebooks consist of building blocks called cells. Cells can be comprised of text or code, which will affect how their contents are rendered.
Jupyter notebooks are run using an engine called a kernel which can be configured to change the notebook’s behavior. Most notebooks will be run with the default engine which is called ipykernel.
Why Use Jupyter?#
Jupyter notebooks allow users to create reproducible and descriptive data processing workflows. Using the available Markdown functionality, notebooks can organize code according to a table of contents or outline, describing the purpose and methodology of each code cell.
Notebooks have a user-friendly interface and robust documentation that supports users of all experience levels. They are widely used for data science applications. Being familliar with notebooks and basic data processing practices using them can help you to understand workflows that are currently employed in data science techniques.
There are also many tools that use or extend Jupyter notebooks to supplement their functionality. With these tools, projects can be worked on collaboratively in a single file, multiple notebooks can be linked to break up a large project’s scope,
Setting Up Jupyter#
As mentioned in Chapter 2, we can use Python’s built-in package manager, pip, to install external modules. Running
pip install jupyter
in the terminal will install Jupyter and all its requirements.
We can then launch Jupyter by executing
jupyter notebook
in the terminal. This will automatically launch the Jupyter notebooks interface in a browser window.
The default landing page will show your local filesystem. You can use the page to navigate to the location where you want to create or access a notebook, and then create a new notebook or open an existing one.
Anatomy of a Notebook#
As mentioned previously, notebooks are made up of text and code cells. Text cells interpret some characters as Markdown operators. We can use these to create titles, headings, lists, and many more components. These cells appear as plain text until they are rendered, at which point any special operators will take effect.
Code cells are executable and contain Python code. Each cell can be run individually, but they all share memory in the notebook. Given this, you can access and manipulate objects across cells. Code cells also have a corresponding output section. In this section, the return values of any print or display commands are shown. In addition, if the last line of a code cell is not a variable assignment, its return value will also be displayed in the cell’s output.
Processing Data with Jupyter Notebooks#
Handling Data with Python#
Python has some data structures built in, which are a great start to working with data. However, when we want to start processing large datasets with metadata and other information, these native data structures can become restrictive. To solve this, we’ll turn to some open-source libraries that implement new data structures in Python.
NumPy#
The first of these libraries is NumPy. NumPy implements arrays, which are similar to Python’s native lists. However, they differ from lists in a few ways:
Lists can store multiple data types, while arrays only store one. This can help us ensure that inputs to our workflow are limited to expected data types and can help avoid errors.
Mathetmatical operations can be applied to arrays. If we want to systematically affect all items on a list, we’ll need to iterate through.
NumPy also adds many numerical operations, which are very useful if we want to apply any statistical methods in our data analysis.
Pandas#
Visualizing Data with Jupyter Notebooks#
The Jaccard Index#
The Jaccard index is a measure of the similarity between two sets. It is defined as the fraction of the number of elements which appear in both sets compared to the number of elements appearing in either set.
Using the gene set libraries DataFrames we set up in the previous section, we can calculate the similarity between each gene set. By comparing which genes are differentially expressed in specific cell lines and by small molecules, we can generate a similarity matrix between the attributes of our two datasets.
Appyters#
Introduction to Appyters#
Appyters are a tool developed by the Ma’ayan Lab to allow users to easily convert existing notebooks into interactive web applications.