Using SQL in Jupyterlab

Learn how to connect to a SQL database and graph your data from a Jupyter notebook.

This guide assumes you already have Jupyterlab installed and are familiar with the interface. If not, check out the following article on installing everything you need for your data science toolkit with the Anaconda distribution:

Introduction

Relational databases

Relational databases store data in tables with fields (columns) and records (rows), and we can query data & make calculations based on these records. The concepts of relational database querying are core to understanding how to do similar things in Python.

An essential skill of any data scientist or analyst is the ability to interface with, and to write and access data to and from these databases, and finally to analyse the data.

It would, therefore, be good practice to be able to do all of this in the one environment, for example in a Jupyter Notebook, such that the entire workflow - that is your SQL queries and Python code used in the analysis - becomes much more efficient and reproducible.

IPython SQL magic

Project Jupyter facilitates magic commands that are designed to solve some of the common problems in standard data analysis - these commands are prefixed by the % character for line magic and a double %% prefix for cell magic, which operate on multiple lines of input.

The IPython SQL magic extension allows us to write SQL queries into code cells, as well as write these results directly into Pandas DataFrames, which provide fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive, and is arguably the most powerful and flexible open source data analysis / manipulation tool available.

So, let's get started!

Installing the SQL module in the notebook

Run the following from the Command Line to install the SQL module, enable the extension and to build Jupyterlab with our newly-installed extension.

conda install -c conda-forge ipython-sql
jupyter serverextension enable jupyterlab_sql --py --sys-prefix
jupyter lab build

Ok, let's launch Jupyterlab.

jupyter-lab

To load the SQL module - ipython-sql extension - into the notebook, run the following in the first code cell:

%load_ext sql

Database management systems

Now we can connect to any database which is supported by SQLAlchemy, a Python SQL toolkit. There are a number of different database management systems for working with relational data. We're going to use SQLite today, but the same concepts can be applied to other database systems too, such as MySQL and PostgreSQL.

SQLite is serverless

Unlike other hosted alternatives (such as MySQL), SQLite is a self-contained, file-based database - a "zero-configuration" database engine. Most SQL database engines are implemented as a separate server process, whereby one can communicate with the server, sending requests to the server and receiving back results.

For SQLite, on the other hand, there is no intermediary server process - the database consists of a single file on the disk. The main advantage of being serverless is that there is no separate server process to install, setup, configure, or manage.

There will be future tutorials on working with alternative, hosted database management systems. For now though, let's get started.

To establish a connection, enter the following command in a code cell:

%sql sqlite://

which should generate ‘Connected: @None',output.

Connecting to an existing database

Let's use the magic function to connect to an existing database. It is assumed that you already have your own database set up and wish to connect to. For this tutorial we have a simple database downloaded to our local machine, which consists of some Gapminder data. This dataset is composed of various national economic indicators like GDP per capita and life expectancy for the period 1952 to 2007.

To connect to our database we need to specify its path, which in this case is pretty simple because it's sitting in the same directory as our notebook.

%sql sqlite:///gapminder.sqlite
-----------------------------------------------------------
'Connected: @/gapminder.sqlite'

The above statement opened the database named gapminder.sqlite .

Some basic queries

One of the tables in the database is called surveys, and we can display its entire contents by using a wildcard.

%sql SELECT * FROM surveys

Displaying only results for Ireland:

%sql SELECT * FROM surveys WHERE country = 'Ireland';

Note that we have capitalised the SQL keywords - SQL is case insensitive, but it helps for readability.

We can also do calculations with the values in a query. For example, if we wanted to look at the population in thousands of people, we could divide all populations by 1000:

SELECT country, year, pop/1000.0 FROM surveys;

When we run the query, the expression pop / 1000.0 is evaluated for each row and appended to that row, in a new column.

%%sql
SELECT *
FROM surveys
WHERE (year >= 2000) AND (country IN ("Ireland", "Germany", "Spain"));

Above, we've selected all records for the years 2000 and later for 3 European countries of interest.

The magic of pandas

All of the above commands print the results of our queries to code output. There are also many more operations we can run, such as aggregation, sorting, joining, and using aliases. However, pandas is MUCH more powerful & intuitive for data manipulation, as well as visualisation for analysis. And it just so happens that we can assign the result of a SQL query to a variable and convert it into a pandas DataFrame.

Let's convert the entire table:

result = %sql SELECT * FROM surveys ;
df = result.DataFrame()

Visualising our data

At Kyso, we work a lot with Plotly. Plotly Express is a new high-level Python visualisation library: it’s a wrapper for Plotly.py that exposes a simple syntax for complex charts. We can import it with:

import plotly_express as px

Let's say we want to graph life expectancy against GDP per capita for all countries, grouping them by continent. We can do this with a few simple lines of code:

df_02 = df[df['year']==2002]
fig = px.scatter(df_02, x="gdpPercap", y="lifeExp", size="pop", color="continent",
hover_name="country", log_x=True, size_max=60)
fig.show()
Economic and Social Inequality in 2002

What if we want to learn how these economic indicators have evolved over time? With plotly express we can create a nice animation for our 55 years of data:

px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
size="pop", color="continent", hover_name="country",
log_x = True,
size_max=45, range_x=[100,100000], range_y=[25,90])
Evolution of Economic and Social Inequality

Conclusion

And there you have it. The ability to interface with SQL from the Jupyter ecosystem and analyse your data all in one place, which takes data analysis to the next level.

Note that you can check out this guide on Github here, with an example notebook and more SQL queries and python code to practice with.

A side note on the SQL extension for Jupyterlab

Extensions are one of the most useful features of Jupyterlab and can enhance your entire data science workflow. The reason we used IPython magic in the tutorial above is because the Jupyterlab extension still seems to be its infancy.

It is incompatible with the latest releases of Jupyterlab - as of September 2019 that is, there is not yet a way to write multiple queries at once, to save queries or to write the results of these queries to a notebook in the form of a pandas DataFrame.

However, jupyterlab-sql is still a powerful extension that facilitates a SQL user interface to Jupyterlab, where we can:

  • Explore SQL tables with point & click.

  • Modify and read these tables with custom queries.

Read more about it at the official Github repository.

There is no conda release just yet. To install jupyterlab-sql, run the following commands in the given order:

pip install jupyterlab_sql
jupyter serverextension enable jupyterlab_sql --py --sys-prefix
jupyter lab build

Restart any running Jupyter servers. Note that the extension only works with Python 3.5 and above.

Getting Started

After you launch Jupyterlab, a new SQL icon will appear in the launcher:

Click File >> New Launcher

This means, everything is installed correctly, and you are ready to connect your instance to a database.

Connection

For connecting to the database, you need to have a valid database URL. jupyterlab-sql. We follow the SQLAlchemy guide on URLs. The URL must be a database URL, such as:

postgres://host:port/database
postgres://username:password@host:port/database
mysql://host/database
sqlite://
sqlite:///database

Click on the SQL icon in the launcher and type in your database's URL. Press enter to connect. As soon as the database is connected, we can view all tables in the database. We can also write custom SQL queries to get the desired data from the tables.