This guide assumes you already have Jupyterlab installed and are familiar with the interface. If not, check out the following article on installing everything you need for your data science toolkit with the Anaconda distribution:
Relational databases store data in tables with fields (columns) and records (rows), and we can query data & make calculations based on these records. The concepts of relational database querying are core to understanding how to do similar things in Python.
An essential skill of any data scientist or analyst is the ability to interface with, and to write and access data to and from these databases, and finally to analyse the data.
It would, therefore, be good practice to be able to do all of this in the one environment, for example in a Jupyter Notebook, such that the entire workflow - that is your SQL queries and Python code used in the analysis - becomes much more efficient and reproducible.
Project Jupyter facilitates magic commands that are designed to solve some of the common problems in standard data analysis - these commands are prefixed by the
% character for line magic and a double
%% prefix for cell magic, which operate on multiple lines of input.
The IPython SQL magic extension allows us to write SQL queries into code cells, as well as write these results directly into Pandas DataFrames, which provide fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive, and is arguably the most powerful and flexible open source data analysis / manipulation tool available.
So, let's get started!
Run the following from the Command Line to install the SQL module, enable the extension and to build Jupyterlab with our newly-installed extension.
conda install -c conda-forge ipython-sqljupyter serverextension enable jupyterlab_sql --py --sys-prefixjupyter lab build
Ok, let's launch Jupyterlab.
To load the SQL module -
ipython-sql extension - into the notebook, run the following in the first code cell:
Now we can connect to any database which is supported by SQLAlchemy, a Python SQL toolkit. There are a number of different database management systems for working with relational data. We're going to use SQLite today, but the same concepts can be applied to other database systems too, such as MySQL and PostgreSQL.
SQLite is serverless
Unlike other hosted alternatives (such as MySQL), SQLite is a self-contained, file-based database - a "zero-configuration" database engine. Most SQL database engines are implemented as a separate server process, whereby one can communicate with the server, sending requests to the server and receiving back results.
For SQLite, on the other hand, there is no intermediary server process - the database consists of a single file on the disk. The main advantage of being serverless is that there is no separate server process to install, setup, configure, or manage.
There will be future tutorials on working with alternative, hosted database management systems. For now though, let's get started.
To establish a connection, enter the following command in a code cell:
which should generate
Let's use the magic function to connect to an existing database. It is assumed that you already have your own database set up and wish to connect to. For this tutorial we have a simple database downloaded to our local machine, which consists of some Gapminder data. This dataset is composed of various national economic indicators like GDP per capita and life expectancy for the period 1952 to 2007.
To connect to our database we need to specify its path, which in this case is pretty simple because it's sitting in the same directory as our notebook.
%sql sqlite:///gapminder.sqlite-----------------------------------------------------------'Connected: @/gapminder.sqlite'
The above statement opened the database named
One of the tables in the database is called
surveys, and we can display its entire contents by using a wildcard.
%sql SELECT * FROM surveys
Displaying only results for Ireland:
%sql SELECT * FROM surveys WHERE country = 'Ireland';
Note that we have capitalised the SQL keywords - SQL is case insensitive, but it helps for readability.
We can also do calculations with the values in a query. For example, if we wanted to look at the population in thousands of people, we could divide all populations by 1000:
SELECT country, year, pop/1000.0 FROM surveys;
When we run the query, the expression
pop / 1000.0 is evaluated for each row and appended to that row, in a new column.
%%sqlSELECT *FROM surveysWHERE (year >= 2000) AND (country IN ("Ireland", "Germany", "Spain"));
Above, we've selected all records for the years 2000 and later for 3 European countries of interest.
All of the above commands print the results of our queries to code output. There are also many more operations we can run, such as aggregation, sorting, joining, and using aliases. However, pandas is MUCH more powerful & intuitive for data manipulation, as well as visualisation for analysis. And it just so happens that we can assign the result of a SQL query to a variable and convert it into a pandas DataFrame.
Let's convert the entire table:
result = %sql SELECT * FROM surveys ;df = result.DataFrame()
At Kyso, we work a lot with Plotly. Plotly Express is a new high-level Python visualisation library: it’s a wrapper for Plotly.py that exposes a simple syntax for complex charts. We can import it with:
import plotly_express as px
Let's say we want to graph life expectancy against GDP per capita for all countries, grouping them by continent. We can do this with a few simple lines of code:
df_02 = df[df['year']==2002]fig = px.scatter(df_02, x="gdpPercap", y="lifeExp", size="pop", color="continent",hover_name="country", log_x=True, size_max=60)fig.show()
What if we want to learn how these economic indicators have evolved over time? With plotly express we can create a nice animation for our 55 years of data:
px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",size="pop", color="continent", hover_name="country",log_x = True,size_max=45, range_x=[100,100000], range_y=[25,90])
And there you have it. The ability to interface with SQL from the Jupyter ecosystem and analyse your data all in one place, which takes data analysis to the next level.
Note that you can check out this guide on Github here, with an example notebook and more SQL queries and python code to practice with.
Extensions are one of the most useful features of Jupyterlab and can enhance your entire data science workflow. The reason we used IPython magic in the tutorial above is because the Jupyterlab extension still seems to be its infancy.
It is incompatible with the latest releases of Jupyterlab - as of September 2019 that is, there is not yet a way to write multiple queries at once, to save queries or to write the results of these queries to a notebook in the form of a pandas DataFrame.
jupyterlab-sql is still a powerful extension that facilitates a SQL user interface to Jupyterlab, where we can:
Explore SQL tables with point & click.
Modify and read these tables with custom queries.
Read more about it at the official Github repository.
There is no conda release just yet. To install
jupyterlab-sql, run the following commands in the given order:
pip install jupyterlab_sqljupyter serverextension enable jupyterlab_sql --py --sys-prefixjupyter lab build
Restart any running Jupyter servers. Note that the extension only works with Python 3.5 and above.
After you launch Jupyterlab, a new SQL icon will appear in the launcher:
This means, everything is installed correctly, and you are ready to connect your instance to a database.
For connecting to the database, you need to have a valid database URL.
jupyterlab-sql. We follow the SQLAlchemy guide on URLs. The URL must be a database URL, such as:
Click on the SQL icon in the launcher and type in your database's URL. Press
enter to connect. As soon as the database is connected, we can view all tables in the database. We can also write custom SQL queries to get the desired data from the tables.