Python and SQL in 5 Minutes

5 minute read

Recently I played around with a combination of SQL and Python. I had been quite interested in learning SQL before and had been working through Stanford’s excellent collection of self-paced database courses. Since I normally work in Python, I was looking for an way of combining SQL and Python. And I found a dream team! While using SQL in combination with Python can sometimes seem a bit cumbersome, combining SQL with the Pandas package is easier than anything. In fact, I found that one can achieve a reasonable lot with only eight commands.

Here is a summary of the main functionality.

Step 0 - Preparing the Data

Let’s assume that we start with three csv-files that include information about movie ratings: movies.csv, reviewers.csv and ratings.csv (this is in example database from Stanford’s SQL course.

First, we have to import the necessary modules. Here, I will be working with SQLite3.

import pandas as pd
import sqlite3

Next, we use the pandas read_csv command to read in the csv-files:

movies = pd.read_csv('movies.csv')
reviewers = pd.read_csv('reviewers.csv')
ratings = pd.read_csv('ratings.csv')

We can inspect the dataframes by:

movies.head(2)
mID title year director
0 101 Gone with the Wind 1939 Victor Fleming
1 102 Star Wars 1977 George Lucas

Step 1 - Creating the Database

First, we need to create a connection:

conn = sqlite3.connect("rating.db")

We can use the connection to directly save the dataframes as tables of our database.

df.to_sql() has the option “index” which is by default set to True. This saves the index column of the dataframe in the table. Since I am not interested in saving the df index, I set index=False:

reviewers.to_sql("reviewers", conn, if_exists="replace", index=False)
ratings.to_sql("ratings", conn, if_exists="replace", index=False)
movies.to_sql("movies", conn, if_exists="replace", index=False)

Step 3: Querying

In order to query our database, we use the pd.read_sql_query command, which directly returns a dataframe which can then be analyzed with the usual dataframe options:

pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID name
0 201 Sarah Martinez
1 202 Daniel Lewis

In other databases, the same could be achieved with “pd.read_sql_table(“reviewers”, conn)”, but Sqlite’s DBAPI is not supported here.

Note that sqlite3 does not natively support date as datatype. When querying, we can tell it to treat columns as datetype by setting the “parse_dates” option. Note the difference between columns ratingDate and date:

pd.read_sql_query("""select rID, mID, ratingDate, ratingDate as date from ratings where stars=4""", conn, parse_dates=["ratingDate"])
rID mID ratingDate date
0 201 101 2011-01-27 2011-01-27
1 202 106 NaT None
2 203 108 2011-01-12 2011-01-12
3 205 108 NaT None

Step 4: Changing and Updating Tables

Changing and updating of databases happens in three steps:

  1. Define a cursor: cur = conn.cursor()
  2. Execute command: cur.execute(SQL-command)
  3. (Commit the change to the db: conn.commit())

1. Define a cursor

In order to update a database, first we have to define a cursor on the connection:

cur = conn.cursor()

2. Execute a command

a) Adding a column:

For example, we can add a column “NRatings” to the table reviewers (that would be a horrible choice from a design point, but let’s go for it)

cur.execute("alter table reviewers add column NRatings integer;")
<sqlite3.Cursor at 0xae8f4520>

If we check reviewers, we see that this took effect immediately:

pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID name NRatings
0 201 Sarah Martinez None
1 202 Daniel Lewis None

b) Updating entries:

We now populate the created column.

cur.execute("""update reviewers set NRatings = (select count(*) from ratings where ratings.rID = reviewers.rID)""")
<sqlite3.Cursor at 0xae8f4520>

If we check, we see the change:

pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID name NRatings
0 201 Sarah Martinez 2
1 202 Daniel Lewis 1

BUT: The change is not in the database yet. We can see this by creating a second connection to the same database and use the second connection for querying:

conn2 = sqlite3.connect("rating.db")
pd.read_sql_query("""select * from reviewers""", conn2).head(2)
rID name NRatings
0 201 Sarah Martinez None
1 202 Daniel Lewis None

In order for the update to take effect, we still have to commit it. We do this by typing:

conn.commit()

And querying now shows that the update took effect:

pd.read_sql_query("""select * from reviewers""", conn2).head(2)
rID name NRatings
0 201 Sarah Martinez 2
1 202 Daniel Lewis 1

This might look a bit strange at first glance, but has the reason to prevent inconsistent databases if the something goes wrong in the process. Imagine the case that money is transfered from one bank account to another. If something goes wrong after the money is subtracted from the first account but before arriving at the second you end up with inconsistent tables. To prevent this you first do both changes then commit them together.

3. Word of Caution!

Here is a word of caution taken from the python documention on sqlite3 because a) security is important and b) it gives me a good reason to link to the xkcd comic below, which made me laugh so much :D

Usually your SQL operations will need to use values from Python variables. You shouldn’t assemble your query using Python’s string operations because doing so is insecure; it makes your program vulnerable to an SQL injection attack (see this fabulous XKCD for a humorous example of what can go wrong). Instead, use the DB-API’s parameter substitution. Put ? as a placeholder wherever you want to use a value, and then provide a tuple of values as the second argument to the cursor’s execute() method.

# Never do this -- insecure!
# title = 'Star Wars'
# cur.execute("""update movies set director='Darth Vader' where title = '%s'""" % title)

# Do this instead
t = ('Star Wars',)
cur.execute("""update movies set director='Darth Vader' where title = ?""", t)
<sqlite3.Cursor at 0xae8f4520>

Or for multiple values:

t = ('Darth Vader', 'Star Wars',)
cur.execute("""update movies set director=? where title = ?""", t)
<sqlite3.Cursor at 0xae8f4520>

4. Closing the Connection

Finally, we can close the connection to the database by:

conn.close()

# And we have to close the second one, too:
conn2.close()

And that’s it folks! This should give you a quick start into combining Pandas/Python with SQL(ite).

I also found this tutorial quite helpful, where you can also find more follow-up links.

Enjoy!

Leave a Comment