Python and SQL in 5 Minutes

5 minute read

Recently I played around with a combination of SQL and Python. I had been quite interested in learning SQL before and had been working through Stanford’s excellent collection of self-paced database courses. Since I normally work in Python, I was looking for an way of combining SQL and Python. And I found a dream team! While using SQL in combination with Python can sometimes seem a bit cumbersome, combining SQL with the Pandas package is easier than anything. In fact, I found that one can achieve a reasonable lot with only eight commands.

Here is a summary of the main functionality.

Step 0 - Preparing the Data

Let’s assume that we start with three csv-files that include information about movie ratings: movies.csv, reviewers.csv and ratings.csv (this is in example database from Stanford’s SQL course.

First, we have to import the necessary modules. Here, I will be working with SQLite3.

import pandas as pd
import sqlite3

Next, we use the pandas read_csv command to read in the csv-files:

movies = pd.read_csv('movies.csv')
reviewers = pd.read_csv('reviewers.csv')
ratings = pd.read_csv('ratings.csv')

We can inspect the dataframes by:

movies.head(2)

	mID	title	year	director
0	101	Gone with the Wind	1939	Victor Fleming
1	102	Star Wars	1977	George Lucas

Step 1 - Creating the Database

First, we need to create a connection:

conn = sqlite3.connect("rating.db")

We can use the connection to directly save the dataframes as tables of our database.

df.to_sql() has the option “index” which is by default set to True. This saves the index column of the dataframe in the table. Since I am not interested in saving the df index, I set index=False:

reviewers.to_sql("reviewers", conn, if_exists="replace", index=False)
ratings.to_sql("ratings", conn, if_exists="replace", index=False)
movies.to_sql("movies", conn, if_exists="replace", index=False)

Step 3: Querying

In order to query our database, we use the pd.read_sql_query command, which directly returns a dataframe which can then be analyzed with the usual dataframe options:

pd.read_sql_query("""select * from reviewers""", conn).head(2)

	rID	name
0	201	Sarah Martinez
1	202	Daniel Lewis

In other databases, the same could be achieved with “pd.read_sql_table(“reviewers”, conn)”, but Sqlite’s DBAPI is not supported here.

Note that sqlite3 does not natively support date as datatype. When querying, we can tell it to treat columns as datetype by setting the “parse_dates” option. Note the difference between columns ratingDate and date:

pd.read_sql_query("""select rID, mID, ratingDate, ratingDate as date from ratings where stars=4""", conn, parse_dates=["ratingDate"])

	rID	mID	ratingDate	date
0	201	101	2011-01-27	2011-01-27
1	202	106	NaT	None
2	203	108	2011-01-12	2011-01-12
3	205	108	NaT	None

Step 4: Changing and Updating Tables

Changing and updating of databases happens in three steps:

Define a cursor: cur = conn.cursor()
Execute command: cur.execute(SQL-command)
(Commit the change to the db: conn.commit())

1. Define a cursor

In order to update a database, first we have to define a cursor on the connection:

cur = conn.cursor()

2. Execute a command

a) Adding a column:

For example, we can add a column “NRatings” to the table reviewers (that would be a horrible choice from a design point, but let’s go for it)

cur.execute("alter table reviewers add column NRatings integer;")

<sqlite3.Cursor at 0xae8f4520>

If we check reviewers, we see that this took effect immediately:

pd.read_sql_query("""select * from reviewers""", conn).head(2)

	rID	name	NRatings
0	201	Sarah Martinez	None
1	202	Daniel Lewis	None

b) Updating entries:

We now populate the created column.

cur.execute("""update reviewers set NRatings = (select count(*) from ratings where ratings.rID = reviewers.rID)""")

<sqlite3.Cursor at 0xae8f4520>

If we check, we see the change:

pd.read_sql_query("""select * from reviewers""", conn).head(2)

	rID	name	NRatings
0	201	Sarah Martinez	2
1	202	Daniel Lewis	1

BUT: The change is not in the database yet. We can see this by creating a second connection to the same database and use the second connection for querying:

conn2 = sqlite3.connect("rating.db")

pd.read_sql_query("""select * from reviewers""", conn2).head(2)

	rID	name	NRatings
0	201	Sarah Martinez	None
1	202	Daniel Lewis	None

In order for the update to take effect, we still have to commit it. We do this by typing:

conn.commit()

And querying now shows that the update took effect:

pd.read_sql_query("""select * from reviewers""", conn2).head(2)

	rID	name	NRatings
0	201	Sarah Martinez	2
1	202	Daniel Lewis	1

This might look a bit strange at first glance, but has the reason to prevent inconsistent databases if the something goes wrong in the process. Imagine the case that money is transfered from one bank account to another. If something goes wrong after the money is subtracted from the first account but before arriving at the second you end up with inconsistent tables. To prevent this you first do both changes then commit them together.

3. Word of Caution!

Here is a word of caution taken from the python documention on sqlite3 because a) security is important and b) it gives me a good reason to link to the xkcd comic below, which made me laugh so much :D

Usually your SQL operations will need to use values from Python variables. You shouldn’t assemble your query using Python’s string operations because doing so is insecure; it makes your program vulnerable to an SQL injection attack (see this fabulous XKCD for a humorous example of what can go wrong). Instead, use the DB-API’s parameter substitution. Put ? as a placeholder wherever you want to use a value, and then provide a tuple of values as the second argument to the cursor’s execute() method.

# Never do this -- insecure!
# title = 'Star Wars'
# cur.execute("""update movies set director='Darth Vader' where title = '%s'""" % title)

# Do this instead
t = ('Star Wars',)
cur.execute("""update movies set director='Darth Vader' where title = ?""", t)

<sqlite3.Cursor at 0xae8f4520>

Or for multiple values:

t = ('Darth Vader', 'Star Wars',)
cur.execute("""update movies set director=? where title = ?""", t)

<sqlite3.Cursor at 0xae8f4520>

4. Closing the Connection

Finally, we can close the connection to the database by:

conn.close()

# And we have to close the second one, too:
conn2.close()

And that’s it folks! This should give you a quick start into combining Pandas/Python with SQL(ite).

I also found this tutorial quite helpful, where you can also find more follow-up links.

Enjoy!

Share on

Twitter Facebook Google+ LinkedIn

Jessica Walkenhorst

Python and SQL in 5 Minutes

Step 0 - Preparing the Data

Step 1 - Creating the Database

Step 3: Querying

Step 4: Changing and Updating Tables

1. Define a cursor

2. Execute a command

a) Adding a column:

b) Updating entries:

3. Word of Caution!

4. Closing the Connection

Share on

Leave a Comment

You May Also Enjoy

Building and Deploying a Meetup Organizer App for London’s PyData Meetup

Why You Shouldn’t Deploy Streamlit Applications from Your Main GitHub Account and What You Should Do Instead

How to use fastai, Gradio and Hugging Face to Save an Ugly Duckling from a Youth of Misery

We Are Beating Child Mortality - How to Create Interactive and Animated Visualizations in Python