Python and SQL in 5 Minutes
Recently I played around with a combination of SQL and Python. I had been quite interested in learning SQL before and had been working through Stanford’s excellent collection of self-paced database courses. Since I normally work in Python, I was looking for an way of combining SQL and Python. And I found a dream team! While using SQL in combination with Python can sometimes seem a bit cumbersome, combining SQL with the Pandas package is easier than anything. In fact, I found that one can achieve a reasonable lot with only eight commands.
Here is a summary of the main functionality.
Step 0 - Preparing the Data
Let’s assume that we start with three csv-files that include information about movie ratings: movies.csv, reviewers.csv and ratings.csv (this is in example database from Stanford’s SQL course.
First, we have to import the necessary modules. Here, I will be working with SQLite3.
import pandas as pd
import sqlite3
Next, we use the pandas read_csv command to read in the csv-files:
movies = pd.read_csv('movies.csv')
reviewers = pd.read_csv('reviewers.csv')
ratings = pd.read_csv('ratings.csv')
We can inspect the dataframes by:
movies.head(2)
mID | title | year | director | |
---|---|---|---|---|
0 | 101 | Gone with the Wind | 1939 | Victor Fleming |
1 | 102 | Star Wars | 1977 | George Lucas |
Step 1 - Creating the Database
First, we need to create a connection:
conn = sqlite3.connect("rating.db")
We can use the connection to directly save the dataframes as tables of our database.
df.to_sql() has the option “index” which is by default set to True. This saves the index column of the dataframe in the table. Since I am not interested in saving the df index, I set index=False:
reviewers.to_sql("reviewers", conn, if_exists="replace", index=False)
ratings.to_sql("ratings", conn, if_exists="replace", index=False)
movies.to_sql("movies", conn, if_exists="replace", index=False)
Step 3: Querying
In order to query our database, we use the pd.read_sql_query command, which directly returns a dataframe which can then be analyzed with the usual dataframe options:
pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID | name | |
---|---|---|
0 | 201 | Sarah Martinez |
1 | 202 | Daniel Lewis |
In other databases, the same could be achieved with “pd.read_sql_table(“reviewers”, conn)”, but Sqlite’s DBAPI is not supported here.
Note that sqlite3 does not natively support date as datatype. When querying, we can tell it to treat columns as datetype by setting the “parse_dates” option. Note the difference between columns ratingDate and date:
pd.read_sql_query("""select rID, mID, ratingDate, ratingDate as date from ratings where stars=4""", conn, parse_dates=["ratingDate"])
rID | mID | ratingDate | date | |
---|---|---|---|---|
0 | 201 | 101 | 2011-01-27 | 2011-01-27 |
1 | 202 | 106 | NaT | None |
2 | 203 | 108 | 2011-01-12 | 2011-01-12 |
3 | 205 | 108 | NaT | None |
Step 4: Changing and Updating Tables
Changing and updating of databases happens in three steps:
- Define a cursor: cur = conn.cursor()
- Execute command: cur.execute(SQL-command)
- (Commit the change to the db: conn.commit())
1. Define a cursor
In order to update a database, first we have to define a cursor on the connection:
cur = conn.cursor()
2. Execute a command
a) Adding a column:
For example, we can add a column “NRatings” to the table reviewers (that would be a horrible choice from a design point, but let’s go for it)
cur.execute("alter table reviewers add column NRatings integer;")
<sqlite3.Cursor at 0xae8f4520>
If we check reviewers, we see that this took effect immediately:
pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID | name | NRatings | |
---|---|---|---|
0 | 201 | Sarah Martinez | None |
1 | 202 | Daniel Lewis | None |
b) Updating entries:
We now populate the created column.
cur.execute("""update reviewers set NRatings = (select count(*) from ratings where ratings.rID = reviewers.rID)""")
<sqlite3.Cursor at 0xae8f4520>
If we check, we see the change:
pd.read_sql_query("""select * from reviewers""", conn).head(2)
rID | name | NRatings | |
---|---|---|---|
0 | 201 | Sarah Martinez | 2 |
1 | 202 | Daniel Lewis | 1 |
BUT: The change is not in the database yet. We can see this by creating a second connection to the same database and use the second connection for querying:
conn2 = sqlite3.connect("rating.db")
pd.read_sql_query("""select * from reviewers""", conn2).head(2)
rID | name | NRatings | |
---|---|---|---|
0 | 201 | Sarah Martinez | None |
1 | 202 | Daniel Lewis | None |
In order for the update to take effect, we still have to commit it. We do this by typing:
conn.commit()
And querying now shows that the update took effect:
pd.read_sql_query("""select * from reviewers""", conn2).head(2)
rID | name | NRatings | |
---|---|---|---|
0 | 201 | Sarah Martinez | 2 |
1 | 202 | Daniel Lewis | 1 |
This might look a bit strange at first glance, but has the reason to prevent inconsistent databases if the something goes wrong in the process. Imagine the case that money is transfered from one bank account to another. If something goes wrong after the money is subtracted from the first account but before arriving at the second you end up with inconsistent tables. To prevent this you first do both changes then commit them together.
3. Word of Caution!
Here is a word of caution taken from the python documention on sqlite3 because a) security is important and b) it gives me a good reason to link to the xkcd comic below, which made me laugh so much :D
Usually your SQL operations will need to use values from Python variables. You shouldn’t assemble your query using Python’s string operations because doing so is insecure; it makes your program vulnerable to an SQL injection attack (see this fabulous XKCD for a humorous example of what can go wrong). Instead, use the DB-API’s parameter substitution. Put ? as a placeholder wherever you want to use a value, and then provide a tuple of values as the second argument to the cursor’s execute() method.
# Never do this -- insecure!
# title = 'Star Wars'
# cur.execute("""update movies set director='Darth Vader' where title = '%s'""" % title)
# Do this instead
t = ('Star Wars',)
cur.execute("""update movies set director='Darth Vader' where title = ?""", t)
<sqlite3.Cursor at 0xae8f4520>
Or for multiple values:
t = ('Darth Vader', 'Star Wars',)
cur.execute("""update movies set director=? where title = ?""", t)
<sqlite3.Cursor at 0xae8f4520>
4. Closing the Connection
Finally, we can close the connection to the database by:
conn.close()
# And we have to close the second one, too:
conn2.close()
And that’s it folks! This should give you a quick start into combining Pandas/Python with SQL(ite).
I also found this tutorial quite helpful, where you can also find more follow-up links.
Enjoy!
Leave a Comment