TDM 20100: Project 12 — 2023
Motivation: In the previous projects, you’ve gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we haven’t yet done is creating your own database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and many of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets.
Context: We will (mostly) be using the sqlite3 command line tool to interact with the database.
Scope: sql, sqlite, unix
Dataset(s)
The project will use the following datasets:
-
/anvil/projects/tdm/data/restaurant/orders.csv
-
/anvil/projects/tdm/data/lahman/lahman.db
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. For example
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
To prepare for this project, create a new Jupyter Notebook called firstname-lastname-project12.ipynb
. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the sqlite3
command line tool from the terminal window.
Okay, once completed, the first step is schema creation. First, it is important to note. The goal of this project is to put the data in /anvil/projects/tdm/data/restaurant/orders.csv
into a sqlite database
With that in mind, run the following (in your terminal) to get a sample of the data.
head /anvil/projects/tdm/data/restaurant/orders.csv
Review the output data. An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types — just more "set in stone" than the very easily changed data.frame.
Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the sqlite_schema
table will contain the type, name, tbl_name, root page, and sql for the database object.
For sqlite, the "database object" could refer to a table, index, view, or trigger. |
This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is here.
For our purposes, when I refer to "schema", what I really mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the exact commands to build the database from scratch without the data itself. For example, take a look at our lahman.db
database by running the following in your terminal.
sqlite3 /anvil/projects/tdm/data/lahman/lahman.db
This will open the command line interface (CLI) for sqlite3. It will look similar to:
sqlite>
Type .schema
to see the "schema" for the database.
Any command you run in the sqlite CLI that starts with a dot ( |
After running .schema
, you should see a variety of legitimate SQL commands that will create the structure of your database without the data itself. This is an extremely useful self-documenting tool that is particularly useful.
So, now let’s study the sample of our orders.csv
dataset to create a markdown list of key:value pairs for each column in the dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.
Questions
Question 1 (2 pts)
-
Create a markdown list of key:value pairs for each column in the
orders.csv
dataset. Each key should be the title of the column, and each value should be the type of data that is stored in that column.
For example, your solution might be given like this:
-
akeed_order_id: INTEGER
-
customer_id: TEXT
-
etc., etc.
where the value is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" here.
We just showed akeed_order_id and customer_id to give examples about how the first two variables in the data set should be classified.
As a side note: Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are approximate numeric data types and exact numeric data types. What you are most familiar with is the approximate numeric data types. In R or Python for example, try running the following:
(3 - 2.9) <= 0.1
FALSE
(3 - 2.9) <= 0.1
False
Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is critical that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for our example, store text as TEXT, numbers without decimal places as INTEGER, and numbers with decimal places as REAL — our example dataset doesn’t have a need for NUMERIC.
Question 2 (2 pts)
-
Create a database named "orders.db" and a table named "orders" by following the instructions below
Let’s put together our See here for some good examples. Realize that the Once you’ve written your Congratulations! To finish things off, please paste the |
Question 3 (2 pts)
The next step in the project is to add the data! After all, it is a data base. You may get how to insert data into table from here
-
Please populate the data from
orders.csv
into yourorders
table -
Connect to "orders.db" and run a query to get the first 5 rows from "orders" table.
You could programmatically generate a
|
To connect to the database:
|
To select data from the table:
|
Question 4 (2 pts)
Woohoo! You’ve successfully created a database and populated it with data from a dataset — pretty cool! Connect to your database from inside a terminal.
sqlite3 $HOME/orders.db
Now, run the following dot command in order to time our queries: .timer on
. This will print out the time it takes to run each query. For example, try the following:
SELECT * FROM orders LIMIT 5;
Cool! Time the following query.
SELECT * FROM orders ORDER BY created_at LIMIT 10;
Run Time: real 0.021 user 0.000261 sys 0.004553
Running time is often critical, particularly during large-scale database searches. Let’s explore some techniques to enhance performance through the use of indexing in tables. You may get more information about index here: www.sqlitetutorial.net/sqlite-index/
-
Create an index for column "created_at".
Project 12 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project12.ipynb
-
-
Sql file 'orders.db' (this file should be approximately 22 MB)
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |