STAT 19000: Project 3 — Spring 2021
Motivation: A dictionary (referred to as a dict
) is one of the most useful data structures in Python. You can think about them as a data structure containing key: value pairs. Under the hood, a dict
is essentially a data structure called a hash table. Hash tables are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time, meaning that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will usually take about the same amount of time. (The worst case time increases linearly.) Dictionaries (dict
) are used a lot, so it is worthwhile to understand them. Although not used quite as often, another important data type called a set
, is also worthwhile learning about.
Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get
method, the other is to use square brackets and strings. Test out the following to understand the differences between the two:
my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}
# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")
# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist
Context: In our third project, we introduce some basic data types, and we demonstrate some familiar control flow concepts, all while digging right into a dataset. Throughout the course, we will slowly introduce concepts from pandas
, and popular plotting packages.
Scope: dicts, sets, if/else statements, opening files, tuples, lists
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/craigslist/vehicles.csv
Questions
Question 1
In project 2 we learned how to read in data using pandas
. Read in the (/class/datamine/data/craigslist/vehicles.csv
) dataset into a DataFrame called myDF
using pandas
. In R we can get a sneak peek at the data by doing something like:
head(myDF) # where myDF is a data.frame
There is a very similar (and aptly named method) in pandas
that allows us to do the exact same thing with a pandas
DataFrame. Get the head
of myDF
, and take a moment to consider how much time it would take to get this information if we didn’t have this nice head
method.
-
Python code used to solve the problem.
-
The
head
of the dataset.
Question 2
Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get
method, the other is to use square brackets and strings. Test out the following to understand the differences between the two:
my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}
# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")
# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist
Look at the dataset. Create a dict called my_dict
that contains key:value pairs where the keys are years, and the values are a single int representing the number of vehicles from that year on craigslist. Use the year
column, a loop, and a dict to accomplish this. Print the dictionary. You can use the following code to extract the year
column as a list. In the next project we will learn how to loop over pandas
DataFrames.
If you get a
Here are some of the results you should get:
|
There is a special kind of |
-
Python code used to solve the problem.
-
my_dict
printed.
Question 3
After completing question (2) you can easily access the number of vehicles from a given year. For example, to get the number of vehicles on craigslist from 1912, just run:
my_dict[1912]
# or
my_dict.get(1912)
A dict
stores its data in key:value pairs. Identify a "key" from my_dict
, as well as the associated "value". As you can imagine, having data in this format can be very beneficial. One benefit is the ability to easily create a graphic using matplotlib
. Use matplotlib
to create a bar graph with the year on the x-axis, and the number of vehicles from that year on the y-axis.
If when you end up seeing something like |
To use
|
The |
-
Python code used to solve the problem.
-
The resulting plot.
-
A sentence giving an example of a "key" and associated "value" from
my_dict
(e.g., a sentence explaining the 1912 example above).
Question 4
In the hint in question (2), we used a set
to quickly get a list of unique years in a list. Some other common uses of a set
are when you want to get a list of values that are in one list but not another, or get a list of values that are present in both lists. Examine the following code. You’ll notice that we are looping over many values. Replace the code for each of the three examples below with code that uses no loops whatsoever.
listA = [1, 2, 3, 4, 5, 6, 12, 12]
listB = [2, 1, 7, 7, 7, 2, 8, 9, 10, 11, 12, 13]
# 1. values in list A but not list B
# values in list A but not list B
onlyA = []
for valA in listA:
if valA not in listB and valA not in onlyA:
onlyA.append(valA)
print(onlyA) # [3, 4, 5, 6]
# 2. values in listB but not list A
onlyB = []
for valB in listB:
if valB not in listA and valB not in onlyB:
onlyB.append(valB)
print(onlyB) # [7, 8, 9, 10, 11, 13]
# 3. values in both lists
# values in both lists
in_both_lists = []
for valA in listA:
if valA in listB and valA not in in_both_lists:
in_both_lists.append(valA)
print(in_both_lists) # [1,2,12]
You should use a |
In addition to being easier to read, using a |
A set is a group of values that are unordered, unchangeable, and no duplicate values are allowed. While they aren’t used a lot, they can be useful for a few common tasks like: removing duplicate values efficiently, efficiently finding values in one group of values that are not in another group of values, etc. |
-
Python code used to solve the problem.
-
The output from running the code.
Question 5
The value of a dictionary does not have to be a single value (like we’ve shown so far). It can be anything. Observe that there is latitude and longitude data for each row in our DataFrame (lat
and long
, respectively). Wouldn’t it be useful to be able to quickly "get" pairs of latitude and longitude data for a given state?
First, run the following code to get a list of tuples where the first value is the state
, the second value is the lat
, and the third value is the long
.
states_list = list(myDF.loc[:, ["state", "lat", "long"]].dropna().to_records(index=False))
states_list[0:3] # [('az', 34.4554, -114.269), ('or', 46.1837, -123.824), ('sc', 34.9352, -81.9654)]
# to get the first tuple
states_list[0] # ('az', 34.4554, -114.269)
# to get the first value in the first tuple
states_list[0][0] # az
# to get the second tuple
states_list[1] # ('or', 46.1837, -123.824)
# to get the first value in the second tuple
states_list[1][0] # or
If you have an issue where you cannot append values to a specific key, make sure to first initialize the specific key to an empty list so the append method is available to use. |
Now, organize the latitude and longitude data in a dictionary called geoDict
such that each state from the state
column is a key, and the respective value is a list of tuples, where the first value in each tuple is the latitude (lat
) and the second value is the longitude (long
). For example, the first 2 (lat,long) pairs in Indiana ("in"
) are:
geoDict.get("in")[0:2] # [(39.0295, -86.8675), (38.8585, -86.4806)]
len(geoDict.get("in")) # 5687
Now that you can easily access latitude and longitude pairs for a given state, run the following code to plot the points for Texas (the state
value is "tx"
). Include the the graphic produced below in your solution, but feel free to experiment with other states.
You do NOT need to include this portion of Question 5 in your Markdown
|
-
Python code used to solve the problem.
-
Graphic file (
q5.jpg
) produced for the given state.