STAT 19000: Project 4 — Spring 2021
Motivation: We’ve now been introduced to a variety of core Python data structures. Along the way we’ve touched on a bit of pandas
, matplotlib
, and have utilized some control flow features like for loops and if statements. We will continue to touch on pandas
and matplotlib
, but we will take a deeper dive in this project and learn more about control flow, all while digging into the data!
Context: We just finished a project where we were able to see the power of dictionaries and sets. In this project we will take a step back and make sure we are able to really grasp control flow (if/else statements, loops, etc.) in Python.
Scope: python, dicts, lists, if/else statements, for loops
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/craigslist/vehicles.csv
Questions
Question 1
Unlike in R, where traditional loops are rare and typically accomplished via one of the apply functions, in Python, loops are extremely common and important to understand. In Python, any iterator can be looped over. Some common iterators are: tuples, lists, dicts, sets, pandas
Series, and pandas
DataFrames. In the previous project we had some examples of looping over lists, let’s learn how to loop over pandas
Series and Dataframes!
Load up our dataset /class/datamine/data/craigslist/vehicles.csv
into a DataFrame called myDF
. In project (3), we organized the latitude and longitude data in a dictionary called geoDict
such that each state from the state
column is a key, and the respective value is a list of tuples, where the first value in each tuple is the latitude (lat
) and the second value is the longitude (long
). Repeat this question, but do not use lists, instead use pandas
to accomplish this.
The data frame has 435,849 rows, and it takes forever to accomplish this with |
Here is a video about the new feature to reset your RStudio session if you make a big mistake or if your session is very slow:
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Wow! The solution to question (1) was slow. In general, you’ll want to avoid looping over large DataFrames. Here is a pretty good explanation of why, as well as a good system on what to try when computing something. In this case, we could have used indexing to get latitude and longitude values for each state, and would have no need to build this dict.
The method we learned in Project 3::Question 5 is faster and easier! Just in case you did not solve Project 3::Question 5, here is a fast way to build geoDict
:
import pandas as pd
myDF = pd.read_csv("/class/datamine/data/craigslist/vehicles.csv")
states_list = list(myDF.loc[:, ["state", "lat", "long"]].dropna().to_records(index=False))
geoDict = {}
for mytriple in states_list:
geoDict[mytriple[0]] = []
for mytriple in states_list:
geoDict[mytriple[0]].append( (mytriple[1],mytriple[2]) )
Now we will practice iterating over a dictionary, list, and tuple, all at once! Loop through geoDict
and use f-strings to print the state abbreviation. Print the first latitude and longitude pair, as well as every 5000th latitude and longitude pair for each state. Round values to the hundreths place. For example, if the state was "pu", and it had 12000 latitude and longitude pairs, we would print the following:
pu: Lat: 41.41, Long: 41.41 Lat: 22.21, Long: 21.21 Lat: 11.11, Long: 10.22
In the above example, Lat: 41.41, Long: 41.41
would be the 0th pair, Lat: 22.21, Long: 21.21
would be the 5000th pair, and Lat: 11.11, Long: 10.22
would be the 10000th pair. Make sure to use f-strings to round the latitude and longitude values to two decimal places.
There are several ways to solve this question. You can use whatever method is easiest for you, but please be sure (as always) to add comments to explain your method of solution.
|
Using an if statement and the modulo operator could be useful. |
Whenever we have a loop within another loop, the "inner" loop is called a "nested" loop, as it is "nested" inside of the other. |
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
We are curious about how the year of the car (year
) effects the price (price
). In R, we could get the median price by year easily, using tapply
:
tapply(myDF$price, myDF$year, median, na.rm=T)
Using pandas
, we would do this:
res = myDF.groupby(['year'], dropna=True).median()
These are very convenient functions that do a lot of work for you. If we were to take a look at the median price of cars by year, it would look like:
import matplotlib.pyplot as plt
res = myDF.groupby(['year'], dropna=True).median()["price"]
plt.bar(res.index, res.values)
Using the content of the variable my_list
provided in the code below, calculate the median car price per year without using the median
function and without using a sort
function. Use only dictionaries, for loops and if statements. Replicate the plot generated by running the code above (you can use the plot to make sure it looks right).
my_list = list(myDF.loc[:, ["year", "price",]].dropna().to_records(index=False))
If you do not want to write your own median function to find the median, then it is OK to just use the |
It is also OK to use: |
-
Python code used to solve the problem.
-
Output from running your code.
-
The barplot.
Question 4
Now calculate the mean price
by year
(still not using pandas code), and create a barplot with the price
on the y-axis and year
on the x-axis. Whoa! Something is odd here. Explain what is happening. Modify your code to use an if statement to "weed out" the likely erroneous value. Re-plot your values.
It is also OK to use a built-in |
-
Python code used to solve the problem.
-
Output from running your code.
-
The barplot.
Question 5
List comprehensions are a neat feature of Python that allows for a more concise syntax for smaller loops. While at first they may seem difficult and more confusing, eventually they grow on you. For example, say you wanted to capitalize every state
in a list full of states:
my_states = myDF['state'].to_list()
my_states = [state.upper() for state in my_states]
Or, maybe you wanted to find the average price of cars in "excellent" condition (without pandas
):
my_list = list(myDF.loc[:, ["condition", "price",]].dropna().to_records(index=False))
my_list = [price for (condition, price) in my_list if condition == "excellent"]
sum(my_list)/len(my_list)
Do the following using list comprehensions, and the provided code:
my_list = list(myDF.loc[:, ["state", "price",]].dropna().to_records(index=False))
-
Calculate the average price of vehicles from Indiana (
in
). -
Calculate the average price of vehicles from Indiana (
in
), Michigan (mi
), and Illinois (il
) combined.
my_list = list(myDF.loc[:, ["manufacturer", "year", "price",]].dropna().to_records(index=False))
-
Calculate the average price of a "honda" (
manufacturer
) that is 2010 or newer (year
).
-
Python code used to solve the problem.
-
Output from running your code.
Question 6
Let’s use a package called spacy
to try and parse phone numbers out of the description
column. First, simply loop through and print the text and the label. What is the label of the majority of the phone numbers you can see?
import spacy
# get list of descriptions
my_list = list(myDF.loc[:, ["description",]].dropna().to_records(index=False))
my_list = [m[0] for m in my_list]
# load the pre-built spacy model
nlp = spacy.load("en_core_web_lg")
# apply the model to a description
doc = nlp(my_list[0])
# print the text and label of each "entity"
for entity in doc.ents:
print(entity.text, entity.label_)
Use an if statement to filter out all entities that are not the label you see. Loop through again and see what our printed data looks like. There is still a lot of data there that we don’t want to capture, right? Phone numbers in the US are usually 7 (5555555), 8 (555-5555), 10 (5555555555), 11 (15555555555), 12 (555-555-5555), or 14 (1-555-555-5555) digits. In addition to your first "filter", add another "filter" that keeps only text where the text is one of those lengths.
That is starting to look better, but there are still some erroneous values. Come up with another "filter", and loop through our data again. Explain what your filter does and make sure that it does a better job on the first 10 documents than when we don’t use your filter.
If you get an error when trying to knit that talks about "unicode" characters, this is caused by trying to print special characters (non-ascii). An easy fix is just to remove all non-ascii text. You can do this with the |
Instead of:
for entity in doc.ents:
print(entity.text, entity.label_)
Do:
for entity in doc.ents:
print(entity.text.encode('ascii', errors='ignore'), entity.label_)
It can be fun to utilize machine learning and natural language processing, but that doesn’t mean it is always the best solution! We could get rid of all of our filters and use regular expressions with much better results! We will demonstrate this in our solution. |
-
Python code used to solve the problem.
-
Output from running your code.
-
1-2 sentences explaining what your filter does.