STAT 19000: Project 11 — Spring 2021
Motivation: We’ve had a pretty intense series of projects recently, and, although you may not have digested everything fully, you may be surprised at how far you’ve come! What better way to realize this but to take a look at some familiar questions that you’ve solved in the past in R, and solve them in Python instead? You will (a) have the solutions in R to be able to compare and contrast what you come up with in Python, and (b) be able to fill in any gaps you find you have along the way.
Context: We’ve just finished a two project series where we built a beer recommendation system using Python. In this project, we are going to take a (hopefully restful) step back and tackle some familiar data wrangling tasks, but in Python instead of R.
Scope: python, r
Questions
Question 1
The fars
dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV
, PERSON.CSV
, and VEHICLE.CSV
. If you take a peek at any ACCIDENT.CSV
file in any year, you’ll notice that the column YEAR
only contains the last two digits of the year. Add a new YEAR
column that contains the full year. Use the pd.concat
function to create a DataFrame called accidents
that combines the ACCIDENT.CSV
files from the years 1975 through 1981 (inclusive) into one big dataset. After (or before) creating that accidents
DataFrame, change the values in the YEAR
column from two digits to four digits (i.e., paste a 19 onto each year value).
One way to append strings to every value in a column is to first convert the column to
|
-
Python code used to solve the problem.
-
head
of theaccidents
dataframe.
Question 2
Using the new accidents
data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?
Look at the variables |
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Again using the accidents
data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?
Does the |
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
Again using the accidents
data frame: Calculate the mean number of motorists involved in an accident (column PERSONS
) with i
drunk drivers (column DRUNK_DR
), where i
takes the values from 0 through 6.
-
Python code used to solve the problem.
-
Output from running your code.
Question 5
Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.
You’ll want to pay special attention to the |
-
Python code used to solve the problem.
-
Output from running your code.