STAT 19000: Project 6 — Fall 2021
Motivation: tapply
is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. If you have any familiarity with SQL, it tapply
is very similar to working with the GROUP BY
clause — you first group your data using some rule, and then perform some operation for each newly created group.
Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply
from the "apply suite" of functions in R.
Scope: r, tapply
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/amazon/tracks.csv
-
/depot/datamine/data/amazon/tracks.db
Questions
Question 1
Load the tracks.csv
file into an R data.frame called tracks
. Immediately after loading the file, run the following.
str(tracks)
What happens?
The C in CSV is not true for this dataset! You’ll need to take advantage of the |
Once you’ve successfully read in the data, re-run the following.
str(tracks)
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Great! tapply
is a very cool, very powerful function in R.
First, let’s say that we wanted to see what the average duration
(a column in the tracks
data.frame) of songs were by each year
(a column in the tracks
data.frame). If you think about how you would approach solving this problem, there are a lot of components to keep track of!
-
You don’t know ahead of time how many different years are in the dataset.
-
You have to associate each sum of
duration
with a specificyear
. -
Etc.
Its a lot of work!
In R, there is a really great library that allows us to run queries on an sqlite database and put the result directly into a dataframe. This would be the SQL and R solution to this problem.
library(RSQLite)
con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
head(myDF)
Use tapply
to solve the same problem! Are your results the same? Print the first 5 results to make sure they are the same.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Plot the results of question (2) with any appropriate plot that will highlight the duration of music by year, sequentially. What patterns do you see, if any?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Ha! Thats not so bad! What are the artist_name
of the artists with the highest median duration
of songs? Sort the results of the tapply
function in descending order and print the first 5 results.
This may take a few minutes to run — this function is doing a lot and there are a lot of artists in this dataset! |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Explore the dataset and come up with a question you want to answer. Make sure tapply
would be useful with your investigation, and use tapply
to calculate something interesting for the dataset. Create one or more graphics as you are working on your question. Write 1-2 sentences reviewing your findings. It could be anything, and your findings do not need to be "good" or "bad", they can be boring (much like a lot of research findings)!
-
Question you want to answer.
-
Code used to solve this problem.
-
Output (including graphic(s)) from running the code.
-
1-2 sentences reviewing your findings.
Question 6 (optional, 0 pts)
Use the following SQL and R code and take a crack at solving a problem (any problem) you want to do with R and SQL. You can use the following code to help. Create a cool graphic with the results!
library(RSQLite)
con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db")
myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;")
myDF
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |