STAT 19000: Project 7 — Fall 2021
Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset
, and merge
. subset
provides a more natural way to filter and select data from a data.frame. merge
brings the principals of combining data that SQL uses, to R.
Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!
Scope: r, subset, merge, tapply
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/goodreads/csv/*.csv
Questions
Question 1
Read the goodreads_books.csv
into a data.frame called books
. Let’s say Dr. Ward is working on a book and new content. He is looking for advice and wants some insight from us.
A friend told him that he should pick a month in the Summer to publish his book.
Based on our books
dataset, is there any evidence that certain months get higher than average rating? What month would you suggest for Dr. Ward to publish his new book?
Use columns |
To read the data in faster and more efficiently, try the following:
|
Relevant topics: tapply, mean
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences comparing the publication month based on average rating.
-
1-2 sentences with your suggestion to Dr. Ward and reasoning.
Question 2
Create a new column called book_size_cat
that is a categorical variable based on the number of pages a book has.
book_size_cat
should have 3 levels: small
, medium
, large
.
Run the code below to get different summaries and visualizations of the number of pages books have in our datasets.
summary(books$num_pages)
hist(books$num_pages)
hist(books$num_pages[books$num_pages <= 1000])
boxplot(books$num_pages[books$num_pages < 4000])
Pick the values from which to separate these levels by. Write 1-2 sentences explaining why you pick those values.
You can do other visualizations to determine. Have fun, there is no right or wrong. What would you consider a small, medium, and large book? |
Relevant topics: cut
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences explaining the values you picked to create your categorical data and why.
-
The results of running
table(books$book_size_cat)
.
Question 3
Dr. Ward is a firm believer in constructive feedback, and would like people to provide feedback for his book.
What recommendation would you make to Dr. Ward when it comes to book size?
Use the column |
Association is not causation, and there are many factors that lead to people providing reviews. Your recommendation can be based on anecdotal evidence, no worries. |
Relevant topics: tapply, mean
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences with your recommendation and reasoning.
Question 4
Sometimes (often times) looking at a single summary of our data may not provide the full picture.
Make a side-by-side boxplot for the text_reviews_count
by book_size_cat
.
Does your answer to question (3) change based on your plot?
Take a look at the first example when you run |
You can make three boxplots if you prefer, but make sure that they all have the same y-axis limit to make the comparisons. |
Relevant topics: boxplot
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences with your recommendation and reasoning.
Question 5
Repeat question (4), this time, use the subset
function to reduce your data to books with a text_reviews_count
less than 200. How does this change your plot? Is it a little easier to read?
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences with your recommendation and reasoning.
Question 6
Read the goodreads_book_authors.csv
into a new data.frame called authors
.
Use the merge
function to combine the books
data.frame with the authors
data.frame. Call your new data.frame books_authors
.
Now, use the subset
function to create get a subset of your data for your favorite authors. Include at least 5 authors that appear in the dataset.
Redo question (4) using this new subset of data. Does your recommendation change at all?
Make sure you pay close attention to the resulting |
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences with your recommendation and reasoning.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |