TDM 10100: Project 9 — 2023
Benford’s Law
Motivation: Benford’s law has many applications, including its well known use in fraud detection. It also helps detect anomalies in naturally occurring datasets.
|
Scope: 'R' and functions
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/restaurant/orders.csv
A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tabs.
|
Questions
Benford’s law (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small.
It is basically a probability distribution that gives the likelihood of the first digit occurring, in a set of numbers.
Another way to understand Benford’s law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.
A probability distribution helps define what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc.
Remember that the sum of all the probabilities in a distribution is always 100% or 1 as a decimal. This law only works for numbers that are significand S(x) which means any number that is set into a standard format. To do this you must
An example would be 9087 and -.9087 both have the S(x) as 9.087 It can also work to find the second, third and succeeding numbers. It can also find the probability of certain combinations of numbers. Typically this law does not apply to data sets that have a minimum and maximum (restricted). This law does not apply to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and are not naturally occurring numbers. Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford’s law. |
Benford’s law is given by the equation below.
$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
An example the probability of the first digit being a 1 is
$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$
The following is a function implementing Benford’s law
benfords_law <- function(d) log10(1+1/d)
To show Benfords_law in a line plot
digits <-1:9
bf_val<-benfords_law(digits)
plot(digits, bf_val, xlab = "digits", ylab="probabilities", main="Benfords Law Plot Line")
Question 1 (1 pt)
-
Create a plot (could be a bar plot, line plot, scatter plot, etc., any type of plot is OK) to show Benfords’s Law for probabilities of digits from 1 to 9.
Question 2 (1 pt)
-
Create a function called
first_digit
that takes an argumentnumber
, and extracts the first non-zero digit from the number
Question 3 (2 pts)
-
Read in the restaurant orders data
/anvil/projects/tdm/data/restaurant/orders.csv
into a dataset namedmyDF
. -
Create a vector
fd_grand_total
by usingsapply
with your functionfirst_digit
from question 2 on thegrand_total
column in yourmyDF
dataframe
Question 4 (2 pts)
-
Calculate the actual distribution of digits in
fd_grand_total
-
Plot the output actual distribution (again, could be a bar plot, line plot, dot plot, etc., anything is OK). Does it look like it follows Benford’s law? Explain briefly.
use |
Question 5 (2 pts)
-
Create a function that will return a new data frame
orders_by_dates
from themyDF
that looks at thedelivery_date
column to compare with two argumentsstart_date
andend_date
. If thedelivery_date
is in between, then add record to the new data frame. -
Run the function for a certain period, and display some orders with the
head
function
as.Date will be useful to do conversion in order to compare dates
|
Project 09 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project09.ipynb
.
-
-
R code and comments for the assignment
-
firstname-lastname-project09.R
.
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |