STAT 29000: Project 6 — Fall 2021
The anatomy of a bash script
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, bash scripts, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/election/*
Questions
Question 1
Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "note" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on. |
We now have a grip on a variety of useful tools, that are often used together using pipes and redirections. As you start to see, "one-liners" can start to become a bit unwieldy. In these cases, wrapping everything into a bash script can be a good solution.
Imagine for a minute, that you have a single file that is continually appended to by another system. Let’s say this file is /depot/datamine/data/election/itcont1990.txt
. Every so often, your manager asks you to generate a summary of the data in this file. Every time you do this, you have to dig through old notes to remember how you did this previously. Instead of constantly doing this manual process, you decide to write a script to handle this for you!
Write a bash script to generate a summary of the data in /depot/datamine/data/election/itcont1990.txt
. The summary should include the following information, in the following format.
120 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont1990.txt Largest donor: Most common donor state: NY Total donations in USD by state: - NY: 100000 - CA: 50000 ... ----------------
For this question, assume that the data file will always be in the same location. |
#!/bin/bash
FILE=/depot/datamine/data/election/itcont1990.txt
RECORDS_READ=`wc -l $FILE | awk '{print $1}'`
awk -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
print RECORDS_READ" RECORDS READ\n----------------";
}{
donor_total_by_name[$8] += $15;
most_common_donor_by_state[$10]++;
donor_total_by_state[$10] += $15;
}END{
PROCINFO["sorted_in"] = "@val_num_desc";
print "File: "FILENAME;
ct=0;
for (i in donor_total_by_name) {
if (ct < 1) {
print "Largest donor: " i;
ct++;
}
};
ct=0;
for (i in most_common_donor_by_state) {
if (ct < 1) {
print "Most common donor state: " i;
ct++;
}
}
print "Total donations in USD by state:";
for (i in donor_total_by_state) {
if (i != "STATE" && i != "") {
print "\t- " i ": " donor_total_by_state[i];
}
}
print "----------------";
}' "$FILE"
In order to run this script, you will need to paste the contents into a new file called firstname-lastname-q1.sh
in your $HOME
directory. In a new bash cell, run it as follows.
%%bash
chmod +x $HOME/firstname-lastname-q1.sh
$HOME/firstname-lastname-q1.sh
That chmod
command is necessary to ensure that you can execute the script.
Create the script and run the script in a bash cell.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Your manager loves your script, but wants you to modify it so it works with any file formatted the same way. A new system is being installed that saves new data into new files rather than appending to the same file.
Modify the script from question (1) to accept an argument that specifies the file to process.
Start by copying the cold script from question (1) into a new file called firstname-lastname-q2.sh
.
%%bash
cp $HOME/firstname-lastname-q1.sh $HOME/firstname-lastname-q2.sh
Then, test the updated script out on /depot/datamine/data/election/itcont2000.txt
.
%%bash
$HOME/firstname-lastname-q2.sh /depot/datamine/data/election/itcont2000.txt
You can edit your scripts directly within Jupyter Lab by right clicking the files and opening in the editor. |
The only difference between the two scripts are the new script you will be able to replace the $FILE argument to the |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Modify your script once again to accept n arguments, each a path to another file to generate a summary for.
Start by copying the cold script from question (2) into a new file called firstname-lastname-q3.sh
.
%%bash
cp $HOME/firstname-lastname-q2.sh $HOME/firstname-lastname-q3.sh
You should be able to run the script as follows.
%%bash
$HOME/firstname-lastname-q3.sh /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont2000.txt Largest donor: Most common donor state: NY Total donations in USD by state: - NY: 100000 - CA: 50000 ... ---------------- 120 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont1990.txt Largest donor: Most common donor state: NY Total donations in USD by state: - NY: 100000 - CA: 50000 ... ----------------
Again, the modification that will need to be made here aren’t so bad at all! If you just wrap the entirety of question (2)'s solution in a for loop where you loop through each argument, you’ll just need to make sure you change the $FILE argument to the |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "tip" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on, and run the example we provide. |
You are particularly interested in donors from your alma mater, Purdue University. Modify your script from question (3) yet again. This time, add a flag, that, when present, will include the name and amount for each donor where the word "purdue" (case insensitive) is present in the EMPLOYER
column.
%%bash
$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont2000.txt Largest donor: ASARO, SALVATORE Most common donor state: NY Purdue donors: - John Smith: 500 - Alice Bob: 1000 Total donations in USD by state: - NY: 100000 - CA: 50000 ... ---------------- 120 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont1990.txt Largest donor: ASARO, SALVATORE Most common donor state: NY Purdue donors: - John Smith: 500 - Alice Bob: 1000 Total donations in USD by state: - NY: 100000 - CA: 50000 ... ----------------
This stackoverflow response has an excellent template using |
You may want to comment out or delete the part of the template that limits your non-flag arguments to one. |
#!/bin/bash
# More safety, by turning some bugs into errors.
# Without `errexit` you don’t need ! and can replace
# PIPESTATUS with a simple $?, but I don’t do that.
set -o errexit -o pipefail -o noclobber -o nounset
# -allow a command to fail with !’s side effect on errexit
# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
echo 'I’m sorry, `getopt --test` failed in this environment.'
exit 1
fi
OPTIONS=p
LONGOPTS=purdue
# -regarding ! and PIPESTATUS see above
# -temporarily store output to be able to check for errors
# -activate quoting/enhanced mode (e.g. by writing out “--options”)
# -pass arguments only via -- "$@" to separate them correctly
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
# e.g. return value is 1
# then getopt has complained about wrong arguments to stdout
exit 2
fi
# read getopt’s output this way to handle the quoting right:
eval set -- "$PARSED"
p=n
# now enjoy the options in order and nicely split until we see --
while true; do
case "$1" in
-p|--purdue)
p=y
shift
;;
--)
shift
break
;;
*)
echo "Programming error"
exit 3
;;
esac
done
# handle non-option arguments
# if [[ $# -ne 1 ]]; then
# echo "$0: A single input file is required."
# exit 4
# fi
for file in "$@"
do
RECORDS_READ=`wc -l $file | awk '{print $1}'`
awk -v PFLAG="$p" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
print RECORDS_READ" RECORDS READ\n----------------";
}{
if ($8 != "") {
donor_total_by_name[$8] += $15;
}
most_common_donor_by_state[$10]++;
donor_total_by_state[$10] += $15;
# see if "purdue" appears in line
if (PFLAG == "y") {
has_purdue = match(tolower($0), /purdue/)
if (has_purdue != 0) {
purdue_total_by_name[$8] += $15;
}
}
}END{
PROCINFO["sorted_in"] = "@val_num_desc";
print "File: "FILENAME;
ct=0;
for (i in donor_total_by_name) {
if (ct < 1) {
print "Largest donor: " i;
ct++;
}
};
ct=0;
for (i in most_common_donor_by_state) {
if (ct < 1) {
print "Most common donor state: " i;
ct++;
}
}
if (PFLAG == "y") {
print "Purdue donors:";
for (i in purdue_total_by_name) {
print "\t- " i ": " purdue_total_by_name[i];
}
}
print "Total donations in USD by state:";
for (i in donor_total_by_state) {
if (i != "STATE" && i != "") {
print "\t- " i ": " donor_total_by_state[i];
}
}
print "----------------\n";
}' $file
done
Please copy and paste this code into a new script called firstname-lastname-q4.sh
and run it.
%%bash
$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Originally, this project was a bit more involved than intended. Instead of writing this script from scratch, I would like you to fill in the parts of the script with the text FIXME, and then test out the script with the commands provided. |
Your manager liked that new feature, however, she thinks the tool would be better suited to search the EMPLOYER
column for a specific string, and then handle this generically, rather than just handling the specific case of Purdue.
Modify your script from question (4). Accept one and only one flag -e
or --employer
. This flag should take a string as an argument, and then search the EMPLOYER
column for that string. Then, the script will print out the results. Only include the top 5 donors from an employer. The following is an example if we chose to search for "ford".
$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont1990.txt Largest donor: ASARO, SALVATORE Most common donor state: NY ford donors: - John Smith: 500 - Alice Bob: 1000 Total donations in USD by state: - NY: 100000 - CA: 50000 ... ---------------- 120 RECORDS READ ---------------- File: /depot/datamine/data/election/itcont2000.txt Largest donor: ASARO, SALVATORE Most common donor state: NY ford donors: - John Smith: 500 - Alice Bob: 1000 Total donations in USD by state: - NY: 100000 - CA: 50000 ... ----------------
#!/bin/bash
# More safety, by turning some bugs into errors.
# Without `errexit` you don’t need ! and can replace
# PIPESTATUS with a simple $?, but I don’t do that.
set -o errexit -o pipefail -o noclobber -o nounset
# -allow a command to fail with !’s side effect on errexit
# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
echo 'I’m sorry, `getopt --test` failed in this environment.'
exit 1
fi
OPTIONS=e:
LONGOPTS=employer:
# -regarding ! and PIPESTATUS see above
# -temporarily store output to be able to check for errors
# -activate quoting/enhanced mode (e.g. by writing out “--options”)
# -pass arguments only via -- "$@" to separate them correctly
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
# e.g. return value is 1
# then getopt has complained about wrong arguments to stdout
exit 2
fi
# read getopt’s output this way to handle the quoting right:
eval set -- "$PARSED"
e=-
# now enjoy the options in order and nicely split until we see --
while true; do
case "$1" in
-e|--employer)
e="$2"
shift 2
;;
--)
shift
break
;;
*)
echo "Programming error"
exit 3
;;
esac
done
# handle non-option arguments
# if [[ $# -ne 1 ]]; then
# echo "$0: A single input file is required."
# exit 4
# fi
for file in "$@"
do
RECORDS_READ=`wc -l $file | awk '{print $1}'`
awk -v EFLAG="$FIXME" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ (1)
print RECORDS_READ" RECORDS READ\n----------------";
}
{
if ($8 != "") {
donor_total_by_name[$8] += $15;
}
most_common_donor_by_state[$10]++;
donor_total_by_state[$10] += $15;
# see if search string appears in line
if (EFLAG != "") {
has_string = match(tolower($12), EFLAG)
if (has_string != 0) {
employer_total_by_name[$8] += $15;
}
}
}END{
PROCINFO["sorted_in"] = "@val_num_desc";
print "File: "FILENAME;
ct=0;
for (i in donor_total_by_name) {
if (ct < 1) {
print "Largest donor: " i;
ct++;
}
};
ct=0;
for (i in most_common_donor_by_state) {
if (ct < 1) {
print "Most common donor state: " i;
ct++;
}
}
ct=0;
if (EFLAG != "") {
print EFLAG" donors:";
for (i in FIXME) { (2)
if (ct < 5) {
print "\t- " i ": " FIXME[i]; (3)
FIXME; (4)
}
}
}
print "Total donations in USD by state:";
for (i in donor_total_by_state) {
if (i != "STATE" && i != "") {
print "\t- " i ": " donor_total_by_state[i];
}
}
print "----------------\n";
}' $file
done
1 | We should put "$something" here — check out how we handle this is question (4) and look at the changes it question (5) to help isolate what goes here. |
2 | What are we looping through here? All you need to do is change it to the only remaining awk array we haven’t looped through in the rest of the code. |
3 | Now we want to access the value of the array — it would make sense if it were the same array as the previous FIXME, right?! |
4 | Without this code, we will print ALL of the donors — not just the first 5. |
Then test it out!
%%bash
$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |