class: title-slide, middle, center # BUS 320 # Topic 4 # Reading and writing data ## Elizabeth Stanny --- layout: true <div class="my-footer"><span>http://bus320.estanny.com</span></div> --- # Learning objectives for the course - Ask the right questions -- - Extract, transform and load relevant data (ETL process) - Extract (`tidyverse/dplyr/tidyr`) - Transform (`tidyverse/dplyr/tidyr`) - Load Excel Workbooks (`tidyverse/readxl`) -- - Apply appropriate data analytic techniques - Descriptive statistics - `skimr` - `tidyverse/dplyr` - `DataExplorer` -- - Interpret and share the results - `rmarkdown` files end with .Rmd --- # Importing and exporting data into R https://readr.tidyverse.org/reference/index.html ### File extension indicates file type .pull-left[ Rectangular well formatted data - Comma-separated data `,` (.csv) - Pipe-separated data `|` (.psv) - Tab-separated data `\t` (.tsv) - Excel (.xls and .xlsx) ] -- .pull-right[ ### Other * Portable document format (.pdf) * Word (.doc, .docx) * Webpages (.html) * Text (.txt) (not separated with comma, pipe, etc) ] --- # Ways to get data into `R` -- Use [`readr`](https://readr.tidyverse.org/reference/read_delim.html) * Data already on your computer -- - need to know where it is ("directory/file.csv") -- * Data on the web with download link --- # Data frames (rectangles) use package -- .pull-left[ - Make sure you are in the project for this class - Open up a new Rmarkdown file ] -- .pull-right[ .large[.content-box-yellow[Load and install]] ```r library(pacman) p_load(tidyverse, janitor, readxl) ``` ] --- ## Preview data we will exporting and importing ```r corp_tax <- read_excel("corp_tax.xlsx") ``` -- - Check your environment pane -- - Export data into 3 formats -- ```r corp_tax %>% write_csv("corp_tax.csv") # comma-separated values corp_tax %>% write_tsv("corp_tax.tsv") # tab separated corp_tax %>% write_delim("corp_tax.psv", delim = "|") # pipe-separated ``` -- - Go to your files pane * Do you see 3 new files? -- * Check your environment? Has anything changed? --- ### Import the files into R -- ```r corp_tax_csv <- read_csv("corp_tax.csv") # comma-separated values corp_tax_tsv <- read_tsv("corp_tax.tsv") #tab separated corp_tax_psv <- read_delim("corp_tax.psv", delim = "|") # pipe separated ``` -- Do you see these data in your **Environment**? --- ### Check that all the data match -- ```r all.equal(corp_tax, corp_tax_csv, check.attributes = FALSE) all.equal(corp_tax_csv, corp_tax_tsv, check.attributes = FALSE) all.equal(corp_tax_csv, corp_tax_psv, check.attributes = FALSE) ``` -- Do they? --- ## Import by file url - Data for class from Good Jobs First Violation Tracker database >Violation Tracker is the first wide-ranging database on corporate misconduct. It covers banking, consumer protection, false claims, environmental, wage & hour, health, safety, employment discrimination, price-fixing, bribery and other cases resolved by more than 40 federal regulatory agencies and all parts of the Justice Department since 2000 -- plus state AG and local DA cases and selected class action lawsuits. In all: 397,000 civil and criminal cases with total penalties of $604 billion. - [Discover Which Corporations are the Biggest Regulatory Violators and Lawbreakers Throughout the United States](https://www.goodjobsfirst.org/violation-tracker) <!-- See article: [California Leads the Country in Local Prosecution of Business Misconduct](https://www.goodjobsfirst.org/blog/california-leads-country-local-prosecution-business-misconduct) https://violationtracker.goodjobsfirst.org/prog.php?agency_sum_st=CA-MULTI --> --- #### [Violation Tracker 100 Most Penalized Parent Companies](https://violationtracker.goodjobsfirst.org/parent-totals) <video controls="controls" width="800" height="600" name="Video Name" src="video/url_save.mov"></video> --- ### Copy the link -- ### Import the data and assign it to the variable `viol_in` -- - use interactive feature in RStudio ```r viol_in <- read_csv("https://violationtracker.goodjobsfirst.org/prog.php?&detail=csv_top_parent", col_types = cols(`Penalty Total Dollars` = col_number())) ``` -- - Do you see `viol_in` in the environment pane? --- # viol_in ```r viol_in ``` ``` ## # A tibble: 100 x 4 ## Rank `Parent Name` `Penalty Total Dollars` `Number of Records` ## <dbl> <chr> <dbl> <dbl> ## 1 1 Bank of America 82695676034 210 ## 2 2 JPMorgan Chase 34517828993 151 ## 3 3 BP 29155941533 236 ## 4 4 Citigroup 25045655764 120 ## 5 5 Volkswagen 23764268628 51 ## 6 6 Wells Fargo 21139073231 172 ## 7 7 Deutsche Bank 17995700778 55 ## 8 8 UBS 16782800910 82 ## 9 9 Royal Bank of Scotland 13455904000 27 ## 10 10 Goldman Sachs 13140380987 40 ## # … with 90 more rows ``` --- ### Do we need all the variables (columns)? -- ```r viol_in %>% select(-Rank) ``` ``` ## # A tibble: 100 x 3 ## `Parent Name` `Penalty Total Dollars` `Number of Records` ## <chr> <dbl> <dbl> ## 1 Bank of America 82695676034 210 ## 2 JPMorgan Chase 34517828993 151 ## 3 BP 29155941533 236 ## 4 Citigroup 25045655764 120 ## 5 Volkswagen 23764268628 51 ## 6 Wells Fargo 21139073231 172 ## 7 Deutsche Bank 17995700778 55 ## 8 UBS 16782800910 82 ## 9 Royal Bank of Scotland 13455904000 27 ## 10 Goldman Sachs 13140380987 40 ## # … with 90 more rows ``` --- ### Are the variable names easy to work with? -- ```r viol_clean <- viol_in %>% select(-Rank) %>% * clean_names() ``` --- ### Which company Had the highest total penalties? ```r viol_clean %>% slice_max(penalty_total_dollars) ``` Had the highest number of penalties? ```r viol_clean %>% slice_max(number_of_records) ``` --- ### What is the average total penalty? median total penalty? -- ```r viol_clean %>% summarize(mean_tot_penalty = mean(penalty_total_dollars), median_tot_penalty = median(penalty_total_dollars)) ``` --- ### Was Facebook one of the top 100? -- ```r viol_clean %>% filter(parent_name == "Facebook") ``` --- ### Write `viol_clean` to Comma-separated data file (.csv) ```r viol_clean %>% write_csv("viol_clean.csv") ``` -- Pipe-separated data file (.psv) ```r viol_clean %>% write_delim("viol_clean.psv", delim = "|") ``` -- Tab-separated data file (.tsv) ```r viol_clean %>% write_tsv("viol_clean.tsv") ``` -- ### Read these formats into Excel --- # Writing to Excel (.xlsx) - `rio` - `openxlsx` - `writexl` --- # Clean up <!-- Corporate subsidy tracker top 100 parent corporations ```r "https://subsidytracker.goodjobsfirst.org/prog.php?&detail=toplist_parent_csv" rio::import("https://www.trackyourcompany.org/sp-500-list.html") ``` -->