Reading and writing data

class: title-slide, middle, center

# BUS 320

# Topic 4

# Reading and writing data

## Elizabeth Stanny

---

layout: true

<div class="my-footer"><span>http://bus320.estanny.com</span></div>

---

# Learning objectives for the course

- Ask the right questions

- Extract, transform and load relevant data (ETL process)

- Extract (`tidyverse/dplyr/tidyr`)
  
  - Transform (`tidyverse/dplyr/tidyr`)
  
  - Load Excel Workbooks (`tidyverse/readxl`)

- Apply appropriate data analytic techniques
  
  - Descriptive statistics
      - `skimr`
      - `tidyverse/dplyr`
      - `DataExplorer`

- Interpret and share the results
   - `rmarkdown`  files end with .Rmd

---

# Importing and exporting data into R

https://readr.tidyverse.org/reference/index.html

### File extension indicates file type

.pull-left[

Rectangular well formatted data

- Comma-separated data `,` (.csv)
  
- Pipe-separated data `|` (.psv)

- Tab-separated data `\t` (.tsv)

- Excel (.xls and .xlsx) 
  
  
]

.pull-right[
### Other

* Portable document format (.pdf)

* Word (.doc, .docx)

* Webpages (.html)

* Text (.txt) (not separated with comma, pipe, etc)

]

---

# Ways to get data into `R`

Use [`readr`](https://readr.tidyverse.org/reference/read_delim.html)

&nbsp;

* Data already on your computer

- need to know where it is ("directory/file.csv")

&nbsp;
&nbsp;

* Data on the web with download link

---

# Data frames (rectangles) use package

.pull-left[

- Make sure you are in the project for this class

- Open up a new Rmarkdown file

]

.pull-right[

.large[.content-box-yellow[Load and install]]

```r
library(pacman)

p_load(tidyverse, janitor, readxl)
```
]

---

## Preview data we will exporting and importing

```r
corp_tax   <- read_excel("corp_tax.xlsx")
```

- Check your environment pane

- Export data into 3 formats

```r
corp_tax  %>% write_csv("corp_tax.csv") # comma-separated values

corp_tax  %>% write_tsv("corp_tax.tsv")  # tab separated

corp_tax  %>% write_delim("corp_tax.psv", delim = "|") # pipe-separated 
```
&nbsp;

- Go to your files pane

* Do you see 3 new files?

* Check your environment? Has anything changed?

---

### Import the files into R

```r
corp_tax_csv  <- read_csv("corp_tax.csv") # comma-separated values

corp_tax_tsv  <- read_tsv("corp_tax.tsv") #tab separated

corp_tax_psv  <- read_delim("corp_tax.psv", delim = "|") # pipe separated
```
--

Do you see these data in your **Environment**?

---

### Check that all the data match

```r
all.equal(corp_tax, corp_tax_csv, check.attributes = FALSE)

all.equal(corp_tax_csv, corp_tax_tsv, check.attributes = FALSE)

all.equal(corp_tax_csv, corp_tax_psv, check.attributes = FALSE)
```

Do they?

---

## Import by file url

- Data for class from Good Jobs First Violation Tracker database

>Violation Tracker is the first wide-ranging database on corporate misconduct. It covers banking, consumer protection, false claims, environmental, wage & hour, health, safety, employment discrimination, price-fixing, bribery and other cases resolved by more than 40 federal regulatory agencies and all parts of the Justice Department since 2000 -- plus state AG and local DA cases and selected class action lawsuits. In all: 397,000 civil and criminal cases with total penalties of $604 billion.

- [Discover Which Corporations are the Biggest Regulatory Violators and Lawbreakers Throughout the United States](https://www.goodjobsfirst.org/violation-tracker)

<!-- See article: [California Leads the Country in Local Prosecution of Business Misconduct](https://www.goodjobsfirst.org/blog/california-leads-country-local-prosecution-business-misconduct)

https://violationtracker.goodjobsfirst.org/prog.php?agency_sum_st=CA-MULTI
 -->

---

#### [Violation Tracker 100 Most Penalized Parent Companies](https://violationtracker.goodjobsfirst.org/parent-totals)

---

### Copy the link

### Import the data and assign it to the variable `viol_in`

--
- use interactive feature in RStudio

```r
viol_in  <-  read_csv("https://violationtracker.goodjobsfirst.org/prog.php?&detail=csv_top_parent",
col_types = cols(`Penalty Total Dollars` = col_number()))
```

&nbsp;

- Do you see `viol_in` in the environment pane?

&nbsp;

---

# viol_in

```r
viol_in 
```

```
## # A tibble: 100 x 4
##     Rank `Parent Name`          `Penalty Total Dollars` `Number of Records`
##    <dbl> <chr>                                    <dbl>               <dbl>
##  1     1 Bank of America                    82695676034                 210
##  2     2 JPMorgan Chase                     34517828993                 151
##  3     3 BP                                 29155941533                 236
##  4     4 Citigroup                          25045655764                 120
##  5     5 Volkswagen                         23764268628                  51
##  6     6 Wells Fargo                        21139073231                 172
##  7     7 Deutsche Bank                      17995700778                  55
##  8     8 UBS                                16782800910                  82
##  9     9 Royal Bank of Scotland             13455904000                  27
## 10    10 Goldman Sachs                      13140380987                  40
## # … with 90 more rows
```

---

### Do we need all the variables (columns)?

```r
viol_in  %>% 
  select(-Rank)
```

```
## # A tibble: 100 x 3
##    `Parent Name`          `Penalty Total Dollars` `Number of Records`
##    <chr>                                    <dbl>               <dbl>
##  1 Bank of America                    82695676034                 210
##  2 JPMorgan Chase                     34517828993                 151
##  3 BP                                 29155941533                 236
##  4 Citigroup                          25045655764                 120
##  5 Volkswagen                         23764268628                  51
##  6 Wells Fargo                        21139073231                 172
##  7 Deutsche Bank                      17995700778                  55
##  8 UBS                                16782800910                  82
##  9 Royal Bank of Scotland             13455904000                  27
## 10 Goldman Sachs                      13140380987                  40
## # … with 90 more rows
```

---

### Are the variable names easy to work with?

```r
viol_clean  <- viol_in  %>% 
  select(-Rank)  %>% 
* clean_names()
```

---

### Which company

Had the highest total penalties?

```r
viol_clean  %>% 
  slice_max(penalty_total_dollars)
```

Had the highest number of penalties?

```r
viol_clean  %>% 
  slice_max(number_of_records)
```

---

### What is the average total penalty? median total penalty?

```r
viol_clean  %>% 
  summarize(mean_tot_penalty = mean(penalty_total_dollars),
      median_tot_penalty = median(penalty_total_dollars))
```

---

### Was Facebook one of the top 100?

```r
viol_clean  %>% 
  filter(parent_name == "Facebook")
```

---

### Write `viol_clean` to

Comma-separated data file (.csv)

```r
viol_clean  %>% write_csv("viol_clean.csv")
```

Pipe-separated data file (.psv)

```r
viol_clean  %>% write_delim("viol_clean.psv", delim = "|")
```
--

Tab-separated data file (.tsv)

```r
viol_clean  %>% write_tsv("viol_clean.tsv")
```
--

### Read these formats into Excel

---

# Writing to Excel (.xlsx)

- `rio`
- `openxlsx`
- `writexl`

---

# Clean up

<!-- 
Corporate subsidy tracker top 100 parent corporations

```r
"https://subsidytracker.goodjobsfirst.org/prog.php?&detail=toplist_parent_csv"

rio::import("https://www.trackyourcompany.org/sp-500-list.html")
```
 -->