class: title-slide, middle, center # BUS 320 # Topic 3 # Data manipulation ## Elizabeth Stanny --- layout: true <div class="my-footer"><span>http://bus320.estanny.com</span></div> --- # Learning objectives for the course - Ask the right questions -- - Extract, transform and load relevant data (ETL process) - Extract (`tidyverse/dplyr/tidyr`) - Transform (`tidyverse//dplyr/tidyr`) - Load Excel Workbooks (`tidyverse/readxl`) -- - Apply appropriate data analytic techniques - Descriptive statistics - `skimr` - `tidyverse/dplyr` - `DataExplorer` -- - Interpret and share the results - `rmarkdown` files end with .Rmd --- # What is the [tidyverse](https://www.tidyverse.org/packages/) -- #### Data transformation with `dplyr` 1. expects **tidy** data - each variable in its own column - each observation in its own row 1. works with pipes **%>%** - x %>% f becomes f(x, y) - **%>%** take something as input for next command - "then" --- <!-- # Glimpse to find data `type` - `chr` refers to “character”, which is computer terminology for text data. In most forms, text data, such as the carrier or origin of a flight, are categorical variables -`int` and `dbl` refer to "integer" and "double", which are computer coding terminology for quantitative/numerical - Which variables in `corp_tax` are numerical (int or dbl) - Which variables in `corp_tax` are categorical (chr) --> # Data transformation with `dplyr` 1. first argument is a data frame 1. subsequent arguments describe what to do with the data frame 1. result is a new data frame --- # Key functions for action in `dplyr`. `function()` | Action --------------|-------------------------------------------------------- `mutate()` | creates new **variables (columns)** `select()` | picks **variables (columns)** based on their names `filter()` | picks **rows** based on their values `summarize()` | reduces multiple values down to a single summary `arrange()` | changes the order of the **rows** based on their values `group_by()` | create subsets of data to apply functions to --- .pull-left[ ### Comparison operators (filter) - `==` equality - `>` greater than - `<` less than" - `>=` corresponds to "greater than or equal to" - `<=` corresponds to "less than or equal to" - `!=` not equal to - `between` numeric variable in a specified range - `near` compare 2 numeric vectors. Set tolerance ] -- .pull-right[ ### Combine criteria using operators that make comparisons: - `|` or - `&` and `,` ] --- # Load the packages and data ```r library(pacman) p_load(tidyverse, readxl) corp_tax <- read_excel("corp_tax.xlsx") ``` --- # corp_tax data frame is tidy .pull-left[ - Each row is a company (observation) ] .pull-right[ - Variables (columns) contain information on a company - company (name) - profit - tax - tax_rate - industry ] --- # slice sample of 10 values from corp_tax ```r corp_tax %>% slice_sample(n = 10) %>% kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> company </th> <th style="text-align:right;"> profit </th> <th style="text-align:right;"> tax </th> <th style="text-align:right;"> tax_rate </th> <th style="text-align:left;"> industry </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Eli Lilly </td> <td style="text-align:right;"> 597.612 </td> <td style="text-align:right;"> -54.3000 </td> <td style="text-align:right;"> -0.0908616 </td> <td style="text-align:left;"> Pharmaceuticals & medical products </td> </tr> <tr> <td style="text-align:left;"> Delta Air Lines </td> <td style="text-align:right;"> 5073.000 </td> <td style="text-align:right;"> -187.0000 </td> <td style="text-align:right;"> -0.0368618 </td> <td style="text-align:left;"> Transportation </td> </tr> <tr> <td style="text-align:left;"> Anthem </td> <td style="text-align:right;"> 4990.000 </td> <td style="text-align:right;"> 1128.0000 </td> <td style="text-align:right;"> 0.2260521 </td> <td style="text-align:left;"> Health care </td> </tr> <tr> <td style="text-align:left;"> Textron </td> <td style="text-align:right;"> 548.000 </td> <td style="text-align:right;"> 3.0000 </td> <td style="text-align:right;"> 0.0054745 </td> <td style="text-align:left;"> Aerospace & defense </td> </tr> <tr> <td style="text-align:left;"> Performance Food Group </td> <td style="text-align:right;"> 207.300 </td> <td style="text-align:right;"> 28.9000 </td> <td style="text-align:right;"> 0.1394115 </td> <td style="text-align:left;"> Retail & wholesale trade </td> </tr> <tr> <td style="text-align:left;"> Morgan Stanley </td> <td style="text-align:right;"> 7597.000 </td> <td style="text-align:right;"> 686.0000 </td> <td style="text-align:right;"> 0.0902988 </td> <td style="text-align:left;"> Financial </td> </tr> <tr> <td style="text-align:left;"> Clorox </td> <td style="text-align:right;"> 888.000 </td> <td style="text-align:right;"> 159.0000 </td> <td style="text-align:right;"> 0.1790541 </td> <td style="text-align:left;"> Household & personal products </td> </tr> <tr> <td style="text-align:left;"> Delek US Holdings </td> <td style="text-align:right;"> 478.538 </td> <td style="text-align:right;"> 121.7380 </td> <td style="text-align:right;"> 0.2543956 </td> <td style="text-align:left;"> Oil, gas & pipelines </td> </tr> <tr> <td style="text-align:left;"> Fortive </td> <td style="text-align:right;"> 679.900 </td> <td style="text-align:right;"> 47.2216 </td> <td style="text-align:right;"> 0.0694537 </td> <td style="text-align:left;"> Industrial machinery </td> </tr> <tr> <td style="text-align:left;"> Travelers Cos. </td> <td style="text-align:right;"> 3031.000 </td> <td style="text-align:right;"> 424.0000 </td> <td style="text-align:right;"> 0.1398878 </td> <td style="text-align:left;"> Financial </td> </tr> </tbody> </table> --- # Questions - Did Amazon.com pay US corporate taxes? - What was its US tax rate? -- ```r corp_tax %>% filter(company == "Amazon.com") %>% kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> company </th> <th style="text-align:right;"> profit </th> <th style="text-align:right;"> tax </th> <th style="text-align:right;"> tax_rate </th> <th style="text-align:left;"> industry </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Amazon.com </td> <td style="text-align:right;"> 10835 </td> <td style="text-align:right;"> -129 </td> <td style="text-align:right;"> -0.0119059 </td> <td style="text-align:left;"> Retail & wholesale trade </td> </tr> </tbody> </table> --- # Questions - Did Facebook pay US corporate taxes? - What was its US tax rate? ```r corp_tax %>% filter(company == ???) ``` --- # Question Did companies in the industry "Utilities, gas and electric" pay taxes ```r corp_tax %>% filter(??? == ???) ``` --- # Question Which companies **not** in the in the industry "Utilities, gas and electric" AND had a negative tax amount ```r corp_tax %>% filter(??? != ??? & tax < 0) corp_tax %>% filter(??? != ??? , tax < 0) ``` --- # Question Which companies had a tax rate greater than 21% ```r corp_tax %>% filter(??? > ???) ``` --- # Question Which companies had lowest US profit? ```r corp_tax %>% arrange(profit) corp_tax %>% slice_min(profit, n = 3) ``` --- # Question Which companies had highest US profit? ```r corp_tax %>% slice_max(profit, n = 3) ``` --- # Question By industry which companies had the highest and lowest tax rate? ```r corp_tax %>% group_by(industry) %>% slice_max(tax_rate, n = 1) corp_tax %>% group_by(industry) %>% slice_min(tax_rate, n = 1) ``` --- # Question - Find companies with no or negative tax and arrange by industry and tax_rate ```r corp_tax %>% filter(tax <= 0) %>% arrange(industry, tax_rate) ``` ``` ## # A tibble: 91 x 5 ## company profit tax tax_rate industry ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 Rockwell Collins 722. -40 -0.0554 Aerospace & defense ## 2 DowDuPont 217 -119 -0.548 Chemicals ## 3 Celanese 480 -142. -0.295 Chemicals ## 4 Air Products & Chemica… 671. -17 -0.0253 Chemicals ## 5 International Business… 500 -342 -0.684 Computers, office equip, so… ## 6 Activision Blizzard 447 -243 -0.544 Computers, office equip, so… ## 7 Pitney Bowes 125. -26.3 -0.210 Computers, office equip, so… ## 8 DXC Technology 522. -6 -0.0115 Computers, office equip, so… ## 9 Sanmina-SCI 16.2 -0.122 -0.00754 Computers, office equip, so… ## 10 Salesforce.com 800 0 0 Computers, office equip, so… ## # … with 81 more rows ``` --- .pull-left[ # Extract observations Find companies - **not** in the industry "Utilities, gas and electric" and - with negative tax amount - *then* arrange by industry and within industry by tax rate ```r corp_tax %>% filter(??? != ???, tax < 0) %>% arrange(???, ???) ``` ] -- .pull-right[ #### Select only company name ```r corp_tax %>% filter(industry != "Utilities, gas and electric", tax < 0) %>% arrange(industry, tax_rate) %>% select(company) ``` ``` ## # A tibble: 70 x 1 ## company ## <chr> ## 1 Rockwell Collins ## 2 DowDuPont ## 3 Celanese ## 4 Air Products & Chemicals ## 5 International Business Machines ## 6 Activision Blizzard ## 7 Pitney Bowes ## 8 DXC Technology ## 9 Sanmina-SCI ## 10 AECOM Technology ## # … with 60 more rows ``` ] --- # Calculate tax rate ```r corp_tax %>% select(company, profit, tax, tax_rate) %>% mutate(tax_rate_check = tax / profit) %>% filter(!near(tax_rate_check, tax_rate)) ``` ``` ## # A tibble: 1 x 5 ## company profit tax tax_rate tax_rate_check ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Liberty Media 1022 14 0.014 0.0137 ``` ```r corp_tax %>% select(company, profit, tax, tax_rate) %>% mutate(tax_rate_check = tax / profit) %>% filter(!near(tax_rate_check, tax_rate, tol = .001)) ``` ``` ## # A tibble: 0 x 5 ## # … with 5 variables: company <chr>, profit <dbl>, tax <dbl>, tax_rate <dbl>, ## # tax_rate_check <dbl> ``` --- ### Use 'summarize' to calculate summary statistics * `mean()`: the average * `min()`: the minimum value (p0) * `quantile( ,1)`: the minimum value (p25) * `median()`: the middle value (p50) * `quantile( ,3)`: the minimum value (p75) * `max()`: the minimum value (p100) * `sd()`: the standard deviation (measure of spread) * `sum()`: the total amount when adding multiple numbers * `n()`: a count of the number of rows/observations in each group --- # Replicate [Figure 5](https://itep.org/corporate-tax-avoidance-in-the-first-year-of-the-trump-tax-law) ```r corp_tax %>% summarize( profit_sum = sum(profit), tax_sum = sum(tax)) %>% kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:right;"> profit_sum </th> <th style="text-align:right;"> tax_sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 765688 </td> <td style="text-align:right;"> 86844.71 </td> </tr> </tbody> </table> --- # What do we need to do? ```r tab <- corp_tax %>% *group_by(industry) %>% summarize( profit_sum = sum(profit), tax_sum = sum(tax)) kable(head(tab), format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> industry </th> <th style="text-align:right;"> profit_sum </th> <th style="text-align:right;"> tax_sum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aerospace & defense </td> <td style="text-align:right;"> 23360.363 </td> <td style="text-align:right;"> 2874.800 </td> </tr> <tr> <td style="text-align:left;"> Chemicals </td> <td style="text-align:right;"> 6281.402 </td> <td style="text-align:right;"> 274.767 </td> </tr> <tr> <td style="text-align:left;"> Computers, office equip, software, data </td> <td style="text-align:right;"> 52314.470 </td> <td style="text-align:right;"> 10655.678 </td> </tr> <tr> <td style="text-align:left;"> Engineering & construction </td> <td style="text-align:right;"> 4749.116 </td> <td style="text-align:right;"> 379.873 </td> </tr> <tr> <td style="text-align:left;"> Financial </td> <td style="text-align:right;"> 196269.695 </td> <td style="text-align:right;"> 20069.580 </td> </tr> <tr> <td style="text-align:left;"> Financial Data Services </td> <td style="text-align:right;"> 19769.280 </td> <td style="text-align:right;"> 2688.200 </td> </tr> </tbody> </table> --- # FIGURE 5 Effective Corporate Tax Rates for 379 Corporations, by industry, 2018 ```r fig5 <- corp_tax %>% group_by(industry) %>% summarize( profit_sum = sum(profit), tax_sum = sum(tax)) %>% * mutate(Rate = tax_sum / profit_sum) %>% * arrange(Rate) %>% * rename(Profit = profit_sum, Tax = tax_sum) kable(head(fig5), format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> industry </th> <th style="text-align:right;"> Profit </th> <th style="text-align:right;"> Tax </th> <th style="text-align:right;"> Rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Industrial machinery </td> <td style="text-align:right;"> 11263.541 </td> <td style="text-align:right;"> -71.19397 </td> <td style="text-align:right;"> -0.0063207 </td> </tr> <tr> <td style="text-align:left;"> Utilities, gas and electric </td> <td style="text-align:right;"> 39894.059 </td> <td style="text-align:right;"> -212.38508 </td> <td style="text-align:right;"> -0.0053237 </td> </tr> <tr> <td style="text-align:left;"> Motor vehicles and parts </td> <td style="text-align:right;"> 9264.907 </td> <td style="text-align:right;"> 134.71636 </td> <td style="text-align:right;"> 0.0145405 </td> </tr> <tr> <td style="text-align:left;"> Oil, gas & pipelines </td> <td style="text-align:right;"> 44643.612 </td> <td style="text-align:right;"> 1628.17897 </td> <td style="text-align:right;"> 0.0364706 </td> </tr> <tr> <td style="text-align:left;"> Chemicals </td> <td style="text-align:right;"> 6281.402 </td> <td style="text-align:right;"> 274.76700 </td> <td style="text-align:right;"> 0.0437429 </td> </tr> <tr> <td style="text-align:left;"> Transportation </td> <td style="text-align:right;"> 30665.010 </td> <td style="text-align:right;"> 2445.63353 </td> <td style="text-align:right;"> 0.0797532 </td> </tr> </tbody> </table> --- ### FIGURE 5 Effective Corporate Tax Rates for 379 Corporations, by industry, 2018 ```r kable(head(fig5), format = "html", digits = c(0, 0, 0, 3)) ``` <table> <thead> <tr> <th style="text-align:left;"> industry </th> <th style="text-align:right;"> Profit </th> <th style="text-align:right;"> Tax </th> <th style="text-align:right;"> Rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Industrial machinery </td> <td style="text-align:right;"> 11264 </td> <td style="text-align:right;"> -71 </td> <td style="text-align:right;"> -0.006 </td> </tr> <tr> <td style="text-align:left;"> Utilities, gas and electric </td> <td style="text-align:right;"> 39894 </td> <td style="text-align:right;"> -212 </td> <td style="text-align:right;"> -0.005 </td> </tr> <tr> <td style="text-align:left;"> Motor vehicles and parts </td> <td style="text-align:right;"> 9265 </td> <td style="text-align:right;"> 135 </td> <td style="text-align:right;"> 0.015 </td> </tr> <tr> <td style="text-align:left;"> Oil, gas & pipelines </td> <td style="text-align:right;"> 44644 </td> <td style="text-align:right;"> 1628 </td> <td style="text-align:right;"> 0.036 </td> </tr> <tr> <td style="text-align:left;"> Chemicals </td> <td style="text-align:right;"> 6281 </td> <td style="text-align:right;"> 275 </td> <td style="text-align:right;"> 0.044 </td> </tr> <tr> <td style="text-align:left;"> Transportation </td> <td style="text-align:right;"> 30665 </td> <td style="text-align:right;"> 2446 </td> <td style="text-align:right;"> 0.080 </td> </tr> </tbody> </table> <!-- ### Other packages for descriptive statistics / exploratory data analysis - [skimr](https://docs.ropensci.org/skimr/index.html) - [DataExplorer](http://boxuancui.github.io/DataExplorer/) # Definitions + **Console**: where you can type commands + **Running code**: tell to perform an act by giving it commands in the console + **Objects**: where values are saved in R + **Data types**: integers, doubles/numerics, logicals, and characters .footnote[.font80[Source: Modern Dive https://moderndive.com/1-getting-started.html]] # Definitions (more) + **Vectors**: a series of values. Created using `c()` * **Factors**: *categorical data* commonly represented in `R` as factors * **Data frames**: spreadsheets contain data - rows are **observations** - columns are **variables** describing the observations * **Functions** (**commands**): perform tasks - take in inputs called **arguments** can specify if don't defaults are used - return outputs .footnote[.font80[Source: Modern Dive https://moderndive.com/1-getting-started.html]] -->