class: title-slide, middle, center # BUS 320 # Topic 5 # Data manipulation ## Elizabeth Stanny --- layout: true <div class="my-footer"><span>http://bus320.estanny.com</span></div> --- # Learning objectives for the course - Ask the right questions -- - Extract, transform and load relevant data (ETL process) - Extract (`tidyverse/dplyr/tidyr`) - Transform (`tidyverse/dplyr/tidyr`) - Load Excel Workbooks (`tidyverse/readxl`) -- - Apply appropriate data analytic techniques - Descriptive statistics - `tidyverse/dplyr` -- - Interpret and share the results - `rmarkdown` files end with .Rmd --- # Load the packages and data ```r library(pacman) p_load(tidyverse) drug_cos <- read_csv("drug_cos.csv") ``` --- ### Examine data ```r drug_cos ``` ``` ## # A tibble: 104 x 9 ## ticker name location ebitdamargin grossmargin netmargin ros roe year ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ZTS Zoetis… New Jers… 0.149 0.61 0.058 0.101 0.069 2011 ## 2 ZTS Zoetis… New Jers… 0.217 0.64 0.101 0.171 0.113 2012 ## 3 ZTS Zoetis… New Jers… 0.222 0.634 0.111 0.176 0.612 2013 ## 4 ZTS Zoetis… New Jers… 0.238 0.641 0.122 0.195 0.465 2014 ## 5 ZTS Zoetis… New Jers… 0.182 0.635 0.071 0.14 0.285 2015 ## 6 ZTS Zoetis… New Jers… 0.335 0.659 0.168 0.286 0.587 2016 ## 7 ZTS Zoetis… New Jers… 0.366 0.666 0.163 0.321 0.488 2017 ## 8 ZTS Zoetis… New Jers… 0.379 0.672 0.245 0.326 0.694 2018 ## 9 PRGO PERRIG… Ireland 0.216 0.343 0.123 0.178 0.248 2011 ## 10 PRGO PERRIG… Ireland 0.226 0.345 0.127 0.183 0.236 2012 ## # … with 94 more rows ``` --- # drug_cos variables .left-column[ - `ticker` - `name` - `location` - `year` ] .right-column[ Profitability ratios - [`netmargin`](https://www.investopedia.com/terms/n/net_margin.asp): Net income / Revenue. What percent of revenue left after all expenses - [`grossmargin`](https://www.investopedia.com/terms/g/grossmargin.asp): Gross Profit / Revenue. Gross Profit = Revenue - Cost of Revenue - [`ros`]((https://www.investopedia.com/terms/r/ros.asp): Return on Sales (or operating profit margin) measures operational efficiency. Earnings Before Income and Taxes (EBIT) / Revenue - [`ebitdamargin`](https://www.investopedia.com/terms/e/ebitda-margin.asp): measures operating profit as a percentage of revenue. EBITDA / Revenue. EBITDA is Earnings Before Interest Taxes, Depreciation and Amortization (cash income) Financial performance ratio - [`roe`](https://www.investopedia.com/terms/r/returnonequity.asp): Return on Equity measures financial performance. Net income / Shareholder's Equity ] --- # Questions this data could answer? --- ## Recall dplyr verbs: `function()` | Action --------------|-------------------------------------------------------- `filter()` | extract **observations** based on their values `select()` | selects a subset of **variables** `arrange()` | orders **observations** based on their values `mutate()` | creates new **variables** `group_by()` | create subsets of data to apply functions to `summarize()` | create summary statistics --- # Filter ### Extract rows by comparisons - `==`, `<=`, `>=`, `!=` - `%in%` is object part of set - `grepl("pattern", variable)` -- ### Combine criteria using operators that make comparisons: - `|` or - `&` and --- # Filter ## Extract rows in range of years ```r drug_cos %>% count(year) #observations per year drug_cos %>% count(ticker) #observations per ticker # these 4 commands give the same result drug_cos %>% filter(year %in% 2012:2017) drug_cos %>% filter(year %in% c(2012:2017)) drug_cos %>% filter(year %in% seq(2012, 2017, 1)) drug_cos %>% filter(between( year, 2012, 2017)) ``` --- # Filter ## extract rows in non-consecutive years ```r drug_cos %>% filter(year %in% c(2013, 2018)) ``` ``` ## # A tibble: 26 x 9 ## ticker name location ebitdamargin grossmargin netmargin ros roe year ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ZTS Zoetis… New Jers… 0.222 0.634 0.111 0.176 0.612 2013 ## 2 ZTS Zoetis… New Jers… 0.379 0.672 0.245 0.326 0.694 2018 ## 3 PRGO PERRIG… Ireland 0.236 0.362 0.125 0.19 0.205 2013 ## 4 PRGO PERRIG… Ireland 0.178 0.387 0.028 0.088 0.022 2018 ## 5 PFE Pfizer… New York… 0.634 0.814 0.427 0.51 0.279 2013 ## 6 PFE Pfizer… New York… 0.34 0.79 0.208 0.221 0.162 2018 ## 7 MYL Mylan … United K… 0.228 0.44 0.09 0.153 0.209 2013 ## 8 MYL Mylan … United K… 0.258 0.35 0.031 0.074 0.028 2018 ## 9 MRK Merck … New Jers… 0.282 0.615 0.1 0.123 0.089 2013 ## 10 MRK Merck … New Jers… 0.313 0.681 0.147 0.206 0.199 2018 ## # … with 16 more rows ``` --- # Filter ## Extract years every other year from 2012 to 2018 ```r drug_cos %>% filter(seq(2012, 2018, by = 2)) ``` <!-- gasoline %>% filter(year %in% seq(1960, 1978, by = 2)) --> --- # Filter ## Extract tickers "PFE" and "MYL" ```r drug_cos %>% filter(ticker %in% c("PFE", "MYL")) ``` --- # Select columns (variables) ```r drug_cos %>% select(ticker, name, ros) ``` ``` ## # A tibble: 104 x 3 ## ticker name ros ## <chr> <chr> <dbl> ## 1 ZTS Zoetis Inc 0.101 ## 2 ZTS Zoetis Inc 0.171 ## 3 ZTS Zoetis Inc 0.176 ## 4 ZTS Zoetis Inc 0.195 ## 5 ZTS Zoetis Inc 0.14 ## 6 ZTS Zoetis Inc 0.286 ## 7 ZTS Zoetis Inc 0.321 ## 8 ZTS Zoetis Inc 0.326 ## 9 PRGO PERRIGO Co plc 0.178 ## 10 PRGO PERRIGO Co plc 0.183 ## # … with 94 more rows ``` --- # Don't `select` columns ```r a <- drug_cos %>% select(-ticker, -name, -ros) b <- drug_cos %>% select(-c(ticker, name, ros)) setequal(a, b) ``` ``` ## [1] TRUE ``` --- # Can rename and reorder columns with `select` ```r drug_cos %>% select(date = year, ticker) ``` --- # Your turn - select ticker, year, location, netmargin, roe, year - change name of location to headquarter - order columns year, ticker, headquarter, netmargin, roe -- ```r drug_cos %>% select(year, ticker, headquarter =location, netmargin, roe ) ``` --- ## `select` ranges of columns ### by name ```r drug_cos %>% select(ebitdamargin:netmargin) ``` -- ### by position ```r drug_cos %>% select(4:6) ``` --- # `select` helper functions - starts_with("abc") matches columns start with "abc" - ends_with("abc") matches columns end with "abc" - contains("abc") matches columns contain "abc" -- ```r drug_cos %>% select(ticker, contains("locat")) ``` -- ```r drug_cos %>% select(ticker, starts_with("r")) ``` ```r drug_cos %>% select(year, ends_with("margin")) ``` --- ### `group_by` ticker ```r drug_cos %>% group_by(ticker) ``` -- ### `group_by` year ```r drug_cos %>% group_by(year) ``` --- ## Summarize - median roe by year ```r drug_cos %>% group_by(year) %>% summarize( median_roe = median(roe)) ``` ``` ## # A tibble: 8 x 2 ## year median_roe ## <dbl> <dbl> ## 1 2011 0.161 ## 2 2012 0.184 ## 3 2013 0.233 ## 4 2014 0.217 ## 5 2015 0.162 ## 6 2016 0.229 ## 7 2017 0.066 ## 8 2018 0.244 ``` --- ## Summarize - median roe by ticker ```r drug_cos %>% group_by(ticker) %>% summarize(median_roe = median(roe)) ``` ``` ## # A tibble: 13 x 2 ## ticker median_roe ## <chr> <dbl> ## 1 ABBV 0.811 ## 2 AGN -0.017 ## 3 AMGN 0.233 ## 4 BIIB 0.264 ## 5 BMY 0.154 ## 6 GILD 0.382 ## 7 JNJ 0.207 ## 8 LLY 0.224 ## 9 MRK 0.105 ## 10 MYL 0.118 ## 11 PFE 0.141 ## 12 PRGO 0.0275 ## 13 ZTS 0.477 ``` --- # Mutate add median by year to dataframe Each year how does company's roe differ from median for industry by year? ```r diff_roe_year <- drug_cos %>% group_by(year) %>% * mutate(median_roe = median(roe), * diff_roe = roe - median_roe) %>% select(ticker, roe, median_roe, diff_roe) diff_roe_year %>% head(5) %>% kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:right;"> year </th> <th style="text-align:left;"> ticker </th> <th style="text-align:right;"> roe </th> <th style="text-align:right;"> median_roe </th> <th style="text-align:right;"> diff_roe </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2011 </td> <td style="text-align:left;"> ZTS </td> <td style="text-align:right;"> 0.069 </td> <td style="text-align:right;"> 0.161 </td> <td style="text-align:right;"> -0.092 </td> </tr> <tr> <td style="text-align:right;"> 2012 </td> <td style="text-align:left;"> ZTS </td> <td style="text-align:right;"> 0.113 </td> <td style="text-align:right;"> 0.184 </td> <td style="text-align:right;"> -0.071 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:left;"> ZTS </td> <td style="text-align:right;"> 0.612 </td> <td style="text-align:right;"> 0.233 </td> <td style="text-align:right;"> 0.379 </td> </tr> <tr> <td style="text-align:right;"> 2014 </td> <td style="text-align:left;"> ZTS </td> <td style="text-align:right;"> 0.465 </td> <td style="text-align:right;"> 0.217 </td> <td style="text-align:right;"> 0.248 </td> </tr> <tr> <td style="text-align:right;"> 2015 </td> <td style="text-align:left;"> ZTS </td> <td style="text-align:right;"> 0.285 </td> <td style="text-align:right;"> 0.162 </td> <td style="text-align:right;"> 0.123 </td> </tr> </tbody> </table>