class: title-slide, middle, center # BUS 320 # Topic 6 # Joining data ## Elizabeth Stanny --- layout: true <div class="my-footer"><span>http://bus320.estanny.com</span></div> --- # Learning objectives for the course - Ask the right questions - Extract, transform and load relevant data (ETL process) - Extract (`tidyverse`) - Transform (`tidyverse`) - Load (`tidyverse`) - Apply appropriate data analytic techniques - Descriptive analysis (`tidyverse`) - Exploratory analysis (`tidyverse`, `inspectdf`) - Interpret and share the results (`rmarkdown`) --- # Load the packages and data ```r library(pacman) p_load(tidyverse, DataExplorer) drug_cos <- read_csv("drug_cos.csv") health_cos <- read_csv("health_cos.csv") ``` --- ## dplyr actions | `function()` | Action | | :-------------- | :-------------------------------------------------------- | | `filter()` | extract **observations** based on their values | | `arrange()` | orders **observations** based on their values | | `select()` | Selects a subset of **variables** from the data frame | | `mutate()` | creates new **variables** | | `group_by()` | create subsets of data to apply functions to | | `summarize()` | create summary statistics | --- ## More dplyr actions | `function()` | Action | | :-------------- | :-------------------------------------------------------- | | `bind_cols(x, y)` | bind x and y (need same number of rows) | | `bind_rows(x, y)` | bind x and y (need same number of columns) | | `inner_join(x, y)` | **rows** are in x and y; **columns** all | | `full_join(x, y)` | **rows** all; **columns** all | | `left_join(x, y)` | **rows** in x that are in y; **columns** all | | `right_join(x, y)` | **rows** in y that are in x; **columns** all | | `semi_join(x, y)` | **rows** in x that are in y; **columns** x | | `anti_join(x, y)` | **rows** in x that are not in y; **columns** x | --- # Combine Tables -- .pull-left[ ```r x <- tribble( ~ticker, ~name, ~revenue, "TPR", "Tapestry", 5880, "ALL", "Allstate", 39815, "FFIV", "F5 Networks Inc", 361 ) # x ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r y <- tribble( ~ticker, ~name, ~cor, "TPR", "Tapestry", 2031, "ALL", "Allstate", 25466, "DLTR", "Dollar Tree", 15876 ) # y ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] --- class: inverse, center, middle # Combining variables ## bind together ### no conditions on observations --- ### `bind_cols` - combine variables same length - make sure same number of rows .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% bind_cols(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker...1 </th> <th style="text-align:left;"> name...2 </th> <th style="text-align:right;"> revenue </th> <th style="text-align:left;"> ticker...4 </th> <th style="text-align:left;"> name...5 </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- ### Would we want to do this? --- ### `bind_rows` - combine cases - same number of columns .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% bind_rows(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] ### Why do we have 4 columns instead of 3? --- class: inverse, center, middle # Mutating joins to ## combine variables --- ### `inner_join` - all rows from `x` with matching rows in `y` - all columns `x` and `y` .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% inner_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:right;"> 25466 </td> </tr> </tbody> </table> ] --- ### `full_join` - all rows from `x` and all rows in `y` - all columns `x` and `y` .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% full_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] --- ### `left_join` - all rows from `x` - all columns `x` and `y` .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% left_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> ] --- ### `right_join` - all rows from `y` - all columns `x` and `y` .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% right_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] --- class: inverse, center, middle # Filtering joins (keep cases from x) --- ### `semi-join` - all rows from `x` that are in `y` - columns `x` -- .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% semi_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> </tbody> </table> ] --- ### `anti-join` - all rows in `x` that are **not** `y` - columns `x` -- .pull-left[ `x` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 5880 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 39815 </td> </tr> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> `y` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> cor </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TPR </td> <td style="text-align:left;"> Tapestry </td> <td style="text-align:right;"> 2031 </td> </tr> <tr> <td style="text-align:left;"> ALL </td> <td style="text-align:left;"> Allstate </td> <td style="text-align:right;"> 25466 </td> </tr> <tr> <td style="text-align:left;"> DLTR </td> <td style="text-align:left;"> Dollar Tree </td> <td style="text-align:right;"> 15876 </td> </tr> </tbody> </table> ] -- .pull-right[ ```r x %>% anti_join(y) ``` <table> <thead> <tr> <th style="text-align:left;"> ticker </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> revenue </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> FFIV </td> <td style="text-align:left;"> F5 Networks Inc </td> <td style="text-align:right;"> 361 </td> </tr> </tbody> </table> ] --- class: inverse, middle, center # Joining `drug_cos` and `health_cos` dataframes --- ### Examine drug_cos ```r drug_cos ``` ``` ## # A tibble: 104 x 9 ## ticker name location ebitdamargin grossmargin netmargin ros roe year ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ZTS Zoetis… New Jers… 0.149 0.61 0.058 0.101 0.069 2011 ## 2 ZTS Zoetis… New Jers… 0.217 0.64 0.101 0.171 0.113 2012 ## 3 ZTS Zoetis… New Jers… 0.222 0.634 0.111 0.176 0.612 2013 ## 4 ZTS Zoetis… New Jers… 0.238 0.641 0.122 0.195 0.465 2014 ## 5 ZTS Zoetis… New Jers… 0.182 0.635 0.071 0.14 0.285 2015 ## 6 ZTS Zoetis… New Jers… 0.335 0.659 0.168 0.286 0.587 2016 ## 7 ZTS Zoetis… New Jers… 0.366 0.666 0.163 0.321 0.488 2017 ## 8 ZTS Zoetis… New Jers… 0.379 0.672 0.245 0.326 0.694 2018 ## 9 PRGO PERRIG… Ireland 0.216 0.343 0.123 0.178 0.248 2011 ## 10 PRGO PERRIG… Ireland 0.226 0.345 0.127 0.183 0.236 2012 ## # … with 94 more rows ``` --- ### drug_co variables .pull-left[ - ticker - name - location - year ] -- .pull-right[ - profitability ratios - [ros](https://www.investopedia.com/terms/r/ros.asp): return on sales (or operating profit margin) **E**arnings **B**efore **I**nterest **T**axes (EBIT) / revenue - [ebitdamargin](https://www.investopedia.com/terms/e/ebitda-margin.asp): another measure of operating profit. **E**arnings **B**efore **I**nterest **T**axes, **D**epreciation and **A**mortization (EBITDA) / revenue - [grossmargin](https://www.investopedia.com/terms/g/grossmargin.asp): grossprofit / revenue. grossprofit = revenue - cor (cost of revenue). measures efficiency - [netmargin](https://www.investopedia.com/terms/n/net_margin.asp): netincome / revenue - [roe](https://www.investopedia.com/terms/r/returnonequity.asp) measure of financial performance. netincome / equity ] --- ### Examine health_cos ```r health_cos ``` ``` ## # A tibble: 464 x 11 ## ticker name revenue gp rnd netincome assets liabilities marketcap ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ZTS Zoet… 4.23e9 2.58e9 4.27e8 2.45e8 5.71e 9 1975000000 NA ## 2 ZTS Zoet… 4.34e9 2.77e9 4.09e8 4.36e8 6.26e 9 2221000000 NA ## 3 ZTS Zoet… 4.56e9 2.89e9 3.99e8 5.04e8 6.56e 9 5596000000 1.63e10 ## 4 ZTS Zoet… 4.78e9 3.07e9 3.96e8 5.83e8 6.59e 9 5251000000 2.16e10 ## 5 ZTS Zoet… 4.76e9 3.03e9 3.64e8 3.39e8 7.91e 9 6822000000 2.39e10 ## 6 ZTS Zoet… 4.89e9 3.22e9 3.76e8 8.21e8 7.65e 9 6150000000 2.64e10 ## 7 ZTS Zoet… 5.31e9 3.53e9 3.82e8 8.64e8 8.59e 9 6800000000 3.51e10 ## 8 ZTS Zoet… 5.82e9 3.91e9 4.32e8 1.43e9 1.08e10 8592000000 4.11e10 ## 9 ZBH Zimm… 4.45e9 3.33e9 2.38e8 7.61e8 8.52e 9 3000500000 9.57e 9 ## 10 ZBH Zimm… 4.47e9 3.35e9 2.26e8 7.55e8 9.01e 9 3146100000 1.16e10 ## # … with 454 more rows, and 2 more variables: year <dbl>, industry <chr> ``` --- ### health_cos variables .pull-left[ - ticker - name - year - industry - marketcap: price per share x number of shares ] .pull-right[ - income statement amounts - [revenue](https://www.investopedia.com/terms/n/netsales.asp) - [grossprofit](https://www.investopedia.com/terms/g/grossprofit.asp) = revenue - cost of revenue - rnd: [research and development expense](https://www.investopedia.com/terms/r/research-and-development-expenses.asp) - netincome: net income - balance sheet amounts - assets - liabilities ] --- ### Which variables are the same in both data sets -- ```r names_drug <- drug_cos %>% names() names_health <- health_cos %>% names() intersect(names_drug, names_health) ``` ``` ## [1] "ticker" "name" "year" ``` --- ### Select subset of variables to work with .pull-left[ - For `drug_cos` select (in this order): `ticker`, `year`, `grossmargin` - Extract observations for 2018 - Assign output to `drug_subset` ```r drug_subset <- drug_cos %>% ???(ticker, ???, ???) %>% ???(year == ???) ``` ] .pull-right[ - For `health_cos` select (in this order): `ticker`, `year`, `revenue`, `gp`, `industry` - Extract observations for 2018 - Assign output to `health_subset` ```r ??? <- health_cos %>% ???(ticker, ???, ???, ???, ???) %>% ???(year == ???) ``` ] --- .pull-left[ ```r drug_subset <- drug_cos %>% select(ticker, year, grossmargin) %>% filter(year == 2018) ``` ] -- .pull-right[ ```r health_subset <- health_cos %>% select(ticker, year, revenue, gp, industry) %>% filter(year == 2018) ``` ] -- ```r drug_subset %>% left_join(health_subset) ``` --- # Exploratory data analysis .pull-left[ - Using [inspectdf](https://alastairrushworth.github.io/inspectdf/index.html) ```r p_load(inspectdf) drug_cos <- drug_cos %>% mutate(year = factor(year)) health_cos <- health_cos %>% mutate(year = factor(year)) drug_cos %>% inspect_types() %>% show_plot() health_cos %>% inspect_types() %>% show_plot() ``` ] .pull-right[ ```r # missing values? drug_cos %>% inspect_na() %>% show_plot() health_cos %>% inspect_na() %>% show_plot() ## Distributions of continuous variables? drug_cos %>% inspect_num() %>% show_plot() ## Distributions of categorical variables? drug_cos %>% inspect_cat() %>% show_plot() # correlation (association) between numerical variables drug_cos %>% inspect_cor() %>% show_plot() drug_cos %>% group_by(year) %>% inspect_cor() %>% show_plot() drug_cos %>% group_by(ticker) %>% inspect_cor() %>% show_plot() ``` ]