class: center, middle # David Neuzerling ## 2018-08-17 <img src="hex/hexwall.jpg" style="width: 45%; height: 45%;"></img> --- class: inverse, center, middle <img src="hex/r2d3.jpg" style="width: 45%; height: 45%;"></img> <img src="hex/recipes.jpg" style="width: 45%; height: 45%;"></img> Hex logos are the cutest part of R --- class: inverse, center, middle # R vs. Python ![Fighting corgis](https://i.imgur.com/NzppaUe.gif) --- class: inverse, center, middle #Refresh-R: Piping data <img src="https://i.giphy.com/Y8qTuvMUjSh7a.gif" style="width: 60%; height: 60%;"></img> --- class: left ## Refresh-R: Ceci **est** pas une pipe `%>%` is a *pipe*. It puts the object on the left into the *first* argument of the function on the right: * `4 %>% sqrt` is the same as `sqrt(4)` * `data %>% head` is the same as `head(data)` <img src="hex/magrittr.png" style="width: 15%; height: 15%;"></img> --- class: left ## Refresh-R: Ceci **est** pas une pipe `%>%` is a *pipe*. It puts the object on the left into the *first* argument of the function on the right: * `4 %>% sqrt` is the same as `sqrt(4)` * `data %>% head` is the same as `head(data)` Or, if we wanted to calculate the average BMI of all droids in the Star Wars movies: ```r starwars %>% filter(species == "Droid") %>% mutate(BMI = mass / (height / 100)^2) %>% summarise(avg_BMI = mean(BMI, na.rm = TRUE)) ``` is the same as ```r summarise(mutate(filter(starwars, species == "Droid"), BMI = mass / (height / 100)^2), avg_BMI = mean(BMI, na.rm = TRUE)) ``` It's 32.7. --- class: inverse, center, middle #Refresh-R: Visualisation <img src="https://i.giphy.com/media/3og0IExSrnfW2kUaaI/source.gif" style="width: 70%; height: 70%;"></img> --- class: center, middle ## Refresh-R: visualisation
--- class: left ## Refresh-R: Visualisation ```r library(tidyverse) library(bomrang) weather <- get_historical(stationid = "090015", type = "max") ``` <img src="hex/tidyverse.png" style="width: 15%; height: 15%;"></img><img src="hex/bomrang.png" style="width: 15%; height: 15%;"></img> --- class: left ## Refresh-R: Visualisation ```r library(tidyverse) library(bomrang) weather <- get_historical(stationid = "090015", type = "max") ``` ```r weather %>% sample_n(6) ``` <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Product_code </th> <th style="text-align:right;"> Station_number </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Month </th> <th style="text-align:right;"> Day </th> <th style="text-align:right;"> Max_temperature </th> <th style="text-align:right;"> Accum_days_max </th> <th style="text-align:left;"> Quality </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 5427 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 1878 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 18.9 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> <tr> <td style="text-align:left;"> 22746 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 1926 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 17.3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> <tr> <td style="text-align:left;"> 52658 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 2008 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 29.5 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> <tr> <td style="text-align:left;"> 36199 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 1963 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 19.4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> <tr> <td style="text-align:left;"> 26884 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 1937 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 16.7 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> <tr> <td style="text-align:left;"> 54953 </td> <td style="text-align:left;"> IDCJAC0010 </td> <td style="text-align:right;"> 90015 </td> <td style="text-align:right;"> 2014 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 13.7 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Y </td> </tr> </tbody> </table> --- class: left ## Refresh-R: Visualisation ```r weather %>% filter(!is.na(Max_temperature)) ``` --- class: left ## Refresh-R: Visualisation ```r weather %>% filter(!is.na(Max_temperature)) %>% mutate( Season = case_when( Month %in% c(12, 1, 2) ~ "Summer", Month %in% c(3, 4, 5) ~ "Autumn", Month %in% c(6, 7, 8) ~ "Winter", Month %in% c(9, 10, 11) ~ "Spring" ), Season = factor(Season, levels = c("Summer", "Autumn", "Winter", "Spring")) # ordering ) ``` --- class: left ## Refresh-R: Visualisation ```r weather %>% filter(!is.na(Max_temperature)) %>% mutate( Season = case_when( Month %in% c(12, 1, 2) ~ "Summer", Month %in% c(3, 4, 5) ~ "Autumn", Month %in% c(6, 7, 8) ~ "Winter", Month %in% c(9, 10, 11) ~ "Spring" ), Season = factor(Season, levels = c("Summer", "Autumn", "Winter", "Spring")) # ordering ) %>% group_by(Year, Season) %>% summarise(Average_max_temp = mean(Max_temperature)) -> weather_summarised ``` -- <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Season </th> <th style="text-align:right;"> Average_max_temp </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1979 </td> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 13.25000 </td> </tr> <tr> <td style="text-align:right;"> 1972 </td> <td style="text-align:left;"> Autumn </td> <td style="text-align:right;"> 17.70978 </td> </tr> <tr> <td style="text-align:right;"> 1949 </td> <td style="text-align:left;"> Winter </td> <td style="text-align:right;"> 12.65761 </td> </tr> <tr> <td style="text-align:right;"> 1899 </td> <td style="text-align:left;"> Spring </td> <td style="text-align:right;"> 17.22593 </td> </tr> </tbody> </table> --- class: left ## Refresh-R: Visualisation .pull-left[ ```r weather_summarised %>% ggplot( aes(x = Year, y = Average_max_temp) ) ``` ] .pull-right[ <img src="user2018_full_files/figure-html/weather_partial_plot_1-1.png" width="500" height="500" /> ] --- class: left ## Refresh-R: Visualisation .pull-left[ ```r weather_summarised %>% ggplot( aes(x = Year, y = Average_max_temp) ) + facet_grid(Season ~ .) ``` ] .pull-right[ <img src="user2018_full_files/figure-html/weather_partial_plot_2-1.png" width="500" height="500" /> ] --- class: left ## Refresh-R: Visualisation .pull-left[ ```r weather_summarised %>% ggplot( aes(x = Year, y = Average_max_temp) ) + facet_grid(Season ~ .) + geom_point() ``` ] .pull-right[ <img src="user2018_full_files/figure-html/weather_partial_plot_3-1.png" width="500" height="500" /> ] --- class: left ## Refresh-R: Visualisation .pull-left[ ```r weather_summarised %>% ggplot( aes(x = Year, y = Average_max_temp) ) + facet_grid(Season ~ .) + geom_point() + geom_smooth() ``` We can pass this to plotly with the `ggplotly` function. ] .pull-right[ <img src="user2018_full_files/figure-html/weather_partial_plot_4-1.png" width="500" height="500" /> ] --- class: inverse, center, middle #Refresh-R: Modelling <img src="https://i.imgur.com/muNWT5A.gif" style="width: 40%; height: 40%;"></img> --- class: left ## Refresh-R: Modelling ```r diamonds %>% nrow ``` ``` ## [1] 53940 ``` ```r diamonds %>% head ``` <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> carat </th> <th style="text-align:left;"> cut </th> <th style="text-align:left;"> color </th> <th style="text-align:left;"> clarity </th> <th style="text-align:right;"> depth </th> <th style="text-align:right;"> table </th> <th style="text-align:right;"> price </th> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y </th> <th style="text-align:right;"> z </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.23 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> E </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:right;"> 61.5 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 326 </td> <td style="text-align:right;"> 3.95 </td> <td style="text-align:right;"> 3.98 </td> <td style="text-align:right;"> 2.43 </td> </tr> <tr> <td style="text-align:right;"> 0.21 </td> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> E </td> <td style="text-align:left;"> SI1 </td> <td style="text-align:right;"> 59.8 </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> 326 </td> <td style="text-align:right;"> 3.89 </td> <td style="text-align:right;"> 3.84 </td> <td style="text-align:right;"> 2.31 </td> </tr> <tr> <td style="text-align:right;"> 0.23 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> E </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:right;"> 56.9 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 327 </td> <td style="text-align:right;"> 4.05 </td> <td style="text-align:right;"> 4.07 </td> <td style="text-align:right;"> 2.31 </td> </tr> <tr> <td style="text-align:right;"> 0.29 </td> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> I </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 62.4 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 334 </td> <td style="text-align:right;"> 4.20 </td> <td style="text-align:right;"> 4.23 </td> <td style="text-align:right;"> 2.63 </td> </tr> <tr> <td style="text-align:right;"> 0.31 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> J </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:right;"> 63.3 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 335 </td> <td style="text-align:right;"> 4.34 </td> <td style="text-align:right;"> 4.35 </td> <td style="text-align:right;"> 2.75 </td> </tr> <tr> <td style="text-align:right;"> 0.24 </td> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> J </td> <td style="text-align:left;"> VVS2 </td> <td style="text-align:right;"> 62.8 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 336 </td> <td style="text-align:right;"> 3.94 </td> <td style="text-align:right;"> 3.96 </td> <td style="text-align:right;"> 2.48 </td> </tr> </tbody> </table> --- class: left ## Refresh-R: Modelling The `glance` function from the `broom` library makes model results tidy. ```r library(broom) ``` ```r diamonds %>% lm(price ~ color + carat + cut + clarity + depth + table, data = .) %>% glance ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> adj.r.squared </th> <th style="text-align:right;"> sigma </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> logLik </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> BIC </th> <th style="text-align:right;"> deviance </th> <th style="text-align:right;"> df.residual </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.9160543 </td> <td style="text-align:right;"> 0.9160232 </td> <td style="text-align:right;"> 1156.09 </td> <td style="text-align:right;"> 29419.46 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> -456955 </td> <td style="text-align:right;"> 913954 </td> <td style="text-align:right;"> 914149.7 </td> <td style="text-align:right;"> 72065109042 </td> <td style="text-align:right;"> 53919 </td> </tr> </tbody> </table> <img src="hex/broom.png" style="width: 15%; height: 15%;"></img> --- class: left ## Refresh-R: Modelling Suppose we wanted to make a different model for every `color`. `glance` gives one row per model. ```r diamonds %>% group_by(color) %>% do(glance(lm(price ~ carat + cut + clarity + depth + table, data = .))) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> color </th> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> adj.r.squared </th> <th style="text-align:right;"> sigma </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> logLik </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> BIC </th> <th style="text-align:right;"> deviance </th> <th style="text-align:right;"> df.residual </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 0.8909225 </td> <td style="text-align:right;"> 0.8906966 </td> <td style="text-align:right;"> 1109.725 </td> <td style="text-align:right;"> 3943.879 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -57111.20 </td> <td style="text-align:right;"> 114254.40 </td> <td style="text-align:right;"> 114363.54 </td> <td style="text-align:right;"> 8324862365 </td> <td style="text-align:right;"> 6760 </td> </tr> <tr> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 0.9039023 </td> <td style="text-align:right;"> 0.9037647 </td> <td style="text-align:right;"> 1037.419 </td> <td style="text-align:right;"> 6572.156 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -81929.01 </td> <td style="text-align:right;"> 163890.02 </td> <td style="text-align:right;"> 164005.06 </td> <td style="text-align:right;"> 10527754413 </td> <td style="text-align:right;"> 9782 </td> </tr> <tr> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 0.9112486 </td> <td style="text-align:right;"> 0.9111182 </td> <td style="text-align:right;"> 1128.422 </td> <td style="text-align:right;"> 6986.982 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -80598.67 </td> <td style="text-align:right;"> 161229.34 </td> <td style="text-align:right;"> 161343.96 </td> <td style="text-align:right;"> 12131071813 </td> <td style="text-align:right;"> 9527 </td> </tr> <tr> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 0.9273524 </td> <td style="text-align:right;"> 0.9272622 </td> <td style="text-align:right;"> 1092.580 </td> <td style="text-align:right;"> 10282.267 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -95017.34 </td> <td style="text-align:right;"> 190066.68 </td> <td style="text-align:right;"> 190183.99 </td> <td style="text-align:right;"> 13461715023 </td> <td style="text-align:right;"> 11277 </td> </tr> <tr> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 0.9315567 </td> <td style="text-align:right;"> 0.9314411 </td> <td style="text-align:right;"> 1103.892 </td> <td style="text-align:right;"> 8058.471 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -69958.15 </td> <td style="text-align:right;"> 139948.29 </td> <td style="text-align:right;"> 140060.68 </td> <td style="text-align:right;"> 10100793783 </td> <td style="text-align:right;"> 8289 </td> </tr> <tr> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 0.9363188 </td> <td style="text-align:right;"> 0.9361540 </td> <td style="text-align:right;"> 1193.242 </td> <td style="text-align:right;"> 5678.599 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -46097.75 </td> <td style="text-align:right;"> 92227.50 </td> <td style="text-align:right;"> 92333.07 </td> <td style="text-align:right;"> 7698632142 </td> <td style="text-align:right;"> 5407 </td> </tr> <tr> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 0.9401123 </td> <td style="text-align:right;"> 0.9398121 </td> <td style="text-align:right;"> 1088.831 </td> <td style="text-align:right;"> 3131.733 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> -23612.81 </td> <td style="text-align:right;"> 47257.62 </td> <td style="text-align:right;"> 47352.66 </td> <td style="text-align:right;"> 3311246600 </td> <td style="text-align:right;"> 2793 </td> </tr> </tbody> </table> --- class: left ## Refresh-R: Modelling We can use the `augment` function from the `broom` package to get *observation-level* statistics for our linear models. ```r diamonds %>% group_by(color) %>% do(augment(lm(price ~ carat + cut + clarity + depth + table, data = .))) %>% head ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> color </th> <th style="text-align:right;"> price </th> <th style="text-align:right;"> carat </th> <th style="text-align:left;"> cut </th> <th style="text-align:left;"> clarity </th> <th style="text-align:right;"> depth </th> <th style="text-align:right;"> table </th> <th style="text-align:right;"> .fitted </th> <th style="text-align:right;"> .se.fit </th> <th style="text-align:right;"> .resid </th> <th style="text-align:right;"> .hat </th> <th style="text-align:right;"> .sigma </th> <th style="text-align:right;"> .cooksd </th> <th style="text-align:right;"> .std.resid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 357 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 60.5 </td> <td style="text-align:right;"> 61 </td> <td style="text-align:right;"> -738.0208 </td> <td style="text-align:right;"> 48.04253 </td> <td style="text-align:right;"> 1095.0208 </td> <td style="text-align:right;"> 0.0018742 </td> <td style="text-align:right;"> 1109.727 </td> <td style="text-align:right;"> 0.0001221 </td> <td style="text-align:right;"> 0.9876762 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 402 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:right;"> 61.9 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> -348.5558 </td> <td style="text-align:right;"> 51.35991 </td> <td style="text-align:right;"> 750.5558 </td> <td style="text-align:right;"> 0.0021420 </td> <td style="text-align:right;"> 1109.769 </td> <td style="text-align:right;"> 0.0000656 </td> <td style="text-align:right;"> 0.6770698 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 403 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 60.8 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> -369.9372 </td> <td style="text-align:right;"> 42.32684 </td> <td style="text-align:right;"> 772.9372 </td> <td style="text-align:right;"> 0.0014548 </td> <td style="text-align:right;"> 1109.767 </td> <td style="text-align:right;"> 0.0000472 </td> <td style="text-align:right;"> 0.6970199 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 403 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 65.2 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> -287.9032 </td> <td style="text-align:right;"> 60.17899 </td> <td style="text-align:right;"> 690.9032 </td> <td style="text-align:right;"> 0.0029408 </td> <td style="text-align:right;"> 1109.775 </td> <td style="text-align:right;"> 0.0000764 </td> <td style="text-align:right;"> 0.6235073 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 403 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:right;"> 58.4 </td> <td style="text-align:right;"> 63 </td> <td style="text-align:right;"> -557.2384 </td> <td style="text-align:right;"> 76.61276 </td> <td style="text-align:right;"> 960.2384 </td> <td style="text-align:right;"> 0.0047662 </td> <td style="text-align:right;"> 1109.745 </td> <td style="text-align:right;"> 0.0002402 </td> <td style="text-align:right;"> 0.8673639 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 404 </td> <td style="text-align:right;"> 0.22 </td> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 59.3 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> -732.0338 </td> <td style="text-align:right;"> 49.90248 </td> <td style="text-align:right;"> 1136.0338 </td> <td style="text-align:right;"> 0.0020222 </td> <td style="text-align:right;"> 1109.720 </td> <td style="text-align:right;"> 0.0001419 </td> <td style="text-align:right;"> 1.0247445 </td> </tr> </tbody> </table> According to these models, some diamonds have... negative prices? --- class: left ## Refresh-R: Modelling ```r diamonds %>% group_by(color) %>% do(augment(lm(price ~ carat + cut + clarity + depth + table, data = .))) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) + geom_point() + facet_wrap(color ~ ., ncol = 2) ``` --- class: center ## Refresh-R: Modelling <img src="user2018_full_files/figure-html/diamonds_ggplot_display-1.png" width="1000" height="500" /> --- class: center ## Refresh-R: Modelling .pull-left[ <img src="user2018_full_files/figure-html/diamond_unmodified-1.png" width="500" height="500" /> ] .pull-right[ <img src="user2018_full_files/figure-html/diamond_modified-1.png" width="500" height="500" /> ] --- class: inverse, center, middle # Modifying data <img src="https://imgs.xkcd.com/comics/machine_learning.png" style="width: 35%; height: 35%;"> https://xkcd.com/1838/ --- class: left ## Modifying data * Centering * Scaling * Filters * PCA * Encoding/decoding * Missing value imputation * Feature engineering ```r library(recipes) ``` <img src="hex/recipes.png" style="width: 18%; height: 18%;"></img> --- ## `ames` housing data .pull-left[ ```r library(leaflet) leaflet(ames) %>% addTiles() %>% addCircleMarkers(radius = 1) ``` ] .pull-right[
] --- ## Infrequently occurring levels .pull-left[ ```r ames %>% ggplot(aes(x = Neighborhood)) + geom_bar(fill = "#6d1e3b") + coord_flip() ``` ] .pull-right[ <img src="user2018_full_files/figure-html/neighbourhood_plot-1.png" width="500" height="500" /> ] --- ## Non-normally distributed data .pull-left[ ```r ames %>% ggplot(aes(x = Lot_Area)) + geom_density() + xlim(0, 30000) ``` ] .pull-right[ <img src="user2018_full_files/figure-html/lot_area_plot-1.png" width="500" height="500" /> ] --- ## Preparing and baking a recipe ```r library(rsample) data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75) ames_train <- training(data_split) ames_test <- testing(data_split) ``` <img src="hex/rsample.png" style="width: 15%; height: 15%;"></img> --- ## Preparing and baking a recipe ```r library(rsample) data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75) ames_train <- training(data_split) ames_test <- testing(data_split) ``` ```r recipe( Sale_Price ~ Longitude + Latitude + Neighborhood + Lot_Area, data = ames_train ) %>% step_other(Neighborhood, threshold = 0.05) %>% step_log(Sale_Price, base = 10) %>% step_YeoJohnson(Lot_Area) %>% prep(training = ames_train) -> ames_recipe ``` -- ```r ames_train_baked <- ames_recipe %>% bake(ames_train) ames_test_baked <- ames_recipe %>% bake(ames_test) ``` --- ## Linear models .pull-left[ <img src="user2018_full_files/figure-html/ames_unmodified_lm-1.png" width="500" height="500" /> ] .pull-right[ <img src="user2018_full_files/figure-html/ames_modified_lm-1.png" width="500" height="500" /> ] --- class: inverse, middle, center # Missing values <img src = "missing_value_imputation.jpg" style="width: 30%; height: 30%;"></img> Mean value imputation --- ## Ozone data .pull-left[ ```r library(naniar) ozone <- read_csv("ozoneNA.csv") %>% select(-X1, -WindDirection) vis_miss(ozone) ``` <img src="hex/naniar.png" style="width: 30%; height: 30%;"></img> ] .pull-right[ <img src="user2018_full_files/figure-html/naniar_plot-1.png" width="500" height="500" /> ] --- class: middle, center ## Visualising missing data <img src="user2018_full_files/figure-html/feed_visdat-1.png" width="1000" height="500" /> --- class: inverse, center, middle # Time series <img src="https://media.giphy.com/media/QYumYBnNm3Qw8/giphy.gif" style="width: 70%; height: 70%;"></img> --- class: left ## `fable` ```r library(tsibble) library(fable) ``` ```r cafe <- fpp2::auscafe %>% as_tsibble cafe %>% head ``` .pull-left[ <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> index </th> <th style="text-align:right;"> value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1982 Apr </td> <td style="text-align:right;"> 0.3424 </td> </tr> <tr> <td style="text-align:left;"> 1982 May </td> <td style="text-align:right;"> 0.3421 </td> </tr> <tr> <td style="text-align:left;"> 1982 Jun </td> <td style="text-align:right;"> 0.3287 </td> </tr> <tr> <td style="text-align:left;"> 1982 Jul </td> <td style="text-align:right;"> 0.3385 </td> </tr> <tr> <td style="text-align:left;"> 1982 Aug </td> <td style="text-align:right;"> 0.3315 </td> </tr> <tr> <td style="text-align:left;"> 1982 Sep </td> <td style="text-align:right;"> 0.3419 </td> </tr> </tbody> </table> ] .pull-right[ <img src="hex/tsibble.png" style="width: 30%; height: 30%;"></img><img src="hex/fable.png" style="width: 30%; height: 30%;"></img> ] --- class: left ## `fable` ```r # Example by Rob Hyndman cafe %>% ARIMA(log(value) ~ pdq(2, 1, 1) + PDQ(2, 1, 2)) %>% forecast() %>% autoplot() ``` <img src="user2018_full_files/figure-html/fable_plot-1.png" width="1000" height="400" /> --- class: inverse, center, middle <img src="hex/shiny.png" style="width: 50%; height: 50%;"></img>