library(tidyverse)
library(sqldf)
library(visdat)
library(broom)
library(lmtest)
library(forecast)
library(hts)
Load the feed data and change the column names to lower-case with underscores.
feed_raw <- "feed.csv" %>%
read_csv %>%
rename_all(.funs = function(x) {gsub(" ", "_", x) %>% tolower})
## Parsed with column specification:
## cols(
## .default = col_integer(),
## `Area Abbreviation` = col_character(),
## Area = col_character(),
## Item = col_character(),
## Element = col_character(),
## Unit = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
## See spec(...) for full column specifications.
At this point we’ve loaded the raw csv data with the read_csv
function. read_csv
does not convert strings to factors, but we do have a few factors here. The year variables are all integers and begin with Y, while the longitude and latitude variables are numerics. The rest we convert to factors.
feed_raw <- feed_raw %>% mutate_at(
vars(-starts_with("y"), -starts_with("l")),
funs(as.factor)
)
The initial wide format isn’t very convenient for modelling, but it is appropriate for visualising the missing data. Some missing data, but it seems that a row is either complete, or is missing an entire string of years. It actually looks as though collection starts in different years for each item, but once collection starts it continues uninterrupted. What happened in 1993?
```r
feed_raw %>% vis_dat(warn_large_data = FALSE) +
theme(axis.text = element_text(size = 8))
```
<img src="EDA_files/figure-html/unnamed-chunk-2-1.png" width="100%" />
Let’s clean this up. We gather all of those year columns into a simple Key/Value long format. We then take the four right-most characters of each year, effectively dropping the “Y” in front of each year.
feed <- feed_raw %>%
gather(key = year, value = production, starts_with("y")) %>%
mutate(year = year %>% substr(2, 5) %>% as.numeric)
Change in food and feed production over the years.
feed %>%
group_by(year, element) %>%
summarise(production = production %>% sum(na.rm = TRUE)) %>%
ggplot(aes(x = year, y = production, group = element, colour = element)) +
geom_line() +
scale_x_continuous(
breaks = seq(min(feed$year), max(feed$year), 4)
)
We use an SQL join to introduce a change column for each (Area, Item, Element) tuple, so that we can track how much production has increased or decreased since last year. I can’t think of a clean way to do this with the usual R merge
function, and SQL seems so well-suited for the task.
feed <- sqldf(
"
SELECT CURRENTYEAR.*
,LASTYEAR.production AS production_last_year
,CURRENTYEAR.production - LASTYEAR.production as production_change
FROM feed AS CURRENTYEAR
LEFT JOIN feed AS LASTYEAR
ON CURRENTYEAR.area = LASTYEAR.area
AND CURRENTYEAR.item = LASTYEAR.item
AND CURRENTYEAR.element = LASTYEAR.element
AND CURRENTYEAR.year = LASTYEAR.year + 1
"
)
Heat map showing change in production for each category.
feed %>%
group_by(year, item) %>%
filter(!is.na(production_last_year)) %>%
summarise(
production = production %>% sum(na.rm = TRUE),
production_last_year = production_last_year %>% sum(na.rm = TRUE)
) %>%
mutate(percent_production_change =
(production - production_last_year) / production_last_year) %>%
ggplot(aes(x = year, y = item, fill = percent_production_change)) +
geom_raster() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
limits = c(-0.5, 0.5))
There’s a mixup with items and elements:
feed %>%
distinct(item, item_code) %>%
count(item) %>%
filter(n > 1) %>%
merge(feed, by = "item") %>%
group_by(item, item_code) %>%
summarise(production = sum(production, na.rm = TRUE))
## # A tibble: 4 x 3
## # Groups: item [?]
## item item_code production
## <fct> <fct> <int>
## 1 Eggs 2744 3394126
## 2 Eggs 2949 3394126
## 3 Milk - Excluding Butter 2848 44763627
## 4 Milk - Excluding Butter 2948 44763627
However, there is no duplication of item_code. Moreover, it seems as though the production figures for 2949 and 2744 are identical, as are 2848 and 2948. As such, we can safely dedupe the data by excluding an arbitrary item code in each pair.
feed <- feed %>% filter(!(item_code %in% c(2948, 2949)))
Creating quick linear models for every element/item combination. We know that data collection doesn’t start in the same year for each item, but once it does start there are no more missing observations. By dropping the missing observations, we’re training our models only on the years after data collection has started.
feed %>%
filter(!is.na(production)) %>%
group_by(year, element, item) %>%
summarise(production = sum(production)) %>%
group_by(element, item) %>%
do(glance(lm(production ~ year, data = .)))
## # A tibble: 203 x 13
## # Groups: element, item [203]
## element item r.squared adj.r.squared sigma statistic p.value df
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Feed Animal… 0.711 0.705 2.73e2 126. 2.33e-15 2
## 2 Feed Apples… 0.465 0.454 1.37e2 44.2 1.93e- 8 2
## 3 Feed Aquati… 0.507 0.497 7.06e1 52.4 2.28e- 9 2
## 4 Feed Aquati… 0.507 0.497 7.06e1 52.4 2.28e- 9 2
## 5 Feed Bananas 0.520 0.511 2.35e2 55.4 1.10e- 9 2
## 6 Feed Barley… 0.732 0.727 1.01e4 139. 3.33e-16 2
## 7 Feed Beans 0.785 0.781 3.85e2 187. 1.13e-18 2
## 8 Feed Bovine… 0.664 0.647 1.28e1 39.5 3.92e- 6 2
## 9 Feed Butter… 0.00709 -0.0124 9.32e0 0.364 5.49e- 1 2
## 10 Feed Cassav… 0.890 0.887 6.32e3 411. 4.60e-26 2
## # ... with 193 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
## # BIC <dbl>, deviance <dbl>, df.residual <int>
There are some encouraging results in here, but unfortunately we have some autocorrelation issues. That is to say, the residuals correlate with one another (with a lag of 1). We know this because of the Durbin-Watson test:
feed %>%
filter(!is.na(production)) %>%
group_by(year) %>%
summarise(production = sum(production)) %>%
lmtest::dwtest(production ~ year, data = .)
##
## Durbin-Watson test
##
## data: production ~ year
## DW = 0.112, p-value < 0.00000000000000022
## alternative hypothesis: true autocorrelation is greater than 0
We can see this autocorrelation in the residuals plot:
feed %>%
filter(!is.na(production)) %>%
group_by(year) %>%
summarise(production = sum(production)) %>%
lm(production ~ year, data = .) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
We can define a hierarchical time series, in which every feed/food and item combination is forecast as an independent time series, which are aggregated up to the top level.
element_item_ts <- feed %>%
mutate(element_item = paste0(element, item_code)) %>%
group_by(year, element_item) %>%
summarise(production = sum(production, na.rm = TRUE)) %>%
spread(key = element_item, value = production) %>%
ungroup %>%
ts(
data = as.matrix(select(., -year)),
start = min(.[["year"]]),
end = max(.[["year"]]),
frequency = 1
) %>%
gts(characters = list(c(4), c(4)), gnames = c("element", "item_code"))
Forecast the hierarchical time series bottom-up. The bottom graph contains over 200 time series, so isn’t a very useful data visualisation.
forecast(element_item_ts, h = 10, fmethod = "arima", method = "bu") %>%
plot(include = 20)