Exploration of feed data

Setup

library(tidyverse)
library(sqldf)
library(visdat)
library(broom)
library(lmtest) 
library(forecast)
library(hts)

Load the feed data and change the column names to lower-case with underscores.

feed_raw <- "feed.csv" %>% 
    read_csv %>% 
    rename_all(.funs = function(x) {gsub(" ", "_", x) %>% tolower})

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   `Area Abbreviation` = col_character(),
##   Area = col_character(),
##   Item = col_character(),
##   Element = col_character(),
##   Unit = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

## See spec(...) for full column specifications.

At this point we’ve loaded the raw csv data with the read_csv function. read_csv does not convert strings to factors, but we do have a few factors here. The year variables are all integers and begin with Y, while the longitude and latitude variables are numerics. The rest we convert to factors.

feed_raw <- feed_raw %>% mutate_at(
vars(-starts_with("y"), -starts_with("l")), 
funs(as.factor)
)

The initial wide format isn’t very convenient for modelling, but it is appropriate for visualising the missing data. Some missing data, but it seems that a row is either complete, or is missing an entire string of years. It actually looks as though collection starts in different years for each item, but once collection starts it continues uninterrupted. What happened in 1993?

```r
feed_raw %>% vis_dat(warn_large_data = FALSE) +
theme(axis.text = element_text(size = 8)) 
```

<img src="EDA_files/figure-html/unnamed-chunk-2-1.png" width="100%" />

Let’s clean this up. We gather all of those year columns into a simple Key/Value long format. We then take the four right-most characters of each year, effectively dropping the “Y” in front of each year.

feed <- feed_raw %>% 
gather(key = year, value = production, starts_with("y")) %>% 
mutate(year = year %>% substr(2, 5) %>% as.numeric)

Change in food and feed production over the years.

feed %>%
group_by(year, element) %>% 
summarise(production = production %>% sum(na.rm = TRUE)) %>%
ggplot(aes(x = year, y = production, group = element, colour = element)) +
geom_line() + 
scale_x_continuous(
breaks = seq(min(feed$year), max(feed$year), 4)
)

We use an SQL join to introduce a change column for each (Area, Item, Element) tuple, so that we can track how much production has increased or decreased since last year. I can’t think of a clean way to do this with the usual R merge function, and SQL seems so well-suited for the task.

feed <- sqldf(
    "
    SELECT       CURRENTYEAR.*
    ,LASTYEAR.production AS production_last_year
    ,CURRENTYEAR.production - LASTYEAR.production as production_change
    FROM        feed AS CURRENTYEAR
    LEFT JOIN   feed AS LASTYEAR
    ON      CURRENTYEAR.area = LASTYEAR.area
    AND     CURRENTYEAR.item = LASTYEAR.item
    AND     CURRENTYEAR.element = LASTYEAR.element
    AND     CURRENTYEAR.year = LASTYEAR.year + 1
    "
)

Heat map showing change in production for each category.

feed %>% 
    group_by(year, item) %>%
    filter(!is.na(production_last_year)) %>% 
    summarise(
        production = production %>% sum(na.rm = TRUE),
        production_last_year = production_last_year %>% sum(na.rm = TRUE)
    ) %>% 
    mutate(percent_production_change = 
               (production - production_last_year) / production_last_year) %>% 
    ggplot(aes(x = year, y = item, fill = percent_production_change)) +
    geom_raster() +
    scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                         limits = c(-0.5, 0.5))

Data quality

There’s a mixup with items and elements:

feed %>% 
distinct(item, item_code) %>% 
count(item) %>% 
filter(n > 1) %>% 
merge(feed, by = "item") %>% 
group_by(item, item_code) %>% 
summarise(production = sum(production, na.rm = TRUE))

## # A tibble: 4 x 3
## # Groups:   item [?]
##   item                    item_code production
##   <fct>                   <fct>          <int>
## 1 Eggs                    2744         3394126
## 2 Eggs                    2949         3394126
## 3 Milk - Excluding Butter 2848        44763627
## 4 Milk - Excluding Butter 2948        44763627

However, there is no duplication of item_code. Moreover, it seems as though the production figures for 2949 and 2744 are identical, as are 2848 and 2948. As such, we can safely dedupe the data by excluding an arbitrary item code in each pair.

feed <- feed %>% filter(!(item_code %in% c(2948, 2949)))

Some quick models

Creating quick linear models for every element/item combination. We know that data collection doesn’t start in the same year for each item, but once it does start there are no more missing observations. By dropping the missing observations, we’re training our models only on the years after data collection has started.

feed %>% 
filter(!is.na(production)) %>% 
group_by(year, element, item) %>% 
summarise(production = sum(production)) %>% 
group_by(element, item) %>% 
do(glance(lm(production ~ year, data = .)))

## # A tibble: 203 x 13
## # Groups:   element, item [203]
##    element item    r.squared adj.r.squared  sigma statistic  p.value    df
##    <fct>   <fct>       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <int>
##  1 Feed    Animal…   0.711          0.705  2.73e2   126.    2.33e-15     2
##  2 Feed    Apples…   0.465          0.454  1.37e2    44.2   1.93e- 8     2
##  3 Feed    Aquati…   0.507          0.497  7.06e1    52.4   2.28e- 9     2
##  4 Feed    Aquati…   0.507          0.497  7.06e1    52.4   2.28e- 9     2
##  5 Feed    Bananas   0.520          0.511  2.35e2    55.4   1.10e- 9     2
##  6 Feed    Barley…   0.732          0.727  1.01e4   139.    3.33e-16     2
##  7 Feed    Beans     0.785          0.781  3.85e2   187.    1.13e-18     2
##  8 Feed    Bovine…   0.664          0.647  1.28e1    39.5   3.92e- 6     2
##  9 Feed    Butter…   0.00709       -0.0124 9.32e0     0.364 5.49e- 1     2
## 10 Feed    Cassav…   0.890          0.887  6.32e3   411.    4.60e-26     2
## # ... with 193 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
## #   BIC <dbl>, deviance <dbl>, df.residual <int>

There are some encouraging results in here, but unfortunately we have some autocorrelation issues. That is to say, the residuals correlate with one another (with a lag of 1). We know this because of the Durbin-Watson test:

feed %>% 
filter(!is.na(production)) %>% 
group_by(year) %>% 
summarise(production = sum(production)) %>% 
lmtest::dwtest(production ~ year, data = .)

## 
##  Durbin-Watson test
## 
## data:  production ~ year
## DW = 0.112, p-value < 0.00000000000000022
## alternative hypothesis: true autocorrelation is greater than 0

We can see this autocorrelation in the residuals plot:

feed %>% 
filter(!is.na(production)) %>%
group_by(year) %>% 
summarise(production = sum(production)) %>% 
lm(production ~ year, data = .) %>% 
ggplot(aes(x = .fitted, y = .resid)) + 
geom_point() + 
geom_hline(yintercept = 0)

We can define a hierarchical time series, in which every feed/food and item combination is forecast as an independent time series, which are aggregated up to the top level.

element_item_ts <- feed %>% 
mutate(element_item = paste0(element, item_code)) %>% 
group_by(year, element_item) %>% 
summarise(production = sum(production, na.rm = TRUE)) %>% 
spread(key = element_item, value = production) %>%
ungroup %>% 
ts( 
data = as.matrix(select(., -year)),
start = min(.[["year"]]), 
end = max(.[["year"]]), 
frequency = 1 
) %>% 
gts(characters = list(c(4), c(4)), gnames = c("element", "item_code"))

Forecast the hierarchical time series bottom-up. The bottom graph contains over 200 time series, so isn’t a very useful data visualisation.

forecast(element_item_ts, h = 10, fmethod = "arima", method = "bu") %>% 
    plot(include = 20)

Possible directions

Different time series forecast methods
Cross-validation of forecasting
Exogenous variables
Different types of categorisations, eg. climate based on latitude and longitude

Exploration of feed data

mdneuzerling

June 6, 2018

Setup

Data quality

Some quick models

Possible directions