The desired outcome is guided by the following comment, provided by the person who uploaded this data to Kaggle:
…we are trying to determine which products we should continue to sell, and which products to remove from our inventory.
The poster discusses their motivations for uploading the data set:
We suspect that data science applied to the set…can help us generate a value (i.e., probability score) for each product, that can be used as the main determinant evaluating the inventory. Each row in the file represents one product.
Paraphrasing, the goal is to answer the following question: what is the value of an item, identified by an SKU, being in inventory? This ultimately relies on three components:
The ideal formula would be something like: \[value = P(sale) * profit\_of\_sale - cost\_of\_warehousing.\] The decision of which items to no longer stock can then be made by continuing to warehouse only those with a positive value or, alternatively, by dropping all items in a bottom portion of the stock, ranked by value.
The cost and value components are a mystery. We don’t have access to the cost of warehousing, so we assume that it is equal for every item, and ignore it in the value calculations. We don’t have access to the profit involved in a sale, but we can substitute it with one of the three price variables. This assumes a similar profit margin for each item, but it is the best we can do with so many unknowns.
Because the historical data is taken over a six-month period, we are forced to use this interval in our calculations. That is, the \(P(sale)\) in the above formula will be calculated as the probability of a sale over a six-month period.
We could model the number of sales as a discrete variable or a continuous variable, requiring a categorisation or regression model, respectively. Both options should be explored, with the decision based on what will most accurately define the solution above.
strength_factor
of a item?marketing_type
and its impact on sales?strength_factor
release_number
and, consequently, new_release_flag
price_reg
, low_user_price
and low_net_price
item_count
variable, presumably a stock level, is defined. There’s almost certainly some sort of business process behind this decision, even if an informal one. It’s possible that previous sales prompt greater stock numbers, and so we run the risk of modelling sales figures on a proxy of sales figures.In the setup stage, we’re going to load some essential packages, along with the data, which is saved as sales.csv
. We’re also going to rename the variables to match the tidyverse style, with lower-case letters and words separated by underscores. Finally, we convert the flag columns to factors.
library(tidyverse)
library(purrr)
library(kableExtra) # for modifying table width
library(randomForest)
library(e1071) # for svm models
sales <- "sales.csv" %>%
read.csv %>% # for some reason, readr's read_csv throws many errors
as_tibble %>%
rename(
order = Order,
file_type = File_Type,
SKU_number = SKU_number,
sold_flag = SoldFlag,
sold_count = SoldCount,
marketing_type = MarketingType,
release_number = ReleaseNumber,
new_release_flag = New_Release_Flag,
strength_factor = StrengthFactor,
price_reg = PriceReg,
release_year = ReleaseYear,
item_count = ItemCount,
low_user_price = LowUserPrice,
low_net_price = LowNetPrice
) %>%
mutate_at(
vars(contains("flag")),
as.factor
)
sales %>%
count(file_type) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
file_type | n |
---|---|
Active | 122921 |
Historical | 75996 |
I’m a big fan of the visdat
package for looking at data for the first time. It gives variable types along with a visualisation of missing data.
sales %>% sample_n(10000) %>% arrange(order) %>% visdat::vis_dat()
The missing data is entirely within the sold_flag
and sold_count
variables. We know the data is split into an historical observational period as well as active inventory, so it makes sense that we would only have sales data for historical observations. We confirm this with the naniar
package:
sales %>%
select(file_type, sold_count, sold_flag) %>%
group_by(file_type) %>%
naniar::miss_var_summary() %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
file_type | variable | n_miss | pct_miss | n_miss_cumsum |
---|---|---|---|---|
Historical | sold_count | 0 | 0 | 0 |
Historical | sold_flag | 0 | 0 | 0 |
Active | sold_count | 122921 | 100 | 122921 |
Active | sold_flag | 122921 | 100 | 245842 |
Given that we can only model on historical data, we now split the sales data into two data sets: historical
and active
. We remove the sold_flag
and sold_count
variables for the active
tibble, since it serves no purpose.
historical <- sales %>%
filter(file_type == "Historical")
active <- sales %>%
select(-one_of("sold_flag", "sold_count")) %>%
filter(file_type == "Active")
This has the benefit of giving us making the data unique along SKU in each data set. For example, there are 75996 distinct SKUs in the historical
data set, which has 75996 rows.
The poster of the data mentioned that sales are rare:
It is important to note that we have MANY products in our inventory, and very few of them tend to sell (only about 10% sell each year) and many of the products only have a single sale in the course of a year.
Let’s verify this claim with a histogram:
historical %>%
filter(sold_count <= 5) %>%
mutate(sold_count = sold_count %>% as.factor) %>% # makes graph look better
ggplot(aes(x = sold_count)) +
geom_bar(na.rm = TRUE, fill = "#00BFC4")
As expected, most of the items did not sell in the observation period, with only 17% of items experiencing a sale. This data imbalance will cause issues in modelling, as we will later see.
Many issues with data quality could be resolved or mitigated with a data dictionary. We can make some assumptions, such that item_count
refers to inventory levels. Other variables, like release_number
and the related new_release_flag
, are a mystery us.
4 items have a sold_count
greater than the corresponding item_count
. There are multiple possible explanations for this:
item_count
records the inventory at the end of the recording period,item_count
variable and the release_number
variable.To complicate things even further, 83 products have an item_count
of 0.
historical %>%
filter(sold_count > item_count) %>%
select(SKU_number, sold_count, item_count, release_number) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
SKU_number | sold_count | item_count | release_number |
---|---|---|---|
2277016 | 2 | 0 | 5 |
613864 | 69 | 44 | 0 |
1475488 | 12 | 8 | 2 |
809190 | 12 | 11 | 5 |
Moreover, the strength_factor
of each item appears to vary between the active
stock and the historical
data sets. While we would expect the price variables to be time-dependent, it seems strange that the strength_factor
—presumably an immutable property of the item—is not fixed.
To determine this we use the sqldf
package to join the historical
and active
tibbles, finding those that match in SKU_number
but not strength_factor
. While not as fast as dplyr
’s join verbs, it allows us to join on predicates rather than just columns.
strength_factor_mismatches <- sqldf::sqldf(
"
SELECT ACTIVE.SKU_number
,HISTORICAL.strength_factor AS Historicalstrength_factor
,ACTIVE.strength_factor AS Activestrength_factor
FROM historical AS HISTORICAL
INNER JOIN active AS ACTIVE
ON HISTORICAL.SKU_number = ACTIVE.SKU_number
AND HISTORICAL.strength_factor <> ACTIVE.strength_factor
"
)
strength_factor_mismatches %>% nrow
## [1] 65556
The release_year
variable is particularly strange. According to Kaggle, this data was uploaded in December of 2016, yet there are active %>% filter(release_year == 2017) %>% nrow
products in inventory with a release_year
of 2017, and active %>% filter(release_year == 2017) %>% nrow
with a release year of 2018.
We will now investigate each of the variables and their impact on the target variables. We will focus predominately on the sold_flag
for now to simplify our analysis.
We’re going to consider release_year
as a numerical variable. We could consider it as a factor, but that would make the modelling steps much more difficult, since there 85 different years.
While we won’t explore this here, it might be easier to interpret the results of any modelling if we were to transform the variable to an age, so that we track how many years since each product was released. This changes the centre of the variable (a translation transformation), so shouldn’t have any impact on model results, but makes for a more practical interpretation.
numeric_cols <- c(
"release_year", "price_reg", "low_net_price", "low_user_price",
"item_count", "strength_factor", "release_number"
)
The ggpairs
function from the GGally
package replicates the plotmatrix
function that once exists in ggplot2
. It displays three pieces of information: correlation coefficients in the top-right triangle, scatterplots in the bottom-left triangle, and histograms along the diagonal.
historical %>% select(numeric_cols) %>% GGally::ggpairs()
There is almost no correlation between any of the numerical columns in this data. Multicollinearity can impact the interpretability of some models such as logistic regressions, since it is difficult to separate the effects of correlated variables.
Curiously, none of the three price metrics correlate either. We will explore these in more detail later.
We may wish to see if certain release products released in certain years, perhaps later years, are more likely to sell. However, we need to be cautious about simply plotting sales against year, since we need to take into account how many items were released in a particular year as well. Below we plot the number of sales by release_year
alongside the number of items available from each release_year
:
x_labels <- seq(
min(historical$release_year),
max(historical$release_year),
by = 5
)
historical %>%
ggplot(aes(x = release_year, y = sold_count)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
historical %>%
ggplot(aes(x = release_year)) +
geom_bar(, fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
To accommodate for this, we instead consider the percentage of items sold each year. We’ll also limit the graph to show only the data with a release_year
in or after 1990, since there is little sales data before this year.
historical %>%
filter(release_year >= 1990) %>%
group_by(release_year) %>%
summarise(percent_sold = sum(sold_count) / sum(item_count)) %>%
filter(!is.nan(percent_sold)) %>% # Remove division by zero errors
ggplot(aes(x = release_year, y = percent_sold)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
scale_x_continuous(breaks = x_labels, labels = x_labels)
It appears as though products released between 2010 and 2015, inclusive, are more likely to sell.
I’m very curious about this item_count
variable. How are inventory levels currently decided by the company? Is it possible that if an item sells, more inventory is ordered in, and so we are potentially modelling sold_flag
on a proxy of sold_flag
? Is the item_count
the inventory at the beginning or end of the observation period? Did the company restock throughout those six months?
historical %>%
ggplot(aes(x = sold_flag, y = item_count, fill = sold_flag)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 50)
These box plots certainly show a positive relationship between item_count
and sold_flag
, but it’s impossible to determine the direction of the causative relationship, if there is one. That is, a higher stock level could cause more sales, a sale could trigger higher stock levels, or neither of these could be true.
It’s impossible to say precisely how we should interpret the three price variables, but we can speculate:
low_user_price
than a price_reg
.Let’s investigate the relationship between the three price variables and sales. For convenience, we will include the marketing_type
variable in this investigation. We change some of the labels for the factor variables to make the graph easier to read, but these changes won’t be preserved.
historical %>%
mutate(marketing_type = paste0("marketing type ", marketing_type)) %>%
gather(key = price_metric, value = price, contains("price")) %>%
ggplot(aes(x = sold_flag, y = price, fill = sold_flag)) +
geom_boxplot(na.rm = TRUE) +
facet_grid(marketing_type ~ price_metric) +
ylim(0, 200)
It appears as though the low_user_price
is related to sales, with a higher price being unexectedly associated with a higher chance of sale. This effect seems slightly higher for marketing type “D”. There is a similar effect for price_reg
, although there is almost no difference between marketing types. There appears to be a negligible relationship between low_net_price
and sales.
Strangely, some of the prices are set to 0:
historical %>%
gather(key = price_metric, value = price, contains("price")) %>%
filter(price == 0) %>%
count(price_metric) %>%
rename(items_with_zero_price = n) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
price_metric | items_with_zero_price |
---|---|
low_net_price | 2256 |
low_user_price | 9101 |
price_reg | 1395 |
Either the respective prices for these items are missing, or items were legitimately sourced/sold or $0. Prices in the active inventory are a little different; only 376 have a $0 low_user_price
. As such, we when calculating expected value we will use low_user_price
under the assumption that the prices are legitimate.
Is the zero price of an item related to its chances of selling? We’ll plot, for each price metric, the proportion of items that have had at least one sale, split by whether or not the price is 0.
historical %>%
gather(key = price_metric, value = price, contains("price")) %>%
mutate(has_price = as.factor(price > 0)) %>%
ggplot(aes(x = has_price, fill = sold_flag)) +
geom_bar(position = "fill") +
facet_grid(. ~ price_metric) +
ylab("proportion")
Sure enough, an item being free is related to its chance of selling, but not in the way we might think; a 0 low_user_price
or low_net_price
is related to a higher chance of sale. The effect is not seen for price_reg
.
The release_number
attribute doesn’t appear to influence sales. It’s difficult to even say what this means. It’s related to the new_release_flag
, which is set to 1 if and only if the release_number
is greater than 1.
historical %>%
ggplot(aes(x = sold_flag, y = release_number, fill = sold_flag)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 10)
There does appear to be a relationship between strength_factor
and sales, with weaker strength_factors
being associated with sales.
historical %>%
ggplot(aes(x = sold_flag, y = strength_factor, fill = sold_flag)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 3000000)
To make the graphs look better, we temporarily consider release_number
as a factor (categorical variable). Let’s investigate how item_count
and release_number
affect not the binary sold_flag
the sold_count
.
historical %>%
filter(release_number <= 20) %>%
mutate(release_number = release_number %>% as.factor) %>%
ggplot(aes(x = release_number, fill = sold_flag)) +
geom_bar(position = "fill", na.rm = TRUE)
historical %>%
filter(item_count <= 20) %>%
mutate(item_count = item_count %>% as.factor) %>%
ggplot(aes(x = item_count, fill = sold_flag)) +
geom_bar(position = "fill", na.rm = TRUE)
A proper modelling approach would optimise accuracy scores by considering multiple models, and within each model adjustments of parameters. We would also consider the effects of feature engineering—transformations of existing variables and creations of new variables.
In the interests of time, we will focus our attention to only a handful of models, in a somewhat arbitrary manner. We’re also going to gloss over the issue of dimesionality (eg. reducing the number of variables used in modelling) because the number of variables is initially small relative to the sample size. Moreover, random forests—which we will use initially—are forgiving of superfluous variables.
I regret the emphasis on simple accuracy rates when I should be considering sensitivity and specificity, preferably with ROC curves.
Our target variables will be either sold_flag
for categorisation or sold_count
for regression. We will model on all variables except:
order
, which is an arbitrary index,SKU_number
, which is we take to be an arbitrary product identifier,new_release_flag
, which is defined by the value of release_number
, andfile_type
, which is redundant in the historical
data set.The imbalanced nature of the data will cause issues in modelling. Of the historical
data, 17.1% has a sold_flag
of 1. If we assume that nothing ever sells, then we will be correct 82.9% of the time. This “model” is very accurate, and entirely useless.
One of the benefits to a bootstrapping model such as a random forest is that we can control the parameters of the bootstrap samples. We can stratify the sample such that 50% of the items will be those with a sale, and the other 50% will be without a sale. This is how we define balanced_sample_size
.
balanced_sample_size <- historical %>%
filter(sold_flag == 1) %>%
nrow
rf_flag <- historical %>%
randomForest(
formula = sold_flag ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor, data = .,
sampsize = c(balanced_sample_size, balanced_sample_size),
strata = .$sold_flag,
ntree = 1000 # More trees means longer training time, but no risk of overfitting
)
rf_flag
##
## Call:
## randomForest(formula = sold_flag ~ marketing_type + release_number + release_year + price_reg + low_net_price + low_user_price + item_count + strength_factor, data = ., sampsize = c(balanced_sample_size, balanced_sample_size), strata = .$sold_flag, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.92%
## Confusion matrix:
## 0 1 class.error
## 0 50404 12596 0.1999365
## 1 5582 7414 0.4295168
Another benefit of a random forest is that we don’t need to cross-validate or separate into a test/train set, since the items outside of each bootstrap sample can be used as a sort of cross-validation against each tree. This is the out-of-bag or OOB error. This model has an OOB error of 24.12, which is a good start.
Another way to deal with imbalanced data is to over-/under-sample. The DMwR
package contains the SMOTE
algorithm, which stands for “synthetic minority oversampling”. This algorithm will over-sample the products that have sold and undersample those that haven’t sold, so as to mitigate the impact of the imbalanced data set. The “synthetic” part of the name refers to the fact that the algorithm generates new items that were not in the original data set, using the nearest neighbours of data points.
Because we will be training our model on a data set that includes synthetic values, the OOB error will be testing each decision tree against data that isn’t in the original data set. As such, we will have to apply the SMOTE
algorithm to a training data set, and test against a separate test or validation data set.
When we split into train and test, we have to keep the imbalanced data in mind. That is, we want some sales in our test set. We set aside 80% of the data with a sale and 80% of the data without a sale into our train set, and the rest goes into the test set.
historical_train <-
historical %>% filter(sold_flag == 0) %>% sample_frac(0.8) %>%
rbind(
historical %>% filter(sold_flag == 1) %>% sample_frac(0.8)
)
historical_test <- historical %>%
anti_join(historical_train, by = "SKU_number")
We now apply the SMOTE
algorithm to the training set.
historical_SMOTE <- historical_train %>%
as.data.frame %>% # DMwR package doesn't work with tibbles
DMwR::SMOTE(
sold_flag ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor,
data = .
) %>%
as_tibble
The default under- and over-sampling settings seem to produce a roughly balanced data set (4:3). We can now train another random forest on this balanced training set, and measure its accuracy against the test set.
rf_smote_flag <- historical_SMOTE %>%
randomForest(
formula = sold_flag ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor,
data = .,
ntree = 1000
)
rf_smote_flag_preds <- predict(rf_smote_flag, historical_test, "response")
rf_smote_flag_cm <- rf_smote_flag_preds %>%
caret::confusionMatrix(historical_test$sold_flag)
rf_smote_flag_cm %>% print
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11045 1493
## 1 1555 1106
##
## Accuracy : 0.7995
## 95% CI : (0.793, 0.8058)
## No Information Rate : 0.829
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.2993
## Mcnemar's Test P-Value : 0.2692
##
## Sensitivity : 0.8766
## Specificity : 0.4255
## Pos Pred Value : 0.8809
## Neg Pred Value : 0.4156
## Prevalence : 0.8290
## Detection Rate : 0.7267
## Detection Prevalence : 0.8249
## Balanced Accuracy : 0.6511
##
## 'Positive' Class : 0
##
The accuracy is roughly 80%, which is an improvement over the random forest with stratified sampling. The summary statistics are interesting, because we’re told that the accuracy is less than the no information rate (NIR). This means that if we simply assumed that no item sold, we would obtain a higher accuracy, as we mentioned earlier. Of course, this model wouldn’t be useful at all. By sacrificicing specificity for sensitivity we obtain a less “accurate” model, but a more useful one.
Logistic regressions are simple but generally require a bit more massaging than random forests. We have to worry about transformations, binning and capping. We will briefly investigate one as a curiousity.
logistic_flag <- historical_SMOTE %>% glm(
formula = sold_flag ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor,
data = .,
family = binomial(link = "logit")
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logistic_flag_cm <- logistic_flag %>%
predict(historical_test, type = "response") %>%
{ifelse(. > 0.5, 1, 0)} %>%
as.factor %>% # braces let us control where the %>% puts the argument
caret::confusionMatrix(historical_test$sold_flag)
logistic_flag_cm %>% print
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10312 1252
## 1 2288 1347
##
## Accuracy : 0.7671
## 95% CI : (0.7603, 0.7738)
## No Information Rate : 0.829
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2907
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.8184
## Specificity : 0.5183
## Pos Pred Value : 0.8917
## Neg Pred Value : 0.3706
## Prevalence : 0.8290
## Detection Rate : 0.6785
## Detection Prevalence : 0.7608
## Balanced Accuracy : 0.6683
##
## 'Positive' Class : 0
##
The accuracy of the model is around 77%. This isn’t as accurate as the random forests, but given how little effort we put into preparing the data it’s still very impressive.
We will now investigate a regression model in which we attempt to predict not just whether a sale will occur, but how many sales will occur. That is, we will use sold_count
as the target variable, instead of sold_flag
. This opens the possibility of a non-integral value of sold items, which makes sense when we consider that we will be calculating expected value.
We’ll begin with a simple linear regression. Once again we will apply the SMOTE
algorithm to emphasise data with a sale, sacrificing specificity for sensitivity.
lm_SMOTE_count <- historical_SMOTE %>%
lm(
sold_count ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor,
data = .
)
lm_SMOTE_count_preds <- predict(lm_SMOTE_count, historical_test)
lm_SMOTE_count %>% summary
##
## Call:
## lm(formula = sold_count ~ marketing_type + release_number + release_year +
## price_reg + low_net_price + low_user_price + item_count +
## strength_factor, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.642 -0.579 -0.192 0.313 68.316
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -19.987208114423 1.992163237187 -10.033
## marketing_typeS -0.491880477053 0.010707232495 -45.939
## release_number 0.033346792914 0.001456064656 22.902
## release_year 0.010070715673 0.000992891954 10.143
## price_reg -0.001130300697 0.000074289504 -15.215
## low_net_price 0.000158827064 0.000066669742 2.382
## low_user_price 0.000463493169 0.000060597153 7.649
## item_count 0.015133000323 0.000127500749 118.690
## strength_factor -0.000000025624 0.000000004205 -6.093
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## marketing_typeS < 0.0000000000000002 ***
## release_number < 0.0000000000000002 ***
## release_year < 0.0000000000000002 ***
## price_reg < 0.0000000000000002 ***
## low_net_price 0.0172 *
## low_user_price 0.0000000000000205 ***
## item_count < 0.0000000000000002 ***
## strength_factor 0.0000000011130737 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.355 on 72770 degrees of freedom
## Multiple R-squared: 0.2675, Adjusted R-squared: 0.2674
## F-statistic: 3322 on 8 and 72770 DF, p-value: < 0.00000000000000022
This linear model has an adjusted R^2 of 0.2674038, which is not particularly encouraging. When compared to the test set, the linear model has a mean squared error of 1.8030016. That is, the prediction is out on average by about one sale.
While I don’t have a thorough understanding of support vector machines, I believe that certain loss functions require normalisation beforehand. I’ve also heard that normalised data is faster train for SVMs. In any case, it can’t do any harm, so we normalise the numeric columns before we fit the model. We do this using the recipes
package, which allows us to prepare the normalisation steps based on the (SMOTE
) training set, and then apply the results to the test set.
library(recipes)
columns_to_normalise <- c(
"price_reg", "low_net_price", "low_user_price", "strength_factor",
"release_year" # I'm uncertain about normalising years
)
historical_recipe <- recipe(
sold_count ~ marketing_type + release_number + release_year +
price_reg + low_net_price + low_user_price + item_count + strength_factor,
data = historical_SMOTE
) %>%
step_center(columns_to_normalise) %>%
step_scale(columns_to_normalise) %>%
prep(training = historical)
historical_SMOTE_baked <- bake(historical_recipe, newdata = historical_SMOTE)
historical_test_baked <- bake(historical_recipe, newdata = historical_test)
We can now feed the normalised data into a support vector machine model.
svm_count <- historical_SMOTE_baked %>% svm(
sold_count ~ marketing_type + release_number + release_year +
price_reg + low_net_price + low_user_price + item_count + strength_factor,
data = .
)
svm_preds <- predict(svm_count, historical_test_baked)
svm_count %>% summary
##
## Call:
## svm(formula = sold_count ~ marketing_type + release_number +
## release_year + price_reg + low_net_price + low_user_price +
## item_count + strength_factor, data = .)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1111111
## epsilon: 0.1
##
##
## Number of Support Vectors: 44581
The mean squared error for this model, when compared to the actuals of the test set, is 1.6355135. Once again, we are off on average by around one sale.
If regression is proving too difficult, can we attempt to model the sold_count
variable as a categorical variable? Let’s take the simplest multinomial model: a trinomial model. We will cap all sold_count
values higher than 2 at 2, and attempt to predict the three outcomes.
We’ll be capping the synthetically balanced SMOTE
data. This data was balanced along the sold_flag
binary variable. We don’t need to worry the SMOTE algorithm introducing non-zero sold_count
values when the sold_flag
is 0; because the algorithm uses a nearest-neighbours approach when balancing the data along sold_flag
, every synthetic sold_count
will be 0 when sold_flag
is 0.
historical_SMOTE %>% filter(sold_flag == 0 & sold_count != 0) %>% nrow
## [1] 0
The SMOTE algorithm will generate new (synthetic) data points with non-integral values of sold_count
. When generating the new category, we will round the non-integral synthetic entries.
sold_cap <- 2
historical_SMOTE_capped <- historical_SMOTE %>%
mutate(
sold_count_capped = case_when(
sold_count <= sold_cap ~ round(sold_count),
TRUE ~ sold_cap
) %>% as.factor
)
In order to test our model, we have to cap the sold_count
in the test set as well. Strangely, we have to use as.numeric
when capping the sold_count
, even though sold_count
is an integer. See dplyr issue 2365.
historical_test_capped <- historical_test %>%
mutate(
sold_count_capped = case_when(
sold_count <= sold_cap ~ as.numeric(sold_count),
TRUE ~ sold_cap
) %>% as.factor
)
We can now train our trinomial model and compare it to the capped test set:
rf_smote_count_capped <- historical_SMOTE_capped %>%
randomForest(
formula = sold_count_capped ~ marketing_type + release_number +
release_year + price_reg + low_net_price +
low_user_price + item_count + strength_factor,
data = .,
ntree = 1000
)
rf_smote_count_capped_preds <- rf_smote_count_capped %>%
predict(historical_test_capped, "response")
rf_smote_count_capped_cm <- rf_smote_count_capped_preds %>%
caret::confusionMatrix(historical_test_capped$sold_count_capped)
rf_smote_count_capped_cm %>% print
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 11773 1285 631
## 1 472 144 118
## 2 355 197 224
##
## Overall Statistics
##
## Accuracy : 0.7988
## 95% CI : (0.7923, 0.8052)
## No Information Rate : 0.829
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1785
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9344 0.088561 0.23022
## Specificity 0.2628 0.956531 0.96120
## Pos Pred Value 0.8600 0.196185 0.28866
## Neg Pred Value 0.4523 0.897546 0.94807
## Prevalence 0.8290 0.106981 0.06402
## Detection Rate 0.7746 0.009474 0.01474
## Detection Prevalence 0.9007 0.048293 0.05106
## Balanced Accuracy 0.5986 0.522546 0.59571
The accuracy of the model is around 80. This might seem encouraging, but what happens if we reduce this to a simple zero/non-zero classification model?
rf_smote_count_capped_preds_reduced <-
replace(rf_smote_count_capped_preds, rf_smote_count_capped_preds != 0, 1) %>%
droplevels
rf_smote_count_capped_reduced_cm <- rf_smote_count_capped_preds_reduced %>%
caret::confusionMatrix(historical_test$sold_flag)
The accuracy in this reduced binomial classification model is 82. This might seem like an improvement, but from the confusion matrix we see a rise in false negatives.
It seems strange that we would see an improvement in accuracy by increasing the number of possible outcomes, and then reducing those outcomes after the model is trained. To investigate further, let’s compare this trinomial model with the binomial random forest we trained earlier using the SMOTE
-balanced data:
binomial_trinomial_comparison <- tibble(
actual = historical_test$sold_flag,
binomial = rf_smote_flag_preds,
reduced_trinomial = rf_smote_count_capped_preds_reduced
)
binomial_trinomial_comparison %>%
count(actual, binomial, reduced_trinomial) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
actual | binomial | reduced_trinomial | n |
---|---|---|---|
0 | 0 | 0 | 11038 |
0 | 0 | 1 | 7 |
0 | 1 | 0 | 735 |
0 | 1 | 1 | 820 |
1 | 0 | 0 | 1491 |
1 | 0 | 1 | 2 |
1 | 1 | 0 | 425 |
1 | 1 | 1 | 681 |
Now we can see where this increase in accuracy has come from. The reduced trinomial model sees a decrease in false positives and an increase in false negatives. The decrease in false positives is larger, so we see the accuracy increase, but because of the imbalanced data a false negative is more damaging than a false positive.
One possible explanation is that the SMOTE
-balanced data is almost balanced: Roughly 43% of the entries have a sold_flag
of 1. When we introduce an additional value for the variable (a sold_count
of 2 or greater) we split this portion up into two smaller components. The data becomes greatly imbalanced again, and we encounter the same issue we had before we balanced the data: the best way to improve model accuracy is to assume, when in doubt, that a given item will not sell. This improves model accuracy at the cost of model usefulness.
We’ve had the most luck with the categorisation models, so we will focus on trying to predict sold_flag
, rather than sold_count
. While not ideal, the sold_flag
should still serve to estimate the value of an item. We must remember that even a single sale is a rare event, so the difference between a sale and non-sale is much more important than the difference between 1 and 2 sales.
We will use the random forest we trained on data balanced with the SMOTE
algorithm, since this offered a better predictive accuracy than the random forest trained using stratified bootstrap sampling.
Now that we’ve settled on a model, we can plot how this random forest believes that the numeric variables impact the sold_flag
. This visualisation below is adapted from this StackExchange answer by Zachary Deane-Mayer.
These graphs are based on the set on which the random forest was trained. This allows us to see how the model was trained, before we introduce unseen data to it.
preds_on_history <- predict(rf_smote_flag, historical_SMOTE, "vote")[,2]
plotData <- lapply(numeric_cols, function(x) {
out <- tibble(
var = x,
value = historical_SMOTE[[x]],
sold_flag = preds_on_history
)
out$value <- out$value-min(out$value) #Normalize to [0,1]
out$value <- out$value/max(out$value)
out
})
plotData <- do.call(rbind, plotData)
qplot(value, sold_flag, data = plotData, facets = ~ var, geom='smooth',
span = 0.5)
We can see from the shaded areas the effect that the outliers have. For example, it is difficult to determine the impact that release_year
has on the outcome for early years, because of a lack of data.
rf_smote_flag %>%
varImpPlot(main = "Variable effect according to random forest using SMOTE")
Let’s take a moment to consider the differences between the historical
and active
data sets, to anticipate the predictions that our model may make. We will explore the four most significant variables, as identified by the random forest model.
sales %>%
ggplot(aes(x = file_type, y = strength_factor, fill = file_type)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 120000)
We noted earlier that, for items shared across the historical
and active
inventories, the strength_factor
of individual items changes between the two inventories. On the whole, however, there doesn’t appear to be much of a difference.
sales %>%
ggplot(aes(x = file_type, y = low_user_price, fill = file_type)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 200)
We can plot the effect of strength factor, as predicted by the model, along with the medians from both the historical
and active
data sets:
predicted_effect <- function(
var,
historical_text_position = 0.3,
active_text_position = 0.3,
xmax = 200
) {
historical_median <- historical[[var]] %>% median
active_median <- active[[var]] %>% median
tibble(
value = historical_SMOTE[[var]],
sold_flag = preds_on_history
) %>%
qplot(
value,
sold_flag,
data = .,
geom = 'smooth',
na.rm = TRUE
) +
xlim(0, xmax) +
xlab(var) +
ylab("predicted probability of sale") +
geom_vline(xintercept = historical_median, colour = "#00BFC4") +
geom_text(
aes(x = historical_median, label = "\nhistorical median",
y = historical_text_position),
colour="#00BFC4", angle = 90, text = element_text(size = 11)
) +
geom_vline(xintercept = active_median, colour = "#F8766D") +
geom_text(
aes(x = active_median, label = "\nactive median",
y = active_text_position),
colour="#F8766D", angle = 90, text = element_text(size = 11)
)
}
predicted_effect(
var = "strength_factor",
historical_text_position = 0.45,
active_text_position = 0.45,
xmax = 3000000
)
We can potentially expect a slightly higher probability of sale due to the lower strength_factor
in the active
data set, but the predicted effect is unstable below around 800000.
The low_user_price
seems to have dropped from the observation period to the current inventory, from a median of 44.03 to a median of 8.68. We know that a higher price is associated with a lack of sales. The random forest model predicts the following effect:
predicted_effect(
var = "low_user_price",
historical_text_position = 0.45,
active_text_position = 0.45,
xmax = 200
)
It’s difficult to interpret this behaviour due to the strange behaviour of sale probability below $50 (or whatever the unit of measurement for low_user_price
is). We an possible expect a slightly smaller probability of sale in the active
inventory due to the decrease in median price.
sales %>%
ggplot(aes(x = file_type, y = item_count, fill = file_type)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 50)
There is a slight decrease in item_count
in the active
inventory, from a median of 34 to a median of 30. We should be hesitant about drawing any conclusions from this because we don’t know what business logic is used to decide item_count
, and there is the possibility of feedback between sales and inventory. Nevertheless, the random forest model will predict the following:
predicted_effect(
var = "item_count",
historical_text_position = 0.33,
active_text_position = 0.33,
xmax = 100
)
That is to say, the reduction in item count is likely to lead to a reduced chance of sale when the random forest model is applied to the active
inventory. In the range shown, the effect is quite linear.
sales %>%
ggplot(aes(x = file_type, y = release_number, fill = file_type)) +
geom_boxplot(na.rm = TRUE) +
ylim(0, 10)
Once again we see a difference with the active
inventory having on average a smaller release_number
than the historical
inventory. The release_number
has dropped from a median of 3 to a median of 2. The box plots we constructed earlier, however, did not demonstrate a large difference in release_number
between the items that sold and the items that didn’t.
predicted_effect(
var = "release_number",
historical_text_position = 0.33,
active_text_position = 0.33,
xmax = 10
)
We may expect a decrease in sale probability due to the difference in release_number
. However, the behaviour of the probability as the release_number
changes is erratic, which may be why this pattern was not detected by a simple box plot.
Let’s now apply our random forest to the active
data set. We will include Boolean predictions for the sold_flag
, as well as probabilities. The latter will be used for determining expected value.
active$sold_flag <- rf_smote_flag %>% predict(active, "response")
active$sold_flag_prob <- (rf_smote_flag %>% predict(active, "prob"))[,2]
Let’s check how many items are now predicted to sell over the next six months, compared to the historical
data set.
historical %>%
count(sold_flag) %>%
mutate(proportion = {100 * n / nrow(historical)} %>% round %>% paste0("%")) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
sold_flag | n | proportion |
---|---|---|
0 | 63000 | 83% |
1 | 12996 | 17% |
active %>%
count(sold_flag) %>%
mutate(proportion = {100 * n / nrow(active)} %>% round %>% paste0("%")) %>%
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
sold_flag | n | proportion |
---|---|---|
0 | 112539 | 92% |
1 | 10382 | 8% |
Items in the active
set are less likely to sell than those in the historical
set. Given the differences between the two sets, this isn’t surprising. It could also be a consequence of the slight data imbalance that remained after the application of the SMOTE
algorith, which will encourage the random forest model to favour predictions of no sale.
Earlier we opted (on weak justification) to use the low_user_price
to determine expected value.
active <- active %>%
mutate(expected_value = sold_flag_prob * low_user_price)
We can now rank the items by expected_value
.
active %>%
select(SKU_number, low_user_price, sold_flag_prob, expected_value) %>%
arrange(-expected_value) %>%
{rbind(head(.), "...", tail(.))} %>% # top 6 and bottom 6
knitr::kable(format = "html") %>%
kable_styling(full_width = F)
SKU_number | low_user_price | sold_flag_prob | expected_value |
---|---|---|---|
3619190 | 741.16 | 0.249 | 184.54884 |
603085 | 503.99 | 0.226 | 113.90174 |
532035 | 262.53 | 0.405 | 106.32465 |
1829990 | 188.55 | 0.492 | 92.7666 |
612294 | 353.99 | 0.262 | 92.74538 |
656512 | 199.07 | 0.444 | 88.38708 |
… | … | … | … |
858486 | 0 | 0.086 | 0 |
2270637 | 0 | 0.19 | 0 |
3230816 | 0 | 0.075 | 0 |
3230814 | 0 | 0.055 | 0 |
3230815 | 0 | 0.086 | 0 |
2287680 | 0 | 0.177 | 0 |
The top items at the top of this ranking are those that have an acceptable chance of selling (compared to the low average) and a reasonable low_user_price
. Sure enough, the items at the bottom of the ranking are those with $0 low_user_price
.