This package exists to explore the concept of creating a machine learning model as an R package, similar to the established concept of an analysis as an R package. The idea here is that, using vignettes, we can train the model by installing the package. The functions in the package then allow the user to score new data with the trained model. To demonstrate this I’ve created an extremely simple sentiment analysis model based on review data from the UCI Machine Learning Repository.
I thought this might work because of a few things:
map_to_dtm
function in this package.roxygen2
for documenting package functions, and testthat
for unit tests. I especially like the ease with which you can test within RStudio.However, I have my doubts:
devtools::build_vignettes
before running devtools::install
. There’s something here with namespaces whereby the data objects suddenly become “unexported” after tests are attempted. I don’t know why!<package_root>/vignettes
when this vignette is knitted, so I can move up one level to obtain the root directory of the package. This should be okay if we’re following the standard package structure, but I’ve been hurt too many times by relative file paths to feel comfortable about this.plumber
package. I don’t know if we can integrate the expected plumber.R
file in the package, except for sticking it in the inst
directory and then finding it with system.file
.No matter what, I think these sorts of projects have to be shared, even if I don’t think that this is a major success!
A quick shout out for the excellent book on R packages by Hadley Wickham. It’s well worth keeping bookmarked.
library(dplyr)
library(ggplot2)
library(text2vec)
library(tidytext)
library(randomForest)
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
package_root <- dirname(getwd()) # dirname moves up a directory
devtools::load_all(package_root)
I haven’t kept the data in this git repository, opting instead to download it if it doesn’t already exist. It’s a fairly small data set though (3000 rows).
download_data
is a package function that downloads and unzips the source data into the inst/extdata
directory (creating it if necessary). On package compilation, everything in the inst
folder is moved up to the root directory of the package, and so we can find the extdata
directory in the finished product.
extdata <- file.path(package_root, "inst", "extdata")
if (!dir.exists(file.path(extdata, "sentiment labelled sentences"))) {
data_source_url <- paste0("https://archive.ics.uci.edu",
"/ml/machine-learning-databases/00331",
"/sentiment%20labelled%20sentences.zip")
download_data(data_source_url, extdata)
}
Data is loaded in with another custom function, read_review_file
. This is just readr::read_tsv
with some special options to cover the pecularities of the raw data. All of these custom functions are documented and stored in the R
directory. Once the package is installed, function manuals can be called in the usual way (eg. ?read_review_file
).
This is a simple analysis, so let’s just stick to discrete categories for sentiment: “good” and “bad”. I don’t care too much about how the model performs, as long as it functions.
data_files <- c("amazon_cells_labelled.txt",
"imdb_labelled.txt",
"yelp_labelled.txt") %>%
file.path(extdata, "sentiment labelled sentences", .)
reviews <- data_files %>%
purrr::map(read_review_file) %>%
purrr::reduce(rbind) %>%
mutate(
sentiment = ifelse(sentiment == 1, "good", "bad")
)
reviews %>% head
## # A tibble: 6 x 2
## review sentiment
## <chr> <chr>
## 1 So there is no way for me to plug it in here in the US unless … bad
## 2 Good case, Excellent value. good
## 3 Great for the jawbone. good
## 4 Tied to charger for conversations lasting more than 45 minutes… bad
## 5 The mic is great. good
## 6 I have to jiggle the plug to get it to line up right to get de… bad
We check for missing data using the naniar
package:
## # A tibble: 2 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 review 0 0
## 2 sentiment 0 0
Let’s take a look at which words are the most frequent. First we create a data frame such that each row is an occurrence of a word. Note that we remove stop words — these are words like “the” that are common and usually provide little semantic content to the text.
words <- reviews %>%
tidytext::unnest_tokens(
word,
review
) %>%
anti_join(
tidytext::stop_words,
by = "word"
)
words %>% head
## # A tibble: 6 x 2
## sentiment word
## <chr> <chr>
## 1 bad plug
## 2 bad converter
## 3 good excellent
## 4 good jawbone
## 5 bad tied
## 6 bad charger
Now we’ll plot the mosst frequently occurring words, keeping a note of which words are “good” and which words are “bad”.
words %>%
count(word, sentiment, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
# scale_fill_manual(
# values = wine_plot_colours
# ) +
xlab(NULL) +
theme(text = element_text(size = 16)) +
coord_flip() +
ggtitle("Frequency of words")
There are no surprises here! “Bad” is universally bad and “love” is universally good. It’s comforting to see. We’ll note this and use these words in our unit tests.
I’m not sure what purpose word clouds serve, but they seem almost mandatory.
We need to apply some preprocessing to our text before we can feed it into a model. The first round of preprocessing is simply ignoring case, punctuation and numbers:
## function(x) {
## x %>%
## tolower %>%
## tm::removeNumbers() %>%
## tm::removePunctuation()
## }
## <environment: namespace:ModelAsAPackage>
I’m actually not sure that we should be removing numbers here. We’re dealing with reviews, after all, and a review like “10/10” certainly tells us something about sentiment. But that’s beyond the scope of this package.
The next round of processing involves tokenising our words. This is a process of stripping words down to their base. Another custom function, stem_tokeniser
plays this role, by calling on the Porter stemming algorithm:
## [[1]]
## [1] "inform" "inform" "inform" "inform"
Now we’ll define our vocabulary. The vocabulary is the domain of the problem — the words that will go into the model as features. Dimensionality is an issue here, we’’ll prune our vocabulary to include only words that occur a minimum number of times. The vocabulary is subject to domain shift — if an incoming piece of text contains a word that isn’t in the vocabulary, it will be ignored by the model.
We’re going to insist that every word in the vocabulary appears in at least 25 of the reviews.
## Number of docs: 3000
## 1149 stopwords: a, a's, able, about, above, according ...
## ngram_min = 1; ngram_max = 1
## Vocabulary:
## term term_count doc_count
## 1: fit 25 25
## 2: buy 25 25
## 3: cast 25 25
## 4: happi 25 25
## 5: impress 25 25
## 6: perform 25 25
## 7: expect 26 26
## 8: plot 27 26
## 9: eat 27 27
## 10: day 27 27
## 11: quit 27 27
## 12: worth 27 27
## 13: mani 28 27
## 14: actor 28 27
## 15: restaur 28 27
## 16: star 28 26
## 17: wont 28 28
## 18: doesnt 28 28
## 19: comfort 29 28
## 20: piec 29 29
## 21: enjoy 31 29
## 22: lot 31 30
## 23: money 31 31
## 24: experi 31 31
## 25: call 32 31
## 26: perfect 32 32
## 27: stori 32 32
## 28: wait 32 26
## 29: real 32 31
## 30: poor 33 31
## 31: peopl 33 32
## 32: definit 33 33
## 33: play 34 33
## 34: terribl 34 34
## 35: scene 34 32
## 36: everi 35 32
## 37: littl 35 35
## 38: everyth 35 35
## 39: feel 36 36
## 40: doe 37 36
## 41: minut 37 33
## 42: amaz 37 37
## 43: worst 39 38
## 44: friend 40 40
## 45: pretti 41 40
## 46: ear 41 36
## 47: didnt 42 41
## 48: tri 42 41
## 49: act 46 46
## 50: batteri 47 46
## 51: wast 47 47
## 52: watch 53 52
## 53: ani 53 52
## 54: nice 54 54
## 55: price 54 54
## 56: sound 54 51
## 57: headset 55 53
## 58: im 56 56
## 59: excel 56 54
## 60: becaus 57 57
## 61: charact 58 54
## 62: ive 60 56
## 63: product 61 61
## 64: disappoint 61 61
## 65: recommend 64 64
## 66: qualiti 66 65
## 67: onli 78 77
## 68: dont 85 82
## 69: love 93 92
## 70: bad 101 90
## 71: realli 103 100
## 72: servic 107 106
## 73: food 125 120
## 74: time 136 134
## 75: phone 174 165
## 76: film 184 172
## 77: movi 208 191
## 78: veri 243 226
## term term_count doc_count
Finally, we’ll create a vectoriser using the text2vec
package. This allows us to map new text onto the vocabulary we’ve just created.
Actually, much of what you see here uses the text2vec
package. I’m fond of this package because it’s designed with the idea that you may need to score new data that comes in after you’ve trained your model, so always need to be able to process new text!
One quick note, though: the itoken
function in this package creates an iterator of tokens. It can be called like this:
tokens <- text2vec::itoken(
unprocessed_text,
preprocessor = text_preprocessor,
tokenizer = stem_tokeniser,
progressbar = FALSE
)
However, I had great trouble with this method, with words not being properly tokenised before making it into the vectoriser. So I’ve done this instead:
processed_text <- stem_tokeniser(text_preprocessor(unprocessed_text))
text2vec::itoken(
processed_text,
progressbar = FALSE
)
I would have thought that the two pieces of code were equivalent, but my unit tests fail with the first example. I’m putting this here as an unknown!
Creating a document term matrix
The input for our model algorithm is a document term matrix. This is a matrix in which every row represents one of our 3000 reviews, and every column uses one of the 78 terms in our vocabulary. We use the map_to_dtm
function which allows us to map raw text onto a new dtm.
## function(x,
## vectoriser = ModelAsAPackage::vectoriser,
## tfidf = ModelAsAPackage::tfidf) {
## processed_text <- stem_tokeniser(text_preprocessor(x))
##
## # For some reason, the preprocessor and stem_tokeniser don't take effect if
## # I put them in the itoken function as values to the relevant arguments.
## # Please let me know if you understand why this is!
## tokens <- text2vec::itoken(
## processed_text,
## progressbar = FALSE
## )
##
## # If the input contains no terms corresponding the to vocabulary used to
## # generate the vectoriser, then a warning will occur for an empty dtm.
## # Since this is a plausible scenario, we suppress the warning.
## suppressWarnings(
## dtm <- text2vec::create_dtm(tokens, vectoriser)
## )
##
## if (!is.null(tfidf)) {
## dtm <- tfidf$transform(dtm)
## }
##
## return(dtm)
## }
## <environment: namespace:ModelAsAPackage>
You’ll notice that tfidf
argument. This stands for term frequency inverse document frequency. Informally, we want to weight every word as more important if it occurs often, and less important if it occurs in many documents. This is exactly what tfidf
does. Let’s start with an unweighted matrix so we can see the effect.
Then, as if I hadn’t wasted enough of my life there, they poured salt in the wound by drawing out the time it took to bring the check.
## im excel becaus charact ive product
## 0 0 0 0 0 0
## disappoint recommend qualiti onli dont love
## 0 0 0 0 0 0
## bad realli servic food time phone
## 0 0 0 0 1 0
## film movi veri
## 0 0 0
Now we fit our tfidf
. Be careful here: fit_transform
is a method which says “use this data to define the tfidf
, and then transform the input by that tfidf
. This is distinct from transform
which says”use a tfidf
that’s already been fitted to transform this data". This terminology is more familiar to Python users than it is to R users, and I occasionally see it tripping people up (especially on Kaggle). The rule of thumb is that you use fit_transform
for your training data and transform
for your test data, or any new data that you encounter.
Let’s take the same example before, but now with a weighted document term matrix:
## im excel becaus charact ive product
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## disappoint recommend qualiti onli dont love
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## bad realli servic food time phone
## 0.000000 0.000000 0.000000 0.000000 1.550546 0.000000
## film movi veri
## 0.000000 0.000000 0.000000
With all of the effort we put into preprocessing, the model training step is relatively straightforward! Document term matrices are stored as a special class of sparse matrix, because there are computational techniques to efficiently store and use matrices in which the vast majority of entries are 0. However, this format isn’t accepted by the randomForest
algorithm. Fortunately, with only 3000 rows and 78 columns, we don’t have to worry too much about computational efficiency.
review_rf <- randomForest::randomForest(
x = as.matrix(dtm_tf_idf),
y = factor(reviews$sentiment),
ntree = 500
)
While I don’t want to invest much time in the model itself, we can at least look at how it performs, and which terms it considers the most important:
##
## Call:
## randomForest(x = as.matrix(dtm_tf_idf), y = factor(reviews$sentiment), ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 33.7%
## Confusion matrix:
## bad good class.error
## bad 1167 333 0.222
## good 678 822 0.452
Now we recall that we’re actually creating a package. We want all of the work that we’ve done so far to be included in the final result. Fortunately, our custom functions are in the R
directory, so they’ll persist when the package is compiled. We need three other objects (all relating to our trained model) to be available as well: our random forest review_rf
, our vectoriser
, and the tfidf
we used to weight our training data (and will reuse for new data). These are all (sparsely) documented with their own entries in the R
directory.