Data Science for Business - Time Series Forecasting Part 3: Forecasting with Facebook's Prophet

2017-06-13 00:00:00 +0000

In my last two posts (Part 1 and Part 2), I explored time series forecasting with the timekit package.

In this post, I want to compare how Facebook’s prophet performs on the same dataset.

Predicting future events/sales/etc. isn’t trivial for a number of reasons and different algorithms use different approaches to handle these problems. Time series data does not behave like a regular numeric vector, because months don’t have the same number of days, weekends and holidays differ between years, etc. Because of this, we often have to deal with multiple layers of seasonality (i.e. weekly, monthly, yearly, irregular holidays, etc.). Regularly missing days, like weekends, are easier to incorporate into time series models than irregularly missing days.

Timekit uses a time series signature for modeling, which we used as features to build our model of choice (e.g. a linear model). This model was then used for predicting future dates.

Prophet is Facebook’s time series forecasting algorithm that was just recently released as open source software with an implementation in R.

“Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.”

(I am not going to discuss forecast and ARIMA or other models because they are quite well established with lots and lots of excellent tutorials out there.)

Training and Test data

I am using the same training and test intervals as in my last post using timekit.

Just as with timekit, prophet starts with a data frame that consists of a date column and the respective response variable for each date.

library(prophet)

library(tidyverse)
library(tidyquant)

retail_p_day <- retail_p_day %>%
  mutate(model = ifelse(day <= "2011-11-01", "train", "test"))

train <- filter(retail_p_day, model == "train") %>%
  select(day, sum_income) %>%
  rename(ds = day,
         y = sum_income)

test <- filter(retail_p_day, model == "test") %>%
  select(day, sum_income) %>%
  rename(ds = day)

Model building

In contrast to timekit, we do not “manually” augment the time series signature in prophet, we can directly feed our input data to the prophet() function (check the function help for details on optional parameters).

To make it comparable, I am feeding the same list of irregularly missing days to the prophet() function. As discussed in the last post, I chose not to use a list of holidays because the holidays in the observation period poorly matched the days that were actually missing.

off_days <- data.frame(ds = as.Date(c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", 
                                      "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03",
                                      "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", 
                                      "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30"))) %>%
  mutate(holiday = paste0("off_day_", seq_along(1:length(ds))))

prophet_model_test <- prophet(train, 
                              growth = "linear", # growth curve trend
                              n.changepoints = 100, # Prophet automatically detects changes in trends by selecting changepoints from the data
                              yearly.seasonality = FALSE, # yearly seasonal component using Fourier series
                              weekly.seasonality = TRUE, # weekly seasonal component using dummy variables
                              holidays = off_days) 

## Initial log joint probability = -8.3297
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

Predicting test data

With our model, we can now predict on the test data and compare the predictions with the actual values.

forecast_test <- predict(prophet_model_test, test)

Just as with timekit, I want to have a look at the residuals. Compared to timekit, the residuals actually look almost identical…

forecast_test %>%
  mutate(resid = sum_income - yhat) %>%
  ggplot(aes(x = ds, y = resid)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()

… As does the comparison plot. So, here it seems that prophet built a model that is basically identical to the linear model I used with timekit.

forecast_test %>%
  gather(x, y, sum_income, yhat) %>%
  ggplot(aes(x = ds, y = y, color = x)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

Predicting future sales

Now, let’s see whether the future predictions will be identical as well.

Just like with timekit, I am using a future time series of 300 days. Here, we see a slight difference in how we generate the future time series: with timekit I could use the entire index of observed dates, together with the list of missing days, while prophet uses the forecasting model that was generated for comparing the test data, i.e. it only considers the dates from the training set. We could built a new model with the entire dataset but this would then be different to how I approached the modeling with timekit.

future <- make_future_dataframe(prophet_model_test, periods = 300)

forecast <- predict(prophet_model_test, future)

plot(prophet_model_test, forecast) +
    theme_tq()

Interestingly, prophet’s forecast is distinctly different from timekit’s, despite identical performance on test samples! While timekit predicted a drop at the beginning of the year (similar to the training period), prophet predicts a steady increase in the future. It looks like timekit put more weight on the overall pattern during the training period, while prophet seems to put more weight on the last months, which showed a rise in net income.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidyquant_0.5.1               quantmod_0.4-8               
##  [3] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
##  [5] xts_0.9-7                     zoo_1.8-0                    
##  [7] lubridate_1.6.0               dplyr_0.5.0                  
##  [9] purrr_0.2.2.2                 readr_1.1.1                  
## [11] tidyr_0.6.3                   tibble_1.3.1                 
## [13] ggplot2_2.2.1                 tidyverse_1.1.1              
## [15] prophet_0.1.1                 Rcpp_0.12.11                 
## 
## loaded via a namespace (and not attached):
##  [1] rstan_2.15.1         reshape2_1.4.2       haven_1.0.0         
##  [4] lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6     
##  [7] stats4_3.4.0         yaml_2.1.14          rlang_0.1.1         
## [10] foreign_0.8-68       DBI_0.6-1            modelr_0.1.0        
## [13] readxl_1.0.0         plyr_1.8.4           stringr_1.2.0       
## [16] Quandl_2.8.0         munsell_0.4.3        gtable_0.2.0        
## [19] cellranger_1.1.0     rvest_0.3.2          codetools_0.2-15    
## [22] psych_1.7.5          evaluate_0.10        labeling_0.3        
## [25] inline_0.3.14        knitr_1.16           forcats_0.2.0       
## [28] parallel_3.4.0       broom_0.4.2          scales_0.4.1        
## [31] backports_1.0.5      StanHeaders_2.15.0-1 jsonlite_1.5        
## [34] gridExtra_2.2.1      mnormt_1.5-5         hms_0.3             
## [37] digest_0.6.12        stringi_1.1.5        grid_3.4.0          
## [40] rprojroot_1.2        tools_3.4.0          magrittr_1.5        
## [43] lazyeval_0.2.0       xml2_1.1.1           extraDistr_1.8.5    
## [46] assertthat_0.2.0     rmarkdown_1.5        httr_1.2.1          
## [49] R6_2.2.1             nlme_3.1-131         compiler_3.4.0

Data Science for Business - Time Series Forecasting Part 2: Forecasting with timekit

2017-06-09 00:00:00 +0000

In my last post, I prepared and visually explored time series data.

Now, I will use this data to test the timekit package for time series forecasting with machine learning.

Forecasting

In time series forecasting, we use models to predict future time points based on past observations.

As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”

And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts - we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.

The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.

library(tidyverse)
library(caret)

library(tidyquant)
library(broom)
library(timekit)
library(modelr)
options(na.action = na.warn)

Training and test data

My input data is the tibble retail_p_day, that was created in my last post.

I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).

retail_p_day <- retail_p_day %>%
  mutate(model = ifelse(day <= "2011-11-01", "train", "test"))

colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])

Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.

As we can see in the plot below, the net income shows variation between days.

retail_p_day %>%
  ggplot(aes(x = day, y = sum_income, color = model)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

Augmenting the time series signature

With timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.

A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:

index: The index value that was decomposed
index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”.
diff: The difference in seconds from the previous numeric index value.
year: The year component of the index.
half: The half component of the index.
quarter: The quarter component of the index.
month: The month component of the index with base 1.
month.xts: The month component of the index with base 0, which is what xts implements.
month.lbl: The month label as an ordered factor beginning with January and ending with December.
day: The day component of the index.
hour: The hour component of the index.
minute: The minute component of the index.
second: The second component of the index.
hour12: The hour component on a 12 hour scale.
am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
mday: The day of the month.
qday: The day of the quarter.
yday: The day of the year.
mweek: The week of the month.
week: The week number of the year (Sunday start).
week.iso: The ISO week number of the year (Monday start).
week2: The modulus for bi-weekly frequency.
week3: The modulus for tri-weekly frequency.
week4: The modulus for quad-weekly frequency.
mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.

Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.

retail_p_day_aug <- retail_p_day %>%
  rename(date = day) %>%
  select(model, date, sum_income) %>% 
  tk_augment_timeseries_signature() %>%
  select(-contains("month"))
  
retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]

Preprocessing

Not all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.

Let’s look at column variation for all numeric feature and remove those with a variance of 0.

library(matrixStats)

(var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]),
           colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>%
  filter(colvars == 0))

##   colnames colvars
## 1     hour       0
## 2   minute       0
## 3   second       0
## 4   hour12       0
## 5    am.pm       0

retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames)))

Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set.

library(ggcorrplot)

cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])
p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])

ggcorrplot(cor,  type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor,
           colors = c(palette_light()[1], "white", palette_light()[2]))

I am going to choose a cutoff of 0.9 for removing features:

cor_cut <- findCorrelation(cor, cutoff=0.9) 
retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut]))

Now, I can split the data into training and test sets:

train <- filter(retail_p_day_aug, model == "train") %>%
  select(-model)
test <- filter(retail_p_day_aug, model == "test")

Modeling

To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple.

fit_lm <- glm(sum_income ~ ., data = train)

We can examine our model e.g. by visualizing:

tidy(fit_lm) %>%
  gather(x, y, estimate:p.value) %>%
  ggplot(aes(x = term, y = y, color = x, fill = x)) +
    facet_wrap(~ x, scales = "free", ncol = 4) +
    geom_bar(stat = "identity", alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

augment(fit_lm) %>%
  ggplot(aes(x = date, y = .resid)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()

With this model, we can now add predictions and residuals for the test data…

pred_test <- test %>%
  add_predictions(fit_lm, "pred_lm") %>%
  add_residuals(fit_lm, "resid_lm")

… and visualize the residuals.

pred_test %>%
    ggplot(aes(x = date, y = resid_lm)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()

We can also compare the predicted with the actual sum income in the test set.

pred_test %>%
  gather(x, y, sum_income, pred_lm) %>%
  ggplot(aes(x = date, y = y, color = x)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

Forecasting

Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame).

# Extract index
idx <- retail_p_day %>%
    tk_index()

From this index we can generate the future time series.

Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds).

retail_p_day_aug %>%
  ggplot(aes(x = date, y = diff)) +
    geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days).

retail_p_day_aug %>%
  select(date, wday.lbl, diff) %>%
  filter(wday.lbl != "Sunday" & diff > 86400) %>%
  mutate(days_missing = diff / 86400 -1)

## # A tibble: 5 x 4
##         date wday.lbl    diff days_missing
##       <date>    <ord>   <int>        <dbl>
## 1 2011-01-04  Tuesday 1036800           11
## 2 2011-04-26  Tuesday  432000            4
## 3 2011-05-03  Tuesday  172800            1
## 4 2011-05-31  Tuesday  172800            1
## 5 2011-08-30  Tuesday  172800            1

retail_p_day_aug %>%
  select(date, wday.lbl, diff) %>%
  filter(wday.lbl == "Sunday" & diff > 172800) %>%
  mutate(days_missing = diff / 86400 -1)

## # A tibble: 1 x 4
##         date wday.lbl   diff days_missing
##       <date>    <ord>  <int>        <dbl>
## 1 2011-05-01   Sunday 259200            2

Let’s create a list of all missing days:

off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03",
              "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>% 
  ymd()

Official UK holidays during that time were:

2011:
Boxing Day December 26
Christmas Day Holiday December 27
2012:
New Year’s Day Holiday January 2
Good Friday April 6
Easter Monday April 9
Early May Bank Holiday May 7
Spring Bank Holiday June 4
Diamond Jubilee Holiday June 5
Summer Bank Holiday August 27

We can account for the missing Saturdays with inspect_weekdays = TRUE.

Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past.

Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values - even though this is only a best-guess approach and likely not going to match all days that will be missing in reality during the future time series.

idx_future <- idx %>%
    tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days)

idx_future %>%
    tk_get_timeseries_signature() %>%
    ggplot(aes(x = index, y = diff)) +
    geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame.

data_future <- idx_future %>%
    tk_get_timeseries_signature() %>%
    rename(date = index)

pred_future <- predict(fit_lm, newdata = data_future)

pred_future <- data_future %>%
    select(date) %>%
    add_column(sum_income = pred_future)

retail_p_day %>%
  select(day, sum_income) %>%
  rename(date = day) %>%
  rbind(pred_future) %>%
  ggplot(aes(x = date, y = sum_income)) +
    scale_x_date() +
    geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    theme_tq()

When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.

test_residuals <- pred_test$resid_lm
test_resid_sd <- sd(test_residuals, na.rm = TRUE)

pred_future <- pred_future %>%
    mutate(
        lo.95 = sum_income - 1.96 * test_resid_sd,
        lo.80 = sum_income - 1.28 * test_resid_sd,
        hi.80 = sum_income + 1.28 * test_resid_sd,
        hi.95 = sum_income + 1.96 * test_resid_sd
        )

This, we can then plot to show the forecast with confidence intervals:

retail_p_day %>%
  select(day, sum_income) %>%
  rename(date = day) %>%
  ggplot(aes(x = date, y = sum_income)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, 
                fill = "#D5DBFF", color = NA, size = 0) +
    geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future,
                fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
    geom_point(aes(x = date, y = sum_income), data = pred_future,
               alpha = 0.5, color = palette_light()[[2]]) +
    geom_smooth(aes(x = date, y = sum_income), data = pred_future,
                method = 'loess', color = "white") +
    theme_tq()

Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future.

Next time, I’ll compare how Facebook’s prophet will predict the future income.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggcorrplot_0.1.1              matrixStats_0.52.2           
##  [3] modelr_0.1.0                  timekit_0.3.0                
##  [5] broom_0.4.2                   tidyquant_0.5.1              
##  [7] quantmod_0.4-8                TTR_0.23-1                   
##  [9] PerformanceAnalytics_1.4.3541 xts_0.9-7                    
## [11] zoo_1.8-0                     lubridate_1.6.0              
## [13] caret_6.0-76                  lattice_0.20-35              
## [15] dplyr_0.5.0                   purrr_0.2.2.2                
## [17] readr_1.1.1                   tidyr_0.6.3                  
## [19] tibble_1.3.1                  ggplot2_2.2.1                
## [21] tidyverse_1.1.1              
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1         jsonlite_1.5       splines_3.4.0     
##  [4] foreach_1.4.3      assertthat_0.2.0   stats4_3.4.0      
##  [7] cellranger_1.1.0   yaml_2.1.14        backports_1.0.5   
## [10] quantreg_5.33      digest_0.6.12      rvest_0.3.2       
## [13] minqa_1.2.4        colorspace_1.3-2   htmltools_0.3.6   
## [16] Matrix_1.2-10      plyr_1.8.4         psych_1.7.5       
## [19] SparseM_1.77       haven_1.0.0        padr_0.3.0        
## [22] scales_0.4.1       lme4_1.1-13        MatrixModels_0.4-1
## [25] mgcv_1.8-17        car_2.1-4          nnet_7.3-12       
## [28] lazyeval_0.2.0     pbkrtest_0.4-7     mnormt_1.5-5      
## [31] magrittr_1.5       readxl_1.0.0       evaluate_0.10     
## [34] nlme_3.1-131       MASS_7.3-47        forcats_0.2.0     
## [37] xml2_1.1.1         foreign_0.8-68     tools_3.4.0       
## [40] hms_0.3            stringr_1.2.0      munsell_0.4.3     
## [43] compiler_3.4.0     rlang_0.1.1        grid_3.4.0        
## [46] nloptr_1.0.4       iterators_1.0.8    labeling_0.3      
## [49] rmarkdown_1.5      gtable_0.2.0       ModelMetrics_1.1.0
## [52] codetools_0.2-15   DBI_0.6-1          reshape2_1.4.2    
## [55] R6_2.2.1           knitr_1.16         rprojroot_1.2     
## [58] Quandl_2.8.0       stringi_1.1.5      parallel_3.4.0    
## [61] Rcpp_0.12.11

Data Science for Business - Time Series Forecasting Part 1: EDA & Data Preparation

2017-05-28 00:00:00 +0000

Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models. But the techniques are only tools in a - sometimes very messy - toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered?

A straight-forward business problem is to estimate future sales and future income. Based on past experience, i.e. data from past sales, data science can help improve forecasts and generate models that describe the main factors of influence. This, in turn, can then be used to develop actions based on what we have learned, like where to increase advertisement, how much of which products to keep in stock, etc.

Data preparation

While it isn’t the most exciting aspect of data science, and therefore often neglected in favor of fancy modeling techniques, getting the data into the right format and extracting meaningful features, is arguably THE most essential part of any analysis!

Therefore, I have chosen to dedicate an entire article to this part and will discuss modeling and time series forecasting in separate blog posts.

Many of the formal concepts I am using when dealing with data in a tidy way come from Hadley Wickham & Garrett Grolemund’s “R for Data Science”.

The central package is tidyverse, which contains tidyr, dplyr, ggplot, etc. Other packages I am using are tidyquant for its nice ggplot theme, modelr, gridExtra and grid for additional plotting functionalities.

library(tidyverse)
library(tidyquant)
library(modelr)
library(gridExtra)
library(grid)

The data

I am again using a dataset from UC Irvine’s machine learning repository (converted to csv from xlsx).

From the dataset description:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Reading in the data

A fast way to read in data in csv format is to use readr’s read_csv() function. With a small dataset like this, it makes sense to specifically define what format each column should have (e.g. integers, character, etc.). In our case, this is particularly convenient for defining the date/time column to be read in correctly with col_datetime().

The original data contains the following features (description from UC Irvine’s machine learning repository):

InvoiceNo: Invoice number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code uniquely assigned to each distinct product.
Description: Product (item) name.
Quantity: The quantities of each product (item) per transaction.
InvoiceDate: Invoice Date and time, the day and time when each transaction was generated.
UnitPrice: Unit price. Product price per unit in sterling.
CustomerID: Customer number uniquely assigned to each customer.
Country: Country name. The name of the country where each customer resides.

Because read_csv generates a tibble (a specific dataframe class) and is part of the tidyverse, we can directly create a few additional columns:

day: Invoice date, the day when each transaction was generated.
time: Invoice time, the time when each transaction was generated.
month: The month when each transaction was generated.
income: The amount of income generated from each transaction (Quantity * UnitPrice), negative in case of returns
income_return: Description of whether a transaction generated income or loss (i.e. purchase or return)

retail <- read_csv("OnlineRetail.csv",
                   col_types = cols(
                      InvoiceNo = col_character(),
                      StockCode = col_character(),
                      Description = col_character(),
                      Quantity = col_integer(),
                      InvoiceDate = col_datetime("%m/%d/%Y %H:%M"),
                      UnitPrice = col_double(),
                      CustomerID = col_integer(),
                      Country = col_character()
                      )) %>%
  mutate(day = parse_date(format(InvoiceDate, "%Y-%m-%d")),
         day_of_week = wday(day, label = TRUE),
         time = parse_time(format(InvoiceDate, "%H:%M")),
         month = format(InvoiceDate, "%m"),
         income = Quantity * UnitPrice,
         income_return = ifelse(Quantity > 0, "income", "return"))

Exploratory Data Analysis (EDA)

In order to decide, which features to use in the final dataset for modeling, we want to get a feel for our data. And a good way to do this, is by creating different visualizations. It also helps with assessing your models later on, because to closer you are acquainted with the data’s properties, the better you’ll be able to pick up on things that might have gone wrong in your analysis (think of it as a kind of sanity-check for your data).

Transactions by country

The online retailer is UK-based, but its customers come from all over the world. However, the plots below tell us very quickly that the main customer base is from the UK, followed by Germany and France.

p1 <- retail %>%
  filter(Country == "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    guides(fill = FALSE) +
    labs(x = "")

p2 <- retail %>%
  filter(Country != "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "",
         fill = "")

grid.arrange(p1, p2, widths = c(0.2, 0.8))

Transactions over time

To get an idea of the number of transactions over time, we can use a frequency polygon. Here, we can see that the number purchases slightly increased during the last two months of recording, while the number of returns remained relatively stable.

retail %>%
  ggplot(aes(x = day, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_freqpoly(bins = 100, size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Number of purchases/returns over time",
         x = "")

Because the number of returns is much smaller than the number of purchases, it is difficult to visualize and compare them in the same plot. While above, I split them into two facets with free scales, we can also compare the density of values. From this plot, we can more easily see the relationship between purchases and returns over time: except for the last month, the proportion of both remained relatively stable.

retail %>%
  ggplot(aes(x = day, y = ..density.., color = income_return)) +
    geom_freqpoly(size = 1, alpha = 0.8, bins = 100) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(title = "Density of purchases/returns over time",
         x = "",
         color = "")

Income/loss from transactions

Let’s look at the income/loss from transactions over time. Here, we plot the sum of income and losses for each day. The income seems to increase slightly during the last month, while losses remained more stable. The only severe outlier is the last day.

retail %>%
  group_by(day, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = day, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income/loss from transactions per day",
         x = "",
         y = "sum of income/losses",
         color = "")

We can also look at the sum of income and losses according to time of day of the transaction. Not surprisingly, transactions happen mostly during business hours.

retail %>%
  group_by(time, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = time, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income from purchases/returns per time of day",
         x = "time of day",
         y = "sum of income/losses",
         color = "")

Here, we again see the two extreme outliers. Let’s look at them in the dataset. This purchase of 80995 paper craft birdies might have been a mistake, because we can see that the same customer who bought them at 09:15 cancelled the order only 15 minutes later and didn’t order a smaller number either.

retail %>%
  filter(day == "2011-12-09") %>%
  arrange(-Quantity) %>%
  .[1:3, ]

## # A tibble: 3 x 14
##   InvoiceNo StockCode                         Description Quantity
##       <chr>     <chr>                               <chr>    <int>
## 1    581483     23843         PAPER CRAFT , LITTLE BIRDIE    80995
## 2    581476     16008 SMALL FOLDING SCISSOR(POINTED EDGE)      240
## 3    581476     22693  GROW A FLYTRAP OR SUNFLOWER IN TIN      192
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>

retail %>%
  filter(day == "2011-12-09") %>%
  arrange(Quantity) %>%
  .[1:3, ]

## # A tibble: 3 x 14
##   InvoiceNo StockCode                     Description Quantity
##       <chr>     <chr>                           <chr>    <int>
## 1   C581484     23843     PAPER CRAFT , LITTLE BIRDIE   -80995
## 2   C581490     22178 VICTORIAN GLASS HANGING T-LIGHT      -12
## 3   C581490     23144 ZINC T-LIGHT HOLDER STARS SMALL      -11
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>

retail %>%
  filter(CustomerID == 16446)

## # A tibble: 4 x 14
##   InvoiceNo StockCode                 Description Quantity
##       <chr>     <chr>                       <chr>    <int>
## 1    553573     22980      PANTRY SCRUBBING BRUSH        1
## 2    553573     22982         PANTRY PASTRY BRUSH        1
## 3    581483     23843 PAPER CRAFT , LITTLE BIRDIE    80995
## 4   C581484     23843 PAPER CRAFT , LITTLE BIRDIE   -80995
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>

Transactions by day and time

The last plot told us that in general, transactions were done during business hours. We can look at this in even more detail by comparing the day and time of transactions in a 2D-bin-plot where the tile colors indicate transaction numbers.

retail %>%
  ggplot(aes(x = time, y = day)) +
    stat_bin2d(alpha = 0.8, bins = 25, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Purchases/returns per day and time")

Net income

The net income we can e.g. plot in a similar way by comparing month and day of the month of transactions with a tile plot:

retail %>%
  mutate(day2 = format(InvoiceDate, "%d")) %>%
  group_by(month, day2) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = month, y = day2, fill = sum_income)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Net income per month and day",
         y = "day of the month",
         fill = "net sum of income")

Items

Also of interest are the items that are being purchases or returned. Here, we sum up the net quantities for each item.

retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  arrange(-sum)

## Source: local data frame [5,749 x 3]
## Groups: StockCode [4,070]
## 
## # A tibble: 5,749 x 3
##    StockCode                        Description   sum
##        <chr>                              <chr> <int>
##  1     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS 53847
##  2    85099B            JUMBO BAG RED RETROSPOT 47363
##  3     84879      ASSORTED COLOUR BIRD ORNAMENT 36381
##  4     22197                     POPCORN HOLDER 36334
##  5     21212    PACK OF 72 RETROSPOT CAKE CASES 36039
##  6    85123A WHITE HANGING HEART T-LIGHT HOLDER 35025
##  7     23084                 RABBIT NIGHT LIGHT 30680
##  8     22492             MINI PAINT SET VINTAGE 26437
##  9     22616          PACK OF 12 LONDON TISSUES 26315
## 10     21977 PACK OF 60 PINK PAISLEY CAKE CASES 24753
## # ... with 5,739 more rows

As we can see in the plots below, the majority of items is purchases only occasionally, while a few items are purchased a lot.

p1 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p2 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 1) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p3 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 10000) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()
    
grid.arrange(p1, p2, p3, ncol = 3)

We can also calculate on how many different days, items have been purchased.

most_sold <- retail %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(StockCode, Description) %>%
  summarise(n = n()) %>%
  arrange(-n)

head(most_sold)

## Source: local data frame [6 x 3]
## Groups: StockCode [6]
## 
## # A tibble: 6 x 3
##   StockCode                        Description     n
##       <chr>                              <chr> <int>
## 1    85123A WHITE HANGING HEART T-LIGHT HOLDER   304
## 2    85099B            JUMBO BAG RED RETROSPOT   302
## 3     22423           REGENCY CAKESTAND 3 TIER   301
## 4     84879      ASSORTED COLOUR BIRD ORNAMENT   300
## 5     20725            LUNCH BAG RED RETROSPOT   299
## 6     21212    PACK OF 72 RETROSPOT CAKE CASES   299

The item that has been purchased most often in terms of days is the white hanging heart t-light holder. Let’s look at its distribution of sold/returned quantities per day:

retail %>%
  filter(StockCode == "85123A") %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum, color = income_return)) +
    facet_wrap(~ income_return, ncol = 1, scales = "free") +
    geom_line(size = 1, alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "sum of quantities",
         color = "",
         title = "Transactions of WHITE HANGING HEART T-LIGHT HOLDER")

Preparing data for modeling by day

There are of course, infinitely more ways to visualize data but for now, I think we have enough of a feel for the data that we can start preparing a dataset for modeling. Because we have only limited information in that we only have data for one year, we might not have enough data to accurately forecast or model time-dependent trends. But we can try by creating a table of features per day.

Which customers are repeat customers?

If the customer ID has been recorded on more than one day (i.e., they have made multiple transactions during the year of recording), they are considered repeat customers. As we can see in the plot, the majority of customers are repeat customers.

rep_customer <- retail %>%
  group_by(day, CustomerID) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(CustomerID) %>%
  summarise(n = n()) %>%
  mutate(repeat_customer = ifelse(n > 1, "repeat_cust", "one_time_cust"))

length(which(rep_customer$repeat_customer == "repeat_cust"))

## [1] 2992

rep_customer_day <- left_join(retail, rep_customer, by = "CustomerID") %>%
  distinct(day, CustomerID, repeat_customer) %>%
  group_by(day, repeat_customer) %>%
  summarise(n = n()) %>%
  spread(key = repeat_customer, value = n)

rep_customer %>%
  group_by(repeat_customer) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = "", y = prop, fill = repeat_customer)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    coord_polar("y", start = 0) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(x = "",
         y = "",
         fill = "",
         title = "Proportion of one-time & repeat customers")

Transactions, quantities and items per customer and day

I am calculating the following metrics per day and customer.

n: number of different items purchased/returned per customer per day
sum_it: net sum of items (quantity) purchased/returned per customer per day
sum_in: mean net income per customer per day

From these, I can further calculate mean numbers per day:

mean_in_cust: mean net income from all customers per day
mean_quant_cust: mean net quantities of items from all customers per day
mean_items_cust: mean number of items from all customers per day

customer_purch <- retail %>%
  group_by(day, CustomerID) %>%
  summarise(n = n(),
            sum_it = sum(Quantity),
            sum_in = sum(income)) %>%
  group_by(day) %>%
  summarise(mean_in_cust = mean(sum_in),
            mean_quant_cust = mean(sum_it),
            mean_items_cust = mean(n))

customer_purch %>%
  gather(x, y, mean_in_cust:mean_items_cust) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

Purchases/returns per day

This calculates the quantities of items that are purchased and returned per day:

income_return <- retail %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = income_return, value = sum)

income_return %>%
  gather(x, y, income:return) %>%
  ggplot(aes(x = day, y = y, color = x)) +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "", 
         y = "quantity of items",
         color = "")

How many items are purchased/returned per country?

Because the majority of transactions came from UK customers, I am dividing customers into UK-based and other nationalities and calculate the net quantities of items purchases and returned by day and country.

country_purch <- retail %>%
  mutate(Country2 = ifelse(Country == "United Kingdom", "uk", "other_country")) %>%
  group_by(day, Country2) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = Country2, value = sum) %>%
  mutate(prop_other_country = other_country / sum(other_country + uk),
         prop_uk = uk / sum(other_country + uk))

country_purch %>%
  gather(x, y, prop_other_country:prop_uk) %>%
  ggplot(aes(x = day, y = y)) +
    geom_bar(aes(fill = x), stat = "identity", alpha = 0.6) +
    scale_fill_manual(values = palette_light()) +
    geom_line(data = country_purch, aes(x = day, y = prop_uk), size = 1) +
    theme_tq() +
    labs(x = "", 
         y = "proportion of quantity of items",
         fill = "")

Here, we can now also see that not every day has a corresponding data point: there are regular (i.e. no Saturdays are recorded) and irregular patterns of missing data.

How many different items are purchased/returned per day?

I also want to know how many different items were purchased or returned. Here, we see a similar pattern where the number of different items per day increased during the last two months.

n_items <- retail %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  group_by(day) %>%
  summarise(n_items = n())

n_items %>%
  ggplot(aes(x = day, y = n_items)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "number of different items",
         color = "")

Net income & quantities summaries

Finally, I am calculating the sum and mean of the net income and net quantities sold per day.

income <- retail %>%
  group_by(day) %>%
  summarise(sum_income = sum(income),
            mean_income = mean(income),
            sum_quantity = sum(Quantity),
            mean_quantity = mean(Quantity))

income %>%
  gather(x, y, sum_income:mean_quantity) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

Income from purchases and returns

The same I am also calculating for purchases and returns separately.

purchases <- retail %>%
  filter(income > 0) %>%
  group_by(day) %>%
  summarise(sum_income_purch = sum(income),
            mean_income_purch = mean(income),
            sum_quantity_purch = sum(Quantity),
            mean_quantity_purch = mean(Quantity))

purchases %>%
  gather(x, y, sum_income_purch:mean_quantity_purch) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

returns <- retail %>%
  filter(income < 0) %>%
  group_by(day) %>%
  summarise(sum_income_return = sum(income),
            mean_income_return = mean(income),
            sum_quantity_return = sum(Quantity),
            mean_quantity_return = mean(Quantity))

returns %>%
  gather(x, y, sum_income_return:mean_quantity_return) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "")

Mean price per units sold per day

This is a bit more complicated than thought because some items are sold multiple times per time and unit prices of the same item change between transactions and days. Therefore, I am creating a temporary dataframe where I combine day & StockCode. This, I can then combine with a table of the number of transactions per day and item to find the price per item and day. Based on this, I calculate the mean unit price per item and the mean unit price of all items per day.

temp <- distinct(select(retail, day, StockCode, UnitPrice)) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  select(temp, UnitPrice)

mean_unit_price <- retail %>%
  filter(income_return == "income") %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  left_join(temp, by = "temp") %>%
  group_by(day, StockCode) %>%
  summarise(mean = mean(UnitPrice)) %>%
  group_by(day) %>%
  summarise(mean_unit_price = mean(mean))

mean_unit_price %>%
  ggplot(aes(x = day, y = mean_unit_price)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "mean unit price of sold items")

Purchases of most sold items

The 10 items that have been sold on the most days throughout the year of recording are also introduced as separate features.

most_sold_day <- retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = StockCode, value = sum)

retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum)) +
    facet_wrap(~ StockCode, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "net sum of quantites sold")

Combining data

Now, we can combine all the different per-day features into one dataframe. Now, I am also calculating the difference of the net sum of income to the previous day by using the lag() function. And I am creating a column for seasons.

Additional time-based information will be added in a later part when we actually do forecasting with the timekit package.

retail_p_day <- distinct(select(retail, day, day_of_week, month)) %>%
  left_join(income, by = "day") %>%
  left_join(mean_unit_price, by = "day") %>%
  left_join(purchases, by = "day") %>%
  left_join(returns, by = "day") %>%
  left_join(customer_purch, by = "day") %>%
  left_join(rep_customer_day, by = "day") %>%
  left_join(income_return, by = "day") %>%
  left_join(country_purch, by = "day") %>%
  left_join(n_items, by = "day") %>%
  left_join(most_sold_day, by = "day") %>%
  mutate(diff_sum_income = sum_income - lag(sum_income),
         season = ifelse(month %in% c("03", "04", "05"), "spring",
                         ifelse(month %in% c("06", "07", "08"), "summer",
                                ifelse(month %in% c("09", "10", "11"), "fall", "winter"))))

To keep track of the features, I am creating an additional dataframe/tibble containing the descriptions of every feature column.

meta <- tibble(colnames_retail_p_day = colnames(retail_p_day)) %>%
  mutate(description = c(
    "day in YYYY-MM-DD",
    "weekday (Sun - Fri, there are no Sat in the dataset)",
    "month (as month number)",
    
    "sum of net income per day (all purchases & losses per day combined)",
    "mean net income per day (all purchases & losses per day combined)",
    "sum of net quantities sold/returned per day (all purchases & returns per day combined)",
    "mean net quantities sold/returned per day (all purchases & returns per day combined)",
    
    "mean price per unit sold (returns excluded)",
    
    "sum of income from purchases per day (losses excluded)",
    "mean income from purchases per day (losses excluded)",
    "sum of quantities from purchases per day (losses excluded)",
    "mean quantities from purchases per day (losses excluded)",
    
    "sum of losses from returns per day (purchases excluded)",
    "mean losses from returns per day (purchases excluded)",
    "sum of quantities from returns per day (purchases excluded)",
    "mean quantities from returns per day (purchases excluded)",
    
    "mean net income from all customers per day",
    "mean number of items from all customers per day",
    "mean number of items from all customers per day",

    "number of one-time customers per day",
    "number of repeat customers per day",
    
    "sum of items (quantities) purchased per day (returns excluded)",
    "sum of items (quantities) returned per day (purchases excluded)",
    
    "net sum of items (quantities) purchased per day from countries other than the UK",
    "net sum of items (quantities) purchased per day from the UK",
    "proportion of net sum of items (quantities) purchased per day from countries other than the UK",
    "proportion of net sum of items (quantities) purchased per day from the UK",
    
    "number of different items purchased/returned per day",
    
    "net sum of quantities sold of item with StockCode 20725",
    "net sum of quantities sold of item with StockCode 21212",
    "net sum of quantities sold of item with StockCode 22423",
    "net sum of quantities sold of item with StockCode 22457",
    "net sum of quantities sold of item with StockCode 22666",
    "net sum of quantities sold of item with StockCode 22960",
    "net sum of quantities sold of item with StockCode 47566",
    "net sum of quantities sold of item with StockCode 84879",
    "net sum of quantities sold of item with StockCode 85099B",
    "net sum of quantities sold of item with StockCode 85123A",
    
    "difference in sum of net income (purchases - returns) to previous day",
    "season"
    ))

Now, that we have all kinds of features per day, we can gather them in a final visualization to identify patterns and potential response variables suitable for modeling. I am also now adding points that are colored by day of the week so that we’ll be able to judge potential weekday biases.

retail_p_day %>%
  # remove last day because it is so extreme
  filter(day != max(retail_p_day$day)) %>%
  gather(x, y, sum_income:diff_sum_income) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, scales = "free", ncol = 2) +
    geom_line(alpha = 0.8, color = palette_light()[[1]]) +
    geom_point(aes(color = day_of_week)) +
    geom_smooth() +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "",
         color = "day of the week")

From the plots we can see that there has been a slight increase in sales during the months of November and December. Unfortunately, we don’t have data from additional years to judge whether that has to do with Christmas sales or whether it is marking a general increase in sales that will continue into Januar and February.

And indeed, for many features we see lower sales on Sundays than on Monday through Friday (Saturdays have not been recorded).

Nevertheless, I will attempt a time-series forecast on this data. First, however, I will be developing some simple models to test for weekday biases and other statistics of the data. Stay tuned for my next post… :-)

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.2.1               modelr_0.1.0                 
##  [3] tidyquant_0.5.1               quantmod_0.4-8               
##  [5] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
##  [7] xts_0.9-7                     zoo_1.8-0                    
##  [9] lubridate_1.6.0               dplyr_0.5.0                  
## [11] purrr_0.2.2.2                 readr_1.1.1                  
## [13] tidyr_0.6.3                   tibble_1.3.1                 
## [15] ggplot2_2.2.1                 tidyverse_1.1.1              
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2   haven_1.0.0      lattice_0.20-35  colorspace_1.3-2
##  [5] htmltools_0.3.6  yaml_2.1.14      rlang_0.1.1      foreign_0.8-68  
##  [9] DBI_0.6-1        readxl_1.0.0     plyr_1.8.4       stringr_1.2.0   
## [13] Quandl_2.8.0     munsell_0.4.3    gtable_0.2.0     cellranger_1.1.0
## [17] rvest_0.3.2      psych_1.7.5      evaluate_0.10    labeling_0.3    
## [21] knitr_1.16       forcats_0.2.0    parallel_3.4.0   broom_0.4.2     
## [25] Rcpp_0.12.11     scales_0.4.1     backports_1.1.0  jsonlite_1.4    
## [29] mnormt_1.5-5     hms_0.3          digest_0.6.12    stringi_1.1.5   
## [33] rprojroot_1.2    tools_3.4.0      magrittr_1.5     lazyeval_0.2.0  
## [37] xml2_1.1.1       assertthat_0.2.0 rmarkdown_1.5    httr_1.2.1      
## [41] R6_2.2.1         nlme_3.1-131     compiler_3.4.0

New R Users group in Münster!

2017-05-20 00:00:00 +0000

This is to announce that Münster now has its very own R users group!

If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.

Code for the logo, which of course has been created in R:

library(ggplot2)
library(ggmap)
library(ggthemes)

munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor")
ggmap(munster) +
  scale_y_continuous(limits=c(51.94, 51.97)) +
  annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) +
  annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) +
  theme_map()

Network analysis of Game of Thrones family ties

2017-05-15 00:00:00 +0000

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

What is a network?

A network in this context is a graph of interconnected nodes/vertices. Nodes can e.g. be people in a social network, genes in a co-expression network, etc. Nodes are connected via ties/edges.

What can network analysis tell us?

Network analysis can e.g. be used to explore relationships in social or professional networks. In such cases, we would typically ask questions like:

How many connections does each person have?
Who is the most connected (i.e. influential or “important”) person?
Are there clusters of tightly connected people?
Are there a few key players that connect clusters of people?
etc.

These answers can give us a lot of information about the patterns of how people interact.

The Game of Thrones character network

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

library(tidyverse)
library(igraph)
library(statnet)

load("union_edges.RData")
load("union_characters.RData")

I am using igraph to plot the initial network. To do so, I first create the graph from the edge- and node-table. An edge-table contains source and target nodes in the first two columns and optionally additional columns with edge attributes. Here, I have the type of interaction (mother, father or spouse), the color and line-type I want to assign to each edge.

Because the books and the TV series differ slightly, I have introduced edges that are only supported or hinted at by the TV series and are not part of the original narrative in the books. These edges are marked by being dotted instead of solid. An additional color for edges with unspecified parental origin are introduced as well. Originally, these served for interactions that were extracted from character names (i.e. characters that ended with “… son/daughter of …”) and could either mean mother or father. Now, they show unclear parentage or cases where there are a biological and a de facto father, as in the case of Jon Snow.

head(union_edges)

##               source            target   type   color   lty
## 1         Lysa Arryn      Robert Arryn mother #7570B3 solid
## 2       Jasper Arryn        Alys Arryn father #1B9E77 solid
## 3       Jasper Arryn         Jon Arryn father #1B9E77 solid
## 4          Jon Arryn      Robert Arryn father #1B9E77 solid
## 110 Cersei Lannister  Tommen Baratheon mother #7570B3 solid
## 210 Cersei Lannister Joffrey Baratheon mother #7570B3 solid

The nodetable contains one row for each character that is either a source or a target in the edgetable. We can give any number and type of node attributes. Here, I chose the followin columns from the original Kaggle dataset: gender/male (male = 1, female = 0), house (as the house each character was born into) and popularity. House2 was meant to assign a color to only the major houses. Shape represents the gender.

head(union_characters)

##            name male culture          house popularity      house2   color  shape
## 1    Alys Arryn    0    <NA>    House Arryn 0.08026756        <NA>    <NA> circle
## 2 Elys Waynwood    0    <NA> House Waynwood 0.07023411        <NA>    <NA> circle
## 3  Jasper Arryn    1    <NA>    House Arryn 0.04347826        <NA>    <NA> square
## 4   Jeyne Royce    0    <NA>    House Royce 0.00000000        <NA>    <NA> circle
## 5     Jon Arryn    1 Valemen    House Arryn 0.83612040        <NA>    <NA> square
## 6    Lysa Arryn    0    <NA>    House Tully 0.00000000 House Tully #F781BF circle

By default, we have a directed graph.

union_graph <- graph_from_data_frame(union_edges, directed = TRUE, vertices = union_characters)

For plotting the legend, I am summarizing the edge and node colors.

color_vertices <- union_characters %>%
  group_by(house, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

colors_edges <- union_edges %>%
  group_by(type, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

Now, we can plot the graph object (here with Fruchterman-Reingold layout):

layout <- layout_with_fr(union_graph)

Click on the image to get to the high resolution pdf:

plot(union_graph,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph)$name),
     vertex.shape = V(union_graph)$shape,
     vertex.color = V(union_graph)$color, 
     vertex.size = (V(union_graph)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph)$color,
     edge.lty = E(union_graph)$lty)
legend("topleft", legend = c(NA, "Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1,
       title = "") 
legend("topleft", legend = "", cex = 4, bty = "n", ncol = 1,
       title = "Game of Thrones Family Ties")

Node color shows the major houses, node size the character’s popularity and node shape their gender (square for male, circle for female). Edge color shows interaction type.

As we can see, even with only a subset of characters from the Game of Thrones world, the network is already quite big. You can click on the image to open the pdf and zoom into specific parts of the plot and read the node labels/character names.

What we can see right away is that there are only limited connections between houses and that the Greyjoys are the only house that has no ties to any of the others.

Network analysis

How do we find out who the most important characters are in this network?

We consider a character “important” if he has connections to many other characters. There are a few network properties, that tell us more about this. For this, I am considering the network as undirected to account for parent/child relationships as being mutual.

union_graph_undir <- as.undirected(union_graph, mode = "collapse")

Centrality

Centrality describes the number of edges that are in- or outgoing to/from nodes. High centrality networks have few nodes with many connections, low centrality networks have many nodes with similar numbers of edges.

“Centralization is a method for creating a graph level centralization measure from the centrality scores of the vertices.” centralize() help

For the whole network, we can calculate centrality by degree (centr_degree()), closeness (centr_clo()) or eigenvector centrality (centr_eigen()) of vertices.

centr_degree(union_graph_undir, mode = "total")$centralization

## [1] 0.04282795

centr_clo(union_graph_undir, mode = "total")$centralization

## [1] 0.01414082

centr_eigen(union_graph_undir, directed = FALSE)$centralization

## [1] 0.8787532

Node degree

Node degree or degree centrality describes how central a node is in the network (i.e. how many in- and outgoing edges it has or to how many other nodes it is directly connected via one edge).

“The degree of a vertex is its most basic structural property, the number of its adjacent edges.” From the help pages of degree()

We can calculate the number of out- or ingoing edges of each node, or - as I am doing here - the sum of both.

union_graph_undir_degree <- igraph::degree(union_graph_undir, mode = "total")

#standardized by number of nodes
union_graph_undir_degree_std <- union_graph_undir_degree / (vcount(union_graph_undir) - 1)

node_degree <- data.frame(degree = union_graph_undir_degree,
                          degree_std = union_graph_undir_degree_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_degree, by = c("name" = "rowname"))

node_degree %>%
  arrange(-degree) %>%
  .[1:10, ]

##            rowname degree degree_std
## 1  Quellon Greyjoy     12 0.05797101
## 2      Walder Frey     10 0.04830918
## 3   Oberyn Martell     10 0.04830918
## 4     Eddard Stark      9 0.04347826
## 5    Catelyn Stark      8 0.03864734
## 6       Emmon Frey      7 0.03381643
## 7  Genna Lannister      7 0.03381643
## 8     Merrett Frey      7 0.03381643
## 9    Balon Greyjoy      7 0.03381643
## 10 Jason Lannister      7 0.03381643

In this case, the node degree reflects how many offspring and spouses a character had. With 3 wifes and several children, Quellon Greyjoy, the grandfather of Theon and Asha/Yara comes out on top (of course, had I included all offspring and wifes of Walder Frey’s, he would easily be on top but the network would have gotten infinitely more confusing).

Closeness

The closeness of a node describes its distance to all other nodes. A node with highest closeness is more central and can spread information to many other nodes.

closeness <- igraph::closeness(union_graph_undir, mode = "total")

#standardized by number of nodes
closeness_std <- closeness / (vcount(union_graph_undir) - 1)

node_closeness <- data.frame(closeness = closeness,
                          closeness_std = closeness_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_closeness, by = c("name" = "rowname"))

node_closeness %>%
  arrange(-closeness) %>%
  .[1:10, ]

##             rowname    closeness closeness_std
## 1       Sansa Stark 0.0002013288  9.726028e-07
## 2  Tyrion Lannister 0.0002012882  9.724070e-07
## 3   Tywin Lannister 0.0002011668  9.718201e-07
## 4  Joanna Lannister 0.0002005616  9.688965e-07
## 5      Eddard Stark 0.0002002804  9.675381e-07
## 6     Catelyn Stark 0.0001986492  9.596579e-07
## 7  Cersei Lannister 0.0001984915  9.588960e-07
## 8   Jaime Lannister 0.0001975894  9.545382e-07
## 9    Jeyne Marbrand 0.0001966568  9.500330e-07
## 10  Tytos Lannister 0.0001966568  9.500330e-07

The characters with highest closeness all surround central characters that connect various storylines and houses in Game of Thrones.

Betweenness centrality

Betweenness describes the number of shortest paths between nodes. Nodes with high betweenness centrality are on the path between many other nodes, i.e. they are people who are key connections or bridges between different groups of nodes. In a social network, these nodes would be very important because they are likely to pass on information to a wide reach of people.

The igraph function betweenness() calculates vertex betweenness, edge_betweenness() calculates edge betweenness:

“The vertex and edge betweenness are (roughly) defined by the number of geodesics (shortest paths) going through a vertex or an edge.” igraph help for estimate_betweenness()

betweenness <- igraph::betweenness(union_graph_undir, directed = FALSE)

# standardize by number of node pairs
betweenness_std <- betweenness / ((vcount(union_graph_undir) - 1) * (vcount(union_graph_undir) - 2) / 2)

node_betweenness <- data.frame(betweenness = betweenness,
                               betweenness_std = betweenness_std) %>%
  tibble::rownames_to_column() 

union_characters <- left_join(union_characters, node_betweenness, by = c("name" = "rowname"))

node_betweenness %>%
  arrange(-betweenness) %>%
  .[1:10, ]

##              rowname betweenness betweenness_std
## 1       Eddard Stark    6926.864       0.3248846
## 2        Sansa Stark    6165.667       0.2891828
## 3   Tyrion Lannister    5617.482       0.2634718
## 4    Tywin Lannister    5070.395       0.2378123
## 5   Joanna Lannister    4737.524       0.2221999
## 6  Rhaegar Targaryen    4301.583       0.2017533
## 7    Margaery Tyrell    4016.417       0.1883784
## 8           Jon Snow    3558.884       0.1669192
## 9        Mace Tyrell    3392.500       0.1591154
## 10   Jason Lannister    3068.500       0.1439191

edge_betweenness <- igraph::edge_betweenness(union_graph_undir, directed = FALSE)

data.frame(edge = attr(E(union_graph_undir), "vnames"),
           betweenness = edge_betweenness) %>%
  tibble::rownames_to_column() %>%
  arrange(-betweenness) %>%
  .[1:10, ]

##    rowname                              edge betweenness
## 1      160      Sansa Stark|Tyrion Lannister    5604.149
## 2      207          Sansa Stark|Eddard Stark    4709.852
## 3      212        Rhaegar Targaryen|Jon Snow    3560.083
## 4      296       Margaery Tyrell|Mace Tyrell    3465.000
## 5      213             Eddard Stark|Jon Snow    3163.048
## 6      131  Jason Lannister|Joanna Lannister    3089.500
## 7      159 Joanna Lannister|Tyrion Lannister    2983.591
## 8      171  Tyrion Lannister|Tywin Lannister    2647.224
## 9      192    Elia Martell|Rhaegar Targaryen    2580.000
## 10     300         Luthor Tyrell|Mace Tyrell    2565.000

This, we can now plot by feeding the node betweenness as vertex.size and edge betweenness as edge.width to our plot function:

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = V(union_graph_undir)$color, 
     vertex.size = betweenness * 0.001, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = edge_betweenness * 0.01,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

Ned Stark is the character with highest betweenness. This makes sense, as he and his children (specifically Sansa and her arranged marriage to Tyrion) connect to other houses and are the central points from which the story unfolds. However, we have to keep in mind here, that my choice of who is important enough to include in the network (e.g. the Stark ancestors) and who not (e.g. the whole complicated mess that is the Targaryen and Frey family tree) makes this result somewhat biased.

Diameter

In contrast to the shortest path between two nodes, we can also calculate the longest path, or diameter:

diameter(union_graph_undir, directed = FALSE)

## [1] 21

In our network, the longest path connects 21 nodes.

“get_diameter returns a path with the actual diameter. If there are many shortest paths of the length of the diameter, then it returns the first one found.” diameter() help

This, we can also plot:

union_graph_undir_diameter <- union_graph_undir
node_diameter <- get.diameter(union_graph_undir_diameter,  directed = FALSE)

V(union_graph_undir_diameter)$color <- scales::alpha(V(union_graph_undir_diameter)$color, alpha = 0.5)
V(union_graph_undir_diameter)$size <- 2

V(union_graph_undir_diameter)[node_diameter]$color <- "red"
V(union_graph_undir_diameter)[node_diameter]$size <- 5

E(union_graph_undir_diameter)$color <- "grey"
E(union_graph_undir_diameter)$width <- 1

E(union_graph_undir_diameter, path = node_diameter)$color <- "red"
E(union_graph_undir_diameter, path = node_diameter)$width <- 5

plot(union_graph_undir_diameter,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir_diameter)$name),
     vertex.shape = V(union_graph_undir_diameter)$shape,
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.lty = E(union_graph_undir_diameter)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

Transitivity

“Transitivity measures the probability that the adjacent vertices of a vertex are connected. This is sometimes also called the clustering coefficient.” transitivity() help

We can calculate the transitivity or ratio of triangles to connected triples for the whole network:

transitivity(union_graph_undir, type = "global")

## [1] 0.2850679

Or for each node:

transitivity <- data.frame(name = V(union_graph_undir)$name,
      transitivity = transitivity(union_graph_undir, type = "local")) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, transitivity, by = "name")

transitivity %>%
  arrange(-transitivity) %>%
  .[1:10, ]

##                 name transitivity
## 1       Robert Arryn            1
## 2   Ormund Baratheon            1
## 3     Selyse Florent            1
## 4  Shireen Baratheon            1
## 5   Amarei Crakehall            1
## 6       Marissa Frey            1
## 7        Olyvar Frey            1
## 8        Perra Royce            1
## 9        Perwyn Frey            1
## 10         Tion Frey            1

Because ours is a family network, characters with a transitivity of one form triangles with their parents or offspring.

PageRank centrality

PageRank (originally used by Google to rank the importance of search results) is similar to eigenvector centrality. Eigenvector centrality scores nodes in a network according to the number of connections to high-degree nodes they have. It is therefore a measure of node importance. PageRank similarly considers nodes as more important if they have many incoming edges (or links).

page_rank <- page.rank(union_graph_undir, directed = FALSE)

page_rank_centrality <- data.frame(name = names(page_rank$vector),
      page_rank = page_rank$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, page_rank_centrality, by = "name")

page_rank_centrality %>%
  arrange(-page_rank) %>%
  .[1:10, ]

##                 name   page_rank
## 1     Oberyn Martell 0.018402407
## 2    Quellon Greyjoy 0.016128129
## 3        Walder Frey 0.012956029
## 4       Eddard Stark 0.011725019
## 5       Cregan Stark 0.010983561
## 6      Catelyn Stark 0.010555473
## 7       Lyarra Stark 0.009876629
## 8  Aegon V Targaryen 0.009688458
## 9      Balon Greyjoy 0.009647049
## 10         Jon Arryn 0.009623742

Oberyn Martell, Quellon Greyjoy and Walder Frey all have the highest number of spouses, children and grandchildren are are therefore scored highest for PageRank.

Matrix representation of a network

Connections between nodes can also be represented as an adjacency matrix. We can convert our graph object to an adjacency matrix with igraph’s as_adjacency_matrix() function. Whenever there is an edge between two nodes, this field in the matrix will get assigned a 1, otherwise it is 0.

adjacency <- as.matrix(as_adjacency_matrix(union_graph_undir))

Eigenvector centrality

We can now calculate the eigenvalues and eigenvectors of the adjacency matrix.

#degree diagonal matrix
degree_diag <- diag(1 / igraph::degree(union_graph_undir))

# PageRank matrix
pagerank <- adjacency %*% degree_diag

eigenvalues <- eigen(pagerank)

The eigenvector with the highest eigenvalue scores those vertices highly, that have many eges or that are connected to vertices with many edges.

eigenvector <- data.frame(name = rownames(pagerank),
           eigenvector = as.numeric(eigenvalues$vectors[, which.max(eigenvalues$values)]))

union_characters <- left_join(union_characters, eigenvector, by = "name")

eigenvector %>%
  arrange(eigenvector) %>%
  .[1:10, ]

##                       name eigenvector
## 1          Quellon Greyjoy  -0.6625628
## 2            Balon Greyjoy  -0.3864950
## 3   Lady of House Sunderly  -0.3312814
## 4           Alannys Harlaw  -0.2760678
## 5  Lady of House Stonetree  -0.2208543
## 6      Asha (Yara) Greyjoy  -0.1656407
## 7            Robin Greyjoy  -0.1104271
## 8            Euron Greyjoy  -0.1104271
## 9          Urrigon Greyjoy  -0.1104271
## 10       Victarion Greyjoy  -0.1104271

Because of their highly connected family ties (i.e. there are only a handful of connections but they are almost all triangles), the Greyjoys have been scored with the highest eigenvalues.

We can find the eigenvector centrality scores with:

eigen_centrality <- igraph::eigen_centrality(union_graph_undir, directed = FALSE)

eigen_centrality <- data.frame(name = names(eigen_centrality$vector),
           eigen_centrality = eigen_centrality$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, eigen_centrality, eigenvector, by = "name")

eigen_centrality %>%
  arrange(-eigen_centrality) %>%
  .[1:10, ]

##                name eigen_centrality
## 1   Tywin Lannister        1.0000000
## 2  Cersei Lannister        0.9168980
## 3  Joanna Lannister        0.8358122
## 4    Jeyne Marbrand        0.8190076
## 5   Tytos Lannister        0.8190076
## 6   Genna Lannister        0.7788376
## 7   Jaime Lannister        0.7642870
## 8  Robert Baratheon        0.7087042
## 9        Emmon Frey        0.6538709
## 10      Walder Frey        0.6516021

When we consider eigenvector centrality, Tywin and the core Lannister family score highest.

Who are the most important characters?

We can now compare all the node-level information to decide which characters are the most important in Game of Thrones. Such node level characteristics could also be used as input for machine learning algorithms.

Let’s look at all characters from the major houses:

union_characters %>%
  filter(!is.na(house2)) %>%
  dplyr::select(-contains("_std")) %>%
  gather(x, y, degree:eigen_centrality) %>%
  ggplot(aes(x = name, y = y, color = house2)) +
    geom_point(size = 3) +
    facet_grid(x ~ house2, scales = "free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

Taken together, we could say that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones.

Groups of nodes

We can also analyze dyads (pairs of two nodes), triads (groups of three nodes) and bigger cliques in our network. For dyads, we can use the function dyad_census() from igraph or dyad.census() from sna. Both are identical and calculate a Holland and Leinhardt dyad census with

mut: The number of pairs with mutual connections (in our case, spouses).
asym: The number of pairs with non-mutual connections (in the original network: mother-child and father-child relationships; but in the undirected network, there are none).
null: The number of pairs with no connection between them.

#igraph::dyad_census(union_graph_undir)
sna::dyad.census(adjacency)

##      Mut Asym  Null
## [1,] 326    0 21202

The same can be calculated for triads (see ?triad_census for details on what each output means).

#igraph::triad_census(union_graph_undir)
sna::triad.census(adjacency)

##          003 012   102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300
## [1,] 1412100   0 65261    0    0    0    0    0    0    0 790    0    0    0   0 105

triad.classify(adjacency, mode = "graph")

## [1] 2

We can also calculate the number of paths and cycles of any length we specify, here e.g. of length <= 5. For edges, we obtain the sum of counts for all paths or cycles up to the given maximum length. For vertices/nodes, we obtain the number of paths or cycles to which each node belongs.

node_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = TRUE, dyadic.tabulation = "sum")
edge_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = FALSE)
edge_kpath

## $path.count
##     1     2     3     4     5 
##   326  1105  2973  7183 17026

This, we could plot with (but here, it does not give much additional information):

gplot(node_kpath$paths.bydyad,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

node_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = TRUE, cycle.comembership = "sum")
edge_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = FALSE)
edge_kcycle

## $cycle.count
##   2   3   4   5   6   7   8 
##   0 105 136  27  57  58  86

node_kcycle_reduced <- node_kcycle$cycle.comemb
node_kcycle_reduced <- node_kcycle_reduced[which(rowSums(node_kcycle_reduced) > 0), which(colSums(node_kcycle_reduced) > 0)]

gplot(node_kcycle_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

“A (maximal) clique is a maximal set of mutually adjacency vertices.” clique.census() help

node_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = TRUE, clique.comembership = "sum")
edge_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = FALSE, clique.comembership = "sum")
edge_clique$clique.count

##   1   2   3 
##   0  74 105

node_clique_reduced <- node_clique$clique.comemb
node_clique_reduced <- node_clique_reduced[which(rowSums(node_clique_reduced) > 0), which(colSums(node_clique_reduced) > 0)]

gplot(node_clique_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

The largest group of nodes ín this network is three, i.e. all parent/child relationships. Therefore, it does not really make sense to plot them all, but we could plot and color them with:

vcol <- rep("grey80", vcount(union_graph_undir))

# highlight first of largest cliques
vcol[unlist(largest_cliques(union_graph_undir)[[1]])] <- "red"

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = vcol, 
     vertex.size = 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = 2,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)

Clustering

We can also look for groups within our network by clustering node groups according to their edge betweenness:

ceb <- cluster_edge_betweenness(union_graph_undir)
modularity(ceb)

## [1] 0.8359884

plot(ceb,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

Or based on propagating labels:

clp <- cluster_label_prop(union_graph_undir)

plot(clp,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

Network properties

We can also feed our adjacency matrix to other functions, like GenInd() from the NetIndices packages. This function calculates a number of network properties, like number of compartments (N), total system throughput (T..), total system throughflow (TST), number of internal links (Lint), total number of links (Ltot), like density (LD), connectance (C), average link weight (Tijbar), average compartment throughflow (TSTbar) and compartmentalization or degree of connectedness of subsystems in the network (Cbar).

library(NetIndices)
graph.properties <- GenInd(adjacency)
graph.properties

## $N
## [1] 208
## 
## $T..
## [1] 652
## 
## $TST
## [1] 652
## 
## $Lint
## [1] 652
## 
## $Ltot
## [1] 652
## 
## $LD
## [1] 3.134615
## 
## $C
## [1] 0.01514307
## 
## $Tijbar
## [1] 1
## 
## $TSTbar
## [1] 3.134615
## 
## $Cbar
## [1] 0.01086163

Alternatively, the network package provides additional functions to obtain network properties. Here, we can again feed in the adjacency matrix of our network and convert it to a network object.

library(network)
adj_network <- network(adjacency, directed = TRUE)
adj_network

##  Network attributes:
##   vertices = 208 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 652 
##     missing edges= 0 
##     non-missing edges= 652 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

From this network object, we can e.g. get the number of dyads and edges within a network and the network size.

network.dyadcount(adj_network)

## [1] 43056

network.edgecount(adj_network)

## [1] 652

network.size(adj_network)

## [1] 208

“equiv.clust uses a definition of approximate equivalence (equiv.fun) to form a hierarchical clustering of network positions. Where dat consists of multiple relations, all specified relations are considered jointly in forming the equivalence clustering.” equiv.clust() help

ec <- equiv.clust(adj_network, mode = "graph", cluster.method = "average", plabels = network.vertex.names(adj_network))
ec

## Position Clustering:
## 
##  Equivalence function: sedist 
##  Equivalence metric: hamming 
##  Cluster method: average 
##  Graph order: 208

ec$cluster$labels <- ec$plabels
plot(ec)

From the sna package, we can e.g. use functions that tell us the graph density and the dyadic reciprocity of the vertices or edges

gden(adjacency)

## [1] 0.01514307

grecip(adjacency)

## Mut 
##   1

grecip(adjacency, measure = "edgewise")

## Mut 
##   1

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NetIndices_1.4.4     MASS_7.3-47          statnet_2016.9       sna_2.4              ergm.count_3.2.2     tergm_3.4.0          networkDynamic_0.9.0 ergm_3.7.1           network_1.13.0       statnet.common_3.3.0 igraph_1.0.1         dplyr_0.5.0          purrr_0.2.2          readr_1.1.0          tidyr_0.6.2          tibble_1.3.0         ggplot2_2.2.1        tidyverse_1.1.1     
## 
## loaded via a namespace (and not attached):
##  [1] lpSolve_5.6.13    reshape2_1.4.2    haven_1.0.0       lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14       foreign_0.8-68    DBI_0.6-1         modelr_0.1.0      readxl_1.0.0      trust_0.1-7       plyr_1.8.4        robustbase_0.92-7 stringr_1.2.0     munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2       coda_0.19-1       psych_1.7.5       evaluate_0.10     labeling_0.3      knitr_1.15.1      forcats_0.2.0     parallel_3.4.0    DEoptimR_1.0-8    broom_0.4.2       Rcpp_0.12.10      scales_0.4.1      backports_1.0.5   jsonlite_1.4      mnormt_1.5-5      hms_0.3           digest_0.6.12     stringi_1.1.5     grid_3.4.0        rprojroot_1.2     tools_3.4.0       magrittr_1.5      lazyeval_0.2.0    Matrix_1.2-10     xml2_1.1.1        lubridate_1.6.0   assertthat_0.2.0  rmarkdown_1.5     httr_1.2.1        R6_2.2.0          nlme_3.1-131      compiler_3.4.0

Update to autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-02 00:00:00 +0000

This is a reply to Wojciech Indyk’s comment on yesterday’s post on autoencoders and anomaly detection with machine learning in fraud analytics:

“I think you can improve the detection of anomalies if you change the training set to the deep-autoencoder. As I understand the train_unsupervised contains both class 0 and class 1. If you put only class 0 as the input of the autoencoder, the network should learn the “normal” pattern. Then the MSE should be higher for 1 class in the test set (of course if anomaly==fraud).”

To test this, I follow the same workflow as in yesterday’s post but this time, I am moving all fraud instances from the first training set for unsupervised learning to the second training test for supervised learning. Now, the autoencoder learns a pattern solely on non-fraud cases.

library(tidyverse)

library(h2o)
h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 minutes 11 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    17 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_erp741 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.54 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)

splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

# move class 1 instances to second training set...
train_supervised <- rbind(as.data.frame(train_supervised), as.data.frame(train_unsupervised[train_unsupervised$Class == "1", ])) %>%
  as.h2o()

# ... and remove from first training set
train_unsupervised <- train_unsupervised[train_unsupervised$Class == "0", ]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)

model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")

anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))

ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

Compared to the results from yesterday’s post, this model seems to have learned a pattern that found two major cases. The mean reconstruction MSE was slightly higher for class 0 and definitely higher for class 1.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.04, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 

## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56608 0.9995762113
## 2     0    outlier    24 0.0004237887
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

Anomaly detection with a higher threshold based on the plot above did not improve the results compared to yesterday’s post.

With a lower threshold of 0.2 (not shown here), the test set performed much better for detecting fraud cases as outliers (65 vs 27, compared to 32 vs 60 in yesterday’s post). However, this model also categorized many more non-fraud cases as outliers (2803 vs 53829, compared to only 30 vs 56602).

Now, I am again using the autoencoder model as pre-training input for supervised learning.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")

pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))

pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 

## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 56347 0.99496751
## 2      0       1   285 0.00503249
## 3      1       0     9 0.09782609
## 4      1       1    83 0.90217391

This model is now much better at identifying fraud cases than in yesterday’s post (90%, compared to 83% - even though we can’t directly compare the two models as they were trained on different training sets) but it is also slightly less accurate at predicting non-fraud cases (99.5%, compared to 99.8%).

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] h2o_3.10.4.4    dplyr_0.5.0     purrr_0.2.2     readr_1.1.0    
## [5] tidyr_0.6.1     tibble_1.3.0    ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10       RColorBrewer_1.1-2 cellranger_1.1.0  
##  [4] compiler_3.4.0     plyr_1.8.4         bitops_1.0-6      
##  [7] forcats_0.2.0      tools_3.4.0        digest_0.6.12     
## [10] lubridate_1.6.0    jsonlite_1.4       evaluate_0.10     
## [13] nlme_3.1-131       gtable_0.2.0       lattice_0.20-35   
## [16] psych_1.7.3.21     DBI_0.6-1          yaml_2.1.14       
## [19] parallel_3.4.0     haven_1.0.0        xml2_1.1.1        
## [22] stringr_1.2.0      httr_1.2.1         knitr_1.15.1      
## [25] hms_0.3            rprojroot_1.2      grid_3.4.0        
## [28] R6_2.2.0           readxl_1.0.0       foreign_0.8-68    
## [31] rmarkdown_1.5      modelr_0.1.0       reshape2_1.4.2    
## [34] magrittr_1.5       backports_1.0.5    scales_0.4.1      
## [37] htmltools_0.3.6    rvest_0.3.2        assertthat_0.2.0  
## [40] mnormt_1.5-5       colorspace_1.3-2   labeling_0.3      
## [43] stringi_1.1.5      RCurl_1.95-4.8     lazyeval_0.2.0    
## [46] munsell_0.4.3      broom_0.4.2

Autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-01 00:00:00 +0000

All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction.

For this task, I am using Kaggle’s credit card fraud dataset from the following study:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

The dataset gives > 280,000 instances of credit card use and for each transaction, we know whether it was fraudulent or not.

Datasets like this needs special treatment when performing machine learning because they are severely unbalanced: in this case, only 0.17% of all transactions are fraudulent.

While we could try to work with classifiers, like random forests or support vector machines, by applying over- or under-sampling techniques, we can alternatively try to find anomalies in the data (assuming we expect our fraud cases to be anomalies within the whole dataset).

When dealing with such a severe unbalance of response labels, we also need to be careful when measuring model performance. Because there are only a handful of fraudulent instances, a model that predicts everything as non-fraud will already achieve a > 99% accuracy. But despite its high accuracy, such a model won’t necessarily help us find fraudulent cases - the proverbial “needle-in-a-haystack” - that we actually want to find!

Below, I will show how you can use autoencoders and anomaly detection, how you can use autoencoders to pre-train a classification model and how you can measure model performance on unbalanced data.

Exploring the data

library(tidyverse)

# download from https://www.kaggle.com/dalpozz/creditcardfraud
creditcard <- read.csv("creditcard.csv")

The dataset contains numerical input variables V1 to V28, which are the result of a PCA transformation of the original features (which could not be provided due to confidentiality issues).

The response variable Class tell us whether a transaction was fraudulent (value = 1) or not (value = 0).

creditcard %>%
  ggplot(aes(x = Class)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

There are two additional features, Time (time in seconds between each transaction and the first transaction) and Amount (how much money was transferred in this transaction).

Because Time only tells us the order in which transactions have been done, it doesn’t actually tell us anything about the actual times (i.e. time of day) of the transaction. Therefore, I am normalizing them by day and bin them into four groups according to time of day.

summary(creditcard$Time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   54202   84692   94814  139320  172792

# how many seconds are 24 hours
# 1 hr = 60 mins = 60 x 60 s = 3600 s
3600 * 24

## [1] 86400

# separate transactions by day
creditcard$day <- ifelse(creditcard$Time > 3600 * 24, "day2", "day1")

# make transaction relative to day
creditcard$Time_day <- ifelse(creditcard$day == "day2", creditcard$Time - 86400, creditcard$Time)

summary(creditcard[creditcard$day == "day1", ]$Time_day)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   38432   54689   52948   70977   86400

summary(creditcard[creditcard$day == "day2", ]$Time_day)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   37843   53425   51705   68182   86392

# bin transactions according to time of day
creditcard$Time <- as.factor(ifelse(creditcard$Time_day <= 38138, "gr1", # mean 1st Qu.
                          ifelse(creditcard$Time_day <= 52327, "gr2", # mean mean
                                 ifelse(creditcard$Time_day <= 69580, "gr3", # mean 3rd Qu
                                        "gr4"))))

creditcard %>%
  ggplot(aes(x = day)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

We can see now that the transactions in this dataset have all been recorded on two consecutive days and there are roughly the same number of transactions on these two days.

Now, I remove the columns I used to create the Time bins.

creditcard <- select(creditcard, -Time_day, -day)

# convert class variable to factor
creditcard$Class <- factor(creditcard$Class)

creditcard %>%
  ggplot(aes(x = Time)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

The distribution of transactions over the four Time bins shows, that the majority of fraud cases have happened in group 1 (although, I can’t say when exactly because the original Time feature did not tell us when the first transaction occurred).

I also want to look at the distribution of the amounts of money that were transferred:

summary(creditcard[creditcard$Class == "0", ]$Amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.65    22.00    88.29    77.05 25691.16

summary(creditcard[creditcard$Class == "1", ]$Amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    9.25  122.21  105.89 2125.87

creditcard %>%
  ggplot(aes(x = Amount)) +
    geom_histogram(color = "grey", fill = "lightgrey", bins = 50) +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

Interestingly, fraudulent credit card transactions had a higher mean amount of money that was transferred, but the maximum amount was much lower compared to regular transactions.

Modeling

For modeling, I am using R’s H2O implementation with the h2o package. For more details and other examples, see my posts on my machine learning webinar, on building neural nets with h2o and on performing grid search for hyperparameter tuning.

library(h2o)
h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 minutes 6 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    16 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_nly512 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.60 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)

Then, I am splitting the dataset into training and test sets. Because I want to check how a pre-trained model with perform, I am splitting my data into two separate training sets and one independent test set for final model comparison.

splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)

Autoencoders

First, I am training the unsupervised neural network model using deep learning autoencoders. With h2o, we can simply set autoencoder = TRUE.

Here, I am applying a technique called “bottleneck” training, where the hidden layer in the middle is very small. This means that my model will have to reduce the dimensionality of the input data (in this case, down to 2 nodes/dimensions).

The autoencoder model will then learn the patterns of the input data irrespective of given class labels. Here, it will learn, which credit card transactions are similar and which transactions are outliers or anomalies. We need to keep in mind though, that autoencoder models will be sensitive to outliers in our data, which might throw off otherwise typical patterns.

model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")

Because training can take a while, I am saving the model:

h2o.saveModel(model_nn, path="model_nn", force = TRUE)

model_nn <- h2o.loadModel("model_nn")
model_nn

## Model Details:
## ==============
## 
## H2OAutoEncoderModel: deeplearning
## Model ID:  model_nn 
## Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 776 weights/biases, 16.0 KB, 2,622,851 training samples, mini-batch size 1
##   layer units  type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    34 Input  0.00 %                                              
## 2     2    10  Tanh  0.00 % 0.000000 0.000000  0.709865 0.320108 0.000000
## 3     3     2  Tanh  0.00 % 0.000000 0.000000  0.048458 0.109033 0.000000
## 4     4    10  Tanh  0.00 % 0.000000 0.000000  0.164717 0.192053 0.000000
## 5     5    34  Tanh         0.000000 0.000000  0.369681 0.425672 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1                                          
## 2   -0.039307   0.691302  0.008052 0.965178
## 3   -0.097383   0.314106  0.226376 0.067917
## 4    0.227664   1.089589  0.112032 0.672444
## 5    0.011072   0.605586  0.091124 0.722602
## 
## 
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.001472071
## RMSE: (Extract with `h2o.rmse`) 0.03836757

#Convert to autoencoded representation
test_autoenc <- h2o.predict(model_nn, test)

Dimensionality reduction with hidden layers

Because I had used a bottleneck model with two nodes in the hidden layer in the middle, we can use this dimensionality reduction to explore our feature space (similar to what to we could do with a principal component analysis). We can extract this hidden feature with the h2o.deepfeatures() function and plot it to show the reduced representation of the input data.

train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 2) %>%
  as.data.frame() %>%
  mutate(Class = as.vector(train_unsupervised[, 31]))

ggplot(train_features, aes(x = DF.L2.C1, y = DF.L2.C2, color = Class)) +
  geom_point(alpha = 0.1)

Here, we do not see a cluster of fraudulent transactions that is distinct from non-fraud instances, so dimensionality reduction with our autoencoder model alone is not sufficient to identify fraud in this dataset.

But we could use the reduced dimensionality representation of one of the hidden layers as features for model training. An example would be to use the 10 features from the first or third hidden layer:

# let's take the third hidden layer
train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 3) %>%
  as.data.frame() %>%
  mutate(Class = as.factor(as.vector(train_unsupervised[, 31]))) %>%
  as.h2o()

features_dim <- setdiff(colnames(train_features), response)

model_nn_dim <- h2o.deeplearning(y = response,
                               x = features_dim,
                               training_frame = train_features,
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")

h2o.saveModel(model_nn_dim, path="model_nn_dim", force = TRUE)

model_nn_dim <- h2o.loadModel("model_nn_dim/DeepLearning_model_R_1493574057843_49")
model_nn_dim

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_49 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 184 weights/biases, 7.1 KB, 4,098,024 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    10   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.299549 0.363303
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.009907 0.012193
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.248854 0.126722
## 5     5     2 Softmax         0.000000 0.000000  0.160990 0.078139
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.539949   4.601841 -0.060170 2.765655
## 3 0.000000   -0.384247   1.763649 -1.429832 0.570700
## 4 0.000000   -0.004865   0.774441 -0.320202 0.668747
## 5 0.000000    0.563951   1.365125  0.000000 0.397460
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9974 samples **
## 
## MSE:  0.4905394
## RMSE:  0.7003852
## LogLoss:  2.763673
## Mean Per-Class Error:  0.3856454
## AUC:  0.6656933
## Gini:  0.3313867
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1438 3594 0.714229  =3594/5032
## 1       282 4660 0.057062   =282/4942
## Totals 1720 8254 0.388610  =3876/9974
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.002173 0.706275 300
## 2                       max f2  0.001540 0.842820 335
## 3                 max f0point5  0.002531 0.619024 280
## 4                 max accuracy  0.002531 0.619711 280
## 5                max precision  0.030005 1.000000   0
## 6                   max recall  0.000496 1.000000 388
## 7              max specificity  0.030005 1.000000   0
## 8             max absolute_mcc  0.002173 0.302697 300
## 9   max min_per_class_accuracy  0.002590 0.554634 277
## 10 max mean_per_class_accuracy  0.002531 0.622276 280
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the trainings data:

test_dim <- h2o.deepfeatures(model_nn, test, layer = 3)

h2o.predict(model_nn_dim, test_dim) %>%
  as.data.frame() %>%
  mutate(actual = as.vector(test[, 31])) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 16710 0.29506286
## 2      0       1 39922 0.70493714
## 3      1       0     7 0.07608696
## 4      1       1    85 0.92391304

Now, this actually looks quite good in terms of identifying fraud cases: 92% of fraud cases were identified! However, many non-fraud cases were also classified as fraud. For real-life application, this wouldn’t be a good model. Let’s try some other techniques…

Anomaly detection

We can also ask which instances were considered outliers or anomalies within our test data, using the h2o.anomaly() function. Based on the autoencoder model that was trained before, the input data will be reconstructed and for each instance, the mean squared error (MSE) between actual value and reconstruction is calculated.

I am also calculating the mean MSE for both class labels.

anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))

This, we can now plot:

ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

As we can see in the plot, there is no perfect classification into fraud and non-fraud cases but the mean MSE is definitely higher for fraudulent transactions than for regular ones.

We can now identify outlier instances by applying an MSE threshold for what we consider outliers. We could e.g. say that we consider every instance with an MSE > 0.02 (chosen according to the plot above) to be an anomaly/outlier.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.02, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 

## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56602 0.9994702642
## 2     0    outlier    30 0.0005297358
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

As we can see, outlier detection is not sufficient to correctly classify fraudulent credit card transactions either (at least not with this dataset).

Pre-trained supervised model

We can now try using the autoencoder model as a pre-training input for a supervised model. Here, I am again using a neural network. This model will now use the weights from the autoencoder for model fitting.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")

h2o.saveModel(model_nn_2, path="model_nn_2", force = TRUE)

model_nn_2 <- h2o.loadModel("model_nn_2/DeepLearning_model_R_1493574057843_9")
model_nn_2

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_9 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 424 weights/biases, 11.7 KB, 3,643,248 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    34   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.274110 0.291925
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.004480 0.002651
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.210337 0.326239
## 5     5     2 Softmax         0.000000 0.000000  0.137694 0.123830
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.006945   1.058583 -0.501068 1.576798
## 3 0.000000    0.021012   0.309286  0.360286 0.105522
## 4 0.000000    0.934361   1.759343  0.304729 0.532527
## 5 0.000000   -0.133149   2.641065  2.392999 3.137845
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9973 samples **
## 
## MSE:  0.01500211
## RMSE:  0.1224831
## LogLoss:  0.04408663
## Mean Per-Class Error:  0.00169424
## AUC:  0.9995075
## Gini:  0.9990149
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error      Rate
## 0      5000   17 0.003388  =17/5017
## 1         0 4956 0.000000   =0/4956
## Totals 5000 4973 0.001705  =17/9973
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.134320 0.998288  96
## 2                       max f2  0.134320 0.999314  96
## 3                 max f0point5  0.134320 0.997263  96
## 4                 max accuracy  0.134320 0.998295  96
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.134320 1.000000  96
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.134320 0.996597  96
## 9   max min_per_class_accuracy  0.134320 0.996612  96
## 10 max mean_per_class_accuracy  0.134320 0.998306  96
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))

pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 

## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>  <fctr> <int>       <dbl>
## 1      0       0 56512 0.997881057
## 2      0       1   120 0.002118943
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957

pred %>%
  ggplot(aes(x = actual, fill = predict)) +
    geom_bar() +
    theme_bw() +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap( ~ actual, scales = "free", ncol = 2)

Now, this looks much better! We did miss 17% of the fraud cases but we also did not mis-classify too many of the non-fraud cases.

In real-life, we would now spend some more time trying to improve the model by e.g. performing grid search for hyperparameter tuning, going back to the original features (which we did not have here) and trying different engineered features and/or trying different algorithms. But here, I will leave it at that.

Measuring model performance on highly unbalanced data

Because of the severe bias towards non-fraud cases, we can not use performance measures like accuracy or area under the curve (AUC), as they would give overly optimistic results based on the high percentage of correct classifications of the majority class.

An alternative to AUC is to use the precision-recall curve or the sensitivity (recall)-specificity curve. To calculate and plot these metrics, we can use the ROCR package. There are different ways to calculate the area under a curve (see the PRROC package for details) but I am going to use a simple function that calculates the area between every consecutive points-pair of x (i.e. x1 - x0, x2 - x1, etc.) under the corresponding values of y.

library(ROCR)

# http://stackoverflow.com/questions/24563061/computing-integral-of-a-line-plot-in-r
line_integral <- function(x, y) {
  dx <- diff(x)
  end <- length(y)
  my <- (y[1:(end - 1)] + y[2:end]) / 2
  sum(dx * my)
} 

prediction_obj <- prediction(pred$p1, pred$actual)

par(mfrow = c(1, 2))
par(mar = c(5.1,4.1,4.1,2.1))

# precision-recall curve
perf1 <- performance(prediction_obj, measure = "prec", x.measure = "rec") 

x <- perf1@x.values[[1]]
y <- perf1@y.values[[1]]
y[1] <- 0

plot(perf1, main = paste("Area Under the\nPrecision-Recall Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

# sensitivity-specificity curve
perf2 <- performance(prediction_obj, measure = "sens", x.measure = "spec") 

x <- perf2@x.values[[1]]
y <- perf2@y.values[[1]]
y[1] <- 0

plot(perf2, main = paste("Area Under the\nSensitivity-Specificity Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

Precision is the proportion of test cases predicted to be fraud that were indeed fraudulent (i.e. the true positive predictions), while recall or sensitivity is the proportion of fraud cases that were identified as fraud. And specificity is the proportion of non-fraud cases that are identified as non-fraud.

The precision-recall curve tells us the relationship between correct fraud predictions and the proportion of fraud cases that were detected (e.g. if all or most fraud cases were identified, we also have many non-fraud cases predicted as fraud and vice versa). The sensitivity-specificity curve thus tell us the relationship between correctly identified classes of both labels (e.g. if we have 100% correctly classified fraud cases, we will have no correctly classified non-fraud cases and vice versa).

We can also look at this a little bit differently, by manually going through different prediction thresholds and calculating how many cases were correctly classified in the two classes:

thresholds <- seq(from = 0, to = 1, by = 0.1)
pred_thresholds <- data.frame(actual = pred$actual)

for (threshold in thresholds) {
  
  prediction <- ifelse(pred$p1 > threshold, 1, 0)
  prediction_true <- ifelse(pred_thresholds$actual == prediction, TRUE, FALSE)
  pred_thresholds <- cbind(pred_thresholds, prediction_true)

}

colnames(pred_thresholds)[-1] <- thresholds

pred_thresholds %>%
  gather(x, y, 2:ncol(pred_thresholds)) %>%
  group_by(actual, x, y) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = as.numeric(x), y = n, color = actual)) +
    geom_vline(xintercept = 0.6, alpha = 0.5) +
    geom_line() +
    geom_point(alpha = 0.5) +
    theme_bw() +
    facet_wrap(actual ~ y, scales = "free", ncol = 2) +
    labs(x = "prediction threshold",
         y = "number of instances")

This plot tells us that we can increase the number of correctly classified non-fraud cases without loosing correctly classified fraud cases when we increase the prediction threshold from the default 0.5 to 0.6:

pred %>%
  mutate(predict = ifelse(pred$p1 > 0.6, 1, 0)) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 

## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>   <dbl> <int>       <dbl>
## 1      0       0 56558 0.998693318
## 2      0       1    74 0.001306682
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957

Our final model now correctly identified 83% of fraud cases and almost 100% of non-fraud cases.

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ROCR_1.0-7      gplots_3.0.1    h2o_3.10.4.4    dplyr_0.5.0    
##  [5] purrr_0.2.2     readr_1.1.0     tidyr_0.6.1     tibble_1.3.0   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       reshape2_1.4.2     haven_1.0.0       
##  [4] lattice_0.20-35    colorspace_1.3-2   htmltools_0.3.6   
##  [7] yaml_2.1.14        foreign_0.8-68     DBI_0.6-1         
## [10] RColorBrewer_1.1-2 modelr_0.1.0       readxl_1.0.0      
## [13] plyr_1.8.4         stringr_1.2.0      munsell_0.4.3     
## [16] gtable_0.2.0       cellranger_1.1.0   rvest_0.3.2       
## [19] caTools_1.17.1     psych_1.7.3.21     evaluate_0.10     
## [22] labeling_0.3       knitr_1.15.1       forcats_0.2.0     
## [25] parallel_3.4.0     broom_0.4.2        Rcpp_0.12.10      
## [28] KernSmooth_2.23-15 scales_0.4.1       backports_1.0.5   
## [31] gdata_2.17.0       jsonlite_1.4       mnormt_1.5-5      
## [34] hms_0.3            digest_0.6.12      stringi_1.1.5     
## [37] grid_3.4.0         rprojroot_1.2      tools_3.4.0       
## [40] bitops_1.0-6       magrittr_1.5       lazyeval_0.2.0    
## [43] RCurl_1.95-4.8     xml2_1.1.1         lubridate_1.6.0   
## [46] assertthat_0.2.0   rmarkdown_1.5      httr_1.2.1        
## [49] R6_2.2.0           nlme_3.1-131       compiler_3.4.0

Does money buy happiness after all? Machine Learning with One Rule

2017-04-23 00:00:00 +0000

This week, I am exploring Holger K. von Jouanne-Diedrich’s OneR package for machine learning. I am running an example analysis on world happiness data and compare the results with other machine learning models (decision trees, random forest, gradient boosting trees and neural nets).

Conclusions

All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.

If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.

When looking at feature importance across models, the feature OneR chose - Economy/GDP per capita - was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

OneR

OneR has been developed for the purpose of creating machine learning models that are easy to interpret and understand, while still being as accurate as possible. It is based on the one rule classification algorithm from Holte (1993), which is basically a decision tree cut at the first level.

R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.

While the original algorithm has difficulties in handling missing values and numeric data, the package provides enhanced functionality to handle those cases better, e.g. introducing a separate class for NA values and the optbin() function to find optimal splitting points for each feature. The main function of the package is OneR, which finds an optimal split for each feature and only use the most important feature with highest training accuracy for classification.

I installed the latest stable version of the OneR package from CRAN.

library(OneR)

The dataset

I am using the World Happiness Report 2016 from Kaggle.

library(tidyverse)

data_16 <- read.table("world-happiness/2016.csv", sep = ",", header = TRUE)
data_15 <- read.table("world-happiness/2015.csv", sep = ",", header = TRUE)

In the 2016 data there are upper and lower CI for the happiness score given, while in the 2015 data we have standard errors. Because I want to combine data from the two years, I am using only columns that are in both datasets.

common_feats <- colnames(data_16)[which(colnames(data_16) %in% colnames(data_15))]

# features and response variable for modeling
feats <- setdiff(common_feats, c("Country", "Happiness.Rank", "Happiness.Score"))
response <- "Happiness.Score"

# combine data from 2015 and 2016
data_15_16 <- rbind(select(data_15, one_of(c(feats, response))),
              select(data_16, one_of(c(feats, response))))

The response variable happiness score is on a numeric scale. OneR could also perform regression but here, I want to compare classification tasks. For classifying happiness, I create three bins for low, medium and high values of the happiness score. In order to not having to deal with unbalanced data, I am using the bin() function from OneR with method = "content". For plotting the cut-points, I am extracting the numbers from the default level names.

data_15_16$Happiness.Score.l <- bin(data_15_16$Happiness.Score, nbins = 3, method = "content")

intervals <- paste(levels(data_15_16$Happiness.Score.l), collapse = " ")
intervals <- gsub("\\(|]", "", intervals)
intervals <- gsub(",", " ", intervals)
intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]]))

data_15_16 %>%
  ggplot() +
    geom_density(aes(x = Happiness.Score), color = "blue", fill = "blue", alpha = 0.4) +
    geom_vline(xintercept = intervals[2]) +
    geom_vline(xintercept = intervals[3])

Now I am removing the original happiness score column from the data for modeling and rename the factor levels of the response variable.

data_15_16 <- select(data_15_16, -Happiness.Score) %>%
  mutate(Happiness.Score.l = plyr::revalue(Happiness.Score.l, c("(2.83,4.79]" = "low", "(4.79,5.89]" = "medium", "(5.89,7.59]" = "high")))

Because there are only 9 features in this small dataset, I want to explore them all individually before modeling. First, I am plotting the only categorical variable: Region.

This plots shows that there are a few regions with very strong biases in happiness: People in Western Europe, Australia, New Zealand, North America, Latin American and the Caribbean tend to me in the high happiness group, while people in sub-saharan Africa and Southern Asia tend to be the least happiest.

data_15_16 %>%
  ggplot(aes(x = Region, fill = Happiness.Score.l)) +
    geom_bar(position = "dodge", alpha = 0.7) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.margin = unit(c(0, 0, 0, 1.5), "cm")) +
    scale_fill_brewer(palette = "Set1")

The remaining quantitative variables show happiness biases to varying degrees: e.g. low health and life expectancy is strongly biased towards low happiness, economic factors, family and freedom show a bias in the same direction, albeit not as strong.

data_15_16 %>%
  gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
  ggplot(aes(x = y, fill = Happiness.Score.l)) +
    geom_histogram(alpha = 0.7) +
    facet_wrap(~ x, scales = "free", ncol = 4) +
    scale_fill_brewer(palette = "Set1")

While OneR could also handle categorical data, in this example, I only want to consider the quantitative features to show the differences between OneR and other machine learning algorithms.

data_15_16 <- select(data_15_16, -Region)

Modeling

The algorithms I will compare to OneR will be run via the caret package.

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)

I will also use caret’s createDataPartition() function to partition the data into training (70%) and test sets (30%).

set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data  <- data_15_16[-index, ]

OneR

OneR only accepts categorical features. Because we have numerical features, we need to convert them to factors by splitting them into appropriate bins. While the original OneR algorithm splits the values into ever smaller factors, this has been changed in this R-implementation with the argument of preventing overfitting. We can either split the data into pre-defined numbers of buckets (by length, content or cluster) or we can use the optbin() function to obtain the optimal number of factors from pairwise logistic regression or information gain.

# default method length
data_1 <- bin(train_data, nbins = 5, method = "length")

# method content
data_2 <- bin(train_data, nbins = 5, method = "content")

# method cluster
data_3 <- bin(train_data, nbins = 3, method = "cluster")

# optimal bin number logistic regression
data_4 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "logreg")

# optimal bin number information gain
data_5 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "infogain")

This is how the data looks like following discretization:

Default method

5 bins with method = "content

3 bins with method = "cluster

optimal bin number according to logistic regression

optimal bin number according to information gain

Model building

Now I am running the OneR models. During model building, the chosen attribute/feature with highest accuracy along with the top 7 features decision rules and accuracies are printed. Unfortunately, this information is not saved in the model object; this would have been nice in order to compare the importance of features across models later on.

Here, all five models achieved highest prediction accuracy with the feature Economy GDP per capita.

for (i in 1:5) {
  data <- get(paste0("data_", i))
  print(model <- OneR(formula = Happiness.Score.l ~., data = data, verbose = TRUE))
  assign(paste0("model_", i), model)
}

## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.96%  
## 2   Health..Life.Expectancy.      59.91%  
## 3   Family                        57.21%  
## 4   Dystopia.Residual             51.8%   
## 5   Freedom                       49.55%  
## 6   Trust..Government.Corruption. 45.5%   
## 7   Generosity                    41.89%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.365] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.365,0.73]     then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.73,1.09]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.09,1.46]      then Happiness.Score.l = high
## If Economy..GDP.per.Capita. = (1.46,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      64.41%  
## 2   Health..Life.Expectancy.      60.81%  
## 3   Family                        59.91%  
## 4   Trust..Government.Corruption. 55.41%  
## 5   Freedom                       53.15%  
## 5   Dystopia.Residual             53.15%  
## 7   Generosity                    41.44%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.548] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.548,0.877]    then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.877,1.06]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.06,1.28]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.28,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 143 of 222 instances classified correctly (64.41%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.51%  
## 2   Health..Life.Expectancy.      62.16%  
## 3   Family                        54.5%   
## 4   Freedom                       50.45%  
## 4   Dystopia.Residual             50.45%  
## 6   Trust..Government.Corruption. 43.24%  
## 7   Generosity                    36.49%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.602] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.602,1.1]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.1,1.83]       then Happiness.Score.l = high
## 
## Accuracy:
## 141 of 222 instances classified correctly (63.51%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.96%  
## 2   Health..Life.Expectancy.      62.16%  
## 3   Family                        58.56%  
## 4   Freedom                       51.35%  
## 5   Dystopia.Residual             50.9%   
## 6   Trust..Government.Corruption. 46.4%   
## 7   Generosity                    40.09%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.754] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.754,1.12]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.12,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      67.12%  
## 2   Health..Life.Expectancy.      65.77%  
## 3   Family                        61.71%  
## 4   Trust..Government.Corruption. 56.31%  
## 5   Dystopia.Residual             55.41%  
## 6   Freedom                       50.9%   
## 7   Generosity                    43.69%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.68] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.68,1.24]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.24,1.83]     then Happiness.Score.l = high
## 
## Accuracy:
## 149 of 222 instances classified correctly (67.12%)

Model evaluation

The function eval_model() prints confusion matrices for absolute and relative predictions, as well as accuracy, error and error rate reduction. For comparison with other models, it would have been convenient to be able to extract these performance metrics directly from the eval_model object, instead of only the confusion matrix and values of correct/all instances and having to re-calculate performance metrics again manually.

for (i in 1:5) {
  model <- get(paste0("model_", i))
  eval_model(predict(model, test_data), test_data$Happiness.Score.l)
}

## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       1  26     10  37
##     medium    7   5     10  22
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.01 0.28   0.11 0.40
##     medium 0.08 0.05   0.11 0.24
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6344 (59/93)
## 
## Error rate:
## 0.3656 (34/93)
## 
## Error rate reduction (vs. base rate):
## 0.4516 (p-value = 2.855e-09)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     19   0      1  20
##     low       3  28     14  45
##     medium    9   3     16  28
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.20 0.00   0.01 0.22
##     low    0.03 0.30   0.15 0.48
##     medium 0.10 0.03   0.17 0.30
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6774 (63/93)
## 
## Error rate:
## 0.3226 (30/93)
## 
## Error rate reduction (vs. base rate):
## 0.5161 (p-value = 1.303e-11)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       0  25      7  32
##     medium    8   6     13  27
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.00 0.27   0.08 0.34
##     medium 0.09 0.06   0.14 0.29
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6559 (61/93)
## 
## Error rate:
## 0.3441 (32/93)
## 
## Error rate reduction (vs. base rate):
## 0.4839 (p-value = 2.116e-10)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       2  26     11  39
##     medium    6   5      9  20
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.02 0.28   0.12 0.42
##     medium 0.06 0.05   0.10 0.22
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6237 (58/93)
## 
## Error rate:
## 0.3763 (35/93)
## 
## Error rate reduction (vs. base rate):
## 0.4355 (p-value = 9.799e-09)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     21   0      3  24
##     low       0  26      8  34
##     medium   10   5     20  35
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.23 0.00   0.03 0.26
##     low    0.00 0.28   0.09 0.37
##     medium 0.11 0.05   0.22 0.38
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.7204 (67/93)
## 
## Error rate:
## 0.2796 (26/93)
## 
## Error rate reduction (vs. base rate):
## 0.5806 (p-value = 2.761e-14)

Because I want to calculate performance measures for the different classes separately and like to have a more detailed look at the prediction probabilities I get from the models, I prefer to obtain predictions with type = "prob. While I am not looking at it here, this would also allow me to test different prediction thresholds.

for (i in 1:5) {
  model <- get(paste0("model_", i))
  pred <- data.frame(model = paste0("model_", i),
                     sample_id = 1:nrow(test_data),
                     predict(model, test_data, type = "prob"),
                     actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
  
  if (i == 1) {
    pred_df <- pred
  } else {
    pred_df <- rbind(pred_df, pred)
  }
}

Comparing other algorithms

Decision trees

First, I am building a decision tree with the rpart package and rpart() function. This, we can plot with rpart.plot().

Economy GDP per capita is the second highest node here, the best predictor here would be health and life expectancy.

library(rpart)
library(rpart.plot)

set.seed(42)
fit <- rpart(Happiness.Score.l ~ .,
            data = train_data,
            method = "class",
            control = rpart.control(xval = 10), 
            parms = list(split = "information"))

rpart.plot(fit, extra = 100)

In order to compare the models, I am producing the same output table for predictions from this model and combine it with the table from the OneR models.

pred <- data.frame(model = "rpart",
                   sample_id = 1:nrow(test_data),
                   predict(fit, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }

pred_df_final <- rbind(pred_df,
                       pred)

Random Forest

Next, I am training a Random Forest model. For more details on Random Forest, check out my post “Can we predict flu deaths with Machine Learning and R?”.

set.seed(42)
model_rf <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "rf",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

The varImp() function from caret shows us which feature was of highest importance for the model and its predictions.

Here, we again find Economy GDP per captia on top.

varImp(model_rf)

## rf variable importance
## 
##                               Overall
## Economy..GDP.per.Capita.       100.00
## Dystopia.Residual               97.89
## Health..Life.Expectancy.        77.10
## Family                          47.17
## Trust..Government.Corruption.   29.89
## Freedom                         19.29
## Generosity                       0.00

pred <- data.frame(model = "rf",
                   sample_id = 1:nrow(test_data),
                   predict(model_rf, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }

pred_df_final <- rbind(pred_df_final,
                       pred)

Extreme gradient boosting trees

Gradient boosting is another decision tree-based algorithm, explained in more detail in my post “Extreme Gradient Boosting and Preprocessing in Machine Learning”.

set.seed(42)
model_xgb <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "xgbTree",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

As before, we again find Economy GDP per capita as most important feature.

varImp(model_xgb)

## xgbTree variable importance
## 
##                          Overall
## Economy..GDP.per.Capita.  100.00
## Health..Life.Expectancy.   67.43
## Family                     46.59
## Freedom                     0.00

pred <- data.frame(model = "xgb",
                   sample_id = 1:nrow(test_data),
                   predict(model_xgb, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }

pred_df_final <- rbind(pred_df_final,
                       pred)

Neural network

Finally, I am comparing a neural network model. Here as well, I have a post where I talk about what they are in more detail: “Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart”.

set.seed(42)
model_nn <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "mlp",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

And Economy GDP per capita is again the most important feature!

varImp(model_nn)

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                                  low medium    high
## Economy..GDP.per.Capita.      100.00 66.548 100.000
## Health..Life.Expectancy.       95.22 63.632  95.222
## Family                         84.48 45.211  84.483
## Dystopia.Residual              71.23 43.450  71.232
## Freedom                        64.58 41.964  64.581
## Trust..Government.Corruption.  26.13 40.573  40.573
## Generosity                      0.00  3.462   3.462

pred <- data.frame(model = "nn",
                   sample_id = 1:nrow(test_data),
                   predict(model_nn, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }

pred_df_final <- rbind(pred_df_final,
                       pred)

Model comparisons

Now to the final verdict: How do the different models compare?

The first plot below shows the prediction probabilites for the three happiness levels low, medium and high for each test data instance. For each instance, only the prediction probability of the predicted class (i.e. with the highest value) is shown. The upper row shows correct predictions, the lower row shows wrong predictions.

Sometimes, it is obvious from such a plot if a more stringent prediction threshold could improve things (when wrong predictions tend to be close to the threshold). With three classes to predict, this is obviously not as trivial as if we only had two but the same principle holds true: the smaller the prediction probability, the more uncertain it tends to be.

pred_df_final %>%
  ggplot(aes(x = actual, y = pred_prob, fill = prediction, color = prediction)) +
    geom_boxplot(alpha = 0.7) +
    facet_grid(correct ~ model) +
    scale_color_brewer(palette = "Set1") +
    scale_fill_brewer(palette = "Set1")

Probably the most straight-forwards performance measure is accuracy: i.e. the proportion of correct predictions vs the total number of instances to predict. The closer to 1, the better the accuracy.

Not surprisingly, the more complex models tend to be more accurate - albeit only slightly.

pred_df_final %>%
  group_by(model) %>%
  dplyr::summarise(correct = sum(correct == "correct")) %>%
  mutate(accuracy = correct / nrow(test_data)) %>%
  ggplot(aes(x = model, y = accuracy, fill = model)) +
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Set1")

When we look at the three classes individually, it looks a bit more complicated but most models achieved highest accuracy for class “high”.

pred_df_final %>%
  group_by(model, prediction) %>%
  dplyr::summarise(correct = sum(correct == "correct"),
            n = n()) %>%
  mutate(accuracy = correct / n) %>%
  ggplot(aes(x = model, y = accuracy, fill = prediction)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_fill_brewer(palette = "Set1")

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo()

## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RSNNS_0.4-9         Rcpp_0.12.10        plyr_1.8.4         
##  [4] xgboost_0.6-4       randomForest_4.6-12 rpart.plot_2.1.1   
##  [7] rpart_4.1-10        caret_6.0-73        lattice_0.20-35    
## [10] doParallel_1.0.10   iterators_1.0.8     foreach_1.4.3      
## [13] dplyr_0.5.0         purrr_0.2.2         readr_1.1.0        
## [16] tidyr_0.6.1         tibble_1.3.0        ggplot2_2.2.1      
## [19] tidyverse_1.1.1     OneR_2.1           
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.6.0    assertthat_0.2.0   rprojroot_1.2     
##  [4] digest_0.6.12      psych_1.7.3.21     R6_2.2.0          
##  [7] backports_1.0.5    MatrixModels_0.4-1 stats4_3.3.3      
## [10] evaluate_0.10      httr_1.2.1         lazyeval_0.2.0    
## [13] readxl_0.1.1       data.table_1.10.4  minqa_1.2.4       
## [16] SparseM_1.76       car_2.1-4          nloptr_1.0.4      
## [19] Matrix_1.2-8       rmarkdown_1.4      labeling_0.3      
## [22] splines_3.3.3      lme4_1.1-12        stringr_1.2.0     
## [25] foreign_0.8-67     munsell_0.4.3      broom_0.4.2       
## [28] modelr_0.1.0       mnormt_1.5-5       mgcv_1.8-17       
## [31] htmltools_0.3.5    nnet_7.3-12        codetools_0.2-15  
## [34] MASS_7.3-45        ModelMetrics_1.1.0 grid_3.3.3        
## [37] nlme_3.1-131       jsonlite_1.4       gtable_0.2.0      
## [40] DBI_0.6-1          magrittr_1.5       scales_0.4.1      
## [43] stringi_1.1.5      reshape2_1.4.2     xml2_1.1.1        
## [46] RColorBrewer_1.1-2 tools_3.3.3        forcats_0.2.0     
## [49] hms_0.3            pbkrtest_0.4-7     yaml_2.1.14       
## [52] colorspace_1.3-2   rvest_0.3.2        knitr_1.15.1      
## [55] haven_1.0.0        quantreg_5.29

Explaining complex machine learning models with LIME

2017-04-23 00:00:00 +0000

The classification decisions made by machine learning models are usually difficult - if not impossible - to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well - often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.

Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand.

Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.

Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable. The method has been published in

“Why Should I Trust You?” Explaining the Predictions of Any Classifier. By Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin from the University of Washington in Seattle

lime is able to explain all models for which we can obtain prediction probabilities (in R, that is every model that works with predict(type = "prob")). It makes use of the fact that linear models are easy to explain because they are based on linear relationships between features and class labels: The complex model function is approximated by locally fitting linear models to permutations of the original training set.

On each permutation, a linear model is being fit and weights are given so that incorrect classification of instances that are more similar to the original data are penalized (positive weights support a decision, negative weights contradict them). This will give an approximation of how much (and in which way) each feature contributed to a decision made by the model.

The code for lime has originally been made available for Python but the awesome Thomas Lin Pedersen has already created an implementation in R. It is not on CRAN (yet, I assume), but you can install it via Github:

devtools::install_github("thomasp85/lime")

The data I am using is the World Happiness Data from my last post. So, let’s train a neural network on this data to predict three classes of the happiness scores: low, medium and high.

load("data_15_16.RData")

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)

set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data  <- data_15_16[-index, ]

set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "mlp",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

The explain function

The central function of lime is lime() It creates the function that is used in the next step to explain the model’s predictions.

We can give a couple of options. Check the help ?lime for details, but the most important to think about are:

Should continuous features be binned? And if so, into how many bins?

Here, I am keeping the default bin_continuous = TRUE but specify 5 instead of 4 (the default) bins with n_bins = 5.

library(lime)

explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)

Now, let’s look at how the model is explained. Here, I am not going to look at all test cases but I’m randomly choosing three cases with correct predictions and three with wrong predictions.

pred <- data.frame(sample_id = 1:nrow(test_data),
                   predict(model_mlp, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")

Beware that we need to give our test-set data table row names with the sample names or IDs to be displayed in the header of our explanatory plots below.

library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")

test_data_cor <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_cor$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)

test_data_wrong <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_wrong$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)

The explain function from above can now be used with our test samples. Further options we can specify are:

How many features do we want to use in the explanatory function?

Let’s say we have a big training set with 100 features. Looking at all features and trying to understand them all could be more confusing than helpful. And very often, a handful of very important features will be enough to predict test samples with a reasonable accuracy (see also my last post on OneR). So, we can choose how many features we want to look at with the n_features option.

How do we want to choose these features?

Next, we specify how we want this subset of features to be found. The default, auto, uses forward selection if we chose n_features <= 6 and uses the features with highest weights otherwise. We can also directly choose feature_select = "forward_selection", feature_select = "highest_weights" or feature_select = "lasso_path". Again, check ?lime for details.

In our example dataset, we only have 7 features and I want to look at the top 5.

I also want to have explanation for all three class labels in the response variable (low, medium and high happiness), so I am choosing n_labels = 3.

explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)

It will return a tidy tibble object that we can plot with plot_features():

plot_features(explanation_cor, ncol = 3)

plot_features(explanation_wrong, ncol = 3)

The information in the output tibble is described in the help function ?lime and can be viewed with

tibble::glimpse(explanation_cor)

So, what does this tell us, now? Let’s look at case 22 (the first row of our plot for correctly predicted classes): This sample has been correctly predicted to come from the medium happiness group because it

has a dystopia value between 2.03 & 2.32,
a trust/government corruption score below 0.05,
a GDP/economy score between 1.06 and 1.23 and
a life expectancy score between 0.59 and 0.7.

From the explanation for the label “high” we can also see that this case has a family score bigger than 1.12, which is more representative of high happiness samples.

pred %>%
  filter(sample_id == 22)

##   sample_id        low   medium       high actual prediction correct
## 1        22 0.02906327 0.847562 0.07429938 medium     medium correct

The explanatory function named dystopia the most strongly supporting feature for this prediction. Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables […]

The explanatory plot tells us for each feature and class label in which range of values a representative data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For case 22 and the feature dystopia, the data point 2.27 falls within the range for medium happiness (between 2.03 and 2.32) with a high weight.

When we look at where this case falls on the range of values for this feature, we can see that is indeed very close to the median of medium training cases and further away from the medians for high and low training cases. The other supportive features show us the same trend.

train_data %>%
  gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
  ggplot(aes(x = Happiness.Score.l, y = y)) +
    geom_boxplot(alpha = 0.8, color = "grey") + 
    geom_point(data = gather(test_data[22, ], x, y, Economy..GDP.per.Capita.:Dystopia.Residual), color = "red", size = 3) +
    facet_wrap(~ x, scales = "free", ncol = 4)

An overview over the top 5 explanatory features for case 22 is stored in:

as.data.frame(explanation_cor[1:9]) %>%
  filter(case == "22")

##    case  label label_prob   model_r2 model_intercept
## 1    22 medium 0.84756196 0.05004205       0.5033729
## 2    22 medium 0.84756196 0.05004205       0.5033729
## 3    22 medium 0.84756196 0.05004205       0.5033729
## 4    22 medium 0.84756196 0.05004205       0.5033729
## 5    22 medium 0.84756196 0.05004205       0.5033729
## 6    22   high 0.07429938 0.06265119       0.2293890
## 7    22   high 0.07429938 0.06265119       0.2293890
## 8    22   high 0.07429938 0.06265119       0.2293890
## 9    22   high 0.07429938 0.06265119       0.2293890
## 10   22   high 0.07429938 0.06265119       0.2293890
## 11   22    low 0.02906327 0.07469729       0.3528088
## 12   22    low 0.02906327 0.07469729       0.3528088
## 13   22    low 0.02906327 0.07469729       0.3528088
## 14   22    low 0.02906327 0.07469729       0.3528088
## 15   22    low 0.02906327 0.07469729       0.3528088
##                          feature feature_value feature_weight
## 1              Dystopia.Residual       2.27394     0.14690100
## 2  Trust..Government.Corruption.       0.03005     0.06308598
## 3       Economy..GDP.per.Capita.       1.13764     0.02944832
## 4       Health..Life.Expectancy.       0.66926     0.02477567
## 5                     Generosity       0.00199    -0.01326503
## 6                         Family       1.23617     0.13629781
## 7                     Generosity       0.00199    -0.07514534
## 8  Trust..Government.Corruption.       0.03005    -0.07574480
## 9              Dystopia.Residual       2.27394    -0.07687559
## 10      Economy..GDP.per.Capita.       1.13764     0.07167086
## 11                        Family       1.23617    -0.14932931
## 12      Economy..GDP.per.Capita.       1.13764    -0.12738346
## 13                    Generosity       0.00199     0.09730858
## 14             Dystopia.Residual       2.27394    -0.07464384
## 15 Trust..Government.Corruption.       0.03005     0.06220305
##                                       feature_desc
## 1         2.025072 < Dystopia.Residual <= 2.320632
## 2        Trust..Government.Corruption. <= 0.051198
## 3  1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 4  0.591822 < Health..Life.Expectancy. <= 0.701046
## 5                           Generosity <= 0.123528
## 6                                1.119156 < Family
## 7                           Generosity <= 0.123528
## 8        Trust..Government.Corruption. <= 0.051198
## 9         2.025072 < Dystopia.Residual <= 2.320632
## 10 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 11                               1.119156 < Family
## 12 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 13                          Generosity <= 0.123528
## 14        2.025072 < Dystopia.Residual <= 2.320632
## 15       Trust..Government.Corruption. <= 0.051198

In a similar way, we can explore why some predictions were wrong.

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.

sessionInfo()

## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_0.5.0       purrr_0.2.2       readr_1.1.0      
##  [4] tidyr_0.6.1       tibble_1.3.0      tidyverse_1.1.1  
##  [7] RSNNS_0.4-9       Rcpp_0.12.10      lime_0.1.0       
## [10] caret_6.0-73      ggplot2_2.2.1     lattice_0.20-35  
## [13] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3    
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.6.0    assertthat_0.2.0   glmnet_2.0-5      
##  [4] rprojroot_1.2      digest_0.6.12      psych_1.7.3.21    
##  [7] R6_2.2.0           plyr_1.8.4         backports_1.0.5   
## [10] MatrixModels_0.4-1 stats4_3.3.3       evaluate_0.10     
## [13] httr_1.2.1         hrbrthemes_0.1.0   lazyeval_0.2.0    
## [16] readxl_0.1.1       minqa_1.2.4        SparseM_1.76      
## [19] extrafontdb_1.0    car_2.1-4          nloptr_1.0.4      
## [22] Matrix_1.2-8       rmarkdown_1.4      labeling_0.3      
## [25] splines_3.3.3      lme4_1.1-12        extrafont_0.17    
## [28] stringr_1.2.0      foreign_0.8-67     munsell_0.4.3     
## [31] hunspell_2.3       broom_0.4.2        modelr_0.1.0      
## [34] mnormt_1.5-5       mgcv_1.8-17        htmltools_0.3.5   
## [37] nnet_7.3-12        codetools_0.2-15   MASS_7.3-45       
## [40] ModelMetrics_1.1.0 grid_3.3.3         nlme_3.1-131      
## [43] jsonlite_1.4       Rttf2pt1_1.3.4     gtable_0.2.0      
## [46] DBI_0.6-1          magrittr_1.5       scales_0.4.1      
## [49] stringi_1.1.5      reshape2_1.4.2     xml2_1.1.1        
## [52] tools_3.3.3        forcats_0.2.0      hms_0.3           
## [55] pbkrtest_0.4-7     yaml_2.1.14        colorspace_1.3-2  
## [58] rvest_0.3.2        knitr_1.15.1       haven_1.0.0       
## [61] quantreg_5.29

Happy EasteR: Plotting hare populations in Germany

2017-04-16 00:00:00 +0000

For Easter, I wanted to have a look at the number of hares in Germany. Wild hare populations have been rapidly declining over the last 10 years but during the last three years they have at least been stable.

This plot shows the 16 federal states of Germany. Their fill color shows the proportion of hares per square kilometer in 2015/2016. The black area of the pie charts shows the percentage of the total number of hares (not corrected by area) for each federal state in regard to the total number of hares in all of Germany during that time. The size of the hares in the background reflects the total number of hares in the eight federal states with the highest percentages.

Below, you can find the code I used to produce this plot.

Hare population numbers

The German hunter’s association (Deutscher Jagdverband) publishes population numbers for a variety of common species of wild game, including hares. The most recent numbers are for the year 2015/16.

I downloaded the pdf and extracted the table on page 1 with the tabulizer package.

library(tidyverse)

#devtools::install_github("ropensci/tabulizer")
library(tabulizer)

txt <- extract_tables("2015-16 Jahresstrecke Feldhase.pdf")

hares <- txt[[1]] %>%
  as.data.frame(stringsAsFactors = FALSE)

# first row contains column names, remove from df and set as colnames
hares_df <- data.frame(hares[-1 , ])
colnames(hares_df) <- gsub(" ", "", as.character(c("bundesland", unlist(hares[1, -1]))))

# remove the spaces in the numbers and make numeric
hares_df <- apply(hares_df, 2, function(x) gsub(" ", "", x))
hares_df[, -1] <- apply(hares_df[, -1], 2, function(x) as.numeric(x))

hare_final <- as.data.frame(hares_df, stringsAsFactors = FALSE)

The map

I downloaded the ESRI Shapefile of German federal states from the Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011 and read it in with the rgdal package.

library(maptools)

ger <- rgdal::readOGR(dsn = "shp", layer = "vg2500_bld")

## OGR data source with driver: ESRI Shapefile 
## Source: "shp", layer: "vg2500_bld"
## with 16 features
## It has 6 fields

I then convert the shapefile into a dataframe for plotting. This allowes the plotting of polygons in the shape of ech federal state with ggplot.

library(plyr)
ger_df <- fortify(ger)
ger@data$id <- rownames(ger@data)
ger_df_final <- join(ger_df, ger@data, by = "id")
ger_df_final$GEN <- as.character(ger_df_final$GEN)
ger_df_final$GEN <- gsub("\xfc", "ü", ger_df_final$GEN)
ger_df_final$GEN <- gsub("ü", "ue", ger_df_final$GEN)

To match the names of the federal states in the map dataframe with the hare population table, I assign a corresponding column to the latter.

hare_final2 <- filter(hare_final, bundesland != "gesamt")
hare_final2$GEN <- as.factor(c("Baden-Wuerttemberg", "Bayern", "Berlin", "Brandenburg", "Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern", "Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz", "Saarland", "Sachsen", "Sachsen-Anhalt", "Schleswig-Holstein", "Thueringen"))
colnames(hare_final2)[2:12] <- paste0("Y_", gsub("/", "_", colnames(hare_final)[2:12]))

hare_final2[, 2:12] <- apply(hare_final2[, 2:12], 2, function(x) as.numeric(x))

Area

The hare population numbers are given as absolute numbers but I want them normalized by area. The German statistics portal (http://www.statistik-portal.de) provides this data on their website. I use the XML package to scrape it from there.

library(XML)
table = readHTMLTable("http://www.statistik-portal.de/Statistik-Portal/de_jb01_jahrtab1.asp", header = TRUE, which = 1, stringsAsFactors = FALSE)

table$V1 <- gsub("Ã¼", "ü", table$V1)
colnames(table) <- c("GEN", "area", "pop_total", "pop_male", "pop_female", "inh_p_km2")

# numbers are in German format, so need to convert them to English decimal notation
table[, -1] <- apply(table[, -1], 2, function(x) as.numeric(gsub(",", ".", gsub("\\.", "", x))))

table <- table[-17, ]
table$GEN <- as.factor(table$GEN)

I then divide each population number by area (in km2).

hare_final2$GEN <- as.character(hare_final2$GEN)
table$GEN <- hare_final2$GEN

hare_final3 <- hare_final2 %>%
  left_join(table, by = "GEN")

hare_final3[, 2:12] <- apply(hare_final3[, 2:12], 2, function(x) x / hare_final3$area)

This final table I then join with the map dataframe.

map <- left_join(ger_df_final, hare_final3, by = "GEN")

The background map

The background map shows Germany’s federal states (as polygons) colored according to how many hares there were relative to the area in km2.

I define the mapping theme for ggplot, so that I have no axes, grids, etc. and a transparent background.

map_theme <- list(theme(panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_rect(fill = "transparent", color = NA),
                        plot.background = element_rect(fill = "transparent", color = NA),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))

The background of each polygon is also filled with a hare image with the hare’s size being proportional to the percentage of total hare numbers of each federal state compared to the total number of hares in Germany in 2015/16.

To achieve this, I first downloaded an open-source image of a rabbit and read it with the png package. This image I then want to plot with gplot’s annotation_custom() function, so I need to convert the image to a raster.

library(png)
bunny <-  readPNG("rabbit-297212_1280.png") #https://pixabay.com

library(grid)
g <- rasterGrob(bunny, interpolate = TRUE)

I am plotting hares only for the 8 federal states with percentage above 1, because the other ones would be too small on the plot.

For each of these 8 federal states, I am plotting the hare image in relative size by scaling the xmax and ymax values according to the percentage values for each state. The image canvas always has a size of 15 by 15, so I am centering the images at 7.5. I then save each image as a png with transparent background.

val <- round(filter(hare_final_pie, GEN == "Bayern")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Bayern.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Niedersachsen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Niedersachsen.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Nordrhein-Westfalen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Nordrhein-Westfalen.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Schleswig-Holstein")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Schleswig-Holstein.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Baden-Wuerttemberg")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Baden-Wuerttemberg.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Hessen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Hessen.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Rheinland-Pfalz")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Rheinland-Pfalz.png", bg = "transparent")

val <- round(filter(hare_final_pie, GEN == "Brandenburg")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Brandenburg.png", bg = "transparent")

Next, I follow the instructions and code from a StackOverflow post: Each image is now read in again and converted to a dataframe for plotting. Because I only want to fill within the polygon borders, I am restricting the points to those, that fall in these shapes. I need to use the “groups” column here, because this is the column that was used for plotting the map polygons.

#http://stackoverflow.com/questions/28206611/adding-custom-image-to-geom-polygon-fill-in-ggplot
# converting raster image to plottable data.frame
library(sp)

ggplot_rasterdf <- function(color_matrix, bottom = 0, top = 1, left = 0, right = 1) {
  require("dplyr")
  require("tidyr")

  if (dim(color_matrix)[3] > 3) hasalpha <- T else hasalpha <- F

  outMatrix <- matrix("#00000000", nrow = dim(color_matrix)[1], ncol = dim(color_matrix)[2])

  for (i in 1:dim(color_matrix)[1])
    for (j in 1:dim(color_matrix)[2]) 
      outMatrix[i, j] <- rgb(color_matrix[i,j,1], color_matrix[i,j,2], color_matrix[i,j,3], ifelse(hasalpha, color_matrix[i,j,4], 1))

  colnames(outMatrix) <- seq(1, ncol(outMatrix))
  rownames(outMatrix) <- seq(1, nrow(outMatrix))
  as.data.frame(outMatrix) %>% mutate(Y = nrow(outMatrix):1) %>% gather(X, color, -Y) %>% 
    mutate(X = left + as.integer(as.character(X))*(right - left)/ncol(outMatrix), Y = bottom + Y*(top - bottom)/nrow(outMatrix))
}

my_image <- readPNG("my_image_Bayern.png")

groups <- as.character(unique(filter(map, GEN == "Bayern")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

bayern <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Niedersachsen.png")

groups <- as.character(unique(filter(map, GEN == "Niedersachsen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

niedersachsen <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                        map[map$group %in% groups,]$long, 
                                        map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Nordrhein-Westfalen.png")

groups <- as.character(unique(filter(map, GEN == "Nordrhein-Westfalen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

nrw <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Schleswig-Holstein.png")

groups <- as.character(unique(filter(map, GEN == "Schleswig-Holstein")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

sh <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Baden-Wuerttemberg.png")

groups <- as.character(unique(filter(map, GEN == "Baden-Wuerttemberg")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

bw <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Hessen.png")

groups <- as.character(unique(filter(map, GEN == "Hessen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

hessen <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Rheinland-Pfalz.png")

groups <- as.character(unique(filter(map, GEN == "Rheinland-Pfalz")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

rp <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

my_image <- readPNG("my_image_Brandenburg.png")

groups <- as.character(unique(filter(map, GEN == "Brandenburg")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

brandenburg <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]

Now, I can plot the background plot. I also set the x- and y-limits now, as well as an empty title because I want to overlay the pie charts onto this plot and need to have the same coordinates and margins for this. I am also defining the legend position to be on the top right, so that I can plot the legend of the pie charts on the bottom right.

p1 <- ggplot() +
  map_theme +
  geom_polygon(data = map, aes(long, lat, group = group, fill = Y_2015_16)) + 
  geom_path(data = map, aes(long, lat, group = group), color = "white", size = 0.5) + 
  scale_fill_gradientn(colors = rev(terrain.colors(10)), limits = c(0, max(as.numeric(hare_final3$Y_2015_16)))) +
  coord_cartesian(ylim = c(47, 55.5), xlim = c(5, 15.5)) +
  theme(legend.justification = "top") +
  labs(
    title = " ",
    fill = "# / km2"
  ) +
  geom_tile(data = bayern, aes(x = X, y = Y), fill = bayern$color) +
  geom_tile(data = niedersachsen, aes(x = X, y = Y), fill = niedersachsen$color) +
  geom_tile(data = nrw, aes(x = X, y = Y), fill = nrw$color) +
  geom_tile(data = sh, aes(x = X, y = Y), fill = sh$color) +
  geom_tile(data = bw, aes(x = X, y = Y), fill = bw$color) +
  geom_tile(data = hessen, aes(x = X, y = Y), fill = hessen$color) +
  geom_tile(data = rp, aes(x = X, y = Y), fill = rp$color) +
  geom_tile(data = brandenburg, aes(x = X, y = Y), fill = brandenburg$color)

The pie plot

Over each federal state, I want to plot a pie chart that shows the percentage of the total hare population that falls on each federal state. I use the scatterpie package for that. The coordinates for each pie should be the center points of each federal state, which I determined with the following code from a StackOverflow post.

library(scatterpie)

# http://stackoverflow.com/questions/10368180/plotting-pie-graphs-on-map-in-ggplot
getLabelPoint <- function(county) {Polygon(county[c('long', 'lat')])@labpt}

centroids <- by(map, map$GEN, getLabelPoint)    
centroids <- do.call("rbind.data.frame", centroids) %>%
  tibble::rownames_to_column()
names(centroids) <- c("GEN", "long", "lat")

I then calculate the percentages and join them with the centroid coordinates. I am also adjusting the positions of the pie charts for some federal states so that they don’t overlap with other pie charts or with the background images.

hare_final_pie <- hare_final2 %>%
  select(one_of(c("GEN", "Y_2015_16"))) %>%
  mutate(percent = round(Y_2015_16 / sum(Y_2015_16) * 100, digits = 2),
         rest = 100 - percent) %>%
  left_join(centroids, by = "GEN")

hare_final_pie[hare_final_pie$GEN == "Brandenburg", "long"] <- 14
hare_final_pie[hare_final_pie$GEN == "Brandenburg", "lat"] <- 52

hare_final_pie[hare_final_pie$GEN == "Niedersachsen", "long"] <- 8
hare_final_pie[hare_final_pie$GEN == "Niedersachsen", "lat"] <- 52.8

hare_final_pie[hare_final_pie$GEN == "Schleswig-Holstein", "long"] <- 10.5
hare_final_pie[hare_final_pie$GEN == "Schleswig-Holstein", "lat"] <- 54

hare_final_pie[hare_final_pie$GEN == "Hamburg", "long"] <- 10.2

hare_final_pie[hare_final_pie$GEN == "Berlin", "long"] <- 13.6

hare_final_pie[hare_final_pie$GEN == "Nordrhein-Westfalen", "long"] <- 6.8

hare_final_pie[hare_final_pie$GEN == "Hessen", "long"] <- 9.2
hare_final_pie[hare_final_pie$GEN == "Hessen", "lat"] <- 50.9

hare_final_pie[hare_final_pie$GEN == "Rheinland-Pfalz", "long"] <- 7
hare_final_pie[hare_final_pie$GEN == "Rheinland-Pfalz", "lat"] <- 50

hare_final_pie[hare_final_pie$GEN == "Baden-Wuerttemberg", "long"] <- 9.5
hare_final_pie[hare_final_pie$GEN == "Baden-Wuerttemberg", "lat"] <- 49

hare_final_pie[hare_final_pie$GEN == "Bayern", "long"] <- 11
hare_final_pie[hare_final_pie$GEN == "Bayern", "lat"] <- 49.6

Now, I am plottin the second plot with only the pie charts. Theoretically, one could plot these pie charts directly on top of another ggplot but here this doesn’t work because I am using conflicting fill attributes: one with a continuous scale for the polygons and another with a categorical scale for the pie charts. Therefore, I am overlaying the two plots instead.

hare_final_pie$radius <- 0.2

p2 <- ggplot() +
  geom_scatterpie(data = hare_final_pie, aes(x = long, y = lat, group = GEN, r = radius), cols = colnames(hare_final_pie)[3:4]) +
  map_theme +
  coord_cartesian(ylim = c(47, 55.5), xlim = c(5, 15.5)) +
  theme(legend.justification = "bottom") +
  scale_fill_manual(values = c("black", "transparent")) +
  labs(
    title = "German hare populations in 2015/16 by federal state",
    fill = ""
  )

grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 1 ,widths = unit(1 ,"npc"))))
print(p1 + theme(legend.position = "right"), vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2 + theme(legend.position = "right"), vp = viewport(layout.pos.row = 1, layout.pos.col = 1))

sessionInfo()

## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scatterpie_0.0.7 png_0.1-7        plyr_1.8.4       maptools_0.9-2  
##  [5] sp_1.2-4         dplyr_0.5.0      purrr_0.2.2      readr_1.1.0     
##  [9] tidyr_0.6.1      tibble_1.3.0     ggplot2_2.2.1    tidyverse_1.1.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     forcats_0.2.0    tools_3.3.3      digest_0.6.12   
##  [5] jsonlite_1.4     lubridate_1.6.0  evaluate_0.10    nlme_3.1-131    
##  [9] gtable_0.2.0     lattice_0.20-35  psych_1.7.3.21   DBI_0.6-1       
## [13] rgdal_1.2-6      yaml_2.1.14      parallel_3.3.3   haven_1.0.0     
## [17] xml2_1.1.1       stringr_1.2.0    httr_1.2.1       knitr_1.15.1    
## [21] hms_0.3          rprojroot_1.2    R6_2.2.0         readxl_0.1.1    
## [25] foreign_0.8-67   rmarkdown_1.4    udunits2_0.13    tweenr_0.1.5    
## [29] modelr_0.1.0     reshape2_1.4.2   magrittr_1.5     units_0.4-3     
## [33] MASS_7.3-45      backports_1.0.5  scales_0.4.1     htmltools_0.3.5 
## [37] rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     ggforce_0.1.1   
## [41] colorspace_1.3-2 labeling_0.3     stringi_1.1.5    lazyeval_0.2.0  
## [45] munsell_0.4.3    broom_0.4.2

Previous Page: 2 of 5 Next

« Prev 1 2 3 4 5 Next »