Data on tour: Plotting 3D maps and location tracks

2017-04-09 00:00:00 +0000

Recently, I was on Gran Canaria for a vacation. So, what better way to keep up the holiday spirit a while longer than to visualize all the places we went in R!?

I am combining location data collected by

  1. our car GPS,
  2. Google location data from my phone and
  3. the hiking tracks we followed.

Obviously, the data itself is not public this time but the principles for plotting can be used with any .gpx files that cointain latitude, longitude and elevation data.


Gran Canaria

Gran Canaria is one of the Spanish Canary Islands off the coast of Morocco. As its sister islands, it is of volcanic origin.


Car GPS tracks

The GPS tracks in .gpx format were downloaded from the device (Garmin) the same way as I had already done with the tracks from our US/Canada roadtrip last year. Garmin’s .gpx tracks are not in a standard format that could be read with readGPX(), so I had to use htmlTreeParse() from the XML library.

library(XML)

# list of files in directory
myfiles <- list.files(path = "Takeout_2017_March", full.names = TRUE)

# all gpx files in that directory
myfiles <- myfiles[grep(".gpx", myfiles)]

for (i in 1:length(myfiles)) {

    # parse document
    pfile <- htmlTreeParse(myfiles[i], useInternalNodes = T)
  
    # Get all elevations, times and coordinates via the respective xpath
    elevations <- as.numeric(as.character(xpathSApply(pfile, path = "//trkpt/ele", xmlValue)))
    times <- xpathSApply(pfile, path = "//trkpt/time", xmlValue)
    coords <- xpathSApply(pfile, path = "//trkpt", xmlAttrs)
    
    # Extract latitude and longitude from the coordinates
    lats <- as.numeric(as.character(coords["lat",]))
    lons <- as.numeric(as.character(coords["lon",]))
    
    # Put everything in a dataframe
    geodf <- data.frame(lat = lats, lon = lons, ele = elevations, time = times)

    if (i == 1) {
      geodata <- geodf
      } else {
      geodata <- rbind(geodata, geodf)
    }
}

geodata_all <- geodata

# Transforming the time column
geodata_all$time <- as.character(strptime(geodata_all$time, format = "%Y-%m-%dT%H:%M:%SZ"))

# ordering by date
library(dplyr)
geodata_all <- arrange(geodata_all, time)

# keep only data points from Gran Canaria
geodata_gc <- filter(geodata_all, lat < 28.5 & lat > 27.5) %>%
  filter(lon > -16 & lon < -15)


Hiking tracks

The hiking tracks we followed came mostly from a German hiking guide-book, the Rother Wanderführer, 7th edition from 2016. They were in standard .gpx format and could be read with readGPX().

Only one of our hikes did not come from this book, but from Wikiloc. It could be treated the same way as the other hiking tracks, though, so I combined all hiking tracks.

library(plotKML)

# get file names
myfiles <- list.files(path = "hiking", full.names = TRUE)
myfiles <- myfiles[grep(".gpx", myfiles)]

for (i in 1:length(myfiles)) {

  # read in gpx files
  t <- readGPX(myfiles[i])
  
  # extract latitude, longitude and elevation
  geodf <- data.frame(lon = t$tracks[[1]][[1]]$lon,
                      lat = t$tracks[[1]][[1]]$lat,
                      ele = t$tracks[[1]][[1]]$ele,
                      tour = names(t$tracks[[1]]))

    # combine into data frame
    if (i == 1) {
      geodata <- geodf
      } else {
      geodata <- rbind(geodata, geodf)
    }
  rm(pfile)
}

geodata_tracks <- geodata


Google location history

The Google location data from my phone were prepared the same way as in my post about plotting your Google location history.

library(jsonlite)
system.time(x <- fromJSON("Standortverlauf.json"))
# extracting the locations dataframe
loc = x$locations

# converting time column from posix milliseconds into a readable time scale
loc$time = as.POSIXct(as.numeric(x$locations$timestampMs)/1000, origin = "1970-01-01")

# converting longitude and latitude from E7 to GPS coordinates
loc$lat = loc$latitudeE7 / 1e7
loc$lon = loc$longitudeE7 / 1e7

# keep only data from Gran Canaria
loc_gc <- filter(loc, lat < 28.5 & lat > 27.5) %>%
  filter(lon > -16 & lon < -15)


All three datasets had information about the latitude/longitude coordinate pairs and of the elevation of each observation, so I can combine these three attributes from the three location data sources.

geodata_tracks$ele <- as.numeric(as.character(geodata_tracks$ele))
data_combined <- data.frame(lat = c(geodata_gc$lat, loc_gc$lat, geodata_tracks$lat),
                            lon = c(geodata_gc$lon, loc_gc$lon, geodata_tracks$lon),
                            ele = c(geodata_gc$ele, loc_gc$altitude, geodata_tracks$ele),
                            track = c(rep("GPS", nrow(geodata_gc)), rep("Google", nrow(loc_gc)), rep("Hiking", nrow(geodata_tracks))))
data_combined <- data_combined[!duplicated(data_combined), ]


Elevation profile

I downloaded the elevation profile of Gran Canaria with the maptools and raster packages.

library(maptools)
library(raster)
srtm <- getData("SRTM", lon = -15.59972, lat = 27.965)

# crop to Gran Canaria & Tenerife
e1 <- extent(min(data_combined$lon) - 1.2, # xmin
            max(data_combined$lon) + 0.1, # xmax
            min(data_combined$lat) - 0.1, # ymin
            max(data_combined$lat) + 0.5) # ymax

srtm_ct <- crop(srtm, e1)

# plot slope and aspect with hill shades
slope <- terrain(srtm_ct, opt = "slope")
aspect <- terrain(srtm_ct, opt = "aspect")
hill <- hillShade(slope, aspect, angle = 45, direction = 45, normalize = TRUE)
plot(hill, col = grey(0:100/100), legend = FALSE)
plot(srtm_ct, col = rainbow(25, alpha = 0.35), add = TRUE)

Its sister island to the west, Tenerife is a bit larger and, as can be seen in the image above, its highest point, the Teide, is about 1000 meters higher than Gran Canaria’s highest point.


# crop to Gran Canaria
e2 <- extent(min(data_combined$lon) - 0.2, # xmin
            max(data_combined$lon) + 0.1, # xmax
            min(data_combined$lat) - 0.1, # ymin
            max(data_combined$lat) + 0.1) # ymax

srtm_c <- crop(srtm, e2)

slope <- terrain(srtm_c, opt = "slope")
aspect <- terrain(srtm_c, opt = "aspect")
hill <- hillShade(slope, aspect, angle = 45, direction = 45, normalize = TRUE)
plot(hill, col = grey(0:100/100), legend = FALSE)
plot(srtm_c, col = rainbow(25, alpha = 0.35), add = TRUE)

Looking only at Gran Canaria’s slope and aspect profile, the plots nicely show its highest point (almost 2000 m) in the center of the island and that it slopes down towards sea level at the coast.


ggmap

I centered the map at the midpoint of Gran Canaria, which is given by http://www.travelmath.com/island/Gran+Canaria as 27° 57’ 54” N / 15° 35’ 59” W. Converted to decimal with http://www.rapidtables.com/convert/number/degrees-minutes-seconds-to-degrees.htm that is 27.965 & 15.59972.

library(ggplot2)
library(ggmap)

map_theme <- list(theme(legend.position = "top",
                        panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_blank(),
                        plot.background = element_rect(fill = "white"),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))
map <- get_map(c(lon = -15.59972, lat = 27.965), zoom = 10, maptype = "terrain-background", source = "google")

ggmap(map) + 
  geom_point(data = data_combined, aes(x = lon, y = lat, color = track), alpha = 0.3) +
  scale_color_brewer(palette = "Set1") +
  map_theme

The above plot shows the different location data sources. Our car GPS recorded the most data points along our routes (blue), while Google location data (red) is only recorded occasionally (usually, when I turn on my phone, so it primarily shows places we walked around in). The hiking tracks in green show the 7 hikes we’ve followed.


map <- get_map(c(lon = -15.59972, lat = 27.965), zoom = 10)

ggmap(map) + 
  geom_point(data = na.omit(data_combined), aes(x = lon, y = lat, color = ele)) +
  scale_color_gradient2(low = "lightblue", mid = "blue", high = "red", name = "Elevation") +
  map_theme

This plots shows the elevation of the location points we’ve been.


3D plots

Because the 2D plot doesn’t show the elevation very clearly, I wanted to plot this in 3D.

library(scatterplot3d)

# http://gis.stackexchange.com/questions/142156/r-how-to-get-latitudes-and-longitudes-from-a-rasterlayer
r.pts <- rasterToPoints(srtm_c, spatial = TRUE)
geo.prj <- proj4string(r.pts)
r.pts <- spTransform(r.pts, CRS(geo.prj)) 
bg_matrix <- data.frame(lon = coordinates(r.pts)[,1],
                         lat = coordinates(r.pts)[,2])

ex_bg <- extract(srtm_c, bg_matrix, cellnumbers = TRUE, df = TRUE)
bg_matrix$ele <- ex_bg$srtm_33_07

ex_points <- extract(srtm_c, cbind(data_combined$lon, data_combined$lat), cellnumbers = TRUE, df = TRUE)

s3d <- scatterplot3d(x = bg_matrix[, 1], 
                     y = bg_matrix[, 2], 
                     z = bg_matrix[, 3], 
                     type = "p", 
                     pch = 16, 
                     highlight.3d = TRUE, 
                     cex.symbols = 0.5, 
                     box = FALSE,
                     grid = FALSE,
                     xlab = "Longitude",
                     ylab = "Latitude",
                     zlab = "Elevation")
s3d$points3d(x = data_combined$lon, y = data_combined$lat, z = ex_points$srtm_33_07, col = "blue", pch = 16, cex = 0.1)

The first 3D plot shows a scatterplot of our location points on the elevation profile of Gran Canaria. It was created with scatterplot3d.


But I also wanted to be able to see the 3D plot from different angles, soI used rgl to make interactive 3D plots.

library(rgdal)
library(rasterVis)
library(rgl)
library(htmlwidgets)
options(rgl.printRglwidget = TRUE)

open3d()
plot3D(srtm_c,  maxpixels = 7e4)


Click to open the interactive 3D image and use the mouse to zoom and turn the plot:

3dplot1

This interactive 3D plot shows the elevation profile of the whole island and you can clearly see the various ravines that have been shaped over the years.

Now, I would have liked to plot our location points directly onto this plot, but I couldn’t figure out how to do this (if someone knows, please let me know). Instead, I plotted only our location points in 3D.


options(rgl.printRglwidget = TRUE)
open3d()
plot3d(data_combined$lon, data_combined$lat, ex_points$srtm_33_07, 
       xlab = "", ylab = "", zlab = "",
       col = "blue", size = 5, alpha = 0.1,
       lit = TRUE,
       box = FALSE, axes = FALSE)


Click to open the interactive 3D image and use the mouse to zoom and turn the plot:

3dplot2

Here as well, I would have liked to have the elevation profile of the island behind it, but the code that worked well within RStudio (see below) did not plot the background points in my html output. I assume that it has to do with having too many points, so if somebody knows a way to reduce the resolution, please let me know!

plot3d(bg_matrix$lon, bg_matrix$lat, bg_matrix$ele, 
       xlab = "", ylab = "", zlab = "",
       col = "grey", size = 5, alpha = 0.1,
       box = FALSE, axes = FALSE, useNULL = TRUE)
plot3d(data_combined$lon, data_combined$lat, ex_points$srtm_33_07, 
       col = "blue", add = TRUE, size = 10, alpha = 0.5, useNULL = TRUE) 


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scatterplot3d_0.3-38 htmlwidgets_0.8      rgl_0.98.1          
##  [4] rasterVis_0.41       latticeExtra_0.6-28  RColorBrewer_1.1-2  
##  [7] lattice_0.20-34      rgdal_1.2-5          raster_2.5-8        
## [10] maptools_0.9-2       sp_1.2-4            
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9       knitr_1.15.1      magrittr_1.5     
##  [4] xtable_1.8-2      viridisLite_0.2.0 R6_2.2.0         
##  [7] stringr_1.2.0     tools_3.3.3       parallel_3.3.3   
## [10] grid_3.3.3        htmltools_0.3.5   yaml_2.1.14      
## [13] rprojroot_1.2     digest_0.6.12     shiny_1.0.0      
## [16] mime_0.5          evaluate_0.10     rmarkdown_1.3    
## [19] stringi_1.1.2     backports_1.0.5   jsonlite_1.2     
## [22] httpuv_1.3.3      foreign_0.8-67    hexbin_1.27.1    
## [25] zoo_1.7-14

Dealing with unbalanced data in machine learning

2017-04-02 00:00:00 +0000

In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.

But because I had gotten a few questions regarding this, I thought it would be worthwhile to explain over- and under-sampling techniques in more detail and show how you can very easily implement them with caret.

library(caret)


Unbalanced data

In this context, unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases. Even more extreme unbalance is seen with fraud detection, where e.g. most credit card uses are okay and only very few will be fraudulent. In the example I used for my webinar, a breast cancer dataset, we had about twice as many benign than malignant samples.

summary(bc_data$classes)
##    benign malignant 
##       458       241


Why is unbalanced data a problem in machine learning?

Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our breast cancer dataset: assume we had 10 malignant vs 90 benign samples. A machine learning model that has been trained and tested on such a dataset could now predict “benign” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!


How to balance data for modeling

The basic theoretical concepts behind over- and under-sampling are very simple:

  • With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 241 out of the 458 benign cases. The main disadvantage of under-sampling is that we lose potentially relevant information from the left-out samples.

  • With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid losing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data, i.e. the test data is no longer independent from training data. This would lead to an overestimation of our model’s performance and generalizability.

In reality though, we should not simply perform over- or under-sampling on our training data and then run the model. We need to account for cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance!


Modeling the original unbalanced data

Here is the same model I used in my webinar example: I randomly divide the data into training and test sets (stratified by class) and perform Random Forest modeling with 10 x 10 repeated cross-validation. Final model performance is then measured on the test set.

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]
set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  verboseIter = FALSE))
final <- data.frame(actual = test_data$classes,
                    predict(model_rf, newdata = test_data, type = "prob"))
final$predict <- ifelse(final$benign > 0.5, "benign", "malignant")
cm_original <- confusionMatrix(final$predict, test_data$classes)


Under-sampling

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "down")

set.seed(42)
model_rf_under <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = ctrl)
final_under <- data.frame(actual = test_data$classes,
                    predict(model_rf_under, newdata = test_data, type = "prob"))
final_under$predict <- ifelse(final_under$benign > 0.5, "benign", "malignant")
cm_under <- confusionMatrix(final_under$predict, test_data$classes)


Oversampling

For over- (also called up-) sampling we simply specify sampling = "up".

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "up")

set.seed(42)
model_rf_over <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = ctrl)
final_over <- data.frame(actual = test_data$classes,
                          predict(model_rf_over, newdata = test_data, type = "prob"))
final_over$predict <- ifelse(final_over$benign > 0.5, "benign", "malignant")
cm_over <- confusionMatrix(final_over$predict, test_data$classes)


ROSE

Besides over- and under-sampling, there are hybrid methods that combine under-sampling with the generation of additional data. Two of the most popular are ROSE and SMOTE.

From Nicola Lunardon, Giovanna Menardi and Nicola Torelli’s “ROSE: A Package for Binary Imbalanced Learning” (R Journal, 2014, Vol. 6 Issue 1, p. 79): “The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.”

You implement them the same way as before, this time choosing sampling = "rose"

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "rose")

set.seed(42)
model_rf_rose <- caret::train(classes ~ .,
                              data = train_data,
                              method = "rf",
                              preProcess = c("scale", "center"),
                              trControl = ctrl)
final_rose <- data.frame(actual = test_data$classes,
                         predict(model_rf_rose, newdata = test_data, type = "prob"))
final_rose$predict <- ifelse(final_rose$benign > 0.5, "benign", "malignant")
cm_rose <- confusionMatrix(final_rose$predict, test_data$classes)


SMOTE

… or by choosing sampling = "smote" in the trainControl settings.

From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357): “This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples.”

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "smote")

set.seed(42)
model_rf_smote <- caret::train(classes ~ .,
                              data = train_data,
                              method = "rf",
                              preProcess = c("scale", "center"),
                              trControl = ctrl)
final_smote <- data.frame(actual = test_data$classes,
                         predict(model_rf_smote, newdata = test_data, type = "prob"))
final_smote$predict <- ifelse(final_smote$benign > 0.5, "benign", "malignant")
cm_smote <- confusionMatrix(final_smote$predict, test_data$classes)


Predictions

Now let’s compare the predictions of all these models:

models <- list(original = model_rf,
                       under = model_rf_under,
                       over = model_rf_over,
                       smote = model_rf_smote,
                       rose = model_rf_rose)

resampling <- resamples(models)
bwplot(resampling)

library(dplyr)
comparison <- data.frame(model = names(models),
                         Sensitivity = rep(NA, length(models)),
                         Specificity = rep(NA, length(models)),
                         Precision = rep(NA, length(models)),
                         Recall = rep(NA, length(models)),
                         F1 = rep(NA, length(models)))

for (name in names(models)) {
  model <- get(paste0("cm_", name))
  
  comparison[comparison$model == name, ] <- filter(comparison, model == name) %>%
    mutate(Sensitivity = model$byClass["Sensitivity"],
           Specificity = model$byClass["Specificity"],
           Precision = model$byClass["Precision"],
           Recall = model$byClass["Recall"],
           F1 = model$byClass["F1"])
}

library(tidyr)
comparison %>%
  gather(x, y, Sensitivity:F1) %>%
  ggplot(aes(x = x, y = y, color = model)) +
    geom_jitter(width = 0.2, alpha = 0.5, size = 3)

With this small dataset, we can already see how the different techniques can influence model performance. Sensitivity (or recall) describes the proportion of benign cases that have been predicted correctly, while specificity describes the proportion of malignant cases that have been predicted correctly. Precision describes the true positives, i.e. the proportion of benign predictions that were actual from benign samples. F1 is the weighted average of precision and sensitivity/ recall.

Here, all four methods improved specificity and precision compared to the original model. Under-sampling, over-sampling and ROSE additionally improved precision and the F1 score.


This post shows a simple example of how to correct for unbalance in datasets for machine learning. For more advanced instructions and potential caveats with these techniques, check out the excellent caret documentation.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.6.1         dplyr_0.5.0         randomForest_4.6-12
## [4] caret_6.0-73        ggplot2_2.2.1       lattice_0.20-34    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        nloptr_1.0.4       plyr_1.8.4        
##  [4] class_7.3-14       iterators_1.0.8    tools_3.3.3       
##  [7] digest_0.6.12      lme4_1.1-12        evaluate_0.10     
## [10] tibble_1.2         gtable_0.2.0       nlme_3.1-131      
## [13] mgcv_1.8-17        Matrix_1.2-8       foreach_1.4.3     
## [16] DBI_0.5-1          yaml_2.1.14        parallel_3.3.3    
## [19] SparseM_1.74       e1071_1.6-8        stringr_1.2.0     
## [22] knitr_1.15.1       MatrixModels_0.4-1 stats4_3.3.3      
## [25] rprojroot_1.2      grid_3.3.3         nnet_7.3-12       
## [28] R6_2.2.0           rmarkdown_1.3      minqa_1.2.4       
## [31] reshape2_1.4.2     car_2.1-4          magrittr_1.5      
## [34] backports_1.0.5    scales_0.4.1       codetools_0.2-15  
## [37] ModelMetrics_1.1.0 htmltools_0.3.5    MASS_7.3-45       
## [40] splines_3.3.3      assertthat_0.1     pbkrtest_0.4-6    
## [43] colorspace_1.3-2   labeling_0.3       quantreg_5.29     
## [46] stringi_1.1.2      lazyeval_0.2.0     munsell_0.4.3

Building meaningful machine learning models for disease prediction

2017-03-31 00:00:00 +0000

Webinar for the ISDS R Group

This document presents the code I used to produce the example analysis and figures shown in my webinar on building meaningful machine learning models for disease prediction.

My webinar slides are available on Github


Description: Dr Shirin Glander will go over her work on building machine-learning models to predict the course of different diseases. She will go over building a model, evaluating its performance, and answering or addressing different disease related questions using machine learning. Her talk will cover the theory of machine learning as it is applied using R.


Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.


The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", 
                      header = FALSE, 
                      sep = ",")
colnames(bc_data) <- c("sample_code_number", 
                       "clump_thickness", 
                       "uniformity_of_cell_size", 
                       "uniformity_of_cell_shape", 
                       "marginal_adhesion", 
                       "single_epithelial_cell_size", 
                       "bare_nuclei", 
                       "bland_chromatin", 
                       "normal_nucleoli", 
                       "mitosis", 
                       "classes")

bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))


Missing data

bc_data[bc_data == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data)))
## [1] 16
# how many samples would we loose, if we removed them?
nrow(bc_data)
## [1] 699
nrow(bc_data[is.na(bc_data), ])
## [1] 16


Missing values are imputed with the mice package.

# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)


Data exploration

  • Response variable for classification
library(ggplot2)

ggplot(bc_data, aes(x = classes, fill = classes)) +
  geom_bar()

We can see that our data is unbalanced. For simplicity’s sake, I am not going to go into how to deal with this, here. But I added a short post about dealing with unbalanced datasets using caret for you to check out.


  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) +
  geom_histogram(bins = 10)


  • Principal Component Analysis
library(pcaGoPromoter)
library(ellipse)

# perform pca and extract scores
pcaOutput <- pca(t(bc_data[, -1]), printDropped = FALSE, scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$scores)
  
# define groups for plotting
pcaOutput2$groups <- bc_data$classes
  
centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)

conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
  data.frame(groups = as.character(t),
             ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                   centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                   level = 0.95),
             stringsAsFactors = FALSE)))
    
ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
    geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
    geom_point(size = 2, alpha = 0.6) + 
    scale_color_brewer(palette = "Set1") +
    labs(color = "",
         fill = "",
         x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
         y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance")) 

  • Features
library(tidyr)

gather(bc_data, x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = classes, fill = classes)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)


Machine Learning packages for R

caret

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)


Training, validation and test data

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]
library(dplyr)

rbind(data.frame(group = "train", train_data),
      data.frame(group = "test", test_data)) %>%
  gather(x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = group, fill = group)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)


Regression

set.seed(42)
model_glm <- caret::train(clump_thickness ~ .,
                          data = train_data,
                          method = "glm",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))
model_glm
## Generalized Linear Model 
## 
## 490 samples
##   9 predictor
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 441, 441, 440, 442, 441, 440, ... 
## Resampling results:
## 
##   RMSE      Rsquared 
##   1.974296  0.5016141
## 
## 
predictions <- predict(model_glm, test_data)
# model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values
data.frame(residuals = resid(model_glm),
           predictors = model_glm$finalModel$linear.predictors) %>%
  ggplot(aes(x = predictors, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

# y == train_data$clump_thickness
data.frame(residuals = resid(model_glm),
           y = model_glm$finalModel$y) %>%
  ggplot(aes(x = y, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness,
           predicted = predictions) %>%
  ggplot(aes(x = actual, y = predicted)) +
    geom_jitter() +
    geom_smooth(method = "lm")


Classification

Decision trees

rpart

library(rpart)
library(rpart.plot)

set.seed(42)
fit <- rpart(classes ~ .,
            data = train_data,
            method = "class",
            control = rpart.control(xval = 10, 
                                    minbucket = 2, 
                                    cp = 0), 
             parms = list(split = "information"))

rpart.plot(fit, extra = 100)


Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf$finalModel$confusion
##           benign malignant class.error
## benign       313         8  0.02492212
## malignant      4       165  0.02366864


  • Feature Importance
imp <- model_rf$finalModel$importance
imp[order(imp, decreasing = TRUE), ]
##     uniformity_of_cell_size    uniformity_of_cell_shape 
##                   54.416003                   41.553022 
##             bland_chromatin                 bare_nuclei 
##                   29.343027                   28.483842 
##             normal_nucleoli single_epithelial_cell_size 
##                   19.239635                   18.480155 
##             clump_thickness           marginal_adhesion 
##                   13.276702                   12.143355 
##                     mitosis 
##                    3.081635
# estimate variable importance
importance <- varImp(model_rf, scale = TRUE)
plot(importance)


  • predicting test data
confusionMatrix(predict(model_rf, test_data), test_data$classes)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       133         2
##   malignant      4        70
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9386, 0.9894)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9369          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9708          
##             Specificity : 0.9722          
##          Pos Pred Value : 0.9852          
##          Neg Pred Value : 0.9459          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6364          
##    Detection Prevalence : 0.6459          
##       Balanced Accuracy : 0.9715          
##                                           
##        'Positive' Class : benign          
## 
results <- data.frame(actual = test_data$classes,
                      predict(model_rf, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)


Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42)
model_xgb <- caret::train(classes ~ .,
                          data = train_data,
                          method = "xgbTree",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))


  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE)
plot(importance)


  • predicting test data
confusionMatrix(predict(model_xgb, test_data), test_data$classes)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       132         2
##   malignant      5        70
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9322, 0.9864)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9266          
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9635          
##             Specificity : 0.9722          
##          Pos Pred Value : 0.9851          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6316          
##    Detection Prevalence : 0.6411          
##       Balanced Accuracy : 0.9679          
##                                           
##        'Positive' Class : benign          
## 
results <- data.frame(actual = test_data$classes,
                      predict(model_xgb, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)


Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot)

# calculate correlation matrix
corMatMy <- cor(train_data[, -1])
corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70,
highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 2  and column  3 with corr  0.899 
##   Means:  0.696 vs 0.575 so flagging column 2 
## Compare row 3  and column  7 with corr  0.736 
##   Means:  0.654 vs 0.55 so flagging column 3 
## All correlations <= 0.7
# which variables are flagged for removal?
highlyCor
## [1] "uniformity_of_cell_size"  "uniformity_of_cell_shape"
#then we remove these variables
train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]


  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7)
results_rfe <- rfe(x = train_data[, -1], 
                   y = train_data$classes, 
                   sizes = c(1:9), 
                   rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10))
# chosen features
predictors(results_rfe)
## [1] "bare_nuclei"                 "uniformity_of_cell_size"    
## [3] "clump_thickness"             "uniformity_of_cell_shape"   
## [5] "bland_chromatin"             "marginal_adhesion"          
## [7] "normal_nucleoli"             "single_epithelial_cell_size"
## [9] "mitosis"
train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]


  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27)
model_ga <- gafs(x = train_data[, -1], 
                 y = train_data$classes,
                 iters = 10, # generations of algorithm
                 popSize = 10, # population size for each generation
                 levels = c("malignant", "benign"),
                 gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF
                                           method = "cv",    # 10 fold cross validation
                                           genParallel = TRUE, # Use parallel programming
                                           allowParallel = TRUE))
plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]


Grid search with caret

  • Automatic Grid
set.seed(42)
model_rf_tune_auto <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneLength = 15)
model_rf_tune_auto
## Random Forest 
## 
## 490 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.9692153  0.9323624
##   2     0.9704277  0.9350498
##   5     0.9645085  0.9216721
##   6     0.9639087  0.9201998
##   7     0.9632842  0.9186919
##   8     0.9626719  0.9172257
##   9     0.9636801  0.9195036
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
plot(model_rf_tune_auto)


  • Manual Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42)
grid <- expand.grid(mtry = c(1:10))

model_rf_tune_man <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneGrid = grid)
model_rf_tune_man
## Random Forest 
## 
## 490 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.9696153  0.9332392
##    2    0.9706440  0.9354737
##    3    0.9696194  0.9330647
##    4    0.9661495  0.9253163
##    5    0.9649252  0.9225586
##    6    0.9653209  0.9233806
##    7    0.9634881  0.9192265
##    8    0.9624718  0.9169227
##    9    0.9641005  0.9203072
##   10    0.9628760  0.9176675
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
plot(model_rf_tune_man)


Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o)
h2o.init(nthreads = -1)
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.out
##     C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 815 milliseconds 
##     H2O cluster version:        3.10.3.6 
##     H2O cluster version age:    1 month and 10 days  
##     H2O cluster name:           H2O_started_from_R_s_glan02_tvy462 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.54 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.3 (2017-03-06)
bc_data_hf <- as.h2o(bc_data)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.describe(bc_data_hf) %>%
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% 
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

library(reshape2) # for melting

bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1])

cor <- h2o.cor(bc_data_hf)
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")


Training, validation and test data

splits <- h2o.splitFrame(bc_data_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "classes"
features <- setdiff(colnames(train), response)
summary(train$classes, exact_quantiles = TRUE)
##  classes       
##  benign   :317 
##  malignant:174
summary(valid$classes, exact_quantiles = TRUE)
##  classes      
##  benign   :71 
##  malignant:35
summary(test$classes, exact_quantiles = TRUE)
##  classes      
##  benign   :70 
##  malignant:32
pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           impute_missing = TRUE,
           k = 3,
           seed = 42)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features

library(ggrepel)
ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel()


Classification

Random Forest
hyper_params <- list(
                     ntrees = c(25, 50, 75, 100),
                     max_depth = c(10, 20, 30),
                     min_rows = c(1, 3, 5)
                     )

search_criteria <- list(
                        strategy = "RandomDiscrete", 
                        max_models = 50,
                        max_runtime_secs = 360,
                        stopping_rounds = 5,          
                        stopping_metric = "AUC",      
                        stopping_tolerance = 0.0005,
                        seed = 42
                        )
rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, 
                                                # alternatively h2o.gbm 
                                                # for Gradient boosting trees
                    x = features,
                    y = response,
                    grid_id = "rf_grid",
                    training_frame = train,
                    validation_frame = valid,
                    nfolds = 25,                           
                    fold_assignment = "Stratified",
                    hyper_params = hyper_params,
                    search_criteria = search_criteria,
                    seed = 42
                    )
# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss")

for (sort_by_1 in sort_options_1) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path="models", force = TRUE)
  
}


# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")

for (sort_by_2 in sort_options_2) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path = "models", force = TRUE)
  
}
files <- list.files(path = "models")
rf_models <- files[grep("rf_grid_model", files)]

for (model_id in rf_models) {
  
  path <- paste0("U:\\Github_blog\\Webinar\\Webinar_ML_for_disease\\models\\", model_id)
  best_model <- h2o.loadModel(path)
  mse_auc_test <- data.frame(model_id = model_id, 
                             mse = h2o.mse(h2o.performance(best_model, test)),
                             auc = h2o.auc(h2o.performance(best_model, test)))
  
  if (model_id == rf_models[[1]]) {
    
    mse_auc_test_comb <- mse_auc_test
    
  } else {
    
    mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
    
  }
}

mse_auc_test_comb %>%
  gather(x, y, mse:auc) %>%
  ggplot(aes(x = model_id, y = y, fill = model_id)) +
    facet_grid(x ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
    labs(x = "", y = "value", fill = "")

for (model_id in rf_models) {
  
  best_model <- h2o.getModel(model_id)
  
  finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, 
                                                   nrow(test)),
                                    actual = as.vector(test$classes), 
                                    as.data.frame(h2o.predict(object = best_model, 
                                                              newdata = test)))
  
  finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                           finalRf_predictions$predict, 
                                         "yes", "no")
  
  finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, 
                                                  "benign", 
                                                  ifelse(finalRf_predictions$malignant 
                                                         > 0.8, "malignant", "uncertain"))
  
  finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                     finalRf_predictions$predict_stringent, "yes", 
                                         ifelse(finalRf_predictions$predict_stringent == 
                                                  "uncertain", "na", "no"))
  
  if (model_id == rf_models[[1]]) {
    
    finalRf_predictions_comb <- finalRf_predictions
    
  } else {
    
    finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
    
  }
}
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
finalRf_predictions_comb %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 3) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions_comb %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 3) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

rf_model <- h2o.loadModel("models/rf_grid_model_6")
h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model)
h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE)
##       train       valid        xval 
## 0.024674571 0.007042254 0.023097284
h2o.confusionMatrix(rf_model, valid = TRUE)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.293125896751881:
##           benign malignant    Error    Rate
## benign        70         1 0.014085   =1/71
## malignant      0        35 0.000000   =0/35
## Totals        70        36 0.009434  =1/106
plot(rf_model,
     timestep = "number_of_trees",
     metric = "classification_error")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "logloss")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "AUC")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "rmse")

h2o.auc(rf_model, train = TRUE)
## [1] 0.989521
h2o.auc(rf_model, valid = TRUE)
## [1] 0.9995976
h2o.auc(rf_model, xval = TRUE)
## [1] 0.9890496
perf <- h2o.performance(rf_model, test)
perf
## H2OBinomialMetrics: drf
## 
## MSE:  0.03673598
## RMSE:  0.1916663
## LogLoss:  0.1158835
## Mean Per-Class Error:  0.0625
## AUC:  0.990625
## Gini:  0.98125
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           benign malignant    Error    Rate
## benign        70         0 0.000000   =0/70
## malignant      4        28 0.125000   =4/32
## Totals        74        28 0.039216  =4/102
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.735027 0.933333  25
## 2                       max f2  0.294222 0.952381  37
## 3                 max f0point5  0.735027 0.972222  25
## 4                 max accuracy  0.735027 0.960784  25
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.294222 1.000000  37
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.735027 0.909782  25
## 9   max min_per_class_accuracy  0.424524 0.937500  31
## 10 max mean_per_class_accuracy  0.294222 0.942857  37
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
plot(perf)

h2o.logloss(perf)
## [1] 0.1158835
h2o.mse(perf)
## [1] 0.03673598
h2o.auc(perf)
## [1] 0.990625
head(h2o.metric(perf))
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  1.000000 0.171429 0.114504 0.340909 0.715686  1.000000 0.093750
## 2  0.998333 0.222222 0.151515 0.416667 0.725490  1.000000 0.125000
## 3  0.998000 0.270270 0.187970 0.480769 0.735294  1.000000 0.156250
## 4  0.997222 0.315789 0.223881 0.535714 0.745098  1.000000 0.187500
## 5  0.996210 0.358974 0.259259 0.583333 0.754902  1.000000 0.218750
## 6  0.994048 0.400000 0.294118 0.625000 0.764706  1.000000 0.250000
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.257464               0.093750                0.546875
## 2    1.000000     0.298807               0.125000                0.562500
## 3    1.000000     0.335794               0.156250                0.578125
## 4    1.000000     0.369755               0.187500                0.593750
## 5    1.000000     0.401478               0.218750                0.609375
## 6    1.000000     0.431474               0.250000                0.625000
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1  70  29   0   3 1.000000 0.906250 0.000000 0.093750   0
## 2  70  28   0   4 1.000000 0.875000 0.000000 0.125000   1
## 3  70  27   0   5 1.000000 0.843750 0.000000 0.156250   2
## 4  70  26   0   6 1.000000 0.812500 0.000000 0.187500   3
## 5  70  25   0   7 1.000000 0.781250 0.000000 0.218750   4
## 6  70  24   0   8 1.000000 0.750000 0.000000 0.250000   5
finalRf_predictions <- data.frame(actual = as.vector(test$classes), 
                                  as.data.frame(h2o.predict(object = rf_model, 
                                                            newdata = test)))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                         finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", 
                                                ifelse(finalRf_predictions$malignant 
                                                       > 0.8, "malignant", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                   finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == 
                                                "uncertain", "na", "no"))

finalRf_predictions %>%
  group_by(actual, predict) %>%
  dplyr::summarise(n = n())
## Source: local data frame [3 x 3]
## Groups: actual [?]
## 
##      actual   predict     n
##      <fctr>    <fctr> <int>
## 1    benign    benign    62
## 2    benign malignant     8
## 3 malignant malignant    32
finalRf_predictions %>%
  group_by(actual, predict_stringent) %>%
  dplyr::summarise(n = n())
## Source: local data frame [4 x 3]
## Groups: actual [?]
## 
##      actual predict_stringent     n
##      <fctr>             <chr> <int>
## 1    benign            benign    61
## 2    benign         uncertain     9
## 3 malignant         malignant    26
## 4 malignant         uncertain     6
finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)]

thresholds <- seq(from = 0, to = 1, by = 0.1)

prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA)

for (threshold in thresholds) {
  pred <- ifelse(df$benign > threshold, "benign", "malignant")
  pred_t <- ifelse(pred == df$actual, TRUE, FALSE)
  
  group <- data.frame(df, "pred" = pred_t) %>%
  group_by(actual, pred) %>%
  dplyr::summarise(n = n())
  
  group_b <- filter(group, actual == "benign")
  
  prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n)
  prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b
  
  group_m <- filter(group, actual == "malignant")
  
  prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n)
  prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m
}

prop_table %>%
  gather(x, y, prop_true_b:prop_true_m) %>%
  ggplot(aes(x = threshold, y = y, color = x)) +
    geom_point() +
    geom_line() +
    scale_color_brewer(palette = "Set1") +
    labs(y = "proportion of true predictions",
         color = "b: benign cases\nm: malignant cases")

h2o.shutdown()

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.6.5        reshape2_1.4.2       h2o_3.10.3.6        
##  [4] corrplot_0.77        plyr_1.8.4           xgboost_0.6-4       
##  [7] randomForest_4.6-12  dplyr_0.5.0          caret_6.0-73        
## [10] lattice_0.20-35      doParallel_1.0.10    iterators_1.0.8     
## [13] foreach_1.4.3        tidyr_0.6.1          pcaGoPromoter_1.18.0
## [16] Biostrings_2.42.1    XVector_0.14.0       IRanges_2.8.1       
## [19] S4Vectors_0.12.1     BiocGenerics_0.20.0  ellipse_0.3-8       
## [22] ggplot2_2.2.1.9000  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10         class_7.3-14         assertthat_0.1      
##  [4] rprojroot_1.2        digest_0.6.12        R6_2.2.0            
##  [7] backports_1.0.5      MatrixModels_0.4-1   RSQLite_1.1-2       
## [10] evaluate_0.10        e1071_1.6-8          zlibbioc_1.20.0     
## [13] lazyeval_0.2.0       minqa_1.2.4          data.table_1.10.4   
## [16] SparseM_1.76         car_2.1-4            nloptr_1.0.4        
## [19] Matrix_1.2-8         rmarkdown_1.4        labeling_0.3        
## [22] splines_3.3.3        lme4_1.1-12          stringr_1.2.0       
## [25] RCurl_1.95-4.8       munsell_0.4.3        mgcv_1.8-17         
## [28] htmltools_0.3.5      nnet_7.3-12          tibble_1.2          
## [31] codetools_0.2-15     MASS_7.3-45          bitops_1.0-6        
## [34] ModelMetrics_1.1.0   grid_3.3.3           nlme_3.1-131        
## [37] jsonlite_1.3         gtable_0.2.0         DBI_0.6             
## [40] magrittr_1.5         scales_0.4.1         stringi_1.1.3       
## [43] RColorBrewer_1.1-2   tools_3.3.3          Biobase_2.34.0      
## [46] pbkrtest_0.4-7       yaml_2.1.14          AnnotationDbi_1.36.2
## [49] colorspace_1.3-2     memoise_1.0.0        knitr_1.15.1        
## [52] quantreg_5.29

Plotting trees from Random Forest models with ggraph

2017-03-16 00:00:00 +0000

Today, I want to show how I use Thomas Lin Pedersen’s awesome ggraph package to plot decision trees from Random Forest models.

I am very much a visual person, so I try to plot as much of my results as possible because it helps me get a better feel for what is going on with my data.

A nice aspect of using tree-based machine learning, like Random Forest models, is that that they are more easily interpreted than e.g. neural networks as they are based on decision trees. So, when I am using such models, I like to plot final decision trees (if they aren’t too large) to get a sense of which decisions are underlying my predictions.

There are a few very convient ways to plot the outcome if you are using the randomForest package but I like to have as much control as possible about the layout, colors, labels, etc. And because I didn’t find a solution I liked for caret models, I developed the following little function (below you may find information about how I built the model):

As input, it takes part of the output from model_rf <- caret::train(... "rf" ...), that gives the trees of the final model: model_rf$finalModel$forest. From these trees, you can specify which one to plot by index.

library(dplyr)
library(ggraph)
library(igraph)

tree_func <- function(final_model, 
                      tree_num) {
  
  # get tree by index
  tree <- randomForest::getTree(final_model, 
                                k = tree_num, 
                                labelVar = TRUE) %>%
    tibble::rownames_to_column() %>%
    # make leaf split points to NA, so the 0s won't get plotted
    mutate(`split point` = ifelse(is.na(prediction), `split point`, NA))
  
  # prepare data frame for graph
  graph_frame <- data.frame(from = rep(tree$rowname, 2),
                            to = c(tree$`left daughter`, tree$`right daughter`))
  
  # convert to graph and delete the last node that we don't want to plot
  graph <- graph_from_data_frame(graph_frame) %>%
    delete_vertices("0")
  
  # set node labels
  V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`))
  V(graph)$leaf_label <- as.character(tree$prediction)
  V(graph)$split <- as.character(round(tree$`split point`, digits = 2))
  
  # plot
  plot <- ggraph(graph, 'dendrogram') + 
    theme_bw() +
    geom_edge_link() +
    geom_node_point() +
    geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) +
    geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") +
    geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, 
					repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_blank(),
          plot.background = element_rect(fill = "white"),
          panel.border = element_blank(),
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          plot.title = element_text(size = 18))
  
  print(plot)
}


We can now plot, e.g. the tree with the smalles number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == min(model_rf$finalModel$forest$ndbigtree))

tree_func(final_model = model_rf$finalModel, tree_num)


Or we can plot the tree with the biggest number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == max(model_rf$finalModel$forest$ndbigtree))

tree_func(final_model = model_rf$finalModel, tree_num)



Preparing the data and modeling

The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first data set looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",")
colnames(bc_data) <- c("sample_code_number", 
                       "clump_thickness", 
                       "uniformity_of_cell_size", 
                       "uniformity_of_cell_shape", 
                       "marginal_adhesion", 
                       "single_epithelial_cell_size", 
                       "bare_nuclei", 
                       "bland_chromatin", 
                       "normal_nucleoli", 
                       "mitosis", 
                       "classes")

bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))

bc_data[bc_data == "?"] <- NA

# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)

# separate into training and test data
library(caret)

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]

# run model
set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] igraph_1.0.1       ggraph_1.0.0       ggplot2_2.2.1.9000
## [4] dplyr_0.5.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9         nloptr_1.0.4        plyr_1.8.4         
##  [4] viridis_0.3.4       iterators_1.0.8     tools_3.3.3        
##  [7] digest_0.6.12       lme4_1.1-12         evaluate_0.10      
## [10] tibble_1.2          gtable_0.2.0        nlme_3.1-131       
## [13] lattice_0.20-34     mgcv_1.8-17         Matrix_1.2-8       
## [16] foreach_1.4.3       DBI_0.6             ggrepel_0.6.5      
## [19] yaml_2.1.14         parallel_3.3.3      SparseM_1.76       
## [22] gridExtra_2.2.1     stringr_1.2.0       knitr_1.15.1       
## [25] MatrixModels_0.4-1  stats4_3.3.3        rprojroot_1.2      
## [28] grid_3.3.3          caret_6.0-73        nnet_7.3-12        
## [31] R6_2.2.0            rmarkdown_1.3       minqa_1.2.4        
## [34] udunits2_0.13       tweenr_0.1.5        deldir_0.1-12      
## [37] reshape2_1.4.2      car_2.1-4           magrittr_1.5       
## [40] units_0.4-2         backports_1.0.5     scales_0.4.1       
## [43] codetools_0.2-15    ModelMetrics_1.1.0  htmltools_0.3.5    
## [46] MASS_7.3-45         splines_3.3.3       randomForest_4.6-12
## [49] assertthat_0.1      pbkrtest_0.4-6      ggforce_0.1.1      
## [52] colorspace_1.3-2    labeling_0.3        quantreg_5.29      
## [55] stringi_1.1.2       lazyeval_0.2.0      munsell_0.4.3

Hyper-parameter Tuning with Grid Search for Deep Learning

2017-03-07 00:00:00 +0000

Last week I showed how to build a deep neural network with h2o and rsparkling. As we could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy.

Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model. It might be necessary to go back to the original features and try e.g. feature engineering methods.


Preparing Spark instance and plotting theme

Check out last week’s post for details on how I prepared the data.

library(rsparkling)
options(rsparkling.sparklingwater.version = "2.0.3")

library(h2o)
library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.0.0")
library(ggplot2)
library(ggrepel)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
arrhythmia_sc <- copy_to(sc, arrhythmia_subset)
arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE)
arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2])
arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3])

splits <- h2o.splitFrame(arrhythmia_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "diagnosis"
weights <- "weights"
features <- setdiff(colnames(train), c(response, weights, "class"))


We can use the h2o.grid() function to perform a Random Grid Search (RGS). We could also test all possible combinations of parameters with Cartesian Grid or exhaustive search, but RGS is much faster when we have a large number of possible combinations and usually finds sufficiently accurate models.

For RGS, we first define a set of hyper-parameters and search criteria to fine-tune our models. Because there are many hyper-parameters, each with a range of possible values, we want to find an (ideally) optimal combination to maximize our model’s accuracy. We can also specify how long we want to run the grid search for. Based on the results of each model tested in the grid, we can choose the one with the highest accuracy or best performance for the question on hand.

Activation Functions

  • Rectifier: is the default activation function. It is the fastest and most versatile option. It can lead to instability though and tends to be lower in accuracy.
  • Tanh: The hyperbolic tangent is a scaled and shifted variant of the sigmoid activation function. It can take on values from -1 to 1 and centers around 0. Tanh needs more computational power than e.g. the Rectifier function.
  • Maxout: is an activation function that is the max of the inputs. It is computationally quite demanding but can produce high accuracy models.

  • ...WithDropout: When we specify with dropout, a random subset of the network is trained and the weights of all sub-networks are averaged. It works together with the parameter hidden_dropout_ratios, which controls the amount of layer neurons that are randomly dropped for each hidden layer. Hidden dropout ratios are useful for preventing overfitting on learned features.

Hidden layers

  • are the most important hyper-parameter to set for deep neural networks, as they specify how many hidden layers and how many nodes per hidden layer the model should learn

L1 and L2 penalties

  • L1: lets only strong weights survive
  • L2: prevents any single weight from getting too big.

Rho and Epsilon

  • rho: similar to prior weight updates
  • epsilon: prevents getting stuck in local optima
hyper_params <- list(
                     activation = c("Rectifier", "Maxout", "Tanh", "RectifierWithDropout", "MaxoutWithDropout", "TanhWithDropout"), 
                     hidden = list(c(5, 5, 5, 5, 5), c(10, 10, 10, 10), c(50, 50, 50), c(100, 100, 100)),
                     epochs = c(50, 100, 200),
                     l1 = c(0, 0.00001, 0.0001), 
                     l2 = c(0, 0.00001, 0.0001),
                     rate = c(0, 01, 0.005, 0.001),
                     rate_annealing = c(1e-8, 1e-7, 1e-6),
                     rho = c(0.9, 0.95, 0.99, 0.999),
                     epsilon = c(1e-10, 1e-8, 1e-6, 1e-4),
                     momentum_start = c(0, 0.5),
                     momentum_stable = c(0.99, 0.5, 0),
                     input_dropout_ratio = c(0, 0.1, 0.2),
                     max_w2 = c(10, 100, 1000, 3.4028235e+38)
                     )


Early stopping criteria

  • stopping_metric: metric that we want to use as stopping criterion
  • stopping_tolerance and stopping_rounds: training stops when the the stopping metric does not improve by the stopping tolerance proportion any more (e.g. by 0.05 or 5%) for the number of consecutive rounds defined by stopping rounds.
search_criteria <- list(strategy = "RandomDiscrete", 
                        max_models = 100,
                        max_runtime_secs = 900,
                        stopping_tolerance = 0.001,
                        stopping_rounds = 15,
                        seed = 42)

Now, we can train the model with combinations of hyper-parameters from our specified stopping criteria and hyper-parameter grid.

dl_grid <- h2o.grid(algorithm = "deeplearning", 
                    x = features,
                    y = response,
                    weights_column = weights,
                    grid_id = "dl_grid",
                    training_frame = train,
                    validation_frame = valid,
                    nfolds = 25,                           
                    fold_assignment = "Stratified",
                    hyper_params = hyper_params,
                    search_criteria = search_criteria,
                    seed = 42
                    )

We now want to extract the best model from the grid model list. What makes a model the best depends on the question you want to address with it: in some cases, the model with highest AUC is the most suitable, or the one with the lowest mean squared error, etc. See last week’s post again for a more detailed discussion of performance metrics.

For demonstration purposes, I am choosing the best models from a range of possible quality criteria. We first use the h2o.getGrid() function to sort all models by the quality metric we choose (depending on the metric, you want it ordered by descending or ascending values). We can then get the model that’s the first in the list to work with further. This model’s hyper-parameters can be found with best_model@allparameters. You can now work with your best model as with any regular model in h2o (for an example see last week’s post).

# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err")

for (sort_by_1 in sort_options_1) {
  
  grid <- h2o.getGrid("dl_grid", sort_by = sort_by_1, decreasing = FALSE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  assign(paste0("best_model_", sort_by_1), best_model)
  
}


# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")

for (sort_by_2 in sort_options_2) {
  
  grid <- h2o.getGrid("dl_grid", sort_by = sort_by_2, decreasing = TRUE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  assign(paste0("best_model_", sort_by_2), best_model)
  
}

Let’s plot the mean per class error for each best model:

library(tibble)

sort_options <- c("mean_per_class_error", "mse", "err", "auc", "precision", "accuracy", "recall", "specificity")

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  errors <- h2o.mean_per_class_error(best_model, train = TRUE, valid = TRUE, xval = TRUE)
 
  errors_df <- data.frame(model_id = best_model@model_id, sort = sort_by, errors) %>%
    rownames_to_column(var = "rowname")
  
  if (sort_by == "mean_per_class_error") {
    
    errors_df_comb <- errors_df
    
  } else {
    
    errors_df_comb <- rbind(errors_df_comb, errors_df)
    
  }
}
order <- subset(errors_df_comb, rowname == "xval") %>%
  arrange(errors)
  
errors_df_comb %>%
  mutate(sort = factor(sort, levels = order$sort)) %>%
  ggplot(aes(x = sort, y = errors, fill = model_id)) +
    facet_grid(rowname ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8) +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
          plot.margin = unit(c(0.5, 0, 0, 1), "cm")) +
    labs(x = "")


Model performance

The ultimate performance test for our model will be it’s prediction accuracy on the test set it hasn’t seen before. Here, I will compare the AUC and mean squared error for each best model from before. You could of course look at any other quality metric that is most appropriate for your model.

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  mse_auc_test <- data.frame(model_id = best_model@model_id,
                             sort = sort_by, 
                             mse = h2o.mse(h2o.performance(best_model, test)),
                             auc = h2o.auc(h2o.performance(best_model, test)))
  
  if (sort_by == "mean_per_class_error") {
    
    mse_auc_test_comb <- mse_auc_test
    
  } else {
    
    mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
    
  }
}
library(tidyr)

mse_auc_test_comb %>%
  gather(x, y, mse:auc) %>%
  ggplot(aes(x = sort, y = y, fill = model_id)) +
    facet_grid(x ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
          plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
    labs(x = "", y = "value", fill = "")

I will then create a dataframe with predictions for each test sample with all best models. As in last week’s post, I want to compare the default predictions with stringent predictions.

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  
  finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)),
                                    sort = rep(sort_by, nrow(test)),
                                    class = as.vector(test$class), 
                                    actual = as.vector(test$diagnosis), 
                                    as.data.frame(h2o.predict(object = best_model, newdata = test)))
  
  finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no")
  
  finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", 
                                                  ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain"))
  finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", 
                                         ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no"))
  
  if (sort_by == "mean_per_class_error") {
    
    finalRf_predictions_comb <- finalRf_predictions
    
  } else {
    
    finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
    
  }
}

To get a better overview, I am going to plot the predictions (default and stringent):

finalRf_predictions_comb %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    facet_wrap(~ sort, ncol = 4) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions_comb %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    facet_wrap(~ sort, ncol = 4) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

With predictions made by different models, we can see where each model performs best. This obviously corresponds with the quality metric we chose to define the best model. Stringent prediction thresholds reduced the number of false predictions but of course also the number of predictions made at all.

We can now decide which model is most relevant. Let’s say, we want this model to give recommendations for further validating a diagnosis of arrhythmia, we might want to detect as many arrhythmia cases as possible, while being okay with also getting some erroneously diagnosed healthy subjects. In this case, the people could flag patients with likely arrhythmia for further testing. Here, this would mean that we wanted to minimize the number of wrong predictions of arrhythmia cases, so we would prefer the mse, auc and specificity model (which is the same model, chosen by all three qualitry metrics). The worst model for this purpose would be the recall model, which also did not make any predictions that passed the stringency threshold. The recall model would have been best if we wanted to be confident that healthy people get flagged correctly. However, in a real-life setting, you can imagine that it would be much worse if a sick patient got sent home because he was wrongly diagnosed as healthy than if a healthy person got submitted to further testing for possible arrhythmia.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.6.1         tibble_1.2          ggrepel_0.6.5      
## [4] ggplot2_2.2.1.9000  sparklyr_0.5.3-9002 dplyr_0.5.0        
## [7] h2o_3.10.3.6        rsparkling_0.1.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        RColorBrewer_1.1-2 plyr_1.8.4        
##  [4] bitops_1.0-6       base64enc_0.1-3    tools_3.3.2       
##  [7] digest_0.6.12      jsonlite_1.2       evaluate_0.10     
## [10] gtable_0.2.0       shiny_1.0.0        DBI_0.5-1         
## [13] rstudioapi_0.6     yaml_2.1.14        parallel_3.3.2    
## [16] withr_1.0.2        httr_1.2.1         stringr_1.2.0     
## [19] knitr_1.15.1       rappdirs_0.3.1     rprojroot_1.2     
## [22] grid_3.3.2         R6_2.2.0           rmarkdown_1.3     
## [25] reshape2_1.4.2     magrittr_1.5       backports_1.0.5   
## [28] scales_0.4.1       htmltools_0.3.5    assertthat_0.1    
## [31] mime_0.5           xtable_1.8-2       colorspace_1.3-2  
## [34] httpuv_1.3.3       labeling_0.3       config_0.2        
## [37] stringi_1.1.2      RCurl_1.95-4.8     lazyeval_0.2.0    
## [40] munsell_0.4.3

Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart

2017-02-27 00:00:00 +0000

Last week, I introduced how to run machine learning applications on Spark from within R, using the sparklyr package. This week, I am showing how to build feed-forward deep neural networks or multilayer perceptrons. The models in this example are built to classify ECG data into being either from healthy hearts or from someone suffering from arrhythmia. I will show how to prepare a dataset for modeling, setting weights and other modeling parameters and finally, how to evaluate model performance with the h2o package via rsparkling.


Deep learning with neural networks

Deep learning with neural networks is arguably one of the most rapidly growing applications of machine learning and AI today. They allow building complex models that consist of multiple hidden layers within artifiical networks and are able to find non-linear patterns in unstructured data. Deep neural networks are usually feed-forward, which means that each layer feeds its output to subsequent layers, but recurrent or feed-back neural networks can also be built. Feed-forward neural networks are also called multilayer perceptrons (MLPs).


H2O and Sparkling Water

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark (Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

Sparkling Water can be accessed from R with the rsparkling extension package to sparklyr and h2o. Check the documentation for rsparkling to find out which H2O, Sparkling Water and Spark versions are compatible.


Preparing the R session

First, we need to load the packages and connect to the Spark instance (for demonstration purposes, I am using a local instance).

library(rsparkling)
options(rsparkling.sparklingwater.version = "2.0.3")

library(h2o)
library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.0.0")

I am also preparing my custom plotting theme.

library(ggplot2)
library(ggrepel)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}


Arrhythmia data

The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names. The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.

arrhythmia <- read.table("arrhythmia.data.txt", sep = ",")
arrhythmia[arrhythmia == "?"] <- NA

# making sure, that all feature columns are numeric
arrhythmia[-280] <- lapply(arrhythmia[-280], as.character)
arrhythmia[-280] <- lapply(arrhythmia[-280], as.numeric)

#  renaming output column and converting to factor
colnames(arrhythmia)[280] <- "class"
arrhythmia$class <- as.factor(arrhythmia$class)

As usual, I want to get acquainted with the data and explore it’s properties before I am building any model. So, I am first going to look at the distribution of classes and of healthy and arrhythmia samples.

p1 <- ggplot(arrhythmia, aes(x = class)) +
  geom_bar(fill = "navy", alpha = 0.7) +
  my_theme()

Because I am interested in distinguishing healthy from arrhythmia ECGs, I am converting the output to binary format by combining all arrhythmia cases into one class.

arrhythmia$diagnosis <- ifelse(arrhythmia$class == 1, "healthy", "arrhythmia")
arrhythmia$diagnosis <- as.factor(arrhythmia$diagnosis)
p2 <- ggplot(arrhythmia, aes(x = diagnosis)) +
  geom_bar(fill = "navy", alpha = 0.7) +
  my_theme()
library(gridExtra)
library(grid)

grid.arrange(p1, p2, ncol = 2)

With binary classification, we have almost the same numbers of healthy and arrhythmia cases in our dataset.

I am also interested in how much the normal and arrhythmia cases cluster in a Principal Component Analysis (PCA). I am first preparing the PCA plotting function and then run it on the feature data.

library(pcaGoPromoter)

pca_func <- function(pcaOutput2, group_name){
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
          data.frame(groups = as.character(t),
                     ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                           centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                           level = 0.95),
                     stringsAsFactors = FALSE)))
        
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.5) + 
      labs(color = paste(group_name),
           fill = paste(group_name),
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance")) +
      my_theme()
    
    return(plot)
}
pcaOutput <- pca(t(arrhythmia[-c(280, 281)]), printDropped = FALSE, scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$scores)

pcaOutput2$groups <- arrhythmia$class
p1 <- pca_func(pcaOutput2, group_name = "class")

pcaOutput2$groups <- arrhythmia$diagnosis
p2 <- pca_func(pcaOutput2, group_name = "diagnosis")

grid.arrange(p1, p2, ncol = 2)

The PCA shows that there is a big overlap between healthy and arrhythmia samples, i.e. there does not seem to be major global differences in all features. The class that is most distinct from all others seems to be class 9. I want to give the arrhythmia cases that are very different from the rest a stronger weight in the neural network, so I define a weight column where every sample outside the central PCA cluster will get a “2”, they will in effect be used twice in the model.

weights <- ifelse(pcaOutput2$PC1 < -5 & abs(pcaOutput2$PC2) > 10, 2, 1)

I also want to know what the variance is within features.

library(matrixStats)

colvars <- data.frame(feature = colnames(arrhythmia[-c(280, 281)]),
                      variance = colVars(as.matrix(arrhythmia[-c(280, 281)])))

subset(colvars, variance > 50) %>%
  mutate(feature = factor(feature, levels = colnames(arrhythmia[-c(280, 281)]))) %>%
  ggplot(aes(x = feature, y = variance)) +
    geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Features with low variance are less likely to strongly contribute to a differentiation between healthy and arrhythmia cases, so I am going to remove them. I am also concatenating the weights column:

arrhythmia_subset <- cbind(weights, arrhythmia[, c(281, 280, which(colvars$variance > 50))])


Working with rsparkling and h2o

Now that I have my final data frame for modeling, I copy it to the Spark instance. For working with h2o functions, the data needs to be converted from a Spark DataFrame to an H2O Frame. This is done with the as_h2o_frame() function.

arrhythmia_sc <- copy_to(sc, arrhythmia_subset)
arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE)

We can now access all functions from the h2o package that are built to work on H2O Frames. A useful such function is h2o.describe(). It is similar to base R’s summary() function but outputs many more descriptive measures for our data. To get a good overview about these measures, I am going to plot them.

library(tidyr) # for gathering
h2o.describe(arrhythmia_hf[, -1]) %>% # excluding the weights column
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% # separating them into facets makes them easier to see
  mutate(Label = factor(Label, levels = colnames(arrhythmia_hf[, -1]))) %>%
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

I am also interested in the correlation between features and the output. We can use the h2o.cor() function to calculate the correlation matrix. It is again much easier to understand the data when we visualize it, so I am going to create another plot.

library(reshape2) # for melting

arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2]) # diagnosis is now a characer column and we need to convert it again
arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3]) # same for class

cor <- h2o.cor(arrhythmia_hf[, -c(1, 3)])
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")


Training, test and validation data

Now we can use the h2o.splitFrame() function to split the data into training, validation and test data. Here, I am using 70% for training and 15% each for validation and testing. We could also just split the data into two sections, a training and test set but when we have sufficient samples, it is a good idea to evaluate model performance on an independent test set on top of training with a validation set. Because we can easily overfit a model, we want to get an idea about how generalizable it is - this we can only assess by looking at how well it works on previously unknown data.

I am also defining response, feature and weight column names now.

splits <- h2o.splitFrame(arrhythmia_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "diagnosis"
weights <- "weights"
features <- setdiff(colnames(train), c(response, weights, "class"))
summary(train$diagnosis, exact_quantiles = TRUE)
##  diagnosis      
##  healthy   :163 
##  arrhythmia:155
summary(valid$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :43 
##  arrhythmia:25
summary(test$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :39 
##  arrhythmia:27

If we had more categorical features, we could use the h2o.interaction() function to define interaction terms, but since we only have numeric features here, we don’t need this.

We can also run a PCA on the training data, using the h2o.prcomp() function to calculate the singular value decomposition of the Gram matrix with the power method.

pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           k = 3,
           seed = 42)

pca
## Model Details:
## ==============
## 
## H2ODimReductionModel: pca
## Model ID:  PCA_model_R_1488113493291_1 
## Importance of components: 
##                             pc1      pc2      pc3
## Standard deviation     0.598791 0.516364 0.424850
## Proportion of Variance 0.162284 0.120680 0.081695
## Cumulative Proportion  0.162284 0.282965 0.364660
## 
## 
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA
eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features

ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel() +
  my_theme()


Modeling

Now, we can build a deep neural network model. We can specify quite a few parameters, like

  • Cross-validation: Cross validation can tell us the training and validation errors for each model. The final model will be overwritten with the best model, if we don’t specify otherwise.

  • Adaptive learning rate: For deep learning with h2o, we by default use stochastic gradient descent optimization with an an adaptive learning rate. The two corresponding parameters rho and epsilon help us find global (or near enough) optima.

  • Activation function: The activation function defines the node output relative to a given set of inputs. We want our activation function to be non-linear and continuously differentiable.

  • Hidden nodes: Defines the number of hidden layers and the number of nodes per layer.

  • Epochs: Increasing the number of epochs (one full training cycle on all training samples) can increase model performance, but we also run the risk of overfitting. To determine the optimal number of epochs, we need to use early stopping.

  • Early stopping: By default, early stopping is enabled. This means that training will be stopped when we reach a certain validation error to prevent overfitting.

Of course, you need quite a bit of experience and intuition to hit on a good combination of parameters. That’s why it usually makes sense to do a grid search for hyper-parameter tuning. Here, I want to focus on building and evaluating deep learning models, though. I will cover grid search in next week’s post.

dl_model <- h2o.deeplearning(x = features,
                             y = response,
                             weights_column = weights,
                             model_id = "dl_model",
                             training_frame = train,
                             validation_frame = valid,
                             nfolds = 15,                                   # 10x cross validation
                             keep_cross_validation_fold_assignment = TRUE,
                             fold_assignment = "Stratified",
                             activation = "RectifierWithDropout",
                             score_each_iteration = TRUE,
                             hidden = c(200, 200, 200, 200, 200),           # 5 hidden layers, each of 200 neurons
                             epochs = 100,
                             variable_importances = TRUE,
                             export_weights_and_biases = TRUE,
                             seed = 42)

Because training can take a while, depending on how many samples, features, nodes and hidden layers you are training on, it is a good idea to save your model.

h2o.saveModel(dl_model, path = "dl_model")

We can then re-load the model again any time to check the model quality and make predictions on new data.

dl_model <- h2o.loadModel("dl_model/dl_model")


Model performance

We now want to know how our model performed on the validation data. The summary() function will give us a detailed overview of our model. I am not showing the output here, because it is quite extensive.

summary(dl_model)

One performance metric we are usually interested in is the mean per class error for training and validation data.

h2o.mean_per_class_error(dl_model, train = TRUE, valid = TRUE, xval = TRUE)
##           train          valid            xval 
## 0.0304878 0.2232558 0.2257781

The confusion matrix tells us, how many classes have been predicted correctly and how many predictions were accurate. Here, we see the errors in predictions on validation data

h2o.confusionMatrix(dl_model, valid = TRUE)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00425904880659062:
##            arrhythmia healthy    Error    Rate
## arrhythmia         15      10 0.400000  =10/25
## healthy             2      41 0.046512   =2/43
## Totals             17      51 0.176471  =12/68

We can also plot the classification error over all epochs or samples.

plot(dl_model,
     timestep = "epochs",
     metric = "classification_error")

plot(dl_model,
     timestep = "samples",
     metric = "classification_error")

Next to the classification error, we are usually interested in the logistic loss (negative log-likelihood or log loss). It describes the sum of errors for each sample in the training or validation data or the negative logarithm of the likelihood of error for a given prediction/ classification. Simply put, the lower the loss, the better the model (if we ignore potential overfitting).

plot(dl_model,
     timestep = "epochs",
     metric = "logloss")

We can also plot the mean squared error (MSE). The MSE tells us the average of the prediction errors squared, i.e. the estimator’s variance and bias. The closer to zero, the better a model.

plot(dl_model,
     timestep = "epochs",
     metric = "rmse")

Next, we want to know the area under the curve (AUC). AUC is an important metric for measuring binary classification model performances. It gives the area under the curve, i.e. the integral, of true positive vs false positive rates. The closer to 1, the better a model.

h2o.auc(dl_model, train = TRUE)
## [1] 0.9864582
h2o.auc(dl_model, valid = TRUE)
## [1] 0.852093
h2o.auc(dl_model, xval = TRUE)
## [1] 0.8577735

The weights for connecting two adjacent layers and per-neuron biases that we specified the model to save, can be accessed with:

w <- h2o.weights(dl_model, matrix_id = 1)
b <- h2o.biases(dl_model, vector_id = 1)

Variable importance can be extracted as well (but keep in mind, that variable importance in deep neural networks is difficult to assess and should be considered only as rough estimates).

h2o.varimp(dl_model)
## Variable Importances: 
##   variable relative_importance scaled_importance percentage
## 1     V169            1.000000          1.000000   0.013925
## 2     V239            0.987290          0.987290   0.013748
## 3     V103            0.913953          0.913953   0.012727
## 4      V15            0.907422          0.907422   0.012636
## 5      V91            0.904267          0.904267   0.012592
## 
## ---
##    variable relative_importance scaled_importance percentage
## 85      V88            0.717914          0.717914   0.009997
## 86     V269            0.715800          0.715800   0.009968
## 87     V137            0.712923          0.712923   0.009928
## 88     V168            0.711402          0.711402   0.009906
## 89      V33            0.707356          0.707356   0.009850
## 90     V219            0.696149          0.696149   0.009694
#h2o.varimp_plot(dl_model)


Test data

Now that we have a good idea about model performance on validation data, we want to know how it performed on unseen test data. A good model should find an optimal balance between accuracy on training and test data. A model that has 0% error on the training data but 40% error on the test data is in effect useless. It overfit on the training data and is thus not able to generalize to unknown data.

perf <- h2o.performance(dl_model, test)
perf
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.2154912
## RMSE:  0.4642103
## LogLoss:  1.378809
## Mean Per-Class Error:  0.2250712
## AUC:  0.7796771
## Gini:  0.5593542
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            arrhythmia healthy    Error      Rate
## arrhythmia         19       8 0.296296    =8/27
## healthy                6      33 0.153846   =6/39
## Totals                 25      41 0.212121   =14/66
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.132564 0.825000  40
## 2                       max f2  0.000002 0.894495  61
## 3                 max f0point5  0.132564 0.812808  40
## 4                 max accuracy  0.132564 0.787879  40
## 5                max precision  0.982938 1.000000   0
## 6                   max recall  0.000002 1.000000  61
## 7              max specificity  0.982938 1.000000   0
## 8             max absolute_mcc  0.132564 0.557317  40
## 9   max min_per_class_accuracy  0.837616 0.743590  34
## 10 max mean_per_class_accuracy  0.132564 0.774929  40
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Plotting the test performance’s AUC plot shows us approximately how good the predictions are.

plot(perf)

We also want to know the log loss, MSE and AUC values, as well as other model metrics for the test data:

h2o.logloss(perf)
## [1] 1.378809
h2o.mse(perf)
## [1] 0.2154912
h2o.auc(perf)
## [1] 0.7796771
head(h2o.metric(perf))
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  0.982938 0.050000 0.031847 0.116279 0.424242  1.000000 0.025641
## 2  0.982811 0.097561 0.063291 0.212766 0.439394  1.000000 0.051282
## 3  0.982460 0.095238 0.062893 0.196078 0.424242  0.666667 0.051282
## 4  0.982256 0.139535 0.093750 0.272727 0.439394  0.750000 0.076923
## 5  0.982215 0.181818 0.124224 0.338983 0.454545  0.800000 0.102564
## 6  0.981170 0.177778 0.123457 0.317460 0.439394  0.666667 0.102564
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.103203               0.025641                0.512821
## 2    1.000000     0.147087               0.051282                0.525641
## 3    0.962963     0.033624               0.051282                0.507123
## 4    0.962963     0.082188               0.076923                0.519943
## 5    0.962963     0.121754               0.102564                0.532764
## 6    0.925926     0.048725               0.102564                0.514245
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1  27  38   0   1 1.000000 0.974359 0.000000 0.025641   0
## 2  27  37   0   2 1.000000 0.948718 0.000000 0.051282   1
## 3  26  37   1   2 0.962963 0.948718 0.037037 0.051282   2
## 4  26  36   1   3 0.962963 0.923077 0.037037 0.076923   3
## 5  26  35   1   4 0.962963 0.897436 0.037037 0.102564   4
## 6  25  35   2   4 0.925926 0.897436 0.074074 0.102564   5

The confusion matrix alone can be seen with the h2o.confusionMatrix() function, but is is also part of the performance summary.

h2o.confusionMatrix(dl_model, test)

The final predictions with probabilities can be extracted with the h2o.predict() function. Beware though, that the number of correct and wrong classifications can be slightly different from the confusion matrix above. Here, I combine the predictions with the actual test diagnoses and classes into a data frame. For plotting I also want to have a column, that tells me whether the predictions were correct. By default, a prediction probability above 0.5 will get scored as a prediction for the respective category. I find it often makes sense to be more stringent with this, though and set a higher threshold. Therefore, I am creating another column with stringent predictions, where I only count predictions that were made with more than 80% probability. Everything that does not fall within this range gets scored as “uncertain”. For these stringent predictions, I am also creating a column that tells me whether they were accurate.

finalRf_predictions <- data.frame(class = as.vector(test$class), actual = as.vector(test$diagnosis), as.data.frame(h2o.predict(object = dl_model, newdata = test)))

finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", 
                                                ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no"))
finalRf_predictions %>%
  group_by(actual, predict) %>%
  summarise(n = n())
## Source: local data frame [4 x 3]
## Groups: actual [?]
## 
##       actual    predict     n
##       <fctr>     <fctr> <int>
## 1 arrhythmia arrhythmia    16
## 2 arrhythmia    healthy    11
## 3    healthy arrhythmia     6
## 4    healthy    healthy    33
finalRf_predictions %>%
  group_by(actual, predict_stringent) %>%
  summarise(n = n())
## Source: local data frame [6 x 3]
## Groups: actual [?]
## 
##       actual predict_stringent     n
##       <fctr>             <chr> <int>
## 1 arrhythmia        arrhythmia    19
## 2 arrhythmia           healthy     6
## 3 arrhythmia         uncertain     2
## 4    healthy        arrhythmia     8
## 5    healthy           healthy    29
## 6    healthy         uncertain     2

To get a better overview, I am going to plot the predictions (default and stringent):

p1 <- finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

p2 <- finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

grid.arrange(p1, p2, ncol = 2)

Being more stringent with the prediction threshold slightly reduced the number of errors but not by much.

I also want to know whether there are certain classes of arrhythmia that are especially prone to being misclassified:

p1 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Default predictions",
         x = "predicted to be")

p2 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict_stringent, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Stringent predictions",
         x = "predicted to be")

grid.arrange(p1, p2, ncol = 2)

There are no obvious biases towards some classes but with the small number of samples for most classes, this is difficult to assess.


Final conclusions: How useful is the model?

Most samples were classified correctly, but the total error was not particularly good. Moreover, when evaluating the usefulness of a specific model, we need to keep in mind what we want to achieve with it and which questions we want to answer. If we wanted to deploy this model in a clinical setting, it should assist with diagnosing patients. So, we need to think about what the consequences of wrong classifications would be. Would it be better to optimize for high sensitivity, in this example as many arrhythmia cases as possible get detected - with the drawback that we probably also diagnose a few healthy people? Or do we want to maximize precision, meaning that we could be confident that a patient who got predicted to have arrhythmia does indeed have it, while accepting that a few arrhythmia cases would remain undiagnosed? When we consider stringent predictions, this model correctly classified 19 out of 27 arrhythmia cases, but 6 were misdiagnosed. This would mean that some patients who were actually sick, wouldn’t have gotten the correct treatment (if decided solely based on this model). For real-life application, this is obviously not sufficient!

Next week, I’ll be trying to improve the model by doing a grid search for hyper-parameter tuning.

So, stay tuned… (sorry, couldn’t resist ;-))


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.2       tidyr_0.6.1          matrixStats_0.51.0  
##  [4] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
##  [7] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [10] ellipse_0.3-8        gridExtra_2.2.1      ggrepel_0.6.5       
## [13] ggplot2_2.2.1        sparklyr_0.5.2       dplyr_0.5.0         
## [16] h2o_3.10.3.6         rsparkling_0.1.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9          RColorBrewer_1.1-2   plyr_1.8.4          
##  [4] zlibbioc_1.20.0      bitops_1.0-6         base64enc_0.1-3     
##  [7] tools_3.3.2          digest_0.6.12        memoise_1.0.0       
## [10] RSQLite_1.1-2        jsonlite_1.2         evaluate_0.10       
## [13] tibble_1.2           gtable_0.2.0         DBI_0.5-1           
## [16] yaml_2.1.14          withr_1.0.2          httr_1.2.1          
## [19] stringr_1.2.0        knitr_1.15.1         rappdirs_0.3.1      
## [22] rprojroot_1.2        Biobase_2.34.0       R6_2.2.0            
## [25] AnnotationDbi_1.36.0 rmarkdown_1.3        magrittr_1.5        
## [28] backports_1.0.5      scales_0.4.1         htmltools_0.3.5     
## [31] assertthat_0.1       colorspace_1.3-2     labeling_0.3        
## [34] config_0.2           stringi_1.1.2        RCurl_1.95-4.8      
## [37] lazyeval_0.2.0       munsell_0.4.3

Predicting food preferences with sparklyr (machine learning)

2017-02-19 00:00:00 +0000

This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R.

The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines.


Apache Spark

Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup. Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training. Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance.

MLlib

Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing. With sparklyr you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries. sparklyr also implements dplyr, making it especially convenient for handling data.

If you don’t have Spark installed locally, run:

library(sparklyr)
spark_install(version = "2.0.0")

Now we can connect to a local Spark instance:

library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")


Preparations

Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages tidyr (for gathering data for plotting), dplyr (for data manipulation) and ggrepel (for non-overlapping text labels in plots).

library(tidyr)
library(ggplot2)
library(ggrepel)
library(dplyr)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "black"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}


The Data

Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup.

This dataset is part of the fivethirtyeight package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen:

  • 5: I love this country’s traditional cuisine. I think it’s one of the best in the world.
  • 4: I like this country’s traditional cuisine. I think it’s considerably above average.
  • 3: I’m OK with this county’s traditional cuisine. I think it’s about average.
  • 2: I dislike this country’s traditional cuisine. I think it’s considerably below average.
  • 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world.
  • N/A: I’m unfamiliar with this country’s traditional cuisine.

Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0.

library(fivethirtyeight)

food_world_cup[food_world_cup == "N/A"] <- NA
food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0
food_world_cup$gender <- as.factor(food_world_cup$gender)
food_world_cup$location <- as.factor(food_world_cup$location)

The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location.

Before I do any machine learning, however, I want to get to know the data. First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country.

# calculating percentages per category and country
percentages <- food_world_cup %>%
  select(algeria:vietnam) %>%
  gather(x, y) %>%
  group_by(x, y) %>%
  summarise(n = n()) %>%
  mutate(Percent = round(n / sum(n) * 100, digits = 2))

# rename countries & plot
percentages %>%
  mutate(x_2 = gsub("_", " ", x)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = "", y = Percent, fill = y)) + 
    geom_bar(width = 1, stat = "identity") + 
    theme_minimal() +
    coord_polar("y", start = 0) +
    facet_wrap(~ x_2, ncol = 8) +
    scale_fill_brewer(palette = "Set3") +
    labs(fill = "")

As we can already see from the distributions, there are big differences between countries. Some countries’ cuisines are very well known and liked (e.g. Italy, Mexico and the United States), while others are only known to a handful of people (e.g. Algeria, Croatia, Ghana, etc.).


Imputing missing values

In the demographic information, there are a few missing values. These, I want to impute:

library(mice)

dataset_impute <- mice(food_world_cup[, -c(1, 2)],  print = FALSE)
food_world_cup <- cbind(food_world_cup[, 2, drop = FALSE], mice::complete(dataset_impute, 1))


Transforming preference values

Strictly speaking, the preference data is categorical. But because using them as factor levels would make the models much more complex and calculation-intensive, I am converting the factors to numbers and transform them by dividing through the mean of non-zero values for each country.

food_world_cup[8:47] <- lapply(food_world_cup[8:47], as.numeric)

countries <- paste(colnames(food_world_cup)[-c(1:7)])

for (response in countries) {
  food_world_cup[paste(response, "trans", sep = "_")] <- food_world_cup[response] / mean(food_world_cup[food_world_cup[response] > 0, response])
}

As we can see by plotting the distribution of transformed values, they are far from being normal distributions. Moreover, we still see big differences between well known and not well known countries - this means that the data is unbalanced.

food_world_cup %>%
  gather(x, y, algeria_trans:vietnam_trans) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(y)) +
    geom_density(fill = "navy", alpha = 0.7) +
    my_theme() + 
    facet_wrap(~ x_2, ncol = 8) +
    labs(x = "transformed preference")


Most liked cuisines and gender biases

Then, I wanted to know which countries were the most liked and whether we see a gender bias in preference for some country’s cuisines. The first plot shows the sum of preference values for each country, separated by male and female answers.

food_world_cup_gather <- food_world_cup %>%
  collect %>%
  gather(country, value, algeria:vietnam)
                                 
food_world_cup_gather$value <- as.numeric(food_world_cup_gather$value)
food_world_cup_gather$country <- as.factor(food_world_cup_gather$country)
food_world_cup_gather <- food_world_cup_gather %>%
  mutate(x_2 = gsub("_", " ", country)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2))

order <- aggregate(food_world_cup_gather$value, by = list(food_world_cup_gather$x_2), FUN = sum)

food_world_cup_gather %>%
  mutate(x_2 = factor(x_2, levels = order$Group.1[order(order$x, decreasing = TRUE)])) %>%
  ggplot(aes(x = x_2, y = value, fill = gender)) +
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(fill = "Gender",
         x = "",
         y = "sum of preferences")

Cuisines from Italy, the United States and Mexico were the most liked, from Ghana, Cameroon and the Ivory Coast were least liked (that they were also less well known playes into it, of course). No country shows an obvious gender bias, though.

To dig a bit deeper into gender differences, I am calculating the differences between mean preference of males and females for each country. Here, I am using the country list that I defined earlier when transforming the values.

Because I want to paste the country list to dplyr’s functions, I need to use standard evaluation (SE). Be default, dplyr uses non-standard evalutaion (NSE), i.e. variable names are given without quotation marks. This makes it possible to easily convert between R and SQL code. But each dplyr function also has a version to use with SE: they each have a “” appended to the function name, here e.g. *mutate_each()*.

food_world_cup %>%
  collect %>%
  mutate_each_(funs(as.numeric), countries) %>%
  group_by(gender) %>%
  summarise_each_(funs(mean), countries) %>%
  summarise_each_(funs(diff), countries) %>%
  gather(x, y) %>%
  mutate(x_2 = gsub("_", " ", x)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = y)) +
    geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "",
         y = "difference\nbetween gender")

All gender differences are very close to zero, so it seems that men and women generally have similar food preferences.


Spark

Now, that I’m somewhat familiar with the data, I copy it to the Spark instance:

food_world_cup <- copy_to(sc, food_world_cup)


Principal Component Analysis (PCA)

One of the functions of Spark’s machine learning functions is for PCA: ml_pca(). I am using it to get an idea about where the countries fall on a 2-dimensional plane of the first two principal components.

pca <- food_world_cup %>%
  mutate_each_(funs(as.numeric), countries) %>%
  ml_pca(features = paste(colnames(food_world_cup)[-c(1:47)]))

library(tibble)
as.data.frame(pca$components) %>%
  rownames_to_column(var = "labels") %>%
  mutate(x_2 = gsub("_trans", "", labels)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = PC1, y = PC2, color = x_2, label = x_2)) + 
    geom_point(size = 2, alpha = 0.6) +
    geom_text_repel() +
    labs(x = paste0("PC1: ", round(pca$explained.variance[1], digits = 2) * 100, "% variance"),
         y = paste0("PC2: ", round(pca$explained.variance[2], digits = 2) * 100, "% variance")) +
    my_theme() + 
    guides(fill = FALSE, color = FALSE)

The least well known countries cluster on the top right, while the most liked are on the bottom right. PC2 seems to be mainly driven by how many 0s there are in the data.


Preparing the data for machine learning

First, I need to convert the factor strings of the non-country features into indexes. To do so, I am using the ft_string_indexer() function. For other feature transformation functions, check out sparklyr’s website.

food_world_cup <- tbl(sc, "food_world_cup") %>%
  ft_string_indexer(input_col = "interest", output_col = "interest_idx") %>%
  ft_string_indexer(input_col = "gender", output_col = "gender_idx") %>%
  ft_string_indexer(input_col = "age", output_col = "age_idx") %>%
  ft_string_indexer(input_col = "household_income", output_col = "household_income_idx") %>%
  ft_string_indexer(input_col = "education", output_col = "education_idx") %>%
  ft_string_indexer(input_col = "location", output_col = "location_idx") %>%
  ft_string_indexer(input_col = "knowledge", output_col = "knowledge_idx")


Now I can divide the data into training and test sets using the sdf_partition() function.

partitions <- food_world_cup %>%
  sdf_partition(training = 0.75, test = 0.25, seed = 753)


Modeling

Now, I run the Random Forest algorithm to predict each country’s preference based on all other countries’ preferences and the demographic information. Check back with sparklyr’s website to find out about other machine learning algorithms you can use.

For each country (as response variable), I am defining the features as all other countries’ transformed values and the indexed factor variables.

Then, I am filtering out all data rows where the response variable was 0. When I left the 0s in the data, the countries with many 0s introduced a very strong bias to the prediction. Here, I needed to use standard evaluation again, this time with lazyeval’s interp() function and a formula.

I am using the ml_random_forest() function to run classification models. Inititally, I run a model on all features, then extract the 10 features with highest importance and re-run the model again on this subset of features. This improved the prediction accuracy compared to all-feature-models.

The sdf_predict() function is used to predict the classes of the test set. To obtain the quality metrics F1, weighted precision and weighted recall, I need to copy the prediction table to the Spark instance and run the ml_classification_eval() function.

Finally, I am combining the output tables from all countries’s models to compare.

library(lazyeval)

for (response in countries) {

  features <- colnames(partitions$training)[-grep(response, colnames(partitions$training))]
  features <- features[grep("_trans|_idx", features)]

  fit <- partitions$training %>%
    filter_(interp(~ var > 0, var = as.name(response))) %>%
    ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
  
  feature_imp <- ml_tree_feature_importance(sc, fit)
  
  features <- as.character(feature_imp[1:10, 2])
  
  fit <- partitions$training %>%
    filter_(interp(~ var > 0, var = as.name(response))) %>%
    ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
  
  partitions$test <- partitions$test %>%
    filter_(interp(~ var > 0, var = as.name(response)))
  
  pred <- sdf_predict(fit, partitions$test) %>%
    collect
  
  pred_2 <- as.data.frame(table(pred[[response]], pred$prediction))
  pred_2$response <- response
  
  pred_sc <- select(pred, -rawPrediction, -probability)
  pred_sc <- copy_to(sc, pred_sc, overwrite = TRUE)
  
  feature_imp$response <- response
  
  f1 <- ml_classification_eval(pred_sc, response, "prediction", metric = "f1")
  wP <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedPrecision")
  wR <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedRecall")
  
  ml_eval <- data.frame(response = response,
                        f1 = f1,
                        weightedPrecision = wP,
                        weightedRecall = wR)
  
  if (response == "algeria") {
    feature_imp_df <- feature_imp
    ml_eval_df <- ml_eval
    pred_df <- pred_2
  } else {
    feature_imp_df <- rbind(feature_imp_df, feature_imp)
    ml_eval_df <- rbind(ml_eval_df, ml_eval)
    pred_df <- rbind(pred_df, pred_2)
  }
}


Model evaluation

Now, I can compare the prediction accuracies for each country’s model by plotting the F1, weighted precision and weighted recall values.

  • Precision

In classification models, precision gives the proportion of correct classifications or “true positives” (i.e. how many of the samples that were classified as “5” were actually a “5”). If we have a high precision, this means that our model classified most samples correctly.

  • Recall

Recall, or sensitivity, gives the proportion of how many of all true “5” samples in the training data were correctly classified. If we have a high recall, this means that our model correctly detected most classes.

  • F1 score

The F-score gives the harmonic mean or weighted average of precision and recall.

Let’s say we had 10 “5s” in the training data. The prediction table classified 8 samples as “5s”, 6 of which were correctly classified. In this example, we have 2 false positives and 4 false negatives. This means we would have a precision of 6/8 and a recall of 6/10.

results <- ml_eval_df %>%
  mutate(x_2 = gsub("_", " ", response)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2))
  
order <- results$x_2[order(results$f1, decreasing = TRUE)]

gather(results, x, y, f1:weightedRecall) %>%
  mutate(x_2 = factor(x_2, levels = order)) %>%
  ggplot(aes(x = x_2, y = y, fill = x)) +
    geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(fill = "", color = "", x = "", y = "value")

The plot above shows that the model predicting preference for Spanish dishes had the highest accuracy, while the model for Italy, India and Ireland had the lowest accuracy.


To see what features had been used in more and less successful models, I am plotting the values of the 10 final features for classifying Spanish, Greece and Italian food preference categories 1 - 5 in the original dataset.

as.data.frame(food_world_cup) %>%
  select_(.dots = c("spain", as.character(feats$feature))) %>%
  gather(x, y, -spain) %>%
  filter(spain > 0) %>%
  group_by(spain, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = spain, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Spain")

feats <- feature_imp_df %>%
  filter(response == "greece") %>%
  slice(1:10)

as.data.frame(food_world_cup) %>%
  select_(.dots = c("greece", as.character(feats$feature))) %>%
  gather(x, y, -greece) %>%
  filter(greece > 0) %>%
  group_by(greece, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = greece, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Greece")

feats <- feature_imp_df %>%
  filter(response == "italy") %>%
  slice(1:10)

as.data.frame(food_world_cup) %>%
  select_(.dots = c("italy", as.character(feats$feature))) %>%
  gather(x, y, -italy) %>%
  filter(italy > 0) %>%
  group_by(italy, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = italy, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Italy")


Conclusions

This dataset was a bit difficult because the preferences were highly unbalanced, i.e. the distributions of preferences were very different between countries. Some countries’ cuisines were largely unknown, while others’ were well known and generally popular. If I kept the unknown class (0) in the prediction models, countries with many 0s jumped up in prediction accuracy. But this was mainly due to the fact, that the chance of correctly classifying a sample as 0 was over-proportionally high (leading to high precision). Removing the 0s while working on raw preference values showed the same effect as before but now countries with a majority of 5s (e.g. Italy) were overly “accurate”. Transforming the data and removing the 0s led to a more evenly distributed accuracy of well known and less well known countries. But by removing them, the number of samples used for modeling varied strongly between countries, introducing another type of bias.

It was not possible to avoid bias entirely with this dataset. But I wanted to show it as an example nonetheless. Usually, machine learning examples show datasets where the models worked very well, leaving the reader in awe of the powers of machine learning. And while there are certainly powerful and impressive prediction models, real-life data is not always as simple. As can be seen with this dataset, it’s not always as straight forward to build a good classification model, even though you’d intuitively assume that it should be easy to predict food preferences based on which other cuisines someone likes…

The model could potentially be improved by engineering features, modeling factors and/ or different algorithms but I’ll leave this for another time.


Next week, I’ll be looking into how to use the h2o package with Spark using rsparkling.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tibble_1.2            fivethirtyeight_0.1.0 dplyr_0.5.0          
## [4] ggrepel_0.6.5         ggplot2_2.2.1         tidyr_0.6.1          
## [7] sparklyr_0.5.1       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        knitr_1.15.1       magrittr_1.5      
##  [4] rappdirs_0.3.1     munsell_0.4.3      colorspace_1.3-2  
##  [7] R6_2.2.0           plyr_1.8.4         stringr_1.1.0     
## [10] httr_1.2.1         tools_3.3.2        parallel_3.3.2    
## [13] grid_3.3.2         gtable_0.2.0       config_0.2        
## [16] DBI_0.5-1          withr_1.0.2        htmltools_0.3.5   
## [19] lazyeval_0.2.0     yaml_2.1.14        assertthat_0.1    
## [22] rprojroot_1.2      digest_0.6.12      RColorBrewer_1.1-2
## [25] base64enc_0.1-3    evaluate_0.10      rmarkdown_1.3     
## [28] labeling_0.3       stringi_1.1.2      scales_0.4.1      
## [31] backports_1.0.5    jsonlite_1.2

Conditional ggplot2 geoms in functions (QTL plots)

2017-02-12 00:00:00 +0000

When running an analysis, I am usually combining functions from multiple packages. Most of these packages come with their own plotting functions. And while they are certainly convenient in that they allow me to get a quick glance at the data or the output, they all have their own style. If I want to prepare a report, proposal or a paper though, I want all my plots to come from a single cast so that they give a consistent feel to the story I want to tell with my data.

I have grown very fond of ggplot2 for producing plots, because of it’s tidy grammar, flexibility and easy customization. Here, I want to show an example of my own ggplot2 function to produce QTL plots for Karl Broman’s qtl package.

Specifically, I want to show how to incorporate conditional geoms when using ggplot2 in a function call.

I will show how my function can be used with the output from qtl’s scanone() function by following the examples from the package tutorial. The tutorial does an excellent job describing the workflow, so I am not going to explain it here.

library(qtl)


For defining the function and preparing the data for plotting, I am working with ggplot2, ggrepel (for adding text labels that don’t overlap), tidyr and dplyr. This is my function code:

library(ggplot2)
library(ggrepel)
library(tidyr)
library(dplyr)

qtl_plot <- function(input,              # data frame input from scanone
                     mult.pheno = FALSE, # multiple phenotypes?
                     model = "normal",   # model used in scanone
                     chrs = NA,          # chromosomes to display
                     lod = NA,           # LOD threshold
                     rug = FALSE,        # plot marker positions as rug?
                     ncol = NA,          # number of columns for facetting
                     labels = NA         # optional dataframe to plot QTL labels
                     ) {
                     
                  # if we have multiple phenotypes and/or a 2part model, gather input
                  if (mult.pheno & model == "2part") {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (mult.pheno) {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (model == "2part") {
                    input <- gather(input, method, lod, lod.p.mu:lod.mu)
                  }
    
                  # if not all chromosomes should be displayed, subset input
                  if (!is.na(chrs)[1]) {
                    input <- input[as.character(input$chr) %in% chrs, ]
                  }

                  # if there is more than one LOD column, gather input
                  if (!any(colnames(input) == "lod")) {
                    input$lod <- input[, grep("lod", colnames(input))]
                  }

                  # if no number of columns for facetting is defined, plot all in one row
                  if (is.na(ncol)) {
                    ncol <- length(unique(input$chr))
                  }

                  # if labels are set and there is no name column, set from rownames
                  if (!is.na(labels)[1]) {
                    if (is.null(labels$name)) {
                      labels$name <- rownames(labels)
                    }
                  }

                  # plot input data frame position and LOD score
                  plot <- ggplot(input, aes(x = pos, y = lod)) + {

                    # if LOD threshold is given, plot as horizontal line
                    if (!is.na(lod)[1] & length(lod) == 1) geom_hline(yintercept = lod, linetype = "dashed")
                  } + {

                    if (!is.na(lod)[1] & length(lod) > 1) geom_hline(data = lod, aes(yintercept = lod, linetype = group))
                  } + {

                    # plot rug on bottom, if TRUE
                    if (rug) geom_rug(size = 0.1, sides = "b")
                  } + {

                    # if input has column method but not group, plot line and color by method
                    if (!is.null(input$method) & is.null(input$group)) geom_line(aes(color = method), size = 2, alpha = 0.6)
                  } + {

                    # if input has column group but not method, plot line and color by group
                    if (!is.null(input$group) & is.null(input$method)) geom_line(aes(color = group), size = 2, alpha = 0.6)
                  } + {

                    # if input has columns method and group, plot line and color by method & linetype by group
                    if (!is.null(input$group) & !is.null(input$method)) geom_line(aes(color = method, linetype = group), size = 2, alpha = 0.6)
                  } + {

                    # set linetype, if input has columns method and group
                    if (!is.null(input$group) & !is.null(input$method)) scale_linetype_manual(values = c("solid", "twodash", "dotted"))
                  } + {

                    # if input has neither columns method nor group, plot black line
                    if (is.null(input$group) & is.null(input$method)) geom_line(size = 2, alpha = 0.6)
                  } + {

                    # if QTL positions are given in labels df, plot as point...
                    if (!is.na(labels)[1]) geom_point(data = labels, aes(x = pos, y = lod))
                  } + {

                    # ... and plot name as text with ggrepel to avoid overlapping
                    if (!is.na(labels)[1]) geom_text_repel(data = labels, aes(x = pos, y = lod, label = name), nudge_y = 0.5)
                  } + 
                    # facet by chromosome
                    facet_wrap(~ chr, ncol = ncol, scales = "free_x") +
                    # minimal plotting theme
                    theme_minimal() +
                    # increase strip title size
                    theme(strip.text = element_text(face = "bold", size = 12)) +
                    # use RcolorBrewer palette
                    scale_color_brewer(palette = "Set1") +
                    # Change plot labels
                    labs(x = "Chromosome",
                         y = "LOD",
                         color = "",
                         linetype = "")

                  print(plot)
}


The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL).

data(hyper)

hyper <- est.rf(hyper)

hyper <- calc.genoprob(hyper, step = 1, error.prob = 0.01)
out.em <- scanone(hyper, model = "normal")
out.hk <- scanone(hyper, method = "hk", model = "normal")

hyper <- sim.geno(hyper, step = 2, n.draws = 16, error.prob = 0.01)
out.imp <- scanone(hyper, method = "imp", model = "normal")

operm.hk <- scanone(hyper, pheno.col = 1, model = "normal", method = "hk", n.perm = 1000, verbose = FALSE)

lod_threshold <- summary(operm.hk, alpha = 0.05)
labels_df <- as.data.frame(summary(out.hk, perms = operm.hk, alpha = 0.05, pvalues = TRUE))

The most basic version of my plotting function takes as input a data frame from the scanone() function: It needs to have a column with chromosome information, position and LOD score. Without defining anything else, the function will return a line plot of the LOD scores for each marker position on each chromosome in the input data.

qtl_plot(input = out.hk)

If we want to compare the output from e.g. different QTL models, we can follow the tidyverse principle by combining the output data frames in long format and adding a column with the respective method names. If the input has a method column, it will color the lines accordingly. We can also specify, which chromosomes to display (here, 1 and 4) and give a LOD threshold to plot as a horizontal dashed line. If we want to see each marker position, we can specify rug = TRUE to plot them underneath the line plots.

And we can also plot the markers with a significant phenotype association (the QTL) by adding a data frame with the labeling information to the function call. This data frame needs to contain the chromosome, position and LOD score information. If there is no name column, it will take the row names (which usually give the marker name) as labels.

qtl_plot(input = rbind(data.frame(out.em, method = "EM algorithm"), 
                       data.frame(out.hk, method = "Haley-Knott regression"), 
                       data.frame(out.imp, method = "Multiple imputation")), 
         chrs = c(1, 4), 
         lod = lod_threshold[1], 
         rug = TRUE, 
         labels = labels_df)


The second example I want to show is from the listeria data. This time, a slightly more complex 2-part model is built.

data(listeria)

listeria$pheno$logSurv <- log(listeria$pheno[,1])
listeria <- est.rf(listeria)

newmap <- est.map(listeria, error.prob = 0.01)
listeria <- replace.map(listeria, newmap)

listeria <- calc.errorlod(listeria, error.prob = 0.01)

listeria <- calc.genoprob(listeria, step = 2)
out.2p <- scanone(listeria, pheno.col = 3, model = "2part", upper = TRUE)

operm.2p <- scanone(listeria, model = "2part", pheno.col = 3, upper = TRUE, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.2p, alpha = 0.05)

labels_df <- as.data.frame(summary(out.2p, format = "allpeaks", perms = operm.2p, alpha = 0.05, pvalues = TRUE))

The output from a 2-part model looks different than from the other model functions: Instead of only one LOD score column, we now get three columns named lod.pu.mu, lod.p and lod.mu (check back with the tutorial if you want to know what they mean). The plotting function is built so that if you define the model parameter as “2part”, each of the three LOD columns will be plotted in a different color.

If we also want to plot the QTL labels, we need to prepare the label data frame before feeding it into the plotting function:

p.mu <- labels_df[, 1:4]
colnames(p.mu)[3] <- "lod"
p.mu$name <- "p.mu"

p <- labels_df[, c(1, 5:7)]
colnames(p)[3] <- "lod"
p$name <- "p"

mu <- labels_df[, c(1, 8:10)]
colnames(mu)[3] <- "lod"
mu$name <- "mu"

labels_df <- rbind(p.mu, p, mu)

If the chromosomes are not in the right order, we can reorder the factor labels:

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
out.2p$chr <- factor(out.2p$chr, levels = f)
labels_df$chr <- factor(labels_df$chr, levels = f)

The parameter ncol let’s you define how many facet columns you want to have. Here, I want two rows and three columns to display the six chromosomes I chose for plotting:

qtl_plot(input = out.2p, 
         model = "2part",
         chrs = c(1, 3, 5:6, 13, 15),
         lod = lod_threshold[1], 
         rug = TRUE, 
         ncol = 3,
         labels = labels_df)


Binary models are again treated the same way as e.g. normal models:

y <- listeria$pheno$logSurv
my <- max(y, na.rm = TRUE)
z <- as.numeric(y == my)
y[y == my] <- NA
listeria$pheno$logSurv2 <- y
listeria$pheno$binary <- z

out.p <- scanone(listeria, pheno.col = 5, model = "binary")
out.np1 <- scanone(listeria, model = "np", ties.random = TRUE)
out.np2 <- scanone(listeria, model = "np", ties.random = FALSE)
qtl_plot(input = rbind(data.frame(out.np1, method = "out.2p"),
                       data.frame(out.np1, method = "np1"),
                       data.frame(out.np2, method = "np2")))


The fake.bc data is used to show how to run QTL models with multiple phenotypes simultaneously and how to add additive or interactive covariates.

data(fake.bc)

fake.bc <- calc.genoprob(fake.bc, step = 2.5)
out.nocovar <- scanone(fake.bc, pheno.col = 1:2)

If the parameter mult.pheno is set to TRUE, the plotting function gathers the phenotype columns and plots each phenotype in a different color.

qtl_plot(input = out.nocovar, mult.pheno = TRUE)

If we specify both mult.pheno = TRUE and model = “2part”, the function plots each LOD score for each phenotype in a different color. If you wanted to plot only the lod.p column for example, you could subset the input data frame before feeding it into the plotting function.

out.nocovar.2p <- scanone(fake.bc, pheno.col = 1:2, model = "2part", upper = TRUE)

qtl_plot(input = out.nocovar.2p,
         chrs = c(2, 5),
         mult.pheno = TRUE, 
         model = "2part")


If we run a multiple phenotype model and want to determine the QTL positions, we get a different LOD threshold for each of the phenotypes and respectively, for significantly associated markers as well.

sex <- fake.bc$pheno$sex
out.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex)

out.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex)

out.sexint <- out.icovar - out.acovar

operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.icovar, alpha = 0.05)

labels_df <- as.data.frame(summary(out.icovar, perms = operm.icovar, alpha = 0.05, pvalues = TRUE))

This means that, if we want to plot QTL labels for both phenotypes, we first need to prepare the label data frame:

pheno1 <- labels_df[, 1:4]
colnames(pheno1)[3] <- "lod"
pheno1$name <- paste("pheno1", rownames(pheno1), sep = "\n")

pheno2 <- labels_df[, c(1, 2, 5, 6)]
colnames(pheno2)[3] <- "lod"
pheno2$name <- paste("pheno2", rownames(pheno1), sep = "\n")

labels_df <- rbind(pheno1, pheno2)

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
labels_df$chr <- factor(labels_df$chr, levels = f)

If different methods and groups are detected in the input data frame, the methods will be drawn with different colors, while the groups will have different line types. If we want to plot two LOD thresholds for the two phenotypes, we can give a data frame to the function that has a group column that corresponds to the phenotypes of the input data frame. The LOD threshold will now be plotted in the same line type as the phenotype LOD score plot.

qtl_plot(input = rbind(data.frame(out.acovar, method = "acovar"),
                       data.frame(out.icovar, method = "icovar"),
                       data.frame(out.sexint, method = "sexint")), 
         chrs = c(2, 5, 17),
         mult.pheno = TRUE,
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1:2]),
         labels = labels_df)


For QTL models with covariates, the same principles apply:

seed <- ceiling(runif(1, 0, 10^8))
set.seed(seed)
operm.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

set.seed(seed)
operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

operm.sexint <- operm.icovar - operm.acovar

lod_threshold <- summary(operm.sexint, alpha = c(0.05, 0.20))
labels <- as.data.frame(summary(out.sexint, perms = operm.sexint, alpha = 0.1, format = "allpeaks", pvalues = TRUE))
pheno1 <- data.frame(labels[, 1:4])
colnames(pheno1)[3] <- "lod"
pheno1$name <- "p1"

pheno2 <- data.frame(labels[, c(1, 5:7)])
colnames(pheno2)[3] <- "lod"
pheno2$name <- "p2"

labels_df <- rbind(pheno1, pheno2)
qtl_plot(input = out.sexint, 
         mult.pheno = TRUE,
         labels = labels_df,
         chrs = c(2, 7, 17), 
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1, ]),
         ncol = 3)

hyper <- sim.geno(hyper, step = 2.5, n.draws = 16, err = 0.01)
g <- pull.geno(fill.geno(hyper))[, "D4Mit164"]
out1.c4i <- scanone(hyper, method = "imp", addcovar = g, intcovar = g)
qtl_plot(input = out1.c4i)


And as well for model fitting:

qc <- c(1, 1, 4, 6, 15)
qp <- c(43.3, 78.3, 30.0, 62.5, 18.0)
qtl <- makeqtl(hyper, chr = qc, pos = qp)
myformula <- y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5
out.fq <- fitqtl(hyper, qtl = qtl, formula = myformula, drop = FALSE, get.ests = TRUE)

revqtl <- refineqtl(hyper, qtl = qtl, formula = myformula, verbose = FALSE)
out1.c4r <- addqtl(hyper, qtl = revqtl, formula = y ~ Q3)

out2.c4r <- addpair(hyper, qtl = revqtl, formula = y ~ Q3, chr = c(1, 6, 7, 15), verbose = FALSE)
out.iw4 <- addqtl(hyper, qtl = revqtl, formula = y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5 + Q6 + Q5:Q6)
out.1more <- addqtl(hyper, qtl = revqtl, formula = myformula)

qtl2 <- addtoqtl(hyper, revqtl, 7, 53.6)
qtl3 <- dropfromqtl(qtl2, index = 2)
qtl4 <- replaceqtl(hyper, qtl3, index = 1, chr = 1, pos = 50)
qtl5 <- reorderqtl(qtl4, c(1:3, 5, 4))

stepout.i <- stepwiseqtl(hyper, max.qtl = 6, verbose = FALSE)

Here, we also need to prepare the label data frame before the plotting function can handle it:

label_df <- data.frame(name = stepout.i$name, chr = stepout.i$chr, pos = stepout.i$pos)

qtl1 <- attr(stepout.i, "lodprofile")[[1]]
qtl4 <- attr(stepout.i, "lodprofile")[[2]]
qtl6 <- attr(stepout.i, "lodprofile")[[3]]
qtl15 <- attr(stepout.i, "lodprofile")[[4]]

input_df <- rbind(qtl1, qtl4, qtl6, qtl15)
input_df$marker <- rownames(input_df)
input_df$name <- paste(input_df$chr, round(input_df$pos, digits = 1), sep = "@")

label_df$pos <- as.numeric(gsub("(^[0-9]+)(.0$)", "\\1", label_df$pos))
label_df$name <- paste(label_df$chr, round(label_df$pos, digits = 1), sep = "@")

label_df <- left_join(label_df, input_df[, 3:5, drop = FALSE], by = "name")
label_df$name <- paste(label_df$name, label_df$marker, sep = "\n")
qtl_plot(input = input_df,
         labels = label_df)


These examples show only the simple QTL models. The qtl package also gives examples for building scantwo() models. But they can not be plotted in the same way as a scanone() output and thus, won’t be covered here.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.5.0   tidyr_0.6.0   ggrepel_0.6.5 ggplot2_2.2.1 qtl_1.40-8   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        knitr_1.15.1       magrittr_1.5      
##  [4] munsell_0.4.3      colorspace_1.3-2   R6_2.2.0          
##  [7] stringr_1.1.0      plyr_1.8.4         tools_3.3.2       
## [10] parallel_3.3.2     grid_3.3.2         gtable_0.2.0      
## [13] DBI_0.5-1          htmltools_0.3.5    yaml_2.1.14       
## [16] lazyeval_0.2.0     rprojroot_1.1      digest_0.6.11     
## [19] assertthat_0.1     tibble_1.2         RColorBrewer_1.1-2
## [22] codetools_0.2-15   evaluate_0.10      rmarkdown_1.3     
## [25] labeling_0.3       stringi_1.1.2      scales_0.4.1      
## [28] backports_1.0.4

Scratching the Surface of Gender Biases

2017-02-06 00:00:00 +0000

Today, I want to share my analysis of the World Gender Statistics dataset.

Last week I already introduced my Shiny app, where you can explore 160 measurements for 164 countries over 56 years. This week I’ve included a statistical analysis of these countries and measurements and put some finishing touches on the app.

Just in case anybody needed proof that gender bias is not only a problem of “other countries”, my analysis showed very nicely, that the Western world isn’t (yet, I hope) gender neutral in regards to many statistics!

You can also load the app via Github with the shiny package. There, you can also find the source code for the app.

library(shiny)
runGitHub("ShirinG/WGS_app") 


The data

The data was downloaded from The World Bank’s Open Data project via Kaggle. The main data table shows

  • Country.Name: the name of the country
  • Country.Code: the country’s code
  • Indicator.Name: the name of the variable that this row represents
  • Indicator.Code: a unique id for the variable
  • 1960 - 2016: one column EACH for the value of the variable in each year it was available

Unfortunately, the dataset doesn’t include a column that indicated which two female/ male statistics belong together and there is no consistent naming scheme either. Therefore, I chose to focus on those statistics, where the counterparts could be easily extracted: indicator codes that differed only in containing “.FE” or “.MA”. I then split the subsetted dataframe into female and male and produced a third dataset with the ratios between male and female values. To calculate the ratios, I added 0.001 to each data point, so that zero values would not produce NAs.

The finished datasets were saved as “R.Data” files, so that I could easily load them into the Shiny app.

dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Name <- gsub(",", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Name <- gsub(",", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")
length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  code <- unique(dataset_fem$Indicator.Name)[n]
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    if (i == 1) {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- cbind(male[i, c(1:4)], diff)
    } else {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
    }
  }
  
  if (n == 1) {
    diff_table_bind <- diff_table
  } else {
    diff_table_bind <- rbind(diff_table_bind, diff_table)
  }
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")


The world map

The map has been downloaded from the Natural Earth Data website. The country borders were reduced by 200 meters with ArcGIS Pro, so that clicking within any country on the map would show the corresponding country’s border as the nearest point. ArcGIS Pro was also used to convert the map to Mercator projection. The changed shapefiles can be downloaded from my Github repository.

library(rgdal)
library(plyr)
library(scales)

wmap_countries <- readOGR(dsn = "shapefiles/changed_borders", layer = "ne_110m_admin_0_countries_smaller_wm")
wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")
wmap_countries_df_final$adm0_a3 <- as.character(wmap_countries_df_final$adm0_a3)
save(wmap_countries_df_final, file = "wmap_countries_smaller_df_final.RData")

diff_table_bind <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]
save(diff_table_bind, file = "diff_table_bind.RData")

countries <- as.character(unique(diff_table_bind$Country.Name))
save(countries, file = "countries.RData")
library(dplyr)
library(tidyr)

library(ggplot2)

map_theme <- list(theme(panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_blank(),
                        plot.background = element_rect(fill = "white"),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

colfunc <- colorRampPalette(c("yellow", "red"))


Which countries are most biased?

To explore differences between countries and statistics, I prepared a dataset that contains the first and last non-NA value in each time series of the male/ female ratio data (per statistic and country) to calculate whether the change over time was statistically significant. As for the world map plots, I am using the log2 of the ratios, so that ratios bigger and smaller than 1 can be more easily compared.

You can explore which countries have the strongest bias and the biggest change for each statistic in the app under the tab “Latest ratios”.

year_table <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]

# last non-NA value
last_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[length(na.omit(x))]
  )
last_val_df <- data.frame(unlist(last_val))
last_val_df$year_of_val <- gsub("([0-9]+)(.X)([0-9]+)", "\\3", rownames(last_val_df))
rownames(last_val_df) <- gsub("([0-9]+)(.X)([0-9]+)", "\\1", rownames(last_val_df))

# first non-NA value
first_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[1]
  )
first_val_df <- data.frame(na.omit(first_val))

year_table_last_val <- cbind(year_table[rownames(last_val_df), 1:3], last_val_df, first_val_df)

year_table_last_val$difference <- (year_table_last_val$unlist.last_val. + 0.001) / (year_table_last_val$na.omit.first_val. + 0.001)
save(year_table_last_val, file = "year_table_last_val.RData")

To identify countries with strongest biases towards either men or women, I calculated the absolute log2 ratio of male/ female values and subsetted the top 10 countries with highest absolute bias. I then counted for how many statistics each country was in the top 10.

for (i in 1:length(measures)) {
  
  m <- measures[i]
  
  subs <- subset(year_table_last_val, Indicator.Name == m)[, c(1, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  subs$measure <- m
    
  if (i == 1) {
    subs_comb <- subs[1:10, ]
    
  } else {
    subs_comb <- rbind(subs_comb, subs[1:10, ])
  }
}
subs_comb_ordered <- arrange(as.data.frame(table(subs_comb$Country.Name)), desc(Freq))

subs_comb$Country.Name <- factor(subs_comb$Country.Name, levels = subs_comb_ordered$Var1)

ggplot(subs_comb, aes(x = Country.Name)) +
  geom_bar(fill = "navyblue", alpha = 0.7) +
  coord_flip() +
  my_theme() +
  labs(title = "Most biased countries",
       subtitle = "For how many statistics were the countries in top 10 with highest absolute bias",
       x = "")

This plot shows that there are quite a few European/ Western countries with high biases in some statistics. For example:

arrange(subs_comb[which(subs_comb$Country.Name %in% c("Germany", "Netherlands", "United States", "United Kingdom")), - 2], 
        Country.Name, desc(abs_ratio))
##      Country.Name abs_ratio
## 1     Netherlands 1.4678833
## 2     Netherlands 1.4674207
## 3     Netherlands 1.3615747
## 4     Netherlands 1.1248210
## 5     Netherlands 0.9993447
## 6     Netherlands 0.6601651
## 7     Netherlands 0.5512176
## 8     Netherlands 0.1847042
## 9         Germany 8.2336197
## 10        Germany 2.0428320
## 11        Germany 1.1641113
## 12        Germany 1.0957438
## 13        Germany 0.9825083
## 14        Germany 0.7356144
## 15        Germany 0.6274894
## 16 United Kingdom 1.0932072
## 17 United Kingdom 0.8659095
## 18 United Kingdom 0.4347145
## 19  United States 1.3212071
##                                                                                    measure
## 1        Educational attainment completed Doctoral or equivalent population 25+ years  (%)
## 2  Educational attainment competed Doctoral or equivalent population 25+  (%) (cumulative)
## 3                              Borrowed from a store by buying on credit  (% age 15+) [w1]
## 4                           Prevalence of stunting height for age  (% of children under 5)
## 5                        Prevalence of underweight weight for age  (% of children under 5)
## 6                                                       Loan in the past year  (% age 15+)
## 7                      Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 8                                                     Progression to secondary school  (%)
## 9                  Prevalence of severe wasting weight for height  (% of children under 5)
## 10                                          Part time employment  (% of total  employment)
## 11                                         Time-related underemployment  (% of employment)
## 12              Educational attainment completed lower secondary population 25+ years  (%)
## 13         Educational attainment completed short-cycle tertiary population 25+ years  (%)
## 14                                       Borrowed from family or friends  (% age 15+) [ts]
## 15                     Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 16                               Borrowed for health or medical purposes  (% age 15+) [w2]
## 17       Average number of hours spent on unpaid domestic work (housework and child care) 
## 18                                   Borrowed any money in the past year  (% age 15+) [w2]
## 19                          Prevalence of stunting height for age  (% of children under 5)

So, let’s take a closer look at the countries: I am calculating the sums of absolute male/ female ratios (again log2) for each country and show it on the map:

for (i in 1:length(countries)) {
  
  c <- countries[i]
  
  subs <- subset(year_table_last_val, Country.Name == c)[, c(2, 3, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  if (i == 1) {
    subs_country <- data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio))
    
  } else {
    subs_country <- rbind(subs_country, data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio)))
  }
}
left_join(subset(wmap_countries_df_final, !continent == "Antarctica"), subs_country, by = c("adm0_a3" = "Country.Code")) %>%
  ggplot(aes(long, lat, group = group, fill = sum_abs_ratio)) +
        coord_equal() +
        map_theme +
        geom_polygon() +
        geom_path(color = "white", size = 0.5) +
        labs(title = "General gender bias per country",
             subtitle = "Sum of absolute gender biases for all statistics",
             fill = "sum of absolute\nlog2 of male / female") +
        scale_fill_gradient2(low = "blue", mid = "blue", high = "red")


Statistical Analysis

You can explore the results for the individual statistics in the app under “Analysis - Plots” and “Analysis - Tests”.


The variables

The world data contains information on each country. I used the following variables to test for statistical significance:

  1. economic status,
  2. income group,
  3. estimated population size,
  4. estimated gross domestic product (GDP) and
  5. continent.

Because not all data was normally distributed I explored non-parametric models, as well as an ANOVA. The plots for each statistic can be explored in my app, so here, I am only showing the results for which measures were statistically significant for the different variables.


Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank test can be used to test whether observations from repeated measures differ statistically from each other. Here, I am using it to test whether the most recent recorded value is significantly different from the first recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

country.inf <- subset(wmap_countries_df_final[, c(17, 25, 26, 31, 32, 42, 43, 47, 48, 62)], !continent == "Antarctica")
country.inf <- country.inf[!duplicated(country.inf), ]
for (i in 1:length(measures)) {
  m <- measures[i]
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))
  table <- wilcox.test(log2(stats_test$na.omit.first_val.), log2(stats_test$unlist.last_val.), paired = TRUE) 
  
  table <- data.frame(Indicator.Name = paste(m),
                      diff = (median(stats_test$unlist.last_val.) + 0.0001) / (median(stats_test$na.omit.first_val.) + 0.0001),
                      V = table$statistic,
                      p.value = table$p.value,
                      alternative = table$alternative)

  if (i == 1) {
    table_wilcox <- table
  } else {
    table_wilcox <- rbind(table_wilcox, table)
  }
}

table_wilcox$p.adj <- p.adjust(table_wilcox$p.value, method = "fdr")
table_wilcox$significant <- as.factor(ifelse(table_wilcox$p.adj < 0.1, "significant", "non-significant"))

sig <- subset(table_wilcox, significant == "significant")
left_join(subset(year_table_last_val, Indicator.Name %in% sig$Indicator.Name), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  gather(x, y, unlist.last_val., na.omit.first_val.) %>%
  ggplot(aes(x = Indicator.Name, y = log2(y), color = x, fill = x)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  coord_flip() +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference over time",
    subtitle = "Wilcoxon Signed-Rank Test",
    x = "",
    y = "log2(male/ female)")


Kruskal-Wallis Test

The Kruskal-Wallis test is a simple non-parametric statistical test. I used it here to test for each variable of interest whether it has a statistically significant effect on the most recent recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

for (i in 1:length(measures)) {
  m <- measures[i]
  
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))

    stats_test$economy_rank <- as.numeric(as.character(gsub("(^[1-7])(.*)", "\\1", stats_test$economy)))
    stats_test$income_grp_rank <- as.numeric(as.character(gsub("(^[1-5])(.*)", "\\1", stats_test$income_grp)))

    k_economy_rank <- kruskal.test(log2(unlist.last_val.) ~ economy_rank, data = stats_test)
    k_income_rank <- kruskal.test(log2(unlist.last_val.) ~ income_grp_rank, data = stats_test)
    k_pop <- kruskal.test(log2(unlist.last_val.) ~ pop_est, data = stats_test)
    k_gdp <- kruskal.test(log2(unlist.last_val.) ~ gdp_md_est, data = stats_test)
    k_continent <- kruskal.test(log2(unlist.last_val.) ~ continent, data = stats_test)

    kruskal_last_val <- data.frame(
      group = c("economy (rank)", "income_grp (rank)", "pop_est", "gdp_md_est", "continent"),
      p.val = c(k_economy_rank$p.value, k_income_rank$p.value, k_pop$p.value, k_gdp$p.value, k_continent$p.value))

    kruskal_last_val$p.adj <- p.adjust(kruskal_last_val$p.val, method = "fdr")

    if (i == 1) {
      kruskal_last_val_df <- kruskal_last_val[, -2]
      colnames(kruskal_last_val_df)[2] <- paste(m)
    } else {
      pre <- kruskal_last_val[, -2]
      colnames(pre)[2] <- paste(m)
      
      kruskal_last_val_df <- left_join(kruskal_last_val_df, pre, by = "group")
    }
}
eco <- t(kruskal_last_val_df[1, -1])
eco_sig <- eco[which(eco < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(eco_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = economy, fill = economy)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in economy",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

inc <- t(kruskal_last_val_df[2, -1])
inc_sig <- inc[which(inc < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(inc_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = income_grp, fill = income_grp)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in income group",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

cont <- t(kruskal_last_val_df[5, -1])
cont_sig <- cont[which(cont < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(cont_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = continent, fill = continent)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in continent",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

For population and GDP, there were no statistically significant measures.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 tidyr_0.6.1   dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9      digest_0.6.12    rprojroot_1.2    assertthat_0.1  
##  [5] plyr_1.8.4       grid_3.3.2       R6_2.2.0         gtable_0.2.0    
##  [9] DBI_0.5-1        backports_1.0.5  magrittr_1.5     scales_0.4.1    
## [13] evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0   rmarkdown_1.3   
## [17] labeling_0.3     tools_3.3.2      stringr_1.1.0    munsell_0.4.3   
## [21] yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5  knitr_1.15.1    
## [25] tibble_1.2

New features in World Gender Statistics app

2017-01-30 00:00:00 +0000

In my last post, I built a shiny app to explore World Gender Statistics.

To make it a bit nicer and more convenient, I added a few more features:

  • The drop-down menu for Years is now reactive, i.e. it only shows options with data (all NA years are removed)
  • You can click on any country on the map to get information about which country it is, its population size, income group and region
  • Below the world map, you can look at timelines for male vs female values for each country
  • The map is now in Mercator projection