Plotting trees from Random Forest models with ggraph

2017-03-16 00:00:00 +0000

Today, I want to show how I use Thomas Lin Pedersen’s awesome ggraph package to plot decision trees from Random Forest models.

I am very much a visual person, so I try to plot as much of my results as possible because it helps me get a better feel for what is going on with my data.

A nice aspect of using tree-based machine learning, like Random Forest models, is that that they are more easily interpreted than e.g. neural networks as they are based on decision trees. So, when I am using such models, I like to plot final decision trees (if they aren’t too large) to get a sense of which decisions are underlying my predictions.

There are a few very convient ways to plot the outcome if you are using the randomForest package but I like to have as much control as possible about the layout, colors, labels, etc. And because I didn’t find a solution I liked for caret models, I developed the following little function (below you may find information about how I built the model):

As input, it takes part of the output from model_rf <- caret::train(... "rf" ...), that gives the trees of the final model: model_rf$finalModel$forest. From these trees, you can specify which one to plot by index.

library(dplyr)
library(ggraph)
library(igraph)

tree_func <- function(final_model, 
                      tree_num) {
  
  # get tree by index
  tree <- randomForest::getTree(final_model, 
                                k = tree_num, 
                                labelVar = TRUE) %>%
    tibble::rownames_to_column() %>%
    # make leaf split points to NA, so the 0s won't get plotted
    mutate(`split point` = ifelse(is.na(prediction), `split point`, NA))
  
  # prepare data frame for graph
  graph_frame <- data.frame(from = rep(tree$rowname, 2),
                            to = c(tree$`left daughter`, tree$`right daughter`))
  
  # convert to graph and delete the last node that we don't want to plot
  graph <- graph_from_data_frame(graph_frame) %>%
    delete_vertices("0")
  
  # set node labels
  V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`))
  V(graph)$leaf_label <- as.character(tree$prediction)
  V(graph)$split <- as.character(round(tree$`split point`, digits = 2))
  
  # plot
  plot <- ggraph(graph, 'dendrogram') + 
    theme_bw() +
    geom_edge_link() +
    geom_node_point() +
    geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) +
    geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") +
    geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, 
					repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) +
    theme(panel.grid.minor = element_blank(),
          panel.grid.major = element_blank(),
          panel.background = element_blank(),
          plot.background = element_rect(fill = "white"),
          panel.border = element_blank(),
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          plot.title = element_text(size = 18))
  
  print(plot)
}


We can now plot, e.g. the tree with the smalles number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == min(model_rf$finalModel$forest$ndbigtree))

tree_func(final_model = model_rf$finalModel, tree_num)


Or we can plot the tree with the biggest number of nodes:

tree_num <- which(model_rf$finalModel$forest$ndbigtree == max(model_rf$finalModel$forest$ndbigtree))

tree_func(final_model = model_rf$finalModel, tree_num)



Preparing the data and modeling

The data set I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first data set looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",")
colnames(bc_data) <- c("sample_code_number", 
                       "clump_thickness", 
                       "uniformity_of_cell_size", 
                       "uniformity_of_cell_shape", 
                       "marginal_adhesion", 
                       "single_epithelial_cell_size", 
                       "bare_nuclei", 
                       "bland_chromatin", 
                       "normal_nucleoli", 
                       "mitosis", 
                       "classes")

bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))

bc_data[bc_data == "?"] <- NA

# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)

# separate into training and test data
library(caret)

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]

# run model
set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] igraph_1.0.1       ggraph_1.0.0       ggplot2_2.2.1.9000
## [4] dplyr_0.5.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9         nloptr_1.0.4        plyr_1.8.4         
##  [4] viridis_0.3.4       iterators_1.0.8     tools_3.3.3        
##  [7] digest_0.6.12       lme4_1.1-12         evaluate_0.10      
## [10] tibble_1.2          gtable_0.2.0        nlme_3.1-131       
## [13] lattice_0.20-34     mgcv_1.8-17         Matrix_1.2-8       
## [16] foreach_1.4.3       DBI_0.6             ggrepel_0.6.5      
## [19] yaml_2.1.14         parallel_3.3.3      SparseM_1.76       
## [22] gridExtra_2.2.1     stringr_1.2.0       knitr_1.15.1       
## [25] MatrixModels_0.4-1  stats4_3.3.3        rprojroot_1.2      
## [28] grid_3.3.3          caret_6.0-73        nnet_7.3-12        
## [31] R6_2.2.0            rmarkdown_1.3       minqa_1.2.4        
## [34] udunits2_0.13       tweenr_0.1.5        deldir_0.1-12      
## [37] reshape2_1.4.2      car_2.1-4           magrittr_1.5       
## [40] units_0.4-2         backports_1.0.5     scales_0.4.1       
## [43] codetools_0.2-15    ModelMetrics_1.1.0  htmltools_0.3.5    
## [46] MASS_7.3-45         splines_3.3.3       randomForest_4.6-12
## [49] assertthat_0.1      pbkrtest_0.4-6      ggforce_0.1.1      
## [52] colorspace_1.3-2    labeling_0.3        quantreg_5.29      
## [55] stringi_1.1.2       lazyeval_0.2.0      munsell_0.4.3

Hyper-parameter Tuning with Grid Search for Deep Learning

2017-03-07 00:00:00 +0000

Last week I showed how to build a deep neural network with h2o and rsparkling. As we could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy.

Keep in mind though, that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model. It might be necessary to go back to the original features and try e.g. feature engineering methods.


Preparing Spark instance and plotting theme

Check out last week’s post for details on how I prepared the data.

library(rsparkling)
options(rsparkling.sparklingwater.version = "2.0.3")

library(h2o)
library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.0.0")
library(ggplot2)
library(ggrepel)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
arrhythmia_sc <- copy_to(sc, arrhythmia_subset)
arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE)
arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2])
arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3])

splits <- h2o.splitFrame(arrhythmia_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "diagnosis"
weights <- "weights"
features <- setdiff(colnames(train), c(response, weights, "class"))


We can use the h2o.grid() function to perform a Random Grid Search (RGS). We could also test all possible combinations of parameters with Cartesian Grid or exhaustive search, but RGS is much faster when we have a large number of possible combinations and usually finds sufficiently accurate models.

For RGS, we first define a set of hyper-parameters and search criteria to fine-tune our models. Because there are many hyper-parameters, each with a range of possible values, we want to find an (ideally) optimal combination to maximize our model’s accuracy. We can also specify how long we want to run the grid search for. Based on the results of each model tested in the grid, we can choose the one with the highest accuracy or best performance for the question on hand.

Activation Functions

  • Rectifier: is the default activation function. It is the fastest and most versatile option. It can lead to instability though and tends to be lower in accuracy.
  • Tanh: The hyperbolic tangent is a scaled and shifted variant of the sigmoid activation function. It can take on values from -1 to 1 and centers around 0. Tanh needs more computational power than e.g. the Rectifier function.
  • Maxout: is an activation function that is the max of the inputs. It is computationally quite demanding but can produce high accuracy models.

  • ...WithDropout: When we specify with dropout, a random subset of the network is trained and the weights of all sub-networks are averaged. It works together with the parameter hidden_dropout_ratios, which controls the amount of layer neurons that are randomly dropped for each hidden layer. Hidden dropout ratios are useful for preventing overfitting on learned features.

Hidden layers

  • are the most important hyper-parameter to set for deep neural networks, as they specify how many hidden layers and how many nodes per hidden layer the model should learn

L1 and L2 penalties

  • L1: lets only strong weights survive
  • L2: prevents any single weight from getting too big.

Rho and Epsilon

  • rho: similar to prior weight updates
  • epsilon: prevents getting stuck in local optima
hyper_params <- list(
                     activation = c("Rectifier", "Maxout", "Tanh", "RectifierWithDropout", "MaxoutWithDropout", "TanhWithDropout"), 
                     hidden = list(c(5, 5, 5, 5, 5), c(10, 10, 10, 10), c(50, 50, 50), c(100, 100, 100)),
                     epochs = c(50, 100, 200),
                     l1 = c(0, 0.00001, 0.0001), 
                     l2 = c(0, 0.00001, 0.0001),
                     rate = c(0, 01, 0.005, 0.001),
                     rate_annealing = c(1e-8, 1e-7, 1e-6),
                     rho = c(0.9, 0.95, 0.99, 0.999),
                     epsilon = c(1e-10, 1e-8, 1e-6, 1e-4),
                     momentum_start = c(0, 0.5),
                     momentum_stable = c(0.99, 0.5, 0),
                     input_dropout_ratio = c(0, 0.1, 0.2),
                     max_w2 = c(10, 100, 1000, 3.4028235e+38)
                     )


Early stopping criteria

  • stopping_metric: metric that we want to use as stopping criterion
  • stopping_tolerance and stopping_rounds: training stops when the the stopping metric does not improve by the stopping tolerance proportion any more (e.g. by 0.05 or 5%) for the number of consecutive rounds defined by stopping rounds.
search_criteria <- list(strategy = "RandomDiscrete", 
                        max_models = 100,
                        max_runtime_secs = 900,
                        stopping_tolerance = 0.001,
                        stopping_rounds = 15,
                        seed = 42)

Now, we can train the model with combinations of hyper-parameters from our specified stopping criteria and hyper-parameter grid.

dl_grid <- h2o.grid(algorithm = "deeplearning", 
                    x = features,
                    y = response,
                    weights_column = weights,
                    grid_id = "dl_grid",
                    training_frame = train,
                    validation_frame = valid,
                    nfolds = 25,                           
                    fold_assignment = "Stratified",
                    hyper_params = hyper_params,
                    search_criteria = search_criteria,
                    seed = 42
                    )

We now want to extract the best model from the grid model list. What makes a model the best depends on the question you want to address with it: in some cases, the model with highest AUC is the most suitable, or the one with the lowest mean squared error, etc. See last week’s post again for a more detailed discussion of performance metrics.

For demonstration purposes, I am choosing the best models from a range of possible quality criteria. We first use the h2o.getGrid() function to sort all models by the quality metric we choose (depending on the metric, you want it ordered by descending or ascending values). We can then get the model that’s the first in the list to work with further. This model’s hyper-parameters can be found with best_model@allparameters. You can now work with your best model as with any regular model in h2o (for an example see last week’s post).

# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err")

for (sort_by_1 in sort_options_1) {
  
  grid <- h2o.getGrid("dl_grid", sort_by = sort_by_1, decreasing = FALSE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  assign(paste0("best_model_", sort_by_1), best_model)
  
}


# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")

for (sort_by_2 in sort_options_2) {
  
  grid <- h2o.getGrid("dl_grid", sort_by = sort_by_2, decreasing = TRUE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  assign(paste0("best_model_", sort_by_2), best_model)
  
}

Let’s plot the mean per class error for each best model:

library(tibble)

sort_options <- c("mean_per_class_error", "mse", "err", "auc", "precision", "accuracy", "recall", "specificity")

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  errors <- h2o.mean_per_class_error(best_model, train = TRUE, valid = TRUE, xval = TRUE)
 
  errors_df <- data.frame(model_id = best_model@model_id, sort = sort_by, errors) %>%
    rownames_to_column(var = "rowname")
  
  if (sort_by == "mean_per_class_error") {
    
    errors_df_comb <- errors_df
    
  } else {
    
    errors_df_comb <- rbind(errors_df_comb, errors_df)
    
  }
}
order <- subset(errors_df_comb, rowname == "xval") %>%
  arrange(errors)
  
errors_df_comb %>%
  mutate(sort = factor(sort, levels = order$sort)) %>%
  ggplot(aes(x = sort, y = errors, fill = model_id)) +
    facet_grid(rowname ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8) +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
          plot.margin = unit(c(0.5, 0, 0, 1), "cm")) +
    labs(x = "")


Model performance

The ultimate performance test for our model will be it’s prediction accuracy on the test set it hasn’t seen before. Here, I will compare the AUC and mean squared error for each best model from before. You could of course look at any other quality metric that is most appropriate for your model.

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  mse_auc_test <- data.frame(model_id = best_model@model_id,
                             sort = sort_by, 
                             mse = h2o.mse(h2o.performance(best_model, test)),
                             auc = h2o.auc(h2o.performance(best_model, test)))
  
  if (sort_by == "mean_per_class_error") {
    
    mse_auc_test_comb <- mse_auc_test
    
  } else {
    
    mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
    
  }
}
library(tidyr)

mse_auc_test_comb %>%
  gather(x, y, mse:auc) %>%
  ggplot(aes(x = sort, y = y, fill = model_id)) +
    facet_grid(x ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
          plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
    labs(x = "", y = "value", fill = "")

I will then create a dataframe with predictions for each test sample with all best models. As in last week’s post, I want to compare the default predictions with stringent predictions.

for (sort_by in sort_options) {
  
  best_model <- get(paste0("best_model_", sort_by))
  
  finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)),
                                    sort = rep(sort_by, nrow(test)),
                                    class = as.vector(test$class), 
                                    actual = as.vector(test$diagnosis), 
                                    as.data.frame(h2o.predict(object = best_model, newdata = test)))
  
  finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no")
  
  finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", 
                                                  ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain"))
  finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", 
                                         ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no"))
  
  if (sort_by == "mean_per_class_error") {
    
    finalRf_predictions_comb <- finalRf_predictions
    
  } else {
    
    finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
    
  }
}

To get a better overview, I am going to plot the predictions (default and stringent):

finalRf_predictions_comb %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    facet_wrap(~ sort, ncol = 4) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions_comb %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    facet_wrap(~ sort, ncol = 4) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

With predictions made by different models, we can see where each model performs best. This obviously corresponds with the quality metric we chose to define the best model. Stringent prediction thresholds reduced the number of false predictions but of course also the number of predictions made at all.

We can now decide which model is most relevant. Let’s say, we want this model to give recommendations for further validating a diagnosis of arrhythmia, we might want to detect as many arrhythmia cases as possible, while being okay with also getting some erroneously diagnosed healthy subjects. In this case, the people could flag patients with likely arrhythmia for further testing. Here, this would mean that we wanted to minimize the number of wrong predictions of arrhythmia cases, so we would prefer the mse, auc and specificity model (which is the same model, chosen by all three qualitry metrics). The worst model for this purpose would be the recall model, which also did not make any predictions that passed the stringency threshold. The recall model would have been best if we wanted to be confident that healthy people get flagged correctly. However, in a real-life setting, you can imagine that it would be much worse if a sick patient got sent home because he was wrongly diagnosed as healthy than if a healthy person got submitted to further testing for possible arrhythmia.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.6.1         tibble_1.2          ggrepel_0.6.5      
## [4] ggplot2_2.2.1.9000  sparklyr_0.5.3-9002 dplyr_0.5.0        
## [7] h2o_3.10.3.6        rsparkling_0.1.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        RColorBrewer_1.1-2 plyr_1.8.4        
##  [4] bitops_1.0-6       base64enc_0.1-3    tools_3.3.2       
##  [7] digest_0.6.12      jsonlite_1.2       evaluate_0.10     
## [10] gtable_0.2.0       shiny_1.0.0        DBI_0.5-1         
## [13] rstudioapi_0.6     yaml_2.1.14        parallel_3.3.2    
## [16] withr_1.0.2        httr_1.2.1         stringr_1.2.0     
## [19] knitr_1.15.1       rappdirs_0.3.1     rprojroot_1.2     
## [22] grid_3.3.2         R6_2.2.0           rmarkdown_1.3     
## [25] reshape2_1.4.2     magrittr_1.5       backports_1.0.5   
## [28] scales_0.4.1       htmltools_0.3.5    assertthat_0.1    
## [31] mime_0.5           xtable_1.8-2       colorspace_1.3-2  
## [34] httpuv_1.3.3       labeling_0.3       config_0.2        
## [37] stringi_1.1.2      RCurl_1.95-4.8     lazyeval_0.2.0    
## [40] munsell_0.4.3

Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart

2017-02-27 00:00:00 +0000

Last week, I introduced how to run machine learning applications on Spark from within R, using the sparklyr package. This week, I am showing how to build feed-forward deep neural networks or multilayer perceptrons. The models in this example are built to classify ECG data into being either from healthy hearts or from someone suffering from arrhythmia. I will show how to prepare a dataset for modeling, setting weights and other modeling parameters and finally, how to evaluate model performance with the h2o package via rsparkling.


Deep learning with neural networks

Deep learning with neural networks is arguably one of the most rapidly growing applications of machine learning and AI today. They allow building complex models that consist of multiple hidden layers within artifiical networks and are able to find non-linear patterns in unstructured data. Deep neural networks are usually feed-forward, which means that each layer feeds its output to subsequent layers, but recurrent or feed-back neural networks can also be built. Feed-forward neural networks are also called multilayer perceptrons (MLPs).


H2O and Sparkling Water

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark (Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

Sparkling Water can be accessed from R with the rsparkling extension package to sparklyr and h2o. Check the documentation for rsparkling to find out which H2O, Sparkling Water and Spark versions are compatible.


Preparing the R session

First, we need to load the packages and connect to the Spark instance (for demonstration purposes, I am using a local instance).

library(rsparkling)
options(rsparkling.sparklingwater.version = "2.0.3")

library(h2o)
library(dplyr)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.0.0")

I am also preparing my custom plotting theme.

library(ggplot2)
library(ggrepel)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}


Arrhythmia data

The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names. The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.

arrhythmia <- read.table("arrhythmia.data.txt", sep = ",")
arrhythmia[arrhythmia == "?"] <- NA

# making sure, that all feature columns are numeric
arrhythmia[-280] <- lapply(arrhythmia[-280], as.character)
arrhythmia[-280] <- lapply(arrhythmia[-280], as.numeric)

#  renaming output column and converting to factor
colnames(arrhythmia)[280] <- "class"
arrhythmia$class <- as.factor(arrhythmia$class)

As usual, I want to get acquainted with the data and explore it’s properties before I am building any model. So, I am first going to look at the distribution of classes and of healthy and arrhythmia samples.

p1 <- ggplot(arrhythmia, aes(x = class)) +
  geom_bar(fill = "navy", alpha = 0.7) +
  my_theme()

Because I am interested in distinguishing healthy from arrhythmia ECGs, I am converting the output to binary format by combining all arrhythmia cases into one class.

arrhythmia$diagnosis <- ifelse(arrhythmia$class == 1, "healthy", "arrhythmia")
arrhythmia$diagnosis <- as.factor(arrhythmia$diagnosis)
p2 <- ggplot(arrhythmia, aes(x = diagnosis)) +
  geom_bar(fill = "navy", alpha = 0.7) +
  my_theme()
library(gridExtra)
library(grid)

grid.arrange(p1, p2, ncol = 2)

With binary classification, we have almost the same numbers of healthy and arrhythmia cases in our dataset.

I am also interested in how much the normal and arrhythmia cases cluster in a Principal Component Analysis (PCA). I am first preparing the PCA plotting function and then run it on the feature data.

library(pcaGoPromoter)

pca_func <- function(pcaOutput2, group_name){
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
          data.frame(groups = as.character(t),
                     ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                           centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                           level = 0.95),
                     stringsAsFactors = FALSE)))
        
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.5) + 
      labs(color = paste(group_name),
           fill = paste(group_name),
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance")) +
      my_theme()
    
    return(plot)
}
pcaOutput <- pca(t(arrhythmia[-c(280, 281)]), printDropped = FALSE, scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$scores)

pcaOutput2$groups <- arrhythmia$class
p1 <- pca_func(pcaOutput2, group_name = "class")

pcaOutput2$groups <- arrhythmia$diagnosis
p2 <- pca_func(pcaOutput2, group_name = "diagnosis")

grid.arrange(p1, p2, ncol = 2)

The PCA shows that there is a big overlap between healthy and arrhythmia samples, i.e. there does not seem to be major global differences in all features. The class that is most distinct from all others seems to be class 9. I want to give the arrhythmia cases that are very different from the rest a stronger weight in the neural network, so I define a weight column where every sample outside the central PCA cluster will get a “2”, they will in effect be used twice in the model.

weights <- ifelse(pcaOutput2$PC1 < -5 & abs(pcaOutput2$PC2) > 10, 2, 1)

I also want to know what the variance is within features.

library(matrixStats)

colvars <- data.frame(feature = colnames(arrhythmia[-c(280, 281)]),
                      variance = colVars(as.matrix(arrhythmia[-c(280, 281)])))

subset(colvars, variance > 50) %>%
  mutate(feature = factor(feature, levels = colnames(arrhythmia[-c(280, 281)]))) %>%
  ggplot(aes(x = feature, y = variance)) +
    geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Features with low variance are less likely to strongly contribute to a differentiation between healthy and arrhythmia cases, so I am going to remove them. I am also concatenating the weights column:

arrhythmia_subset <- cbind(weights, arrhythmia[, c(281, 280, which(colvars$variance > 50))])


Working with rsparkling and h2o

Now that I have my final data frame for modeling, I copy it to the Spark instance. For working with h2o functions, the data needs to be converted from a Spark DataFrame to an H2O Frame. This is done with the as_h2o_frame() function.

arrhythmia_sc <- copy_to(sc, arrhythmia_subset)
arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE)

We can now access all functions from the h2o package that are built to work on H2O Frames. A useful such function is h2o.describe(). It is similar to base R’s summary() function but outputs many more descriptive measures for our data. To get a good overview about these measures, I am going to plot them.

library(tidyr) # for gathering
h2o.describe(arrhythmia_hf[, -1]) %>% # excluding the weights column
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% # separating them into facets makes them easier to see
  mutate(Label = factor(Label, levels = colnames(arrhythmia_hf[, -1]))) %>%
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

I am also interested in the correlation between features and the output. We can use the h2o.cor() function to calculate the correlation matrix. It is again much easier to understand the data when we visualize it, so I am going to create another plot.

library(reshape2) # for melting

arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2]) # diagnosis is now a characer column and we need to convert it again
arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3]) # same for class

cor <- h2o.cor(arrhythmia_hf[, -c(1, 3)])
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")


Training, test and validation data

Now we can use the h2o.splitFrame() function to split the data into training, validation and test data. Here, I am using 70% for training and 15% each for validation and testing. We could also just split the data into two sections, a training and test set but when we have sufficient samples, it is a good idea to evaluate model performance on an independent test set on top of training with a validation set. Because we can easily overfit a model, we want to get an idea about how generalizable it is - this we can only assess by looking at how well it works on previously unknown data.

I am also defining response, feature and weight column names now.

splits <- h2o.splitFrame(arrhythmia_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "diagnosis"
weights <- "weights"
features <- setdiff(colnames(train), c(response, weights, "class"))
summary(train$diagnosis, exact_quantiles = TRUE)
##  diagnosis      
##  healthy   :163 
##  arrhythmia:155
summary(valid$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :43 
##  arrhythmia:25
summary(test$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :39 
##  arrhythmia:27

If we had more categorical features, we could use the h2o.interaction() function to define interaction terms, but since we only have numeric features here, we don’t need this.

We can also run a PCA on the training data, using the h2o.prcomp() function to calculate the singular value decomposition of the Gram matrix with the power method.

pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           k = 3,
           seed = 42)

pca
## Model Details:
## ==============
## 
## H2ODimReductionModel: pca
## Model ID:  PCA_model_R_1488113493291_1 
## Importance of components: 
##                             pc1      pc2      pc3
## Standard deviation     0.598791 0.516364 0.424850
## Proportion of Variance 0.162284 0.120680 0.081695
## Cumulative Proportion  0.162284 0.282965 0.364660
## 
## 
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA
## H2ODimReductionMetrics: pca
## 
## No model metrics available for PCA
eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features

ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel() +
  my_theme()


Modeling

Now, we can build a deep neural network model. We can specify quite a few parameters, like

  • Cross-validation: Cross validation can tell us the training and validation errors for each model. The final model will be overwritten with the best model, if we don’t specify otherwise.

  • Adaptive learning rate: For deep learning with h2o, we by default use stochastic gradient descent optimization with an an adaptive learning rate. The two corresponding parameters rho and epsilon help us find global (or near enough) optima.

  • Activation function: The activation function defines the node output relative to a given set of inputs. We want our activation function to be non-linear and continuously differentiable.

  • Hidden nodes: Defines the number of hidden layers and the number of nodes per layer.

  • Epochs: Increasing the number of epochs (one full training cycle on all training samples) can increase model performance, but we also run the risk of overfitting. To determine the optimal number of epochs, we need to use early stopping.

  • Early stopping: By default, early stopping is enabled. This means that training will be stopped when we reach a certain validation error to prevent overfitting.

Of course, you need quite a bit of experience and intuition to hit on a good combination of parameters. That’s why it usually makes sense to do a grid search for hyper-parameter tuning. Here, I want to focus on building and evaluating deep learning models, though. I will cover grid search in next week’s post.

dl_model <- h2o.deeplearning(x = features,
                             y = response,
                             weights_column = weights,
                             model_id = "dl_model",
                             training_frame = train,
                             validation_frame = valid,
                             nfolds = 15,                                   # 10x cross validation
                             keep_cross_validation_fold_assignment = TRUE,
                             fold_assignment = "Stratified",
                             activation = "RectifierWithDropout",
                             score_each_iteration = TRUE,
                             hidden = c(200, 200, 200, 200, 200),           # 5 hidden layers, each of 200 neurons
                             epochs = 100,
                             variable_importances = TRUE,
                             export_weights_and_biases = TRUE,
                             seed = 42)

Because training can take a while, depending on how many samples, features, nodes and hidden layers you are training on, it is a good idea to save your model.

h2o.saveModel(dl_model, path = "dl_model")

We can then re-load the model again any time to check the model quality and make predictions on new data.

dl_model <- h2o.loadModel("dl_model/dl_model")


Model performance

We now want to know how our model performed on the validation data. The summary() function will give us a detailed overview of our model. I am not showing the output here, because it is quite extensive.

summary(dl_model)

One performance metric we are usually interested in is the mean per class error for training and validation data.

h2o.mean_per_class_error(dl_model, train = TRUE, valid = TRUE, xval = TRUE)
##           train          valid            xval 
## 0.0304878 0.2232558 0.2257781

The confusion matrix tells us, how many classes have been predicted correctly and how many predictions were accurate. Here, we see the errors in predictions on validation data

h2o.confusionMatrix(dl_model, valid = TRUE)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00425904880659062:
##            arrhythmia healthy    Error    Rate
## arrhythmia         15      10 0.400000  =10/25
## healthy             2      41 0.046512   =2/43
## Totals             17      51 0.176471  =12/68

We can also plot the classification error over all epochs or samples.

plot(dl_model,
     timestep = "epochs",
     metric = "classification_error")

plot(dl_model,
     timestep = "samples",
     metric = "classification_error")

Next to the classification error, we are usually interested in the logistic loss (negative log-likelihood or log loss). It describes the sum of errors for each sample in the training or validation data or the negative logarithm of the likelihood of error for a given prediction/ classification. Simply put, the lower the loss, the better the model (if we ignore potential overfitting).

plot(dl_model,
     timestep = "epochs",
     metric = "logloss")

We can also plot the mean squared error (MSE). The MSE tells us the average of the prediction errors squared, i.e. the estimator’s variance and bias. The closer to zero, the better a model.

plot(dl_model,
     timestep = "epochs",
     metric = "rmse")

Next, we want to know the area under the curve (AUC). AUC is an important metric for measuring binary classification model performances. It gives the area under the curve, i.e. the integral, of true positive vs false positive rates. The closer to 1, the better a model.

h2o.auc(dl_model, train = TRUE)
## [1] 0.9864582
h2o.auc(dl_model, valid = TRUE)
## [1] 0.852093
h2o.auc(dl_model, xval = TRUE)
## [1] 0.8577735

The weights for connecting two adjacent layers and per-neuron biases that we specified the model to save, can be accessed with:

w <- h2o.weights(dl_model, matrix_id = 1)
b <- h2o.biases(dl_model, vector_id = 1)

Variable importance can be extracted as well (but keep in mind, that variable importance in deep neural networks is difficult to assess and should be considered only as rough estimates).

h2o.varimp(dl_model)
## Variable Importances: 
##   variable relative_importance scaled_importance percentage
## 1     V169            1.000000          1.000000   0.013925
## 2     V239            0.987290          0.987290   0.013748
## 3     V103            0.913953          0.913953   0.012727
## 4      V15            0.907422          0.907422   0.012636
## 5      V91            0.904267          0.904267   0.012592
## 
## ---
##    variable relative_importance scaled_importance percentage
## 85      V88            0.717914          0.717914   0.009997
## 86     V269            0.715800          0.715800   0.009968
## 87     V137            0.712923          0.712923   0.009928
## 88     V168            0.711402          0.711402   0.009906
## 89      V33            0.707356          0.707356   0.009850
## 90     V219            0.696149          0.696149   0.009694
#h2o.varimp_plot(dl_model)


Test data

Now that we have a good idea about model performance on validation data, we want to know how it performed on unseen test data. A good model should find an optimal balance between accuracy on training and test data. A model that has 0% error on the training data but 40% error on the test data is in effect useless. It overfit on the training data and is thus not able to generalize to unknown data.

perf <- h2o.performance(dl_model, test)
perf
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.2154912
## RMSE:  0.4642103
## LogLoss:  1.378809
## Mean Per-Class Error:  0.2250712
## AUC:  0.7796771
## Gini:  0.5593542
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            arrhythmia healthy    Error      Rate
## arrhythmia         19       8 0.296296    =8/27
## healthy                6      33 0.153846   =6/39
## Totals                 25      41 0.212121   =14/66
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.132564 0.825000  40
## 2                       max f2  0.000002 0.894495  61
## 3                 max f0point5  0.132564 0.812808  40
## 4                 max accuracy  0.132564 0.787879  40
## 5                max precision  0.982938 1.000000   0
## 6                   max recall  0.000002 1.000000  61
## 7              max specificity  0.982938 1.000000   0
## 8             max absolute_mcc  0.132564 0.557317  40
## 9   max min_per_class_accuracy  0.837616 0.743590  34
## 10 max mean_per_class_accuracy  0.132564 0.774929  40
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Plotting the test performance’s AUC plot shows us approximately how good the predictions are.

plot(perf)

We also want to know the log loss, MSE and AUC values, as well as other model metrics for the test data:

h2o.logloss(perf)
## [1] 1.378809
h2o.mse(perf)
## [1] 0.2154912
h2o.auc(perf)
## [1] 0.7796771
head(h2o.metric(perf))
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  0.982938 0.050000 0.031847 0.116279 0.424242  1.000000 0.025641
## 2  0.982811 0.097561 0.063291 0.212766 0.439394  1.000000 0.051282
## 3  0.982460 0.095238 0.062893 0.196078 0.424242  0.666667 0.051282
## 4  0.982256 0.139535 0.093750 0.272727 0.439394  0.750000 0.076923
## 5  0.982215 0.181818 0.124224 0.338983 0.454545  0.800000 0.102564
## 6  0.981170 0.177778 0.123457 0.317460 0.439394  0.666667 0.102564
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.103203               0.025641                0.512821
## 2    1.000000     0.147087               0.051282                0.525641
## 3    0.962963     0.033624               0.051282                0.507123
## 4    0.962963     0.082188               0.076923                0.519943
## 5    0.962963     0.121754               0.102564                0.532764
## 6    0.925926     0.048725               0.102564                0.514245
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1  27  38   0   1 1.000000 0.974359 0.000000 0.025641   0
## 2  27  37   0   2 1.000000 0.948718 0.000000 0.051282   1
## 3  26  37   1   2 0.962963 0.948718 0.037037 0.051282   2
## 4  26  36   1   3 0.962963 0.923077 0.037037 0.076923   3
## 5  26  35   1   4 0.962963 0.897436 0.037037 0.102564   4
## 6  25  35   2   4 0.925926 0.897436 0.074074 0.102564   5

The confusion matrix alone can be seen with the h2o.confusionMatrix() function, but is is also part of the performance summary.

h2o.confusionMatrix(dl_model, test)

The final predictions with probabilities can be extracted with the h2o.predict() function. Beware though, that the number of correct and wrong classifications can be slightly different from the confusion matrix above. Here, I combine the predictions with the actual test diagnoses and classes into a data frame. For plotting I also want to have a column, that tells me whether the predictions were correct. By default, a prediction probability above 0.5 will get scored as a prediction for the respective category. I find it often makes sense to be more stringent with this, though and set a higher threshold. Therefore, I am creating another column with stringent predictions, where I only count predictions that were made with more than 80% probability. Everything that does not fall within this range gets scored as “uncertain”. For these stringent predictions, I am also creating a column that tells me whether they were accurate.

finalRf_predictions <- data.frame(class = as.vector(test$class), actual = as.vector(test$diagnosis), as.data.frame(h2o.predict(object = dl_model, newdata = test)))

finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", 
                                                ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no"))
finalRf_predictions %>%
  group_by(actual, predict) %>%
  summarise(n = n())
## Source: local data frame [4 x 3]
## Groups: actual [?]
## 
##       actual    predict     n
##       <fctr>     <fctr> <int>
## 1 arrhythmia arrhythmia    16
## 2 arrhythmia    healthy    11
## 3    healthy arrhythmia     6
## 4    healthy    healthy    33
finalRf_predictions %>%
  group_by(actual, predict_stringent) %>%
  summarise(n = n())
## Source: local data frame [6 x 3]
## Groups: actual [?]
## 
##       actual predict_stringent     n
##       <fctr>             <chr> <int>
## 1 arrhythmia        arrhythmia    19
## 2 arrhythmia           healthy     6
## 3 arrhythmia         uncertain     2
## 4    healthy        arrhythmia     8
## 5    healthy           healthy    29
## 6    healthy         uncertain     2

To get a better overview, I am going to plot the predictions (default and stringent):

p1 <- finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

p2 <- finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

grid.arrange(p1, p2, ncol = 2)

Being more stringent with the prediction threshold slightly reduced the number of errors but not by much.

I also want to know whether there are certain classes of arrhythmia that are especially prone to being misclassified:

p1 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Default predictions",
         x = "predicted to be")

p2 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict_stringent, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Stringent predictions",
         x = "predicted to be")

grid.arrange(p1, p2, ncol = 2)

There are no obvious biases towards some classes but with the small number of samples for most classes, this is difficult to assess.


Final conclusions: How useful is the model?

Most samples were classified correctly, but the total error was not particularly good. Moreover, when evaluating the usefulness of a specific model, we need to keep in mind what we want to achieve with it and which questions we want to answer. If we wanted to deploy this model in a clinical setting, it should assist with diagnosing patients. So, we need to think about what the consequences of wrong classifications would be. Would it be better to optimize for high sensitivity, in this example as many arrhythmia cases as possible get detected - with the drawback that we probably also diagnose a few healthy people? Or do we want to maximize precision, meaning that we could be confident that a patient who got predicted to have arrhythmia does indeed have it, while accepting that a few arrhythmia cases would remain undiagnosed? When we consider stringent predictions, this model correctly classified 19 out of 27 arrhythmia cases, but 6 were misdiagnosed. This would mean that some patients who were actually sick, wouldn’t have gotten the correct treatment (if decided solely based on this model). For real-life application, this is obviously not sufficient!

Next week, I’ll be trying to improve the model by doing a grid search for hyper-parameter tuning.

So, stay tuned… (sorry, couldn’t resist ;-))


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.2       tidyr_0.6.1          matrixStats_0.51.0  
##  [4] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
##  [7] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [10] ellipse_0.3-8        gridExtra_2.2.1      ggrepel_0.6.5       
## [13] ggplot2_2.2.1        sparklyr_0.5.2       dplyr_0.5.0         
## [16] h2o_3.10.3.6         rsparkling_0.1.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9          RColorBrewer_1.1-2   plyr_1.8.4          
##  [4] zlibbioc_1.20.0      bitops_1.0-6         base64enc_0.1-3     
##  [7] tools_3.3.2          digest_0.6.12        memoise_1.0.0       
## [10] RSQLite_1.1-2        jsonlite_1.2         evaluate_0.10       
## [13] tibble_1.2           gtable_0.2.0         DBI_0.5-1           
## [16] yaml_2.1.14          withr_1.0.2          httr_1.2.1          
## [19] stringr_1.2.0        knitr_1.15.1         rappdirs_0.3.1      
## [22] rprojroot_1.2        Biobase_2.34.0       R6_2.2.0            
## [25] AnnotationDbi_1.36.0 rmarkdown_1.3        magrittr_1.5        
## [28] backports_1.0.5      scales_0.4.1         htmltools_0.3.5     
## [31] assertthat_0.1       colorspace_1.3-2     labeling_0.3        
## [34] config_0.2           stringi_1.1.2        RCurl_1.95-4.8      
## [37] lazyeval_0.2.0       munsell_0.4.3

Predicting food preferences with sparklyr (machine learning)

2017-02-19 00:00:00 +0000

This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R.

The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines.


Apache Spark

Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup. Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training. Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance.

MLlib

Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing. With sparklyr you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries. sparklyr also implements dplyr, making it especially convenient for handling data.

If you don’t have Spark installed locally, run:

library(sparklyr)
spark_install(version = "2.0.0")

Now we can connect to a local Spark instance:

library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")


Preparations

Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages tidyr (for gathering data for plotting), dplyr (for data manipulation) and ggrepel (for non-overlapping text labels in plots).

library(tidyr)
library(ggplot2)
library(ggrepel)
library(dplyr)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "black"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}


The Data

Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup.

This dataset is part of the fivethirtyeight package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen:

  • 5: I love this country’s traditional cuisine. I think it’s one of the best in the world.
  • 4: I like this country’s traditional cuisine. I think it’s considerably above average.
  • 3: I’m OK with this county’s traditional cuisine. I think it’s about average.
  • 2: I dislike this country’s traditional cuisine. I think it’s considerably below average.
  • 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world.
  • N/A: I’m unfamiliar with this country’s traditional cuisine.

Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0.

library(fivethirtyeight)

food_world_cup[food_world_cup == "N/A"] <- NA
food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0
food_world_cup$gender <- as.factor(food_world_cup$gender)
food_world_cup$location <- as.factor(food_world_cup$location)

The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location.

Before I do any machine learning, however, I want to get to know the data. First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country.

# calculating percentages per category and country
percentages <- food_world_cup %>%
  select(algeria:vietnam) %>%
  gather(x, y) %>%
  group_by(x, y) %>%
  summarise(n = n()) %>%
  mutate(Percent = round(n / sum(n) * 100, digits = 2))

# rename countries & plot
percentages %>%
  mutate(x_2 = gsub("_", " ", x)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = "", y = Percent, fill = y)) + 
    geom_bar(width = 1, stat = "identity") + 
    theme_minimal() +
    coord_polar("y", start = 0) +
    facet_wrap(~ x_2, ncol = 8) +
    scale_fill_brewer(palette = "Set3") +
    labs(fill = "")

As we can already see from the distributions, there are big differences between countries. Some countries’ cuisines are very well known and liked (e.g. Italy, Mexico and the United States), while others are only known to a handful of people (e.g. Algeria, Croatia, Ghana, etc.).


Imputing missing values

In the demographic information, there are a few missing values. These, I want to impute:

library(mice)

dataset_impute <- mice(food_world_cup[, -c(1, 2)],  print = FALSE)
food_world_cup <- cbind(food_world_cup[, 2, drop = FALSE], mice::complete(dataset_impute, 1))


Transforming preference values

Strictly speaking, the preference data is categorical. But because using them as factor levels would make the models much more complex and calculation-intensive, I am converting the factors to numbers and transform them by dividing through the mean of non-zero values for each country.

food_world_cup[8:47] <- lapply(food_world_cup[8:47], as.numeric)

countries <- paste(colnames(food_world_cup)[-c(1:7)])

for (response in countries) {
  food_world_cup[paste(response, "trans", sep = "_")] <- food_world_cup[response] / mean(food_world_cup[food_world_cup[response] > 0, response])
}

As we can see by plotting the distribution of transformed values, they are far from being normal distributions. Moreover, we still see big differences between well known and not well known countries - this means that the data is unbalanced.

food_world_cup %>%
  gather(x, y, algeria_trans:vietnam_trans) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(y)) +
    geom_density(fill = "navy", alpha = 0.7) +
    my_theme() + 
    facet_wrap(~ x_2, ncol = 8) +
    labs(x = "transformed preference")


Most liked cuisines and gender biases

Then, I wanted to know which countries were the most liked and whether we see a gender bias in preference for some country’s cuisines. The first plot shows the sum of preference values for each country, separated by male and female answers.

food_world_cup_gather <- food_world_cup %>%
  collect %>%
  gather(country, value, algeria:vietnam)
                                 
food_world_cup_gather$value <- as.numeric(food_world_cup_gather$value)
food_world_cup_gather$country <- as.factor(food_world_cup_gather$country)
food_world_cup_gather <- food_world_cup_gather %>%
  mutate(x_2 = gsub("_", " ", country)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2))

order <- aggregate(food_world_cup_gather$value, by = list(food_world_cup_gather$x_2), FUN = sum)

food_world_cup_gather %>%
  mutate(x_2 = factor(x_2, levels = order$Group.1[order(order$x, decreasing = TRUE)])) %>%
  ggplot(aes(x = x_2, y = value, fill = gender)) +
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(fill = "Gender",
         x = "",
         y = "sum of preferences")

Cuisines from Italy, the United States and Mexico were the most liked, from Ghana, Cameroon and the Ivory Coast were least liked (that they were also less well known playes into it, of course). No country shows an obvious gender bias, though.

To dig a bit deeper into gender differences, I am calculating the differences between mean preference of males and females for each country. Here, I am using the country list that I defined earlier when transforming the values.

Because I want to paste the country list to dplyr’s functions, I need to use standard evaluation (SE). Be default, dplyr uses non-standard evalutaion (NSE), i.e. variable names are given without quotation marks. This makes it possible to easily convert between R and SQL code. But each dplyr function also has a version to use with SE: they each have a “” appended to the function name, here e.g. *mutate_each()*.

food_world_cup %>%
  collect %>%
  mutate_each_(funs(as.numeric), countries) %>%
  group_by(gender) %>%
  summarise_each_(funs(mean), countries) %>%
  summarise_each_(funs(diff), countries) %>%
  gather(x, y) %>%
  mutate(x_2 = gsub("_", " ", x)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = y)) +
    geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "",
         y = "difference\nbetween gender")

All gender differences are very close to zero, so it seems that men and women generally have similar food preferences.


Spark

Now, that I’m somewhat familiar with the data, I copy it to the Spark instance:

food_world_cup <- copy_to(sc, food_world_cup)


Principal Component Analysis (PCA)

One of the functions of Spark’s machine learning functions is for PCA: ml_pca(). I am using it to get an idea about where the countries fall on a 2-dimensional plane of the first two principal components.

pca <- food_world_cup %>%
  mutate_each_(funs(as.numeric), countries) %>%
  ml_pca(features = paste(colnames(food_world_cup)[-c(1:47)]))

library(tibble)
as.data.frame(pca$components) %>%
  rownames_to_column(var = "labels") %>%
  mutate(x_2 = gsub("_trans", "", labels)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = PC1, y = PC2, color = x_2, label = x_2)) + 
    geom_point(size = 2, alpha = 0.6) +
    geom_text_repel() +
    labs(x = paste0("PC1: ", round(pca$explained.variance[1], digits = 2) * 100, "% variance"),
         y = paste0("PC2: ", round(pca$explained.variance[2], digits = 2) * 100, "% variance")) +
    my_theme() + 
    guides(fill = FALSE, color = FALSE)

The least well known countries cluster on the top right, while the most liked are on the bottom right. PC2 seems to be mainly driven by how many 0s there are in the data.


Preparing the data for machine learning

First, I need to convert the factor strings of the non-country features into indexes. To do so, I am using the ft_string_indexer() function. For other feature transformation functions, check out sparklyr’s website.

food_world_cup <- tbl(sc, "food_world_cup") %>%
  ft_string_indexer(input_col = "interest", output_col = "interest_idx") %>%
  ft_string_indexer(input_col = "gender", output_col = "gender_idx") %>%
  ft_string_indexer(input_col = "age", output_col = "age_idx") %>%
  ft_string_indexer(input_col = "household_income", output_col = "household_income_idx") %>%
  ft_string_indexer(input_col = "education", output_col = "education_idx") %>%
  ft_string_indexer(input_col = "location", output_col = "location_idx") %>%
  ft_string_indexer(input_col = "knowledge", output_col = "knowledge_idx")


Now I can divide the data into training and test sets using the sdf_partition() function.

partitions <- food_world_cup %>%
  sdf_partition(training = 0.75, test = 0.25, seed = 753)


Modeling

Now, I run the Random Forest algorithm to predict each country’s preference based on all other countries’ preferences and the demographic information. Check back with sparklyr’s website to find out about other machine learning algorithms you can use.

For each country (as response variable), I am defining the features as all other countries’ transformed values and the indexed factor variables.

Then, I am filtering out all data rows where the response variable was 0. When I left the 0s in the data, the countries with many 0s introduced a very strong bias to the prediction. Here, I needed to use standard evaluation again, this time with lazyeval’s interp() function and a formula.

I am using the ml_random_forest() function to run classification models. Inititally, I run a model on all features, then extract the 10 features with highest importance and re-run the model again on this subset of features. This improved the prediction accuracy compared to all-feature-models.

The sdf_predict() function is used to predict the classes of the test set. To obtain the quality metrics F1, weighted precision and weighted recall, I need to copy the prediction table to the Spark instance and run the ml_classification_eval() function.

Finally, I am combining the output tables from all countries’s models to compare.

library(lazyeval)

for (response in countries) {

  features <- colnames(partitions$training)[-grep(response, colnames(partitions$training))]
  features <- features[grep("_trans|_idx", features)]

  fit <- partitions$training %>%
    filter_(interp(~ var > 0, var = as.name(response))) %>%
    ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
  
  feature_imp <- ml_tree_feature_importance(sc, fit)
  
  features <- as.character(feature_imp[1:10, 2])
  
  fit <- partitions$training %>%
    filter_(interp(~ var > 0, var = as.name(response))) %>%
    ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
  
  partitions$test <- partitions$test %>%
    filter_(interp(~ var > 0, var = as.name(response)))
  
  pred <- sdf_predict(fit, partitions$test) %>%
    collect
  
  pred_2 <- as.data.frame(table(pred[[response]], pred$prediction))
  pred_2$response <- response
  
  pred_sc <- select(pred, -rawPrediction, -probability)
  pred_sc <- copy_to(sc, pred_sc, overwrite = TRUE)
  
  feature_imp$response <- response
  
  f1 <- ml_classification_eval(pred_sc, response, "prediction", metric = "f1")
  wP <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedPrecision")
  wR <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedRecall")
  
  ml_eval <- data.frame(response = response,
                        f1 = f1,
                        weightedPrecision = wP,
                        weightedRecall = wR)
  
  if (response == "algeria") {
    feature_imp_df <- feature_imp
    ml_eval_df <- ml_eval
    pred_df <- pred_2
  } else {
    feature_imp_df <- rbind(feature_imp_df, feature_imp)
    ml_eval_df <- rbind(ml_eval_df, ml_eval)
    pred_df <- rbind(pred_df, pred_2)
  }
}


Model evaluation

Now, I can compare the prediction accuracies for each country’s model by plotting the F1, weighted precision and weighted recall values.

  • Precision

In classification models, precision gives the proportion of correct classifications or “true positives” (i.e. how many of the samples that were classified as “5” were actually a “5”). If we have a high precision, this means that our model classified most samples correctly.

  • Recall

Recall, or sensitivity, gives the proportion of how many of all true “5” samples in the training data were correctly classified. If we have a high recall, this means that our model correctly detected most classes.

  • F1 score

The F-score gives the harmonic mean or weighted average of precision and recall.

Let’s say we had 10 “5s” in the training data. The prediction table classified 8 samples as “5s”, 6 of which were correctly classified. In this example, we have 2 false positives and 4 false negatives. This means we would have a precision of 6/8 and a recall of 6/10.

results <- ml_eval_df %>%
  mutate(x_2 = gsub("_", " ", response)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2))
  
order <- results$x_2[order(results$f1, decreasing = TRUE)]

gather(results, x, y, f1:weightedRecall) %>%
  mutate(x_2 = factor(x_2, levels = order)) %>%
  ggplot(aes(x = x_2, y = y, fill = x)) +
    geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(fill = "", color = "", x = "", y = "value")

The plot above shows that the model predicting preference for Spanish dishes had the highest accuracy, while the model for Italy, India and Ireland had the lowest accuracy.


To see what features had been used in more and less successful models, I am plotting the values of the 10 final features for classifying Spanish, Greece and Italian food preference categories 1 - 5 in the original dataset.

as.data.frame(food_world_cup) %>%
  select_(.dots = c("spain", as.character(feats$feature))) %>%
  gather(x, y, -spain) %>%
  filter(spain > 0) %>%
  group_by(spain, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = spain, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Spain")

feats <- feature_imp_df %>%
  filter(response == "greece") %>%
  slice(1:10)

as.data.frame(food_world_cup) %>%
  select_(.dots = c("greece", as.character(feats$feature))) %>%
  gather(x, y, -greece) %>%
  filter(greece > 0) %>%
  group_by(greece, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = greece, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Greece")

feats <- feature_imp_df %>%
  filter(response == "italy") %>%
  slice(1:10)

as.data.frame(food_world_cup) %>%
  select_(.dots = c("italy", as.character(feats$feature))) %>%
  gather(x, y, -italy) %>%
  filter(italy > 0) %>%
  group_by(italy, x) %>%
  summarise(mean = mean(y)) %>%
  mutate(x_2 = gsub("_trans", "", x)) %>%
  mutate(x_2 = gsub("_idx", "", x_2)) %>%
  mutate(x_2 = gsub("_", " ", x_2)) %>%
  mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
  mutate(x_2 = gsub("And", "and", x_2)) %>%
  ggplot(aes(x = x_2, y = italy, fill = mean)) +
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "", y = "Italy")


Conclusions

This dataset was a bit difficult because the preferences were highly unbalanced, i.e. the distributions of preferences were very different between countries. Some countries’ cuisines were largely unknown, while others’ were well known and generally popular. If I kept the unknown class (0) in the prediction models, countries with many 0s jumped up in prediction accuracy. But this was mainly due to the fact, that the chance of correctly classifying a sample as 0 was over-proportionally high (leading to high precision). Removing the 0s while working on raw preference values showed the same effect as before but now countries with a majority of 5s (e.g. Italy) were overly “accurate”. Transforming the data and removing the 0s led to a more evenly distributed accuracy of well known and less well known countries. But by removing them, the number of samples used for modeling varied strongly between countries, introducing another type of bias.

It was not possible to avoid bias entirely with this dataset. But I wanted to show it as an example nonetheless. Usually, machine learning examples show datasets where the models worked very well, leaving the reader in awe of the powers of machine learning. And while there are certainly powerful and impressive prediction models, real-life data is not always as simple. As can be seen with this dataset, it’s not always as straight forward to build a good classification model, even though you’d intuitively assume that it should be easy to predict food preferences based on which other cuisines someone likes…

The model could potentially be improved by engineering features, modeling factors and/ or different algorithms but I’ll leave this for another time.


Next week, I’ll be looking into how to use the h2o package with Spark using rsparkling.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tibble_1.2            fivethirtyeight_0.1.0 dplyr_0.5.0          
## [4] ggrepel_0.6.5         ggplot2_2.2.1         tidyr_0.6.1          
## [7] sparklyr_0.5.1       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        knitr_1.15.1       magrittr_1.5      
##  [4] rappdirs_0.3.1     munsell_0.4.3      colorspace_1.3-2  
##  [7] R6_2.2.0           plyr_1.8.4         stringr_1.1.0     
## [10] httr_1.2.1         tools_3.3.2        parallel_3.3.2    
## [13] grid_3.3.2         gtable_0.2.0       config_0.2        
## [16] DBI_0.5-1          withr_1.0.2        htmltools_0.3.5   
## [19] lazyeval_0.2.0     yaml_2.1.14        assertthat_0.1    
## [22] rprojroot_1.2      digest_0.6.12      RColorBrewer_1.1-2
## [25] base64enc_0.1-3    evaluate_0.10      rmarkdown_1.3     
## [28] labeling_0.3       stringi_1.1.2      scales_0.4.1      
## [31] backports_1.0.5    jsonlite_1.2

Conditional ggplot2 geoms in functions (QTL plots)

2017-02-12 00:00:00 +0000

When running an analysis, I am usually combining functions from multiple packages. Most of these packages come with their own plotting functions. And while they are certainly convenient in that they allow me to get a quick glance at the data or the output, they all have their own style. If I want to prepare a report, proposal or a paper though, I want all my plots to come from a single cast so that they give a consistent feel to the story I want to tell with my data.

I have grown very fond of ggplot2 for producing plots, because of it’s tidy grammar, flexibility and easy customization. Here, I want to show an example of my own ggplot2 function to produce QTL plots for Karl Broman’s qtl package.

Specifically, I want to show how to incorporate conditional geoms when using ggplot2 in a function call.

I will show how my function can be used with the output from qtl’s scanone() function by following the examples from the package tutorial. The tutorial does an excellent job describing the workflow, so I am not going to explain it here.

library(qtl)


For defining the function and preparing the data for plotting, I am working with ggplot2, ggrepel (for adding text labels that don’t overlap), tidyr and dplyr. This is my function code:

library(ggplot2)
library(ggrepel)
library(tidyr)
library(dplyr)

qtl_plot <- function(input,              # data frame input from scanone
                     mult.pheno = FALSE, # multiple phenotypes?
                     model = "normal",   # model used in scanone
                     chrs = NA,          # chromosomes to display
                     lod = NA,           # LOD threshold
                     rug = FALSE,        # plot marker positions as rug?
                     ncol = NA,          # number of columns for facetting
                     labels = NA         # optional dataframe to plot QTL labels
                     ) {
                     
                  # if we have multiple phenotypes and/or a 2part model, gather input
                  if (mult.pheno & model == "2part") {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (mult.pheno) {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (model == "2part") {
                    input <- gather(input, method, lod, lod.p.mu:lod.mu)
                  }
    
                  # if not all chromosomes should be displayed, subset input
                  if (!is.na(chrs)[1]) {
                    input <- input[as.character(input$chr) %in% chrs, ]
                  }

                  # if there is more than one LOD column, gather input
                  if (!any(colnames(input) == "lod")) {
                    input$lod <- input[, grep("lod", colnames(input))]
                  }

                  # if no number of columns for facetting is defined, plot all in one row
                  if (is.na(ncol)) {
                    ncol <- length(unique(input$chr))
                  }

                  # if labels are set and there is no name column, set from rownames
                  if (!is.na(labels)[1]) {
                    if (is.null(labels$name)) {
                      labels$name <- rownames(labels)
                    }
                  }

                  # plot input data frame position and LOD score
                  plot <- ggplot(input, aes(x = pos, y = lod)) + {

                    # if LOD threshold is given, plot as horizontal line
                    if (!is.na(lod)[1] & length(lod) == 1) geom_hline(yintercept = lod, linetype = "dashed")
                  } + {

                    if (!is.na(lod)[1] & length(lod) > 1) geom_hline(data = lod, aes(yintercept = lod, linetype = group))
                  } + {

                    # plot rug on bottom, if TRUE
                    if (rug) geom_rug(size = 0.1, sides = "b")
                  } + {

                    # if input has column method but not group, plot line and color by method
                    if (!is.null(input$method) & is.null(input$group)) geom_line(aes(color = method), size = 2, alpha = 0.6)
                  } + {

                    # if input has column group but not method, plot line and color by group
                    if (!is.null(input$group) & is.null(input$method)) geom_line(aes(color = group), size = 2, alpha = 0.6)
                  } + {

                    # if input has columns method and group, plot line and color by method & linetype by group
                    if (!is.null(input$group) & !is.null(input$method)) geom_line(aes(color = method, linetype = group), size = 2, alpha = 0.6)
                  } + {

                    # set linetype, if input has columns method and group
                    if (!is.null(input$group) & !is.null(input$method)) scale_linetype_manual(values = c("solid", "twodash", "dotted"))
                  } + {

                    # if input has neither columns method nor group, plot black line
                    if (is.null(input$group) & is.null(input$method)) geom_line(size = 2, alpha = 0.6)
                  } + {

                    # if QTL positions are given in labels df, plot as point...
                    if (!is.na(labels)[1]) geom_point(data = labels, aes(x = pos, y = lod))
                  } + {

                    # ... and plot name as text with ggrepel to avoid overlapping
                    if (!is.na(labels)[1]) geom_text_repel(data = labels, aes(x = pos, y = lod, label = name), nudge_y = 0.5)
                  } + 
                    # facet by chromosome
                    facet_wrap(~ chr, ncol = ncol, scales = "free_x") +
                    # minimal plotting theme
                    theme_minimal() +
                    # increase strip title size
                    theme(strip.text = element_text(face = "bold", size = 12)) +
                    # use RcolorBrewer palette
                    scale_color_brewer(palette = "Set1") +
                    # Change plot labels
                    labs(x = "Chromosome",
                         y = "LOD",
                         color = "",
                         linetype = "")

                  print(plot)
}


The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL).

data(hyper)

hyper <- est.rf(hyper)

hyper <- calc.genoprob(hyper, step = 1, error.prob = 0.01)
out.em <- scanone(hyper, model = "normal")
out.hk <- scanone(hyper, method = "hk", model = "normal")

hyper <- sim.geno(hyper, step = 2, n.draws = 16, error.prob = 0.01)
out.imp <- scanone(hyper, method = "imp", model = "normal")

operm.hk <- scanone(hyper, pheno.col = 1, model = "normal", method = "hk", n.perm = 1000, verbose = FALSE)

lod_threshold <- summary(operm.hk, alpha = 0.05)
labels_df <- as.data.frame(summary(out.hk, perms = operm.hk, alpha = 0.05, pvalues = TRUE))

The most basic version of my plotting function takes as input a data frame from the scanone() function: It needs to have a column with chromosome information, position and LOD score. Without defining anything else, the function will return a line plot of the LOD scores for each marker position on each chromosome in the input data.

qtl_plot(input = out.hk)

If we want to compare the output from e.g. different QTL models, we can follow the tidyverse principle by combining the output data frames in long format and adding a column with the respective method names. If the input has a method column, it will color the lines accordingly. We can also specify, which chromosomes to display (here, 1 and 4) and give a LOD threshold to plot as a horizontal dashed line. If we want to see each marker position, we can specify rug = TRUE to plot them underneath the line plots.

And we can also plot the markers with a significant phenotype association (the QTL) by adding a data frame with the labeling information to the function call. This data frame needs to contain the chromosome, position and LOD score information. If there is no name column, it will take the row names (which usually give the marker name) as labels.

qtl_plot(input = rbind(data.frame(out.em, method = "EM algorithm"), 
                       data.frame(out.hk, method = "Haley-Knott regression"), 
                       data.frame(out.imp, method = "Multiple imputation")), 
         chrs = c(1, 4), 
         lod = lod_threshold[1], 
         rug = TRUE, 
         labels = labels_df)


The second example I want to show is from the listeria data. This time, a slightly more complex 2-part model is built.

data(listeria)

listeria$pheno$logSurv <- log(listeria$pheno[,1])
listeria <- est.rf(listeria)

newmap <- est.map(listeria, error.prob = 0.01)
listeria <- replace.map(listeria, newmap)

listeria <- calc.errorlod(listeria, error.prob = 0.01)

listeria <- calc.genoprob(listeria, step = 2)
out.2p <- scanone(listeria, pheno.col = 3, model = "2part", upper = TRUE)

operm.2p <- scanone(listeria, model = "2part", pheno.col = 3, upper = TRUE, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.2p, alpha = 0.05)

labels_df <- as.data.frame(summary(out.2p, format = "allpeaks", perms = operm.2p, alpha = 0.05, pvalues = TRUE))

The output from a 2-part model looks different than from the other model functions: Instead of only one LOD score column, we now get three columns named lod.pu.mu, lod.p and lod.mu (check back with the tutorial if you want to know what they mean). The plotting function is built so that if you define the model parameter as “2part”, each of the three LOD columns will be plotted in a different color.

If we also want to plot the QTL labels, we need to prepare the label data frame before feeding it into the plotting function:

p.mu <- labels_df[, 1:4]
colnames(p.mu)[3] <- "lod"
p.mu$name <- "p.mu"

p <- labels_df[, c(1, 5:7)]
colnames(p)[3] <- "lod"
p$name <- "p"

mu <- labels_df[, c(1, 8:10)]
colnames(mu)[3] <- "lod"
mu$name <- "mu"

labels_df <- rbind(p.mu, p, mu)

If the chromosomes are not in the right order, we can reorder the factor labels:

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
out.2p$chr <- factor(out.2p$chr, levels = f)
labels_df$chr <- factor(labels_df$chr, levels = f)

The parameter ncol let’s you define how many facet columns you want to have. Here, I want two rows and three columns to display the six chromosomes I chose for plotting:

qtl_plot(input = out.2p, 
         model = "2part",
         chrs = c(1, 3, 5:6, 13, 15),
         lod = lod_threshold[1], 
         rug = TRUE, 
         ncol = 3,
         labels = labels_df)


Binary models are again treated the same way as e.g. normal models:

y <- listeria$pheno$logSurv
my <- max(y, na.rm = TRUE)
z <- as.numeric(y == my)
y[y == my] <- NA
listeria$pheno$logSurv2 <- y
listeria$pheno$binary <- z

out.p <- scanone(listeria, pheno.col = 5, model = "binary")
out.np1 <- scanone(listeria, model = "np", ties.random = TRUE)
out.np2 <- scanone(listeria, model = "np", ties.random = FALSE)
qtl_plot(input = rbind(data.frame(out.np1, method = "out.2p"),
                       data.frame(out.np1, method = "np1"),
                       data.frame(out.np2, method = "np2")))


The fake.bc data is used to show how to run QTL models with multiple phenotypes simultaneously and how to add additive or interactive covariates.

data(fake.bc)

fake.bc <- calc.genoprob(fake.bc, step = 2.5)
out.nocovar <- scanone(fake.bc, pheno.col = 1:2)

If the parameter mult.pheno is set to TRUE, the plotting function gathers the phenotype columns and plots each phenotype in a different color.

qtl_plot(input = out.nocovar, mult.pheno = TRUE)

If we specify both mult.pheno = TRUE and model = “2part”, the function plots each LOD score for each phenotype in a different color. If you wanted to plot only the lod.p column for example, you could subset the input data frame before feeding it into the plotting function.

out.nocovar.2p <- scanone(fake.bc, pheno.col = 1:2, model = "2part", upper = TRUE)

qtl_plot(input = out.nocovar.2p,
         chrs = c(2, 5),
         mult.pheno = TRUE, 
         model = "2part")


If we run a multiple phenotype model and want to determine the QTL positions, we get a different LOD threshold for each of the phenotypes and respectively, for significantly associated markers as well.

sex <- fake.bc$pheno$sex
out.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex)

out.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex)

out.sexint <- out.icovar - out.acovar

operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.icovar, alpha = 0.05)

labels_df <- as.data.frame(summary(out.icovar, perms = operm.icovar, alpha = 0.05, pvalues = TRUE))

This means that, if we want to plot QTL labels for both phenotypes, we first need to prepare the label data frame:

pheno1 <- labels_df[, 1:4]
colnames(pheno1)[3] <- "lod"
pheno1$name <- paste("pheno1", rownames(pheno1), sep = "\n")

pheno2 <- labels_df[, c(1, 2, 5, 6)]
colnames(pheno2)[3] <- "lod"
pheno2$name <- paste("pheno2", rownames(pheno1), sep = "\n")

labels_df <- rbind(pheno1, pheno2)

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
labels_df$chr <- factor(labels_df$chr, levels = f)

If different methods and groups are detected in the input data frame, the methods will be drawn with different colors, while the groups will have different line types. If we want to plot two LOD thresholds for the two phenotypes, we can give a data frame to the function that has a group column that corresponds to the phenotypes of the input data frame. The LOD threshold will now be plotted in the same line type as the phenotype LOD score plot.

qtl_plot(input = rbind(data.frame(out.acovar, method = "acovar"),
                       data.frame(out.icovar, method = "icovar"),
                       data.frame(out.sexint, method = "sexint")), 
         chrs = c(2, 5, 17),
         mult.pheno = TRUE,
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1:2]),
         labels = labels_df)


For QTL models with covariates, the same principles apply:

seed <- ceiling(runif(1, 0, 10^8))
set.seed(seed)
operm.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

set.seed(seed)
operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

operm.sexint <- operm.icovar - operm.acovar

lod_threshold <- summary(operm.sexint, alpha = c(0.05, 0.20))
labels <- as.data.frame(summary(out.sexint, perms = operm.sexint, alpha = 0.1, format = "allpeaks", pvalues = TRUE))
pheno1 <- data.frame(labels[, 1:4])
colnames(pheno1)[3] <- "lod"
pheno1$name <- "p1"

pheno2 <- data.frame(labels[, c(1, 5:7)])
colnames(pheno2)[3] <- "lod"
pheno2$name <- "p2"

labels_df <- rbind(pheno1, pheno2)
qtl_plot(input = out.sexint, 
         mult.pheno = TRUE,
         labels = labels_df,
         chrs = c(2, 7, 17), 
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1, ]),
         ncol = 3)

hyper <- sim.geno(hyper, step = 2.5, n.draws = 16, err = 0.01)
g <- pull.geno(fill.geno(hyper))[, "D4Mit164"]
out1.c4i <- scanone(hyper, method = "imp", addcovar = g, intcovar = g)
qtl_plot(input = out1.c4i)


And as well for model fitting:

qc <- c(1, 1, 4, 6, 15)
qp <- c(43.3, 78.3, 30.0, 62.5, 18.0)
qtl <- makeqtl(hyper, chr = qc, pos = qp)
myformula <- y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5
out.fq <- fitqtl(hyper, qtl = qtl, formula = myformula, drop = FALSE, get.ests = TRUE)

revqtl <- refineqtl(hyper, qtl = qtl, formula = myformula, verbose = FALSE)
out1.c4r <- addqtl(hyper, qtl = revqtl, formula = y ~ Q3)

out2.c4r <- addpair(hyper, qtl = revqtl, formula = y ~ Q3, chr = c(1, 6, 7, 15), verbose = FALSE)
out.iw4 <- addqtl(hyper, qtl = revqtl, formula = y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5 + Q6 + Q5:Q6)
out.1more <- addqtl(hyper, qtl = revqtl, formula = myformula)

qtl2 <- addtoqtl(hyper, revqtl, 7, 53.6)
qtl3 <- dropfromqtl(qtl2, index = 2)
qtl4 <- replaceqtl(hyper, qtl3, index = 1, chr = 1, pos = 50)
qtl5 <- reorderqtl(qtl4, c(1:3, 5, 4))

stepout.i <- stepwiseqtl(hyper, max.qtl = 6, verbose = FALSE)

Here, we also need to prepare the label data frame before the plotting function can handle it:

label_df <- data.frame(name = stepout.i$name, chr = stepout.i$chr, pos = stepout.i$pos)

qtl1 <- attr(stepout.i, "lodprofile")[[1]]
qtl4 <- attr(stepout.i, "lodprofile")[[2]]
qtl6 <- attr(stepout.i, "lodprofile")[[3]]
qtl15 <- attr(stepout.i, "lodprofile")[[4]]

input_df <- rbind(qtl1, qtl4, qtl6, qtl15)
input_df$marker <- rownames(input_df)
input_df$name <- paste(input_df$chr, round(input_df$pos, digits = 1), sep = "@")

label_df$pos <- as.numeric(gsub("(^[0-9]+)(.0$)", "\\1", label_df$pos))
label_df$name <- paste(label_df$chr, round(label_df$pos, digits = 1), sep = "@")

label_df <- left_join(label_df, input_df[, 3:5, drop = FALSE], by = "name")
label_df$name <- paste(label_df$name, label_df$marker, sep = "\n")
qtl_plot(input = input_df,
         labels = label_df)


These examples show only the simple QTL models. The qtl package also gives examples for building scantwo() models. But they can not be plotted in the same way as a scanone() output and thus, won’t be covered here.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.5.0   tidyr_0.6.0   ggrepel_0.6.5 ggplot2_2.2.1 qtl_1.40-8   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        knitr_1.15.1       magrittr_1.5      
##  [4] munsell_0.4.3      colorspace_1.3-2   R6_2.2.0          
##  [7] stringr_1.1.0      plyr_1.8.4         tools_3.3.2       
## [10] parallel_3.3.2     grid_3.3.2         gtable_0.2.0      
## [13] DBI_0.5-1          htmltools_0.3.5    yaml_2.1.14       
## [16] lazyeval_0.2.0     rprojroot_1.1      digest_0.6.11     
## [19] assertthat_0.1     tibble_1.2         RColorBrewer_1.1-2
## [22] codetools_0.2-15   evaluate_0.10      rmarkdown_1.3     
## [25] labeling_0.3       stringi_1.1.2      scales_0.4.1      
## [28] backports_1.0.4

Scratching the Surface of Gender Biases

2017-02-06 00:00:00 +0000

Today, I want to share my analysis of the World Gender Statistics dataset.

Last week I already introduced my Shiny app, where you can explore 160 measurements for 164 countries over 56 years. This week I’ve included a statistical analysis of these countries and measurements and put some finishing touches on the app.

Just in case anybody needed proof that gender bias is not only a problem of “other countries”, my analysis showed very nicely, that the Western world isn’t (yet, I hope) gender neutral in regards to many statistics!

You can also load the app via Github with the shiny package. There, you can also find the source code for the app.

library(shiny)
runGitHub("ShirinG/WGS_app") 


The data

The data was downloaded from The World Bank’s Open Data project via Kaggle. The main data table shows

  • Country.Name: the name of the country
  • Country.Code: the country’s code
  • Indicator.Name: the name of the variable that this row represents
  • Indicator.Code: a unique id for the variable
  • 1960 - 2016: one column EACH for the value of the variable in each year it was available

Unfortunately, the dataset doesn’t include a column that indicated which two female/ male statistics belong together and there is no consistent naming scheme either. Therefore, I chose to focus on those statistics, where the counterparts could be easily extracted: indicator codes that differed only in containing “.FE” or “.MA”. I then split the subsetted dataframe into female and male and produced a third dataset with the ratios between male and female values. To calculate the ratios, I added 0.001 to each data point, so that zero values would not produce NAs.

The finished datasets were saved as “R.Data” files, so that I could easily load them into the Shiny app.

dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Name <- gsub(",", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Name <- gsub(",", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")
length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  code <- unique(dataset_fem$Indicator.Name)[n]
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    if (i == 1) {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- cbind(male[i, c(1:4)], diff)
    } else {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
    }
  }
  
  if (n == 1) {
    diff_table_bind <- diff_table
  } else {
    diff_table_bind <- rbind(diff_table_bind, diff_table)
  }
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")


The world map

The map has been downloaded from the Natural Earth Data website. The country borders were reduced by 200 meters with ArcGIS Pro, so that clicking within any country on the map would show the corresponding country’s border as the nearest point. ArcGIS Pro was also used to convert the map to Mercator projection. The changed shapefiles can be downloaded from my Github repository.

library(rgdal)
library(plyr)
library(scales)

wmap_countries <- readOGR(dsn = "shapefiles/changed_borders", layer = "ne_110m_admin_0_countries_smaller_wm")
wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")
wmap_countries_df_final$adm0_a3 <- as.character(wmap_countries_df_final$adm0_a3)
save(wmap_countries_df_final, file = "wmap_countries_smaller_df_final.RData")

diff_table_bind <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]
save(diff_table_bind, file = "diff_table_bind.RData")

countries <- as.character(unique(diff_table_bind$Country.Name))
save(countries, file = "countries.RData")
library(dplyr)
library(tidyr)

library(ggplot2)

map_theme <- list(theme(panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_blank(),
                        plot.background = element_rect(fill = "white"),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

colfunc <- colorRampPalette(c("yellow", "red"))


Which countries are most biased?

To explore differences between countries and statistics, I prepared a dataset that contains the first and last non-NA value in each time series of the male/ female ratio data (per statistic and country) to calculate whether the change over time was statistically significant. As for the world map plots, I am using the log2 of the ratios, so that ratios bigger and smaller than 1 can be more easily compared.

You can explore which countries have the strongest bias and the biggest change for each statistic in the app under the tab “Latest ratios”.

year_table <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]

# last non-NA value
last_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[length(na.omit(x))]
  )
last_val_df <- data.frame(unlist(last_val))
last_val_df$year_of_val <- gsub("([0-9]+)(.X)([0-9]+)", "\\3", rownames(last_val_df))
rownames(last_val_df) <- gsub("([0-9]+)(.X)([0-9]+)", "\\1", rownames(last_val_df))

# first non-NA value
first_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[1]
  )
first_val_df <- data.frame(na.omit(first_val))

year_table_last_val <- cbind(year_table[rownames(last_val_df), 1:3], last_val_df, first_val_df)

year_table_last_val$difference <- (year_table_last_val$unlist.last_val. + 0.001) / (year_table_last_val$na.omit.first_val. + 0.001)
save(year_table_last_val, file = "year_table_last_val.RData")

To identify countries with strongest biases towards either men or women, I calculated the absolute log2 ratio of male/ female values and subsetted the top 10 countries with highest absolute bias. I then counted for how many statistics each country was in the top 10.

for (i in 1:length(measures)) {
  
  m <- measures[i]
  
  subs <- subset(year_table_last_val, Indicator.Name == m)[, c(1, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  subs$measure <- m
    
  if (i == 1) {
    subs_comb <- subs[1:10, ]
    
  } else {
    subs_comb <- rbind(subs_comb, subs[1:10, ])
  }
}
subs_comb_ordered <- arrange(as.data.frame(table(subs_comb$Country.Name)), desc(Freq))

subs_comb$Country.Name <- factor(subs_comb$Country.Name, levels = subs_comb_ordered$Var1)

ggplot(subs_comb, aes(x = Country.Name)) +
  geom_bar(fill = "navyblue", alpha = 0.7) +
  coord_flip() +
  my_theme() +
  labs(title = "Most biased countries",
       subtitle = "For how many statistics were the countries in top 10 with highest absolute bias",
       x = "")

This plot shows that there are quite a few European/ Western countries with high biases in some statistics. For example:

arrange(subs_comb[which(subs_comb$Country.Name %in% c("Germany", "Netherlands", "United States", "United Kingdom")), - 2], 
        Country.Name, desc(abs_ratio))
##      Country.Name abs_ratio
## 1     Netherlands 1.4678833
## 2     Netherlands 1.4674207
## 3     Netherlands 1.3615747
## 4     Netherlands 1.1248210
## 5     Netherlands 0.9993447
## 6     Netherlands 0.6601651
## 7     Netherlands 0.5512176
## 8     Netherlands 0.1847042
## 9         Germany 8.2336197
## 10        Germany 2.0428320
## 11        Germany 1.1641113
## 12        Germany 1.0957438
## 13        Germany 0.9825083
## 14        Germany 0.7356144
## 15        Germany 0.6274894
## 16 United Kingdom 1.0932072
## 17 United Kingdom 0.8659095
## 18 United Kingdom 0.4347145
## 19  United States 1.3212071
##                                                                                    measure
## 1        Educational attainment completed Doctoral or equivalent population 25+ years  (%)
## 2  Educational attainment competed Doctoral or equivalent population 25+  (%) (cumulative)
## 3                              Borrowed from a store by buying on credit  (% age 15+) [w1]
## 4                           Prevalence of stunting height for age  (% of children under 5)
## 5                        Prevalence of underweight weight for age  (% of children under 5)
## 6                                                       Loan in the past year  (% age 15+)
## 7                      Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 8                                                     Progression to secondary school  (%)
## 9                  Prevalence of severe wasting weight for height  (% of children under 5)
## 10                                          Part time employment  (% of total  employment)
## 11                                         Time-related underemployment  (% of employment)
## 12              Educational attainment completed lower secondary population 25+ years  (%)
## 13         Educational attainment completed short-cycle tertiary population 25+ years  (%)
## 14                                       Borrowed from family or friends  (% age 15+) [ts]
## 15                     Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 16                               Borrowed for health or medical purposes  (% age 15+) [w2]
## 17       Average number of hours spent on unpaid domestic work (housework and child care) 
## 18                                   Borrowed any money in the past year  (% age 15+) [w2]
## 19                          Prevalence of stunting height for age  (% of children under 5)

So, let’s take a closer look at the countries: I am calculating the sums of absolute male/ female ratios (again log2) for each country and show it on the map:

for (i in 1:length(countries)) {
  
  c <- countries[i]
  
  subs <- subset(year_table_last_val, Country.Name == c)[, c(2, 3, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  if (i == 1) {
    subs_country <- data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio))
    
  } else {
    subs_country <- rbind(subs_country, data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio)))
  }
}
left_join(subset(wmap_countries_df_final, !continent == "Antarctica"), subs_country, by = c("adm0_a3" = "Country.Code")) %>%
  ggplot(aes(long, lat, group = group, fill = sum_abs_ratio)) +
        coord_equal() +
        map_theme +
        geom_polygon() +
        geom_path(color = "white", size = 0.5) +
        labs(title = "General gender bias per country",
             subtitle = "Sum of absolute gender biases for all statistics",
             fill = "sum of absolute\nlog2 of male / female") +
        scale_fill_gradient2(low = "blue", mid = "blue", high = "red")


Statistical Analysis

You can explore the results for the individual statistics in the app under “Analysis - Plots” and “Analysis - Tests”.


The variables

The world data contains information on each country. I used the following variables to test for statistical significance:

  1. economic status,
  2. income group,
  3. estimated population size,
  4. estimated gross domestic product (GDP) and
  5. continent.

Because not all data was normally distributed I explored non-parametric models, as well as an ANOVA. The plots for each statistic can be explored in my app, so here, I am only showing the results for which measures were statistically significant for the different variables.


Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank test can be used to test whether observations from repeated measures differ statistically from each other. Here, I am using it to test whether the most recent recorded value is significantly different from the first recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

country.inf <- subset(wmap_countries_df_final[, c(17, 25, 26, 31, 32, 42, 43, 47, 48, 62)], !continent == "Antarctica")
country.inf <- country.inf[!duplicated(country.inf), ]
for (i in 1:length(measures)) {
  m <- measures[i]
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))
  table <- wilcox.test(log2(stats_test$na.omit.first_val.), log2(stats_test$unlist.last_val.), paired = TRUE) 
  
  table <- data.frame(Indicator.Name = paste(m),
                      diff = (median(stats_test$unlist.last_val.) + 0.0001) / (median(stats_test$na.omit.first_val.) + 0.0001),
                      V = table$statistic,
                      p.value = table$p.value,
                      alternative = table$alternative)

  if (i == 1) {
    table_wilcox <- table
  } else {
    table_wilcox <- rbind(table_wilcox, table)
  }
}

table_wilcox$p.adj <- p.adjust(table_wilcox$p.value, method = "fdr")
table_wilcox$significant <- as.factor(ifelse(table_wilcox$p.adj < 0.1, "significant", "non-significant"))

sig <- subset(table_wilcox, significant == "significant")
left_join(subset(year_table_last_val, Indicator.Name %in% sig$Indicator.Name), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  gather(x, y, unlist.last_val., na.omit.first_val.) %>%
  ggplot(aes(x = Indicator.Name, y = log2(y), color = x, fill = x)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  coord_flip() +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference over time",
    subtitle = "Wilcoxon Signed-Rank Test",
    x = "",
    y = "log2(male/ female)")


Kruskal-Wallis Test

The Kruskal-Wallis test is a simple non-parametric statistical test. I used it here to test for each variable of interest whether it has a statistically significant effect on the most recent recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

for (i in 1:length(measures)) {
  m <- measures[i]
  
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))

    stats_test$economy_rank <- as.numeric(as.character(gsub("(^[1-7])(.*)", "\\1", stats_test$economy)))
    stats_test$income_grp_rank <- as.numeric(as.character(gsub("(^[1-5])(.*)", "\\1", stats_test$income_grp)))

    k_economy_rank <- kruskal.test(log2(unlist.last_val.) ~ economy_rank, data = stats_test)
    k_income_rank <- kruskal.test(log2(unlist.last_val.) ~ income_grp_rank, data = stats_test)
    k_pop <- kruskal.test(log2(unlist.last_val.) ~ pop_est, data = stats_test)
    k_gdp <- kruskal.test(log2(unlist.last_val.) ~ gdp_md_est, data = stats_test)
    k_continent <- kruskal.test(log2(unlist.last_val.) ~ continent, data = stats_test)

    kruskal_last_val <- data.frame(
      group = c("economy (rank)", "income_grp (rank)", "pop_est", "gdp_md_est", "continent"),
      p.val = c(k_economy_rank$p.value, k_income_rank$p.value, k_pop$p.value, k_gdp$p.value, k_continent$p.value))

    kruskal_last_val$p.adj <- p.adjust(kruskal_last_val$p.val, method = "fdr")

    if (i == 1) {
      kruskal_last_val_df <- kruskal_last_val[, -2]
      colnames(kruskal_last_val_df)[2] <- paste(m)
    } else {
      pre <- kruskal_last_val[, -2]
      colnames(pre)[2] <- paste(m)
      
      kruskal_last_val_df <- left_join(kruskal_last_val_df, pre, by = "group")
    }
}
eco <- t(kruskal_last_val_df[1, -1])
eco_sig <- eco[which(eco < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(eco_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = economy, fill = economy)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in economy",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

inc <- t(kruskal_last_val_df[2, -1])
inc_sig <- inc[which(inc < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(inc_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = income_grp, fill = income_grp)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in income group",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

cont <- t(kruskal_last_val_df[5, -1])
cont_sig <- cont[which(cont < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(cont_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = continent, fill = continent)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in continent",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

For population and GDP, there were no statistically significant measures.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 tidyr_0.6.1   dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9      digest_0.6.12    rprojroot_1.2    assertthat_0.1  
##  [5] plyr_1.8.4       grid_3.3.2       R6_2.2.0         gtable_0.2.0    
##  [9] DBI_0.5-1        backports_1.0.5  magrittr_1.5     scales_0.4.1    
## [13] evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0   rmarkdown_1.3   
## [17] labeling_0.3     tools_3.3.2      stringr_1.1.0    munsell_0.4.3   
## [21] yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5  knitr_1.15.1    
## [25] tibble_1.2

New features in World Gender Statistics app

2017-01-30 00:00:00 +0000

In my last post, I built a shiny app to explore World Gender Statistics.

To make it a bit nicer and more convenient, I added a few more features:

  • The drop-down menu for Years is now reactive, i.e. it only shows options with data (all NA years are removed)
  • You can click on any country on the map to get information about which country it is, its population size, income group and region
  • Below the world map, you can look at timelines for male vs female values for each country
  • The map is now in Mercator projection

Exploring World Gender Statistics with Shiny

2017-01-29 00:00:00 +0000

This week I explored the World Gender Statistics dataset. You can look at 160 measurements over 56 years with my Shiny app here.


I prepared the data as follows:

Data.csv

  • Country.Name: the name of the country
  • Country.Code: the country’s code
  • Indicator.Name: the name of the variable that this row represents
  • Indicator.Code: a unique id for the variable
  • 1960 - 2016: one column EACH for the value of the variable in each year it was available
dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")
length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  
  code <- unique(dataset_fem$Indicator.Name)[n]
  
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    
    if (i == 1) {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- cbind(male[i, c(1:4)], diff)
      
    } else {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
      
    }
  }
  
  if (n == 1) {
    
    diff_table_bind <- diff_table
    
  } else {
    
    diff_table_bind <- rbind(diff_table_bind, diff_table)
    
  }
  
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
diff_table_bind[diff_table_bind == "NaN"] <- NA

save(diff_table_bind, file = "diff_table_bind.RData")
measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")

Map

The map has been downloaded from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/

library(rgdal)
library(ggplot2)
library(plyr)
library(dplyr)
library(scales)

wmap_countries <- readOGR(dsn="shapefiles", layer="ne_110m_admin_0_countries")

wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")

wmap_countries_df_final$gu_a3 <- as.character(wmap_countries_df_final$gu_a3)

save(wmap_countries_df_final, file = "wmap_countries_df_final.RData")


## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.8     stringi_1.1.2  
##  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.11  
## [13] evaluate_0.10

R vs Python - a One-on-One Comparison

2017-01-22 00:00:00 +0000

I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative.

That’s why I wanted to see how R and Python fare in a one-on-one comparison of an analysis that’s representative of what I would typically work with.

Conclusions

All in all, the Python code could easily be translated into R and was comparable in length and simplicity between the two languages. While Python’s syntax is inherently cleaner/ tidier, we can use packages that implement piping in R and achieve similar results (even though Python’s dot-separated syntax is still much easier to type than using the piping operator of magrittr). For plotting and visualisation I still think that R’s ggplot2 is top of the line in both syntax, customizability and outcome (admittedly, I don’t know matplotlib as well as ggplot)! In terms of functionality, I couldn’t find major differences between the two languages and I would say they both have their merits. For me, R comes more natural as that is what I’m more fluent in, but I can see why Python holds an appeal too and I think I’ll make more of an effort to use both languages in my future projects.



R and Python

R and Python are both open-source languages used in a wide range of data analysis fields. Their main difference is that R has traditionally been geared towards statistical analysis, while Python is more generalist. Both comprise a large collection of packages for specific tasks and have a growing community that offers support and tutorials online.

For a nice overview of the languages respective strengths and weaknesses, check out Datacamp’s nice infographic.


Comparative analysis of genome data

To directly compare R and Python, I am following Zhuyi Xue’s “A Comprehensive Introduction To Your Genome With the SciPy Stack” (with some minor tweaks here and there). He gives a nice introduction to the data, so I will not repeat it here but focus on the comparison between the code lines.

For R, I am working with RStudio, for Python with Anaconda and Spyder.

Python:

For this analysis, we need the SciPy stack with pandas for data wrangling and matplotlib for visualisation. Anaconda already comes with all these packages that we need. Zhuyi Xue first imports pandas as “pd”, so that we can call pandas function by prefixing them with “pd.”.

import pandas as pd

R:

While the code could be replicated with base R, I prefer dplyr for data wrangling and ggplot2 for visualisation.

library(dplyr)
library(ggplot2)

Reading in data

Reading in data is straight forward in both R and Python. The code we need to read in the file is comparable between R and Python. Zhuyi Xue specified “compression = ‘gzip’”, but this would not have been necessary as the default is to infer it from the file suffix.

One big difference in the general syntax we can see here too: Boolean true/false values are written in all caps in R (TRUE/FALSE), while Python uses first letter capitalisation (True/False). The same prinicple applies to “none”.

Python:

df = pd.read_csv('Homo_sapiens.GRCh38.85.gff3.gz', 
                         compression = 'gzip',
                         sep = '\t', 
                         comment = '#', 
                         low_memory = False,
                         header = None, 
                         names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
df.head()

R:

df <- read.csv("~/Documents/Github/Homo_sapiens.GRCh38.85.gff3.gz", 
               header = FALSE, 
               sep = "\t", 
               col.names = c('seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'), 
               comment.char = "#")
head(df)
##   seqid source              type start       end   score strand phase
## 1     1 GRCh38        chromosome     1 248956422       .      .     .
## 2     1      . biological_region 10469     11240 1.3e+03      .     .
## 3     1      . biological_region 10650     10657   0.999      +     .
## 4     1      . biological_region 10655     10657   0.999      -     .
## 5     1      . biological_region 10678     10687   0.999      +     .
## 6     1      . biological_region 10681     10688   0.999      -     .
##                                           attributes
## 1 ID=chromosome:1;Alias=CM000663.2,chr1,NC_000001.11
## 2           external_name=oe %3D 0.79;logic_name=cpg
## 3                                 logic_name=eponine
## 4                                 logic_name=eponine
## 5                                 logic_name=eponine
## 6                                 logic_name=eponine

Examining data

  • listing unique strings

The first thing we want to know from the data is how many unique entries there are in the “seqid” column.

Here, we can already see the main difference in syntax between R and Python:

Python concatenates the object name (“df) with the column name and the functions that we want to run on this column in a sequential manner, separated by a dot. Base R uses nested functions, where each function is called with “function_name()” and we specify columns with “object_name$column_name”.

However, both R and Python can also call columns in a dataframe with “[ ]” with the difference that Python per default subsets data columns df[“seqid”], while R always needs index specifications for rows and columns, separated by “,”: e.g. df[, “seqid”] would subset every row and only the column named “seqid”.

The sequential calling of functions is indeed very handy, it makes the code easier to read and understand than lots of interwoven functions and brackets. But while this isn’t the concept of base R, dplyr uses the magrittr principle of chaining functions with the pipe symbol “%>%”. Even though this symbol isn’t as easily typed, it’s functionality is often superior to base R, especially if you need to run many functions on a dataframe. However, with just or two functions, I usually keep to base R as it is shorter.

Python:

df.seqid.unique() # alternatively: df['seqid'].unique()

R:

unique(df$seqid)
##   [1] 1          10         11         12         13         14        
##   [7] 15         16         17         18         19         2         
##  [13] 20         21         22         3          4          5         
##  [19] 6          7          8          9          GL000008.2 GL000009.2
##  [25] GL000194.1 GL000195.1 GL000205.2 GL000208.1 GL000213.1 GL000214.1
##  [31] GL000216.2 GL000218.1 GL000219.1 GL000220.1 GL000221.1 GL000224.1
##  [37] GL000225.1 GL000226.1 KI270302.1 KI270303.1 KI270304.1 KI270305.1
##  [43] KI270310.1 KI270311.1 KI270312.1 KI270315.1 KI270316.1 KI270317.1
##  [49] KI270320.1 KI270322.1 KI270329.1 KI270330.1 KI270333.1 KI270334.1
##  [55] KI270335.1 KI270336.1 KI270337.1 KI270338.1 KI270340.1 KI270362.1
##  [61] KI270363.1 KI270364.1 KI270366.1 KI270371.1 KI270372.1 KI270373.1
##  [67] KI270374.1 KI270375.1 KI270376.1 KI270378.1 KI270379.1 KI270381.1
##  [73] KI270382.1 KI270383.1 KI270384.1 KI270385.1 KI270386.1 KI270387.1
##  [79] KI270388.1 KI270389.1 KI270390.1 KI270391.1 KI270392.1 KI270393.1
##  [85] KI270394.1 KI270395.1 KI270396.1 KI270411.1 KI270412.1 KI270414.1
##  [91] KI270417.1 KI270418.1 KI270419.1 KI270420.1 KI270422.1 KI270423.1
##  [97] KI270424.1 KI270425.1 KI270429.1 KI270435.1 KI270438.1 KI270442.1
## [103] KI270448.1 KI270465.1 KI270466.1 KI270467.1 KI270468.1 KI270507.1
## [109] KI270508.1 KI270509.1 KI270510.1 KI270511.1 KI270512.1 KI270515.1
## [115] KI270516.1 KI270517.1 KI270518.1 KI270519.1 KI270521.1 KI270522.1
## [121] KI270528.1 KI270529.1 KI270530.1 KI270538.1 KI270539.1 KI270544.1
## [127] KI270548.1 KI270579.1 KI270580.1 KI270581.1 KI270582.1 KI270583.1
## [133] KI270584.1 KI270587.1 KI270588.1 KI270589.1 KI270590.1 KI270591.1
## [139] KI270593.1 KI270706.1 KI270707.1 KI270708.1 KI270709.1 KI270710.1
## [145] KI270711.1 KI270712.1 KI270713.1 KI270714.1 KI270715.1 KI270716.1
## [151] KI270717.1 KI270718.1 KI270719.1 KI270720.1 KI270721.1 KI270722.1
## [157] KI270723.1 KI270724.1 KI270725.1 KI270726.1 KI270727.1 KI270728.1
## [163] KI270729.1 KI270730.1 KI270731.1 KI270732.1 KI270733.1 KI270734.1
## [169] KI270735.1 KI270736.1 KI270737.1 KI270738.1 KI270739.1 KI270740.1
## [175] KI270741.1 KI270742.1 KI270743.1 KI270744.1 KI270745.1 KI270746.1
## [181] KI270747.1 KI270748.1 KI270749.1 KI270750.1 KI270751.1 KI270752.1
## [187] KI270753.1 KI270754.1 KI270755.1 KI270756.1 KI270757.1 MT        
## [193] X          Y         
## 194 Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 ... Y
# with dplyr:
# df %>% select(seqid) %>% unique


  • how many unique seqids are there?

To get the number of unique entries in the “seqid” column, we simply need to append “.shape” to the above Python code. In R, we can either wrap the above R code with the “length()” function or use dplyr and piping instead. If we use the latter, we need to use the “nrow()” function because base R returns a vector, while dplyr returns a dataframe. Here, we can see that with two functions, using dplyr is still a bit more code but it already looks much tidier.

Python:

df.seqid.unique().shape

R:

length(unique(df$seqid))
## [1] 194
# with dplyr:
# df %>% select(seqid) %>% unique %>% nrow


  • counting occurrences

To count the frequencies of each unique entry in the “source” column, we use the “value_counts()” function in Python and the “table()” function in R. These two functions differ in how they sort the output table: value_counts() sorts by decreasing frequency, while R alphabetically sorts the variables. To order the data as in Python, we need to add the “sort()” function to our R code.

Python:

df.source.value_counts()

R:

# table(df$source) is per default ordered alphabetically, if we want it ordered by decreasing counts, like with Python:
sort(table(df$source), decreasing = TRUE)
## 
##         havana ensembl_havana        ensembl              .        mirbase 
##        1441093         745065         228212         182510           4701 
##         GRCh38          insdc 
##            194             74
# dplyr:
# df %>% select(source) %>% table %>% sort(decreasing = TRUE)


How Much of the Genome Is Incomplete?

  • subsetting a dataframe

We are now subsetting our original dataframe and assign it a new object name with “ = “ or “ <- “.

To subset the dataframe to keep only rows which say “GRCh38” in the “source” column, ther are several ways to do this in R: the way that would be directly comparable to how it was done in Python would be to also use the square bracket indexing. However, there are two solutions which are more elegant: 1) base R’s “subset()” or dplyr’s “filter()” function. But with this short example, there is no big difference between the three.

Python’s “shape” gives us the same information as R’s “dim()” function: how many rows and columns our dataframe has.

To preview a random subset of 10 rows from our dataframe, we use Python’s “sample()” and dplyr’s “sample_n()” function.

Python:

gdf = df[df.source == 'GRCh38']

gdf.shape
gdf.sample(10)

R:

gdf <- df[df$source == "GRCh38", ]

# alternatively:
# gdf <- subset(df, source == "GRCh38")

# dplyr:
# gdf <- df %>% filter(source == "GRCh38")

# get number of rows and columns
dim(gdf)
## [1] 194   9
# randomly sample 10 rows for observation
sample_n(gdf, 10)
##              seqid source        type start       end score strand phase
## 2511484 KI270375.1 GRCh38 supercontig     1      2378     .      .     .
## 2511545 KI270466.1 GRCh38 supercontig     1      1233     .      .     .
## 2511473 KI270337.1 GRCh38 supercontig     1      1121     .      .     .
## 2511504 KI270411.1 GRCh38 supercontig     1      2646     .      .     .
## 2511487 KI270379.1 GRCh38 supercontig     1      1045     .      .     .
## 483371          12 GRCh38  chromosome     1 133275309     .      .     .
## 2511494 KI270387.1 GRCh38 supercontig     1      1537     .      .     .
## 674768          14 GRCh38  chromosome     1 107043718     .      .     .
## 2511493 KI270386.1 GRCh38 supercontig     1      1788     .      .     .
## 2511505 KI270412.1 GRCh38 supercontig     1      1179     .      .     .
##                                                           attributes
## 2511484 ID=supercontig:KI270375.1;Alias=chrUn_KI270375v1,NT_187493.1
## 2511545 ID=supercontig:KI270466.1;Alias=chrUn_KI270466v1,NT_187421.1
## 2511473 ID=supercontig:KI270337.1;Alias=chrUn_KI270337v1,NT_187466.1
## 2511504 ID=supercontig:KI270411.1;Alias=chrUn_KI270411v1,NT_187409.1
## 2511487 ID=supercontig:KI270379.1;Alias=chrUn_KI270379v1,NT_187472.1
## 483371          ID=chromosome:12;Alias=CM000674.2,chr12,NC_000012.12
## 2511494 ID=supercontig:KI270387.1;Alias=chrUn_KI270387v1,NT_187475.1
## 674768           ID=chromosome:14;Alias=CM000676.2,chr14,NC_000014.9
## 2511493 ID=supercontig:KI270386.1;Alias=chrUn_KI270386v1,NT_187480.1
## 2511505 ID=supercontig:KI270412.1;Alias=chrUn_KI270412v1,NT_187408.1


  • creating new columns and performing calculations

Now we want to create a new column called “length”. It should contain the gene lengths, i.e. the distance in base pairs between the start and end point on the chromosomes (“start” and “end” columns). In R we don’t need to copy the dataframe first but the rest of the code is very similar: we define a new column name for the dataframe and assign its value by subtracting the values of the “start” from the values of the “end” column (+1). Here, we can see again that in Python we use a dot to define columns, while R uses the Dollar sign.

The sum of all lengths is calculated with the “sum()” function in both languages.

Python:

gdf = gdf.copy()
gdf['length'] = gdf.end - gdf.start + 1

gdf.length.sum()

R:

gdf$length <- gdf$end - gdf$start + 1

sum(gdf$length)
## [1] 3096629726

Next, we want to calculating the proportion of the genome that is not on main chromosome assemblies. For that, we first define a character string with the main chromosomes: 1 to 23, X, Y and MT (mitochondrial chromosome). Defining this string is a bit easier in R.

We will use this string to calculate the sum of lengths of the subsetted dataframe and divide it by the sum of lengths of the whole dataframe. For subsetting, we use the “isin()” function of Python, which corresponds to R’s “%in%”.

Python:

chrs = [str(_) for _ in range(1, 23)] + ['X', 'Y', 'MT']
gdf[-gdf.seqid.isin(chrs)].length.sum() / gdf.length.sum()

# or
gdf[(gdf['type'] == 'supercontig')].length.sum() / gdf.length.sum()

R:

chrs <- c(1:23, "X", "Y", "MT")
sum(subset(gdf, !seqid %in% chrs)$length) / sum(gdf$length)
## [1] 0.003702192


How Many Genes Are There?

Here, we are again using the same functions as above for subsetting the dataframe, asking for its dimensions, printing 10 random lines and asking for the frequencies of each unique item in the “type” column. For the latter I am using base R over dplyr, because it’s faster to type.

Python:

edf = df[df.source.isin(['ensembl', 'havana', 'ensembl_havana'])]
edf.shape

edf.sample(10)
edf.type.value_counts()

R:

edf <- subset(df, source %in% c("ensembl", "havana", "ensembl_havana"))
dim(edf)
## [1] 2414370       9
sample_n(edf, 10)
##         seqid         source            type     start       end score
## 2548897     X        ensembl three_prime_UTR  67723842  67730618     .
## 790670     15 ensembl_havana             CDS  42352030  42352139     .
## 1298369    19         havana             CDS  38942683  38942751     .
## 189728      1 ensembl_havana             CDS 186304108 186304257     .
## 1080599    17         havana            exon  44401337  44401453     .
## 1562968    20 ensembl_havana            exon   2964272   2964350     .
## 1965100     5        ensembl      transcript    144087    144197     .
## 2581554     X         havana            exon 147927670 147928807     .
## 785399     15         havana            exon  40611478  40611511     .
## 1417830     2 ensembl_havana             CDS  71520178  71520208     .
##         strand phase
## 2548897      +     .
## 790670       +     2
## 1298369      -     0
## 189728       +     2
## 1080599      -     .
## 1562968      +     .
## 1965100      -     .
## 2581554      +     .
## 785399       +     .
## 1417830      +     0
##                                                                                                                                                                         attributes
## 2548897                                                                                                                                          Parent=transcript:ENST00000612452
## 790670                                                                                         ID=CDS:ENSP00000326227;Parent=transcript:ENST00000318010;protein_id=ENSP00000326227
## 1298369                                                                                        ID=CDS:ENSP00000472465;Parent=transcript:ENST00000599996;protein_id=ENSP00000472465
## 189728                                                                                         ID=CDS:ENSP00000356453;Parent=transcript:ENST00000367483;protein_id=ENSP00000356453
## 1080599                         Parent=transcript:ENST00000585614;Name=ENSE00002917300;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00002917300;rank=7;version=1
## 1562968                        Parent=transcript:ENST00000216877;Name=ENSE00003603687;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;exon_id=ENSE00003603687;rank=4;version=1
## 1965100 ID=transcript:ENST00000362670;Parent=gene:ENSG00000199540;Name=Y_RNA.43-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000362670;transcript_support_level=NA;version=1
## 2581554                       Parent=transcript:ENST00000620828;Name=ENSE00003724969;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003724969;rank=1;version=1
## 785399                          Parent=transcript:ENST00000399668;Name=ENSE00003659959;constitutive=0;ensembl_end_phase=2;ensembl_phase=1;exon_id=ENSE00003659959;rank=7;version=1
## 1417830                                                                                        ID=CDS:ENSP00000398305;Parent=transcript:ENST00000429174;protein_id=ENSP00000398305
sort(table(edf$type), decreasing = TRUE)
## 
##                          exon                           CDS 
##                       1180596                        704604 
##                five_prime_UTR               three_prime_UTR 
##                        142387                        133938 
##                    transcript                          gene 
##                         96375                         42470 
##          processed_transcript aberrant_processed_transcript 
##                         28228                         26944 
##        NMD_transcript_variant                       lincRNA 
##                         13761                         13247 
##          processed_pseudogene                  lincRNA_gene 
##                         10722                          7533 
##                    pseudogene                           RNA 
##                          3049                          2221 
##                         snRNA                    snRNA_gene 
##                          1909                          1909 
##                        snoRNA                   snoRNA_gene 
##                           956                           944 
##        pseudogenic_transcript                          rRNA 
##                           737                           549 
##                     rRNA_gene                         miRNA 
##                           549                           302 
##                V_gene_segment                J_gene_segment 
##                           216                           158 
##               VD_gene_segment                C_gene_segment 
##                            37                            29 
##             biological_region                    chromosome 
##                             0                             0 
##                    miRNA_gene                       mt_gene 
##                             0                             0 
##                   supercontig 
##                             0

Now we want to subset the dataframe to rows with the attribute “gene” in the “type” column, look at 10 random lines from the “attributes” column and get the dataframe dimensions.

Python:

ndf = edf[edf.type == 'gene']
ndf = ndf.copy()
ndf.sample(10).attributes.values
ndf.shape

R:

ndf <- subset(edf, type == "gene")
sample_n(ndf, 10)$attributes
##  [1] ID=gene:ENSG00000170949;Name=ZNF160;biotype=protein_coding;description=zinc finger protein 160 [Source:HGNC Symbol%3BAcc:HGNC:12948];gene_id=ENSG00000170949;havana_gene=OTTHUMG00000182854;havana_version=3;logic_name=ensembl_havana_gene;version=17                  
##  [2] ID=gene:ENSG00000215088;Name=RPS5P3;biotype=processed_pseudogene;description=ribosomal protein S5 pseudogene 3 [Source:HGNC Symbol%3BAcc:HGNC:10427];gene_id=ENSG00000215088;havana_gene=OTTHUMG00000065532;havana_version=1;logic_name=havana;version=3                
##  [3] ID=gene:ENSG00000249077;Name=RP11-478C1.8;biotype=processed_pseudogene;gene_id=ENSG00000249077;havana_gene=OTTHUMG00000160290;havana_version=1;logic_name=havana;version=1                                                                                              
##  [4] ID=gene:ENSG00000110619;Name=CARS;biotype=protein_coding;description=cysteinyl-tRNA synthetase [Source:HGNC Symbol%3BAcc:HGNC:1493];gene_id=ENSG00000110619;havana_gene=OTTHUMG00000010927;havana_version=9;logic_name=ensembl_havana_gene;version=16                   
##  [5] ID=gene:ENSG00000168010;Name=ATG16L2;biotype=protein_coding;description=autophagy related 16 like 2 [Source:HGNC Symbol%3BAcc:HGNC:25464];gene_id=ENSG00000168010;havana_gene=OTTHUMG00000167961;havana_version=2;logic_name=ensembl_havana_gene;version=10             
##  [6] ID=gene:ENSG00000055957;Name=ITIH1;biotype=protein_coding;description=inter-alpha-trypsin inhibitor heavy chain 1 [Source:HGNC Symbol%3BAcc:HGNC:6166];gene_id=ENSG00000055957;havana_gene=OTTHUMG00000150312;havana_version=9;logic_name=ensembl_havana_gene;version=10
##  [7] ID=gene:ENSG00000258781;Name=RP11-496I2.4;biotype=unprocessed_pseudogene;gene_id=ENSG00000258781;havana_gene=OTTHUMG00000170502;havana_version=2;logic_name=havana;version=2                                                                                            
##  [8] ID=gene:ENSG00000157965;Name=SSX8;biotype=unprocessed_pseudogene;description=SSX family member 8 [Source:HGNC Symbol%3BAcc:HGNC:19654];gene_id=ENSG00000157965;havana_gene=OTTHUMG00000021573;havana_version=4;logic_name=havana;version=11                             
##  [9] ID=gene:ENSG00000273489;Name=RP11-180C16.1;biotype=antisense;gene_id=ENSG00000273489;havana_gene=OTTHUMG00000186336;havana_version=1;logic_name=havana;version=1                                                                                                        
## [10] ID=gene:ENSG00000279532;Name=CTB-96E2.6;biotype=TEC;gene_id=ENSG00000279532;havana_gene=OTTHUMG00000179414;havana_version=1;logic_name=havana;version=1                                                                                                                 
## 1623077 Levels: external_name=Ala;logic_name=trnascan ...
dim(ndf)
## [1] 42470     9


  • extracting gene information from attributes field

I don’t know if there is an easier way in Python but in R we don’t need to create big helper functions around it. We can simply use “gsub()” with defining the regular expression for what we want to extract. This makes the R code much shorter and easier to understand! We then drop the original “attributes” column.

To have a look at the dataframe, we use “head()” this time.

Python:

import re

RE_GENE_NAME = re.compile(r'Name=(?P<gene_name>.+?);')
def extract_gene_name(attributes_str):
    res = RE_GENE_NAME.search(attributes_str)
    return res.group('gene_name')


ndf['gene_name'] = ndf.attributes.apply(extract_gene_name)

RE_GENE_ID = re.compile(r'gene_id=(?P<gene_id>ENSG.+?);')
def extract_gene_id(attributes_str):
    res = RE_GENE_ID.search(attributes_str)
    return res.group('gene_id')


ndf['gene_id'] = ndf.attributes.apply(extract_gene_id)


RE_DESC = re.compile('description=(?P<desc>.+?);')
def extract_description(attributes_str):
    res = RE_DESC.search(attributes_str)
    if res is None:
        return ''
    else:
        return res.group('desc')


ndf['desc'] = ndf.attributes.apply(extract_description)

ndf.drop('attributes', axis=1, inplace=True)
ndf.head()

R:

ndf$gene_name <- gsub("(.*Name=)(.*?)(;biotype.*)", "\\2", ndf$attributes)
ndf$gene_id <- gsub("(ID=gene:)(.*?)(;Name.*)", "\\2", ndf$attributes)
ndf$desc <- gsub("(.*description=)(.*?)(;.*)", "\\2", ndf$attributes)

# some genes don't have a description
ndf$desc <- ifelse(grepl("^ID=.*", ndf$desc), "", ndf$desc)

ndf <- subset(ndf, select = -attributes)
head(ndf)
##     seqid         source type  start    end score strand phase gene_name
## 17      1         havana gene  11869  14409     .      +     .   DDX11L1
## 29      1         havana gene  14404  29570     .      -     .    WASH7P
## 72      1         havana gene  52473  53312     .      +     .    OR4G4P
## 75      1         havana gene  62948  63887     .      +     .   OR4G11P
## 78      1 ensembl_havana gene  69091  70008     .      +     .     OR4F5
## 109     1         havana gene 131025 134836     .      +     .    CICP27
##             gene_id
## 17  ENSG00000223972
## 29  ENSG00000227232
## 72  ENSG00000268020
## 75  ENSG00000240361
## 78  ENSG00000186092
## 109 ENSG00000233750
##                                                                                                   desc
## 17                                 DEAD/H-box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:HGNC:37102]
## 29                       WAS protein family homolog 7 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:38034]
## 72   olfactory receptor family 4 subfamily G member 4 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:14822]
## 75  olfactory receptor family 4 subfamily G member 11 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:31276]
## 78              olfactory receptor family 4 subfamily F member 5 [Source:HGNC Symbol%3BAcc:HGNC:14825]
## 109              capicua transcriptional repressor pseudogene 27 [Source:HGNC Symbol%3BAcc:HGNC:48835]

Next, we want to know how many unique gene names and gene IDs there are. As above we use the “unique()”, “shape()” (Python) and “length()” (R) functions.

The count table for gene names can again be obtained with R’s “table()” function, even though Zhuyi Xue uses a slightly different approach: he first groups the “gene_name” column, then counts and sorts.

Finally, we can calculate the proportion of genes that have more than appear more than once and we can have a closer look at the SCARNA20 gene.

Python:

ndf.shape
ndf.gene_id.unique().shape
ndf.gene_name.unique().shape

count_df = ndf.groupby('gene_name').count().ix[:, 0].sort_values().ix[::-1]
count_df.head(10)

count_df[count_df > 1].shape
count_df.shape
count_df[count_df > 1].shape[0] / count_df.shape[0]

ndf[ndf.gene_name == 'SCARNA20']

R:

dim(ndf)
## [1] 42470    11
length(unique(ndf$gene_id))
## [1] 42470
length(unique(ndf$gene_name))
## [1] 42387
count_df <- sort(table(ndf$gene_name), decreasing = TRUE)
head(count_df, n = 10)
## 
##        SCARNA20        SCARNA16        SCARNA17        SCARNA11 
##               7               6               5               4 
##        SCARNA15        SCARNA21 Clostridiales-1         SCARNA4 
##               4               4               3               3 
##        ACTR3BP2           AGBL1 
##               2               2
length(count_df[count_df > 1])
## [1] 63
length(count_df)
## [1] 42387
length(count_df[count_df > 1]) / length(count_df)
## [1] 0.001486305
ndf[ndf$gene_name == "SCARNA20", ]
##         seqid  source type     start       end score strand phase
## 179400      1 ensembl gene 171768070 171768175     .      +     .
## 201038      1 ensembl gene 204727991 204728106     .      +     .
## 349204     11 ensembl gene   8555016   8555146     .      +     .
## 718521     14 ensembl gene  63479272  63479413     .      +     .
## 837234     15 ensembl gene  75121536  75121666     .      -     .
## 1039875    17 ensembl gene  28018770  28018907     .      +     .
## 1108216    17 ensembl gene  60231516  60231646     .      -     .
##         gene_name         gene_id
## 179400   SCARNA20 ENSG00000253060
## 201038   SCARNA20 ENSG00000251861
## 349204   SCARNA20 ENSG00000252778
## 718521   SCARNA20 ENSG00000252800
## 837234   SCARNA20 ENSG00000252722
## 1039875  SCARNA20 ENSG00000251818
## 1108216  SCARNA20 ENSG00000252577
##                                                                           desc
## 179400            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 201038            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 349204            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 718521            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 837234            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1039875           Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1108216 small Cajal body-specific RNA 20 [Source:HGNC Symbol%3BAcc:HGNC:32578]
# dplyr:
# ndf %>% filter(gene_name == "SCARNA20")


How Long Is a Typical Gene?

To calculate gene lengths we use the same code as before. R’s summary() is not exactly the same as Python’s describe() but it’s close enough.

Python:

ndf['length'] = ndf.end - ndf.start + 1
ndf.length.describe()

R:

ndf$length <- ndf$end - ndf$start + 1
summary(ndf$length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       8     884    5170   35830   30550 2305000

Now we produce the first plot, showing a histogram of gene length. In base R you can’t really plot a histogram with logarithmic y-axis scales (at least not without manually tweaking the hist() output but it isn’t recommended anyway because 0 will become -Inf). But we can do it easily with ggplot2 with “scale_y_log10()”. The code we need for ggplot2 is a bit longer than with matplotlib. We could of course further customize our plot but for now, let’s keep it simple.

Python:

import matplotlib as plt

ndf.length.plot(kind='hist', bins=50, logy=True)
plt.show()

R:

ndf %>% ggplot(aes(x = length)) + 
  geom_histogram(bins = 50, fill = "blue") + 
  scale_y_log10()

Now, we subset the dataframe to keep only rows where the “length” column contains values bigger than 2 million and order it by descending gene length. To see the shortest genes, we order the original dataframe and view the first 6 rows. Instead of “sort()”, we use dplyr’s “arrange()” function this time (I didn’t use it before because it can only be applied to dataframes).

Python:

ndf[ndf.length > 2e6].sort_values('length').ix[::-1]
ndf.sort_values('length').head()

R:

ndf %>% filter(length > 2e6) %>% arrange(desc(length))
##   seqid         source type     start       end score strand phase
## 1     7 ensembl_havana gene 146116002 148420998     .      +     .
## 2     9 ensembl_havana gene   8314246  10612723     .      -     .
## 3     X ensembl_havana gene  31097677  33339441     .      -     .
## 4    11 ensembl_havana gene  83455012  85627922     .      -     .
## 5     8 ensembl_havana gene   2935353   4994972     .      -     .
## 6    20 ensembl_havana gene  13995369  16053197     .      +     .
##   gene_name         gene_id
## 1   CNTNAP2 ENSG00000174469
## 2     PTPRD ENSG00000153707
## 3       DMD ENSG00000198947
## 4      DLG2 ENSG00000150672
## 5     CSMD1 ENSG00000183117
## 6   MACROD2 ENSG00000172264
##                                                                                   desc
## 1            contactin associated protein-like 2 [Source:HGNC Symbol%3BAcc:HGNC:13830]
## 2 protein tyrosine phosphatase%2C receptor type D [Source:HGNC Symbol%3BAcc:HGNC:9668]
## 3                                      dystrophin [Source:HGNC Symbol%3BAcc:HGNC:2928]
## 4            discs large MAGUK scaffold protein 2 [Source:HGNC Symbol%3BAcc:HGNC:2901]
## 5               CUB and Sushi multiple domains 1 [Source:HGNC Symbol%3BAcc:HGNC:14026]
## 6                      MACRO domain containing 2 [Source:HGNC Symbol%3BAcc:HGNC:16126]
##    length
## 1 2304997
## 2 2298478
## 3 2241765
## 4 2172911
## 5 2059620
## 6 2057829
head(arrange(ndf, length))
##   seqid source type     start       end score strand phase    gene_name
## 1    14 havana gene  22438547  22438554     .      +     .        TRDD1
## 2    14 havana gene  22439007  22439015     .      +     .        TRDD2
## 3     7 havana gene 142786213 142786224     .      +     .        TRBD1
## 4    14 havana gene  22449113  22449125     .      +     .        TRDD3
## 5     4 havana gene  10238213  10238235     .      -     .   AC006499.9
## 6     3 havana gene 179452395 179452419     .      -     . RP11-145M9.2
##           gene_id
## 1 ENSG00000223997
## 2 ENSG00000237235
## 3 ENSG00000282431
## 4 ENSG00000228985
## 5 ENSG00000271544
## 6 ENSG00000239255
##                                                                      desc
## 1 T cell receptor delta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12254]
## 2 T cell receptor delta diversity 2 [Source:HGNC Symbol%3BAcc:HGNC:12255]
## 3  T cell receptor beta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12158]
## 4 T cell receptor delta diversity 3 [Source:HGNC Symbol%3BAcc:HGNC:12256]
## 5                                                                        
## 6                                                                        
##   length
## 1      8
## 2      9
## 3     12
## 4     13
## 5     23
## 6     25


Gene Distribution Among Chromosomes

The number of genes per chromosome are counted with the “subset()”, “table()” and “sort()” functions as described earlier.

Python:

ndf = ndf[ndf.seqid.isin(chrs)]
chr_gene_counts = ndf.groupby('seqid').count().ix[:, 0].sort_values().ix[::-1]
chr_gene_counts

R:

ndf$seqid <- as.character(ndf$seqid) # as factors it will subset the dataframe but keep the factor levels
ndf <- subset(ndf, seqid %in% chrs)
chr_gene_counts <- sort(table(ndf$seqid), decreasing = TRUE)
chr_gene_counts
## 
##    1    2   11   19   17    3    6   12    7    5   16    X    4    9    8 
## 3902 2806 2561 2412 2280 2204 2154 2140 2106 2002 1881 1852 1751 1659 1628 
##   10   15   14   22   20   13   18   21    Y 
## 1600 1476 1449  996  965  872  766  541  436

To see all genes that are on mitochondrial chromosome, we subset the first dataframe by two conditions. In both, R and Python this is done with the ampersand symbol but in R we don’t need brackets around the individual conditions.

Python:

df[(df.type == 'gene') & (df.seqid == 'MT')]

R:

subset(df, type == "gene" & seqid == "MT")
##         seqid source type start   end score strand phase
## 2514004    MT  insdc gene   648  1601     .      +     .
## 2514010    MT  insdc gene  1671  3229     .      +     .
## 2514017    MT  insdc gene  3307  4262     .      +     .
## 2514030    MT  insdc gene  4470  5511     .      +     .
## 2514049    MT  insdc gene  5904  7445     .      +     .
## 2514059    MT  insdc gene  7586  8269     .      +     .
## 2514066    MT  insdc gene  8366  8572     .      +     .
## 2514070    MT  insdc gene  8527  9207     .      +     .
## 2514074    MT  insdc gene  9207  9990     .      +     .
## 2514081    MT  insdc gene 10059 10404     .      +     .
## 2514088    MT  insdc gene 10470 10766     .      +     .
## 2514092    MT  insdc gene 10760 12137     .      +     .
## 2514105    MT  insdc gene 12337 14148     .      +     .
## 2514109    MT  insdc gene 14149 14673     .      -     .
## 2514116    MT  insdc gene 14747 15887     .      +     .
##                                                                                                                                                                                                                                                 attributes
## 2514004                                               ID=gene:ENSG00000211459;Name=MT-RNR1;biotype=Mt_rRNA;description=mitochondrially encoded 12S RNA [Source:HGNC Symbol%3BAcc:HGNC:7470];gene_id=ENSG00000211459;logic_name=mt_genbank_import;version=2
## 2514010                                               ID=gene:ENSG00000210082;Name=MT-RNR2;biotype=Mt_rRNA;description=mitochondrially encoded 16S RNA [Source:HGNC Symbol%3BAcc:HGNC:7471];gene_id=ENSG00000210082;logic_name=mt_genbank_import;version=2
## 2514017   ID=gene:ENSG00000198888;Name=MT-ND1;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol%3BAcc:HGNC:7455];gene_id=ENSG00000198888;logic_name=mt_genbank_import;version=2
## 2514030   ID=gene:ENSG00000198763;Name=MT-ND2;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 2 [Source:HGNC Symbol%3BAcc:HGNC:7456];gene_id=ENSG00000198763;logic_name=mt_genbank_import;version=3
## 2514049                          ID=gene:ENSG00000198804;Name=MT-CO1;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase I [Source:HGNC Symbol%3BAcc:HGNC:7419];gene_id=ENSG00000198804;logic_name=mt_genbank_import;version=2
## 2514059                         ID=gene:ENSG00000198712;Name=MT-CO2;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase II [Source:HGNC Symbol%3BAcc:HGNC:7421];gene_id=ENSG00000198712;logic_name=mt_genbank_import;version=1
## 2514066                                 ID=gene:ENSG00000228253;Name=MT-ATP8;biotype=protein_coding;description=mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol%3BAcc:HGNC:7415];gene_id=ENSG00000228253;logic_name=mt_genbank_import;version=1
## 2514070                                 ID=gene:ENSG00000198899;Name=MT-ATP6;biotype=protein_coding;description=mitochondrially encoded ATP synthase 6 [Source:HGNC Symbol%3BAcc:HGNC:7414];gene_id=ENSG00000198899;logic_name=mt_genbank_import;version=2
## 2514074                        ID=gene:ENSG00000198938;Name=MT-CO3;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase III [Source:HGNC Symbol%3BAcc:HGNC:7422];gene_id=ENSG00000198938;logic_name=mt_genbank_import;version=2
## 2514081   ID=gene:ENSG00000198840;Name=MT-ND3;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 3 [Source:HGNC Symbol%3BAcc:HGNC:7458];gene_id=ENSG00000198840;logic_name=mt_genbank_import;version=2
## 2514088 ID=gene:ENSG00000212907;Name=MT-ND4L;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4L [Source:HGNC Symbol%3BAcc:HGNC:7460];gene_id=ENSG00000212907;logic_name=mt_genbank_import;version=2
## 2514092   ID=gene:ENSG00000198886;Name=MT-ND4;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4 [Source:HGNC Symbol%3BAcc:HGNC:7459];gene_id=ENSG00000198886;logic_name=mt_genbank_import;version=2
## 2514105   ID=gene:ENSG00000198786;Name=MT-ND5;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 5 [Source:HGNC Symbol%3BAcc:HGNC:7461];gene_id=ENSG00000198786;logic_name=mt_genbank_import;version=2
## 2514109   ID=gene:ENSG00000198695;Name=MT-ND6;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 6 [Source:HGNC Symbol%3BAcc:HGNC:7462];gene_id=ENSG00000198695;logic_name=mt_genbank_import;version=2
## 2514116                                    ID=gene:ENSG00000198727;Name=MT-CYB;biotype=protein_coding;description=mitochondrially encoded cytochrome b [Source:HGNC Symbol%3BAcc:HGNC:7427];gene_id=ENSG00000198727;logic_name=mt_genbank_import;version=2

We can get the chromosome lengths from the dataframe as well. We again subset to only the main chromosomes, then drop unwanted columns and order by length.

Python:

gdf = gdf[gdf.seqid.isin(chrs)]
gdf.drop(['start', 'end', 'score', 'strand', 'phase' ,'attributes'], axis=1, inplace=True)
gdf.sort_values('length').ix[::-1]

R:

gdf$seqid <- as.character(gdf$seqid) # as factors it will subset the dataframe but keep the factor levels
gdf <- subset(gdf, as.character(seqid) %in% chrs) %>%
  select(-(start:attributes))
arrange(gdf, desc(length))
##    seqid source       type    length
## 1      1 GRCh38 chromosome 248956422
## 2      2 GRCh38 chromosome 242193529
## 3      3 GRCh38 chromosome 198295559
## 4      4 GRCh38 chromosome 190214555
## 5      5 GRCh38 chromosome 181538259
## 6      6 GRCh38 chromosome 170805979
## 7      7 GRCh38 chromosome 159345973
## 8      X GRCh38 chromosome 156040895
## 9      8 GRCh38 chromosome 145138636
## 10     9 GRCh38 chromosome 138394717
## 11    11 GRCh38 chromosome 135086622
## 12    10 GRCh38 chromosome 133797422
## 13    12 GRCh38 chromosome 133275309
## 14    13 GRCh38 chromosome 114364328
## 15    14 GRCh38 chromosome 107043718
## 16    15 GRCh38 chromosome 101991189
## 17    16 GRCh38 chromosome  90338345
## 18    17 GRCh38 chromosome  83257441
## 19    18 GRCh38 chromosome  80373285
## 20    20 GRCh38 chromosome  64444167
## 21    19 GRCh38 chromosome  58617616
## 22     Y GRCh38 chromosome  54106423
## 23    22 GRCh38 chromosome  50818468
## 24    21 GRCh38 chromosome  46709983
## 25    MT GRCh38 chromosome     16569

Now, we merge the dataframe with the number of genes per chromosome with the dataframe of chromosome lengths. Because R’s “table()” function produces a vector, we need to convert it to a dataframe first and define the column names. Then, we use the “merge()” function and point to the name of the column we want to merge by.

Python:

cdf = chr_gene_counts.to_frame(name='gene_count').reset_index()
cdf.head(2)

merged = gdf.merge(cdf, on='seqid')

R:

cdf <- as.data.frame(chr_gene_counts)
colnames(cdf) <- c("seqid", "gene_count")
head(cdf, n = 2)
##   seqid gene_count
## 1     1       3902
## 2     2       2806
merged <- merge(gdf, cdf, by = "seqid")
merged
##    seqid source       type    length gene_count
## 1      1 GRCh38 chromosome 248956422       3902
## 2     10 GRCh38 chromosome 133797422       1600
## 3     11 GRCh38 chromosome 135086622       2561
## 4     12 GRCh38 chromosome 133275309       2140
## 5     13 GRCh38 chromosome 114364328        872
## 6     14 GRCh38 chromosome 107043718       1449
## 7     15 GRCh38 chromosome 101991189       1476
## 8     16 GRCh38 chromosome  90338345       1881
## 9     17 GRCh38 chromosome  83257441       2280
## 10    18 GRCh38 chromosome  80373285        766
## 11    19 GRCh38 chromosome  58617616       2412
## 12     2 GRCh38 chromosome 242193529       2806
## 13    20 GRCh38 chromosome  64444167        965
## 14    21 GRCh38 chromosome  46709983        541
## 15    22 GRCh38 chromosome  50818468        996
## 16     3 GRCh38 chromosome 198295559       2204
## 17     4 GRCh38 chromosome 190214555       1751
## 18     5 GRCh38 chromosome 181538259       2002
## 19     6 GRCh38 chromosome 170805979       2154
## 20     7 GRCh38 chromosome 159345973       2106
## 21     8 GRCh38 chromosome 145138636       1628
## 22     9 GRCh38 chromosome 138394717       1659
## 23     X GRCh38 chromosome 156040895       1852
## 24     Y GRCh38 chromosome  54106423        436

To calculate the correlation between length and gene count, we subset the merged dataframe to those two columns and use the “corr()” (Python) or “cor()” (R) function.

Python:

merged[['length', 'gene_count']].corr()

R:

cor(merged[, c("length", "gene_count")])
##               length gene_count
## length     1.0000000  0.7282208
## gene_count 0.7282208  1.0000000

And now we produce the final plot: a line plot of chromosome length by number of genes per chromosome. For Python, we again use the matplotlib and for R the ggplot2 packages. Because Zhuyi Xue creates a new dataframe and tweaks the plot somewhat, our ggplot2 code is simpler and tidier here.

Python:

ax = merged[['length', 'gene_count']].sort_values('length').plot(x='length', y='gene_count', style='o-')
# add some margin to both ends of x axis
xlim = ax.get_xlim()
margin = xlim[0] * 0.1
ax.set_xlim([xlim[0] - margin, xlim[1] + margin])
# Label each point on the graph
for (s, x, y) in merged[['seqid', 'length', 'gene_count']].sort_values('length').values:
    ax.text(x, y - 100, str(s))

R:

merged[, c("seqid", "length", "gene_count")] %>%
  arrange(desc(length)) %>%
  ggplot(aes(x = length, y = gene_count, label = seqid)) +
  geom_point(color = "blue") +
  geom_line(color = "blue") +
  geom_text()



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8      codetools_0.2-15 digest_0.6.11    rprojroot_1.1   
##  [5] assertthat_0.1   plyr_1.8.4       grid_3.3.2       R6_2.2.0        
##  [9] gtable_0.2.0     DBI_0.5-1        backports_1.0.4  magrittr_1.5    
## [13] scales_0.4.1     evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0  
## [17] rmarkdown_1.3    labeling_0.3     tools_3.3.2      stringr_1.1.0   
## [21] munsell_0.4.3    yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5 
## [25] knitr_1.15.1     tibble_1.2

Feature Selection in Machine Learning (Breast Cancer Datasets)

2017-01-15 00:00:00 +0000

Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.

There are several ways to identify how much each feature contributes to the model and to restrict the number of selected features. Here, I am going to examine the effect of feature selection via

  • Correlation,
  • Recursive Feature Elimination (RFE) and
  • Genetic Algorithm (GA)

on Random Forest models.

Additionally, I want to know how different data properties affect the influence of these feature selection methods on the outcome. For that I am using three breast cancer datasets, one of which has few features; the other two are larger but differ in how well the outcome clusters in PCA.

Based on my comparisons of the correlation method, RFE and GA, I would conclude that for Random Forest models

  • removing highly correlated features isn’t a generally suitable method,
  • GA produced the best models in this example but is impractical for everyday use-cases with many features because it takes a lot of time to run with sufficient generations and individuals and
  • data that doesn’t allow a good classification to begin with (because the features are not very distinct between classes) don’t necessarily benefit from feature selection.

My conclusions are of course not to be generalized to any ol’ data you are working with: There are many more feature selection methods and I am only looking at a limited number of datasets and only at their influence on Random Forest models. But even this small example shows how different features and parameters can influence your predictions. With machine learning, there is no “one size fits all”! It is always worthwhile to take a good hard look at your data, get acquainted with its quirks and properties before you even think about models and algorithms. And once you’ve got a feel for your data, investing the time and effort to compare different feature selection methods (or engineered features), model parameters and - finally - different machine learning algorithms can make a big difference!



Breast Cancer Wisconsin (Diagnostic) Dataset

The data I am going to use to explore feature selection methods is the Breast Cancer Wisconsin (Diagnostic) Dataset:

W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.

The data was downloaded from the UC Irvine Machine Learning Repository. The features in these datasets characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses.

Included are three datasets. The first dataset is small with only 9 features, the other two datasets have 30 and 33 features and vary in how strongly the two predictor classes cluster in PCA. I want to explore the effect of different feature selection methods on datasets with these different properties.

But first, I want to get to know the data I am working with.


Breast cancer dataset 1

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The phenotypes for characterisation are:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis

Missing values are imputed with the mice package.

bc_data <- read.table("breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",")
colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", 
                       "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")
bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))

bc_data[bc_data == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data)))
## [1] 16
# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)
##    benign malignant 
##       458       241
str(bc_data)
## 'data.frame':    699 obs. of  10 variables:
##  $ classes                    : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
##  $ clump_thickness            : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ uniformity_of_cell_size    : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ uniformity_of_cell_shape   : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginal_adhesion          : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ single_epithelial_cell_size: num  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare_nuclei                : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ bland_chromatin            : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal_nucleoli            : num  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitosis                    : num  1 1 1 1 1 1 1 1 5 1 ...


Breast cancer dataset 2

The second dataset looks again at the predictor classes:

  • M: malignant or
  • B: benign breast mass.

The first two columns give:

  • Sample ID
  • Classes, i.e. diagnosis

For each cell nucleus, the following ten characteristics were measured:

  • Radius (mean of all distances from the center to points on the perimeter)
  • Texture (standard deviation of gray-scale values)
  • Perimeter
  • Area
  • Smoothness (local variation in radius lengths)
  • Compactness (perimeter^2 / area - 1.0)
  • Concavity (severity of concave portions of the contour)
  • Concave points (number of concave portions of the contour)
  • Symmetry
  • Fractal dimension (“coastline approximation” - 1)

For each characteristic three measures are given:

  • Mean
  • Standard error
  • Largest/ “worst”
bc_data_2 <- read.table("wdbc.data.txt", header = FALSE, sep = ",")

phenotypes <- rep(c("radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension"), 3)
types <- rep(c("mean", "se", "largest_worst"), each = 10)

colnames(bc_data_2) <- c("ID", "diagnosis", paste(phenotypes, types, sep = "_"))

# how many NAs are in the data
length(which(is.na(bc_data_2)))
## [1] 0
# how many benign and malignant cases are there?
summary(bc_data_2$diagnosis)
##   B   M 
## 357 212
str(bc_data_2)
## 'data.frame':    569 obs. of  32 variables:
##  $ ID                             : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis                      : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean                    : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean                   : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean                 : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean                      : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean                : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean                 : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean            : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se                      : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se                     : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se                   : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                        : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se                  : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se                    : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_largest_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_largest_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_largest_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...


Breast cancer dataset 3

The third dataset looks at the predictor classes:

  • R: recurring or
  • N: nonrecurring breast cancer.

The first two columns give:

  • Sample ID
  • Classes, i.e. outcome

For each cell nucleus, the same ten characteristics and measures were given as in dataset 2, plus:

  • Time (recurrence time if field 2 = R, disease-free time if field 2 = N)
  • Tumor size - diameter of the excised tumor in centimeters
  • Lymph node status - number of positive axillary lymph nodes observed at time of surgery

Missing values are imputed with the mice package.

bc_data_3 <- read.table("wpbc.data.txt", header = FALSE, sep = ",")
colnames(bc_data_3) <- c("ID", "outcome", "time", paste(phenotypes, types, sep = "_"), "tumor_size", "lymph_node_status")

bc_data_3[bc_data_3 == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data_3)))
## [1] 4
# impute missing data
library(mice)

bc_data_3[,3:35] <- apply(bc_data_3[,3:35], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data_3[,3:35],  print = FALSE)
bc_data_3 <- cbind(bc_data_3[, 2, drop = FALSE], mice::complete(dataset_impute, 1))

# how many recurring and non-recurring cases are there?
summary(bc_data_3$outcome)
##   N   R 
## 151  47
str(bc_data_3)
## 'data.frame':    198 obs. of  34 variables:
##  $ outcome                        : Factor w/ 2 levels "N","R": 1 1 1 1 2 2 1 2 1 1 ...
##  $ time                           : num  31 61 116 123 27 77 60 77 119 76 ...
##  $ radius_mean                    : num  18 18 21.4 11.4 20.3 ...
##  $ texture_mean                   : num  27.6 10.4 17.4 20.4 14.3 ...
##  $ perimeter_mean                 : num  117.5 122.8 137.5 77.6 135.1 ...
##  $ area_mean                      : num  1013 1001 1373 386 1297 ...
##  $ smoothness_mean                : num  0.0949 0.1184 0.0884 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.104 0.278 0.119 0.284 0.133 ...
##  $ concavity_mean                 : num  0.109 0.3 0.126 0.241 0.198 ...
##  $ concave_points_mean            : num  0.0706 0.1471 0.0818 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.186 0.242 0.233 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0633 0.0787 0.0601 0.0974 0.0588 ...
##  $ radius_se                      : num  0.625 1.095 0.585 0.496 0.757 ...
##  $ texture_se                     : num  1.89 0.905 0.611 1.156 0.781 ...
##  $ perimeter_se                   : num  3.97 8.59 3.93 3.44 5.44 ...
##  $ area_se                        : num  71.5 153.4 82.2 27.2 94.4 ...
##  $ smoothness_se                  : num  0.00443 0.0064 0.00617 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.0142 0.049 0.0345 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0323 0.0537 0.033 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.00985 0.01587 0.01805 0.01867 0.01885 ...
##  $ symmetry_se                    : num  0.0169 0.03 0.0309 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00349 0.00619 0.00504 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  21.6 25.4 24.9 14.9 22.5 ...
##  $ texture_largest_worst          : num  37.1 17.3 21 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  139.7 184.6 159.1 98.9 152.2 ...
##  $ area_largest_worst             : num  1436 2019 1949 568 1575 ...
##  $ smoothness_largest_worst       : num  0.119 0.162 0.119 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.193 0.666 0.345 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.314 0.712 0.341 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.117 0.265 0.203 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.268 0.46 0.433 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.0811 0.1189 0.0907 0.173 0.0768 ...
##  $ tumor_size                     : num  5 3 2.5 2 3.5 2.5 1.5 4 2 6 ...
##  $ lymph_node_status              : num  5 2 0 0 0 0 0 10 1 20 ...


Principal Component Analysis (PCA)

To get an idea about the dimensionality and variance of the datasets, I am first looking at PCA plots for samples and features. The first two principal components (PCs) show the two components that explain the majority of variation in the data.

After defining my custom ggplot2 theme, I am creating a function that performs the PCA (using the pcaGoPromoter package), calculates ellipses of the data points (with the ellipse package) and produces the plot with ggplot2. Some of the features in datasets 2 and 3 are not very distinct and overlap in the PCA plots, therefore I am also plotting hierarchical clustering dendrograms.

# plotting theme

library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "navy", color = "navy", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())
# function for PCA plotting
library(pcaGoPromoter)
library(ellipse)

pca_func <- function(data, groups, title, print_ellipse = TRUE) {
  
  # perform pca and extract scores
  pcaOutput <- pca(data, printDropped = FALSE, scale = TRUE, center = TRUE)
  pcaOutput2 <- as.data.frame(pcaOutput$scores)
  
  # define groups for plotting
  pcaOutput2$groups <- groups
  
  # when plotting samples calculate ellipses for plotting (when plotting features, there are no replicates)
  if (print_ellipse) {
    
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
      data.frame(groups = as.character(t),
                 ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                       centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                       level = 0.95),
                 stringsAsFactors = FALSE)))
    
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.6) + 
      scale_color_brewer(palette = "Set1") +
      labs(title = title,
           color = "",
           fill = "",
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
    
  } else {
    
    # if there are fewer than 10 groups (e.g. the predictor classes) I want to have colors from RColorBrewer
    if (length(unique(pcaOutput2$groups)) <= 10) {
      
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        scale_color_brewer(palette = "Set1") +
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    } else {
      
      # otherwise use the default rainbow colors
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    }
  }
  
  return(plot)
  
}

library(gridExtra)
library(grid)


  • Dataset 1
p1 <- pca_func(data = t(bc_data[, 2:10]), groups = as.character(bc_data$classes), title = "Breast cancer dataset 1: Samples")
p2 <- pca_func(data = bc_data[, 2:10], groups = as.character(colnames(bc_data[, 2:10])), title = "Breast cancer dataset 1: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2)

h_1 <- hclust(dist(t(bc_data[, 2:10]), method = "euclidean"), method = "complete")
plot(h_1)

library(tidyr)
bc_data_gather <- bc_data %>%
  gather(measure, value, clump_thickness:mitosis)

ggplot(data = bc_data_gather, aes(x = value, fill = classes, color = classes)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


  • Dataset 2
p1 <- pca_func(data = t(bc_data_2[, 3:32]), groups = as.character(bc_data_2$diagnosis), title = "Breast cancer dataset 2: Samples")
p2 <- pca_func(data = bc_data_2[, 3:32], groups = as.character(colnames(bc_data_2[, 3:32])), title = "Breast cancer dataset 2: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_2 <- hclust(dist(t(bc_data_2[, 3:32]), method = "euclidean"), method = "complete")
plot(h_2)

bc_data_2_gather <- bc_data_2[, -1] %>%
  gather(measure, value, radius_mean:fractal_dimension_largest_worst)

ggplot(data = bc_data_2_gather, aes(x = value, fill = diagnosis, color = diagnosis)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


  • Dataset 3
p1 <- pca_func(data = t(bc_data_3[, 2:34]), groups = as.character(bc_data_3$outcome), title = "Breast cancer dataset 3: Samples")
p2 <- pca_func(data = bc_data_3[, 2:34], groups = as.character(colnames(bc_data_3[, 2:34])), title = "Breast cancer dataset 3: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_3 <- hclust(dist(t(bc_data_3[,2:34]), method = "euclidean"), method = "complete")
plot(h_3)

Datasets 1 and 2 show a nice separation of benign and malignant masses, models based on these features will likely be able to predict the classes quite well for most samples. The classes in dataset 3 don’t cluster into distinct groups, I assume that prediction will not be as accurate for these features.

The features of datasets 2 and 3 don’t cluster very distinctly, many features seem to show similar patterns across samples. Selecting an approriate subset of features will probably have different effects on the three different datasets.

bc_data_3_gather <- bc_data_3 %>%
  gather(measure, value, time:lymph_node_status)

ggplot(data = bc_data_3_gather, aes(x = value, fill = outcome, color = outcome)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


Feature importance

To get an idea about the feature’s respective importances, I’m running Random Forest models with 10 x 10 cross validation using the caret package. If I wanted to use feature importance to select features for modeling, I would need to perform it on the training data instead of on the complete dataset. But here, I only want to use it to get acquainted with my data. I am again defining a function that estimates the feature importance and produces a plot.

library(caret)
library(doParallel) # parallel processing
registerDoParallel()

# prepare training scheme
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)

feature_imp <- function(model, title) {
  
  # estimate variable importance
  importance <- varImp(model, scale = TRUE)
  
  # prepare dataframes for plotting
  importance_df_1 <- importance$importance
  importance_df_1$group <- rownames(importance_df_1)
  
  importance_df_2 <- importance_df_1
  importance_df_2$Overall <- 0
  
  importance_df <- rbind(importance_df_1, importance_df_2)
  
  plot <- ggplot() +
    geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) +
    geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) +
    theme(legend.position = "none") +
    labs(
      x = "Importance",
      y = "",
      title = title,
      subtitle = "Scaled feature importance",
      caption = "\nDetermined with Random Forest and
      repeated cross validation (10 repeats, 10 times)"
    )
  
  return(plot)
  
}
# train the model
set.seed(27)
imp_1 <- train(classes ~ ., data = bc_data, method = "rf", preProcess = c("scale", "center"), trControl = control)
p1 <- feature_imp(imp_1, title = "Breast cancer dataset 1")
set.seed(27)
imp_2 <- train(diagnosis ~ ., data = bc_data_2[, -1], method = "rf", preProcess = c("scale", "center"), trControl = control)
p2 <- feature_imp(imp_2, title = "Breast cancer dataset 2")
set.seed(27)
imp_3 <- train(outcome ~ ., data = bc_data_3, method = "rf", preProcess = c("scale", "center"), trControl = control)
p3 <- feature_imp(imp_3, title = "Breast cancer dataset 3")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(0.3, 0.35, 0.35))


Feature Selection

Now that I have a general idea about the data, I will run three feature selection methods on all three datasets and compare how they effect the prediction accuracy of a Random Forest model.

Creating train and test data

Before doing anything else with the data, we need to subset the datasets into train and test data. Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!


  • Dataset 1
set.seed(27)
bc_data_index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
bc_data_train <- bc_data[bc_data_index, ]
bc_data_test  <- bc_data[-bc_data_index, ]


  • Dataset 2
set.seed(27)
bc_data_2_index <- createDataPartition(bc_data_2$diagnosis, p = 0.7, list = FALSE)
bc_data_2_train <- bc_data_2[bc_data_2_index, ]
bc_data_2_test  <- bc_data_2[-bc_data_2_index, ]


  • Dataset 3
set.seed(27)
bc_data_3_index <- createDataPartition(bc_data_3$outcome, p = 0.7, list = FALSE)
bc_data_3_train <- bc_data_3[bc_data_3_index, ]
bc_data_3_test  <- bc_data_3[-bc_data_3_index, ]


Correlation

Often we have features that are highly correlated and thus provide redundant information. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features. This also shows us, that when we want to make statements about the biological/ medical importance of specific features, we need to keep in mind that just because they are suitable to predicting an outcome they are not necessarily causal - they could simply be correlated with causal factors.

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.


  • Dataset 1
library(corrplot)

# calculate correlation matrix
corMatMy <- cor(bc_data_train[, -1])
corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70,
highlyCor <- colnames(bc_data_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 2  and column  3 with corr  0.913 
##   Means:  0.716 vs 0.601 so flagging column 2 
## Compare row 3  and column  6 with corr  0.741 
##   Means:  0.677 vs 0.579 so flagging column 3 
## Compare row 6  and column  7 with corr  0.706 
##   Means:  0.602 vs 0.545 so flagging column 6 
## All correlations <= 0.7
# which variables are flagged for removal?
highlyCor
## [1] "uniformity_of_cell_size"  "uniformity_of_cell_shape"
## [3] "bare_nuclei"
#then we remove these variables
bc_data_cor <- bc_data_train[, which(!colnames(bc_data_train) %in% highlyCor)]

Correlation between features in dataset 1 is generally high and 4 out of 10 feature were flagged for removal.


  • Dataset 2
corMatMy <- cor(bc_data_2_train[, 3:32])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_2_train[, 3:32])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 7  and column  8 with corr  0.92 
##   Means:  0.579 vs 0.393 so flagging column 7 
## Compare row 8  and column  6 with corr  0.84 
##   Means:  0.548 vs 0.38 so flagging column 8 
## Compare row 6  and column  28 with corr  0.827 
##   Means:  0.536 vs 0.368 so flagging column 6 
## Compare row 28  and column  27 with corr  0.855 
##   Means:  0.506 vs 0.357 so flagging column 28 
## Compare row 27  and column  26 with corr  0.894 
##   Means:  0.46 vs 0.346 so flagging column 27 
## Compare row 23  and column  21 with corr  0.993 
##   Means:  0.454 vs 0.336 so flagging column 23 
## Compare row 21  and column  24 with corr  0.983 
##   Means:  0.419 vs 0.327 so flagging column 21 
## Compare row 26  and column  30 with corr  0.817 
##   Means:  0.402 vs 0.323 so flagging column 26 
## Compare row 24  and column  3 with corr  0.943 
##   Means:  0.383 vs 0.312 so flagging column 24 
## Compare row 3  and column  1 with corr  0.998 
##   Means:  0.347 vs 0.306 so flagging column 3 
## Compare row 1  and column  4 with corr  0.986 
##   Means:  0.302 vs 0.304 so flagging column 4 
## Compare row 1  and column  14 with corr  0.726 
##   Means:  0.264 vs 0.304 so flagging column 14 
## Compare row 13  and column  11 with corr  0.973 
##   Means:  0.32 vs 0.304 so flagging column 13 
## Compare row 18  and column  16 with corr  0.757 
##   Means:  0.388 vs 0.295 so flagging column 18 
## Compare row 16  and column  17 with corr  0.796 
##   Means:  0.404 vs 0.288 so flagging column 16 
## Compare row 9  and column  29 with corr  0.711 
##   Means:  0.343 vs 0.274 so flagging column 9 
## Compare row 17  and column  20 with corr  0.745 
##   Means:  0.306 vs 0.261 so flagging column 17 
## Compare row 5  and column  25 with corr  0.809 
##   Means:  0.311 vs 0.255 so flagging column 5 
## Compare row 30  and column  10 with corr  0.753 
##   Means:  0.288 vs 0.241 so flagging column 30 
## Compare row 22  and column  2 with corr  0.913 
##   Means:  0.243 vs 0.242 so flagging column 22 
## All correlations <= 0.7
highlyCor
##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "concave_points_largest_worst"   
##  [5] "concavity_largest_worst"         "perimeter_largest_worst"        
##  [7] "radius_largest_worst"            "compactness_largest_worst"      
##  [9] "area_largest_worst"              "perimeter_mean"                 
## [11] "perimeter_se"                    "area_mean"                      
## [13] "concave_points_se"               "compactness_se"                 
## [15] "area_se"                         "symmetry_mean"                  
## [17] "concavity_se"                    "smoothness_mean"                
## [19] "fractal_dimension_largest_worst" "texture_largest_worst"
bc_data_2_cor <- bc_data_2_train[, which(!colnames(bc_data_2_train) %in% highlyCor)]

Here, we have more variation between the 30 features: some are highly correlated, while others seem to be very distinct. 20 are flagged for removal (see output above).


  • Dataset 3
corMatMy <- cor(bc_data_3_train[, -1])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_3_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 8  and column  9 with corr  0.898 
##   Means:  0.425 vs 0.286 so flagging column 8 
## Compare row 9  and column  7 with corr  0.714 
##   Means:  0.39 vs 0.277 so flagging column 9 
## Compare row 7  and column  29 with corr  0.753 
##   Means:  0.364 vs 0.271 so flagging column 7 
## Compare row 4  and column  2 with corr  0.996 
##   Means:  0.348 vs 0.264 so flagging column 4 
## Compare row 2  and column  5 with corr  0.993 
##   Means:  0.33 vs 0.259 so flagging column 2 
## Compare row 5  and column  24 with corr  0.921 
##   Means:  0.304 vs 0.255 so flagging column 5 
## Compare row 24  and column  22 with corr  0.985 
##   Means:  0.272 vs 0.252 so flagging column 24 
## Compare row 11  and column  31 with corr  0.83 
##   Means:  0.341 vs 0.247 so flagging column 11 
## Compare row 22  and column  15 with corr  0.773 
##   Means:  0.24 vs 0.242 so flagging column 15 
## Compare row 22  and column  25 with corr  0.989 
##   Means:  0.217 vs 0.242 so flagging column 25 
## Compare row 14  and column  12 with corr  0.975 
##   Means:  0.256 vs 0.243 so flagging column 14 
## Compare row 31  and column  28 with corr  0.71 
##   Means:  0.328 vs 0.238 so flagging column 31 
## Compare row 18  and column  17 with corr  0.812 
##   Means:  0.331 vs 0.229 so flagging column 18 
## Compare row 28  and column  27 with corr  0.84 
##   Means:  0.286 vs 0.219 so flagging column 28 
## Compare row 17  and column  21 with corr  0.839 
##   Means:  0.285 vs 0.212 so flagging column 17 
## Compare row 10  and column  30 with corr  0.766 
##   Means:  0.277 vs 0.204 so flagging column 10 
## Compare row 6  and column  26 with corr  0.754 
##   Means:  0.235 vs 0.198 so flagging column 6 
## Compare row 3  and column  23 with corr  0.858 
##   Means:  0.164 vs 0.195 so flagging column 23 
## All correlations <= 0.7
highlyCor
##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "perimeter_mean"                 
##  [5] "radius_mean"                     "area_mean"                      
##  [7] "perimeter_largest_worst"         "fractal_dimension_mean"         
##  [9] "perimeter_se"                    "area_se"                        
## [11] "area_largest_worst"              "fractal_dimension_largest_worst"
## [13] "concavity_se"                    "concavity_largest_worst"        
## [15] "compactness_se"                  "symmetry_mean"                  
## [17] "smoothness_mean"                 "texture_largest_worst"
bc_data_3_cor <- bc_data_3_train[, which(!colnames(bc_data_3_train) %in% highlyCor)]

The features in dataset 3 look similar: some are highly correlated, others are very different. 18 are flagged for removal (see output above).


Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.


  • Dataset 1
# ensure the results are repeatable
set.seed(7)
# define the control using a random forest selection function with cross validation
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# run the RFE algorithm
results_1 <- rfe(x = bc_data_train[, -1], y = bc_data_train$classes, sizes = c(1:9), rfeControl = control)

# chosen features
predictors(results_1)
## [1] "bare_nuclei"                 "uniformity_of_cell_size"    
## [3] "clump_thickness"             "uniformity_of_cell_shape"   
## [5] "bland_chromatin"             "marginal_adhesion"          
## [7] "mitosis"                     "normal_nucleoli"            
## [9] "single_epithelial_cell_size"
# subset the chosen features
bc_data_rfe <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% predictors(results_1)))]


  • Dataset 2
set.seed(7)
results_2 <- rfe(x = bc_data_2_train[, -c(1, 2)], y = as.factor(bc_data_2_train$diagnosis), sizes = c(1:30), rfeControl = control)

predictors(results_2)
##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"
bc_data_2_rfe <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% predictors(results_2)))]


  • Dataset 3
set.seed(7)
results_3 <- rfe(x = bc_data_3_train[,-1], y = as.factor(bc_data_3_train$outcome), sizes = c(1:33), rfeControl = control)

predictors(results_2)
##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"
bc_data_3_rfe <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% predictors(results_3)))]


Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

For demonstration purposes I am using only 10 generations consisting of 5 individuals. More iterations with larger populations would of course be preferable, but this takes quite long to run!

library(dplyr)

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv",    # 10 fold cross validation
                       genParallel = TRUE, # Use parallel programming
                       allowParallel = TRUE)


  • Dataset 1
lev <- c("malignant", "benign")     # Set the levels

set.seed(27)
model_1 <- gafs(x = bc_data_train[, -1], y = bc_data_train$classes,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_1) # Plot mean fitness (AUC) by generation

model_1$ga$final
## [1] "clump_thickness"             "uniformity_of_cell_shape"   
## [3] "marginal_adhesion"           "single_epithelial_cell_size"
## [5] "bare_nuclei"                 "bland_chromatin"            
## [7] "normal_nucleoli"             "mitosis"
bc_data_ga <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% model_1$ga$final))]


  • Dataset 2
lev <- c("M", "B")

set.seed(27)
model_2 <- gafs(x = bc_data_2_train[, -c(1, 2)], y = bc_data_2_train$diagnosis,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_2)

model_2$ga$final
##  [1] "radius_mean"                  "texture_mean"                
##  [3] "perimeter_mean"               "area_mean"                   
##  [5] "smoothness_mean"              "compactness_mean"            
##  [7] "concavity_mean"               "symmetry_mean"               
##  [9] "fractal_dimension_mean"       "texture_se"                  
## [11] "perimeter_se"                 "area_se"                     
## [13] "smoothness_se"                "compactness_se"              
## [15] "concavity_se"                 "concave_points_se"           
## [17] "symmetry_se"                  "radius_largest_worst"        
## [19] "texture_largest_worst"        "smoothness_largest_worst"    
## [21] "compactness_largest_worst"    "concave_points_largest_worst"
bc_data_2_ga <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% model_2$ga$final))]


  • Dataset 3
lev <- c("R", "N")

set.seed(27)
model_3 <- gafs(x = bc_data_3_train[, -1], y = bc_data_3_train$outcome,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_3)

model_3$ga$final
##  [1] "time"                            "perimeter_mean"                 
##  [3] "radius_se"                       "texture_se"                     
##  [5] "compactness_se"                  "concavity_se"                   
##  [7] "symmetry_se"                     "fractal_dimension_se"           
##  [9] "radius_largest_worst"            "texture_largest_worst"          
## [11] "perimeter_largest_worst"         "smoothness_largest_worst"       
## [13] "compactness_largest_worst"       "concavity_largest_worst"        
## [15] "concave_points_largest_worst"    "symmetry_largest_worst"         
## [17] "fractal_dimension_largest_worst" "tumor_size"                     
## [19] "lymph_node_status"
bc_data_3_ga <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% model_3$ga$final))]


Model comparison

Now I can compare Random Forest models with the different feature subsets. (I chose Random Forests because I find it works reasonbly well on a large variety of data and because it’s the algorithm I’ve used most often and thus know most about…)

For a more detailed description of building machine learning models see here and here.


All features

  • Dataset 1
set.seed(27)
model_bc_data_all <- train(classes ~ .,
                           data = bc_data_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_1 <- confusionMatrix(predict(model_bc_data_all, bc_data_test[, -1]), bc_data_test$classes)
cm_all_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         5
##   malignant      3        67
##                                          
##                Accuracy : 0.9617         
##                  95% CI : (0.926, 0.9833)
##     No Information Rate : 0.6555         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9147         
##  Mcnemar's Test P-Value : 0.7237         
##                                          
##             Sensitivity : 0.9781         
##             Specificity : 0.9306         
##          Pos Pred Value : 0.9640         
##          Neg Pred Value : 0.9571         
##              Prevalence : 0.6555         
##          Detection Rate : 0.6411         
##    Detection Prevalence : 0.6651         
##       Balanced Accuracy : 0.9543         
##                                          
##        'Positive' Class : benign         
## 


  • Dataset 2
set.seed(27)
model_bc_data_2_all <- train(diagnosis ~ .,
                           data = bc_data_2_train[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_2 <- confusionMatrix(predict(model_bc_data_2_all, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_all_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 


  • Dataset 3
set.seed(27)
model_bc_data_3_all <- train(outcome ~ .,
                           data = bc_data_3_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_3 <- confusionMatrix(predict(model_bc_data_3_all, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_all_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 


Selected features

Dataset 1

library(gplots)

venn_list <- list(cor = colnames(bc_data_cor)[-1],
                  rfe = colnames(bc_data_rfe)[-1],
                  ga = colnames(bc_data_ga)[-1])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   0   0   0  1
## 010   1   0   1  0
## 011   2   0   1  1
## 100   0   1   0  0
## 101   0   1   0  1
## 110   0   1   1  0
## 111   6   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "clump_thickness"             "marginal_adhesion"          
## [3] "single_epithelial_cell_size" "bland_chromatin"            
## [5] "normal_nucleoli"             "mitosis"                    
## 
## attr(,"intersections")$rfe
## [1] "uniformity_of_cell_size"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "uniformity_of_cell_shape" "bare_nuclei"             
## 
## attr(,"class")
## [1] "venn"

4 out of 10 features were chosen by all three methods; the biggest overlap is seen between GA and RFE with 7 features. RFE and GA both retained 8 features for modeling, compared to only 5 based on the correlation method.


  • Correlation
set.seed(27)
model_bc_data_cor <- train(classes ~ .,
                 data = bc_data_cor,
                 method = "rf",
                 preProcess = c("scale", "center"),
                 trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_1 <- confusionMatrix(predict(model_bc_data_cor, bc_data_test[, -1]), bc_data_test$classes)
cm_cor_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       131         6
##   malignant      6        66
##                                         
##                Accuracy : 0.9426        
##                  95% CI : (0.9019, 0.97)
##     No Information Rate : 0.6555        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.8729        
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.9562        
##             Specificity : 0.9167        
##          Pos Pred Value : 0.9562        
##          Neg Pred Value : 0.9167        
##              Prevalence : 0.6555        
##          Detection Rate : 0.6268        
##    Detection Prevalence : 0.6555        
##       Balanced Accuracy : 0.9364        
##                                         
##        'Positive' Class : benign        
## 


  • RFE
set.seed(27)
model_bc_data_rfe <- train(classes ~ .,
                           data = bc_data_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_1 <- confusionMatrix(predict(model_bc_data_rfe, bc_data_test[, -1]), bc_data_test$classes)
cm_rfe_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         4
##   malignant      3        68
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9322, 0.9864)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9256          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9710          
##          Neg Pred Value : 0.9577          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6603          
##       Balanced Accuracy : 0.9613          
##                                           
##        'Positive' Class : benign          
## 


  • GA
set.seed(27)
model_bc_data_ga <- train(classes ~ .,
                           data = bc_data_ga,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_1 <- confusionMatrix(predict(model_bc_data_ga, bc_data_test[, -1]), bc_data_test$classes)
cm_ga_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         3
##   malignant      3        69
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9386, 0.9894)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9364          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9583          
##          Pos Pred Value : 0.9781          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6555          
##       Balanced Accuracy : 0.9682          
##                                           
##        'Positive' Class : benign          
## 


Dataset 2

venn_list <- list(cor = colnames(bc_data_2_cor)[-c(1, 2)],
                  rfe = colnames(bc_data_2_rfe)[-c(1, 2)],
                  ga = colnames(bc_data_2_ga)[-c(1, 2)])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   2   0   0  1
## 010   5   0   1  0
## 011  13   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   2   1   1  0
## 111   2   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "texture_mean"             "smoothness_largest_worst"
## 
## attr(,"intersections")$ga
## [1] "compactness_se"    "concave_points_se"
## 
## attr(,"intersections")$cor
## [1] "radius_mean"          "fractal_dimension_se"
## 
## attr(,"intersections")$rfe
## [1] "concave_points_mean"             "perimeter_largest_worst"        
## [3] "area_largest_worst"              "concavity_largest_worst"        
## [5] "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:ga`
## [1] "fractal_dimension_mean" "texture_se"            
## [3] "smoothness_se"          "symmetry_se"           
## 
## attr(,"intersections")$`rfe:ga`
##  [1] "perimeter_mean"               "area_mean"                   
##  [3] "smoothness_mean"              "compactness_mean"            
##  [5] "concavity_mean"               "symmetry_mean"               
##  [7] "perimeter_se"                 "area_se"                     
##  [9] "concavity_se"                 "radius_largest_worst"        
## [11] "texture_largest_worst"        "compactness_largest_worst"   
## [13] "concave_points_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "radius_se"              "symmetry_largest_worst"
## 
## attr(,"class")
## [1] "venn"

For dataset 2 we see a much bigger variation in chosen features between the three selection methods: only 1 feature was chosen by all and the biggest overlap is again seen between RFE and GA, followed by correlation and GA. But this time we also see quite a few features that are uniquely retained by any of the three feature selection methods.


  • Correlation
set.seed(27)
model_bc_data_2_cor <- train(diagnosis ~ .,
                           data = bc_data_2_cor[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_2 <- confusionMatrix(predict(model_bc_data_2_cor, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_cor_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   6
##          M   1  57
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9103         
##  Mcnemar's Test P-Value : 0.1306         
##                                          
##             Sensitivity : 0.9907         
##             Specificity : 0.9048         
##          Pos Pred Value : 0.9464         
##          Neg Pred Value : 0.9828         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6235         
##    Detection Prevalence : 0.6588         
##       Balanced Accuracy : 0.9477         
##                                          
##        'Positive' Class : B              
## 


  • RFE
set.seed(27)
model_bc_data_2_rfe <- train(diagnosis ~ .,
                           data = bc_data_2_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_2 <- confusionMatrix(predict(model_bc_data_2_rfe, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_rfe_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 


  • GA
set.seed(27)
model_bc_data_2_ga <- train(diagnosis ~ .,
                          data = bc_data_2_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_2 <- confusionMatrix(predict(model_bc_data_2_ga, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_ga_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   4
##          M   1  59
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9327, 0.9904)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9363          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9365          
##          Pos Pred Value : 0.9636          
##          Neg Pred Value : 0.9833          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6471          
##       Balanced Accuracy : 0.9636          
##                                           
##        'Positive' Class : B               
## 


Dataset 3

venn_list <- list(cor = colnames(bc_data_3_cor)[-1],
                  rfe = colnames(bc_data_3_rfe)[-1],
                  ga = colnames(bc_data_3_ga)[-1])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   5   0   0  1
## 010   6   0   1  0
## 011   2   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   1   1   1  0
## 111   8   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "time"                         "radius_se"                   
## [3] "symmetry_se"                  "smoothness_largest_worst"    
## [5] "concave_points_largest_worst" "symmetry_largest_worst"      
## [7] "tumor_size"                   "lymph_node_status"           
## 
## attr(,"intersections")$ga
## [1] "perimeter_mean"          "compactness_se"         
## [3] "texture_largest_worst"   "perimeter_largest_worst"
## [5] "concavity_largest_worst"
## 
## attr(,"intersections")$cor
## [1] "texture_mean"  "smoothness_se"
## 
## attr(,"intersections")$rfe
## [1] "compactness_mean"       "concavity_mean"        
## [3] "concave_points_mean"    "symmetry_mean"         
## [5] "fractal_dimension_mean" "perimeter_se"          
## 
## attr(,"intersections")$`cor:ga`
## [1] "texture_se"                "fractal_dimension_se"     
## [3] "radius_largest_worst"      "compactness_largest_worst"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "concavity_se"                    "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "concave_points_se"
## 
## attr(,"class")
## [1] "venn"

With the third dataset there is still some variation among the selected features but not as extreme as with dataset 2. This time, we find the biggest overlap between correlation and GA.


  • Correlation
set.seed(27)
model_bc_data_3_cor <- train(outcome ~ .,
                           data = bc_data_3_cor,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_3 <- confusionMatrix(predict(model_bc_data_3_cor, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_cor_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 41  7
##          R  4  7
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.6909, 0.9031)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.2256          
##                                           
##                   Kappa : 0.4439          
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9111          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8542          
##          Neg Pred Value : 0.6364          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6949          
##    Detection Prevalence : 0.8136          
##       Balanced Accuracy : 0.7056          
##                                           
##        'Positive' Class : N               
## 


  • RFE
set.seed(27)
model_bc_data_3_rfe <- train(outcome ~ .,
                           data = bc_data_3_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_3 <- confusionMatrix(predict(model_bc_data_3_rfe, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_rfe_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 40  7
##          R  5  7
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.6717, 0.8902)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.3311          
##                                           
##                   Kappa : 0.409           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8511          
##          Neg Pred Value : 0.5833          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6780          
##    Detection Prevalence : 0.7966          
##       Balanced Accuracy : 0.6944          
##                                           
##        'Positive' Class : N               
## 


  • GA
set.seed(27)
model_bc_data_3_ga <- train(outcome ~ .,
                          data = bc_data_3_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_3 <- confusionMatrix(predict(model_bc_data_3_ga, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_ga_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 


Overall model parameters

To compare the feature selection methods’ influence on the models built with the three different datasets, I am going to plot model accuracy, Kappa, precision and sensitivity and specificity.

overall <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$overall,
                      cm_cor_1$overall,
                      cm_rfe_1$overall,
                      cm_ga_1$overall,
                      cm_all_2$overall,
                      cm_cor_2$overall,
                      cm_rfe_2$overall,
                      cm_ga_2$overall,
                      cm_all_3$overall,
                      cm_cor_3$overall,
                      cm_rfe_3$overall,
                      cm_ga_3$overall))

library(tidyr)
overall_gather <- overall[, 1:4] %>%
  gather(measure, value, Accuracy:Kappa)
byClass <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$byClass,
                      cm_cor_1$byClass,
                      cm_rfe_1$byClass,
                      cm_ga_1$byClass,
                      cm_all_2$byClass,
                      cm_cor_2$byClass,
                      cm_rfe_2$byClass,
                      cm_ga_2$byClass,
                      cm_all_3$byClass,
                      cm_cor_3$byClass,
                      cm_rfe_3$byClass,
                      cm_ga_3$byClass))

byClass_gather <- byClass[, c(1:4, 7)] %>%
  gather(measure, value, Sensitivity:Precision)
overall_byClass_gather <- rbind(overall_gather, byClass_gather)
overall_byClass_gather <- within(overall_byClass_gather, model <- factor(model, levels = c("all", "cor", "rfe", "ga")))

ggplot(overall_byClass_gather, aes(x = model, y = value, color = measure, shape = measure, group = measure)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_path(alpha = 0.7) +
  scale_colour_brewer(palette = "Set1") +
  facet_grid(dataset ~ ., scales = "free_y") +
  labs(
    x = "Feature Selection method",
    y = "Value",
    color = "",
    shape = "",
    title = "Comparison of feature selection methods",
    subtitle = "in three breast cancer datasets",
    caption = "\nBreast Cancer Wisconsin (Diagnostic) Data Sets: 1, 2 & 3
    Street et al., 1993;
    all: no feature selection
    cor: features with correlation > 0.7 removed
    rfe: Recursive Feature Elimination
    ga: Genetic Algorithm"
  )

Conclusions

As expected from PCA of sample classes, which showed that the two classes recur/ non-recur did not cluster well, Random Forest models of dataset 3 had much lower prediction accuracy than models of datasets 1 and 2.

The correlation method operates regardless of feature importance. E.g. in dataset 1, the features with the highest importance were also flagged as highly correlated. Correlation models performed worst in all three datasets. RFE and GA tend to include features with high importance, but feature importance alone is not a good indicator for whether several features will work well in combination when predicting an outcome.

Dataset 1 was small with only 9 features; here, removing highly correlated features was the least successful selection method. RFE and GA both improved the predictions compared to no feature selection, while GA performed best. Dataset 2 with its 30 original features produced the best models with the GA. And in dataset 3, which had lower overall accuracy, different feature selection methods did not have a strong influence.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] C
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1         dplyr_0.5.0          plyr_1.8.4          
##  [4] corrplot_0.77        randomForest_4.6-12  doParallel_1.0.10   
##  [7] iterators_1.0.8      foreach_1.4.3        caret_6.0-73        
## [10] lattice_0.20-34      tidyr_0.6.0          gridExtra_2.2.1     
## [13] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
## [16] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [19] ellipse_0.3-8        ggplot2_2.2.1        mice_2.25           
## [22] Rcpp_0.12.8         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0         class_7.3-14         assertthat_0.1      
##  [4] rprojroot_1.1        digest_0.6.11        R6_2.2.0            
##  [7] backports_1.0.4      MatrixModels_0.4-1   RSQLite_1.1-1       
## [10] evaluate_0.10        e1071_1.6-7          zlibbioc_1.20.0     
## [13] lazyeval_0.2.0       gdata_2.17.0         minqa_1.2.4         
## [16] SparseM_1.74         car_2.1-4            nloptr_1.0.4        
## [19] rpart_4.1-10         Matrix_1.2-7.1       rmarkdown_1.3       
## [22] labeling_0.3         splines_3.3.2        lme4_1.1-12         
## [25] stringr_1.1.0        munsell_0.4.3        compiler_3.3.2      
## [28] mgcv_1.8-16          htmltools_0.3.5      nnet_7.3-12         
## [31] tibble_1.2           codetools_0.2-15     bitops_1.0-6        
## [34] MASS_7.3-45          ModelMetrics_1.1.0   nlme_3.1-128        
## [37] gtable_0.2.0         DBI_0.5-1            magrittr_1.5        
## [40] scales_0.4.1         KernSmooth_2.23-15   stringi_1.1.2       
## [43] reshape2_1.4.2       RColorBrewer_1.1-2   tools_3.3.2         
## [46] Biobase_2.34.0       pbkrtest_0.4-6       survival_2.40-1     
## [49] yaml_2.1.14          AnnotationDbi_1.36.0 colorspace_1.3-2    
## [52] caTools_1.17.1       memoise_1.0.0        knitr_1.15.1        
## [55] quantreg_5.29