Conditional ggplot2 geoms in functions (QTL plots)

2017-02-12 00:00:00 +0000

When running an analysis, I am usually combining functions from multiple packages. Most of these packages come with their own plotting functions. And while they are certainly convenient in that they allow me to get a quick glance at the data or the output, they all have their own style. If I want to prepare a report, proposal or a paper though, I want all my plots to come from a single cast so that they give a consistent feel to the story I want to tell with my data.

I have grown very fond of ggplot2 for producing plots, because of it’s tidy grammar, flexibility and easy customization. Here, I want to show an example of my own ggplot2 function to produce QTL plots for Karl Broman’s qtl package.

Specifically, I want to show how to incorporate conditional geoms when using ggplot2 in a function call.

I will show how my function can be used with the output from qtl’s scanone() function by following the examples from the package tutorial. The tutorial does an excellent job describing the workflow, so I am not going to explain it here.

library(qtl)


For defining the function and preparing the data for plotting, I am working with ggplot2, ggrepel (for adding text labels that don’t overlap), tidyr and dplyr. This is my function code:

library(ggplot2)
library(ggrepel)
library(tidyr)
library(dplyr)

qtl_plot <- function(input,              # data frame input from scanone
                     mult.pheno = FALSE, # multiple phenotypes?
                     model = "normal",   # model used in scanone
                     chrs = NA,          # chromosomes to display
                     lod = NA,           # LOD threshold
                     rug = FALSE,        # plot marker positions as rug?
                     ncol = NA,          # number of columns for facetting
                     labels = NA         # optional dataframe to plot QTL labels
                     ) {
                     
                  # if we have multiple phenotypes and/or a 2part model, gather input
                  if (mult.pheno & model == "2part") {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (mult.pheno) {
                    input <- gather(input, group, lod, grep("pheno", colnames(input)))
                  } else if (model == "2part") {
                    input <- gather(input, method, lod, lod.p.mu:lod.mu)
                  }
    
                  # if not all chromosomes should be displayed, subset input
                  if (!is.na(chrs)[1]) {
                    input <- input[as.character(input$chr) %in% chrs, ]
                  }

                  # if there is more than one LOD column, gather input
                  if (!any(colnames(input) == "lod")) {
                    input$lod <- input[, grep("lod", colnames(input))]
                  }

                  # if no number of columns for facetting is defined, plot all in one row
                  if (is.na(ncol)) {
                    ncol <- length(unique(input$chr))
                  }

                  # if labels are set and there is no name column, set from rownames
                  if (!is.na(labels)[1]) {
                    if (is.null(labels$name)) {
                      labels$name <- rownames(labels)
                    }
                  }

                  # plot input data frame position and LOD score
                  plot <- ggplot(input, aes(x = pos, y = lod)) + {

                    # if LOD threshold is given, plot as horizontal line
                    if (!is.na(lod)[1] & length(lod) == 1) geom_hline(yintercept = lod, linetype = "dashed")
                  } + {

                    if (!is.na(lod)[1] & length(lod) > 1) geom_hline(data = lod, aes(yintercept = lod, linetype = group))
                  } + {

                    # plot rug on bottom, if TRUE
                    if (rug) geom_rug(size = 0.1, sides = "b")
                  } + {

                    # if input has column method but not group, plot line and color by method
                    if (!is.null(input$method) & is.null(input$group)) geom_line(aes(color = method), size = 2, alpha = 0.6)
                  } + {

                    # if input has column group but not method, plot line and color by group
                    if (!is.null(input$group) & is.null(input$method)) geom_line(aes(color = group), size = 2, alpha = 0.6)
                  } + {

                    # if input has columns method and group, plot line and color by method & linetype by group
                    if (!is.null(input$group) & !is.null(input$method)) geom_line(aes(color = method, linetype = group), size = 2, alpha = 0.6)
                  } + {

                    # set linetype, if input has columns method and group
                    if (!is.null(input$group) & !is.null(input$method)) scale_linetype_manual(values = c("solid", "twodash", "dotted"))
                  } + {

                    # if input has neither columns method nor group, plot black line
                    if (is.null(input$group) & is.null(input$method)) geom_line(size = 2, alpha = 0.6)
                  } + {

                    # if QTL positions are given in labels df, plot as point...
                    if (!is.na(labels)[1]) geom_point(data = labels, aes(x = pos, y = lod))
                  } + {

                    # ... and plot name as text with ggrepel to avoid overlapping
                    if (!is.na(labels)[1]) geom_text_repel(data = labels, aes(x = pos, y = lod, label = name), nudge_y = 0.5)
                  } + 
                    # facet by chromosome
                    facet_wrap(~ chr, ncol = ncol, scales = "free_x") +
                    # minimal plotting theme
                    theme_minimal() +
                    # increase strip title size
                    theme(strip.text = element_text(face = "bold", size = 12)) +
                    # use RcolorBrewer palette
                    scale_color_brewer(palette = "Set1") +
                    # Change plot labels
                    labs(x = "Chromosome",
                         y = "LOD",
                         color = "",
                         linetype = "")

                  print(plot)
}


The first example uses the hyper data set and builds a simple QTL model with three modeling functions: the EM algorithm, Haley-Knott regression and multiple imputation. The genome wide LOD threshold is calculated with permutation. Feeding this LOD threshold into the summary output gives us the markers with a significant phenotype association (i.e. the QTL).

data(hyper)

hyper <- est.rf(hyper)

hyper <- calc.genoprob(hyper, step = 1, error.prob = 0.01)
out.em <- scanone(hyper, model = "normal")
out.hk <- scanone(hyper, method = "hk", model = "normal")

hyper <- sim.geno(hyper, step = 2, n.draws = 16, error.prob = 0.01)
out.imp <- scanone(hyper, method = "imp", model = "normal")

operm.hk <- scanone(hyper, pheno.col = 1, model = "normal", method = "hk", n.perm = 1000, verbose = FALSE)

lod_threshold <- summary(operm.hk, alpha = 0.05)
labels_df <- as.data.frame(summary(out.hk, perms = operm.hk, alpha = 0.05, pvalues = TRUE))

The most basic version of my plotting function takes as input a data frame from the scanone() function: It needs to have a column with chromosome information, position and LOD score. Without defining anything else, the function will return a line plot of the LOD scores for each marker position on each chromosome in the input data.

qtl_plot(input = out.hk)

If we want to compare the output from e.g. different QTL models, we can follow the tidyverse principle by combining the output data frames in long format and adding a column with the respective method names. If the input has a method column, it will color the lines accordingly. We can also specify, which chromosomes to display (here, 1 and 4) and give a LOD threshold to plot as a horizontal dashed line. If we want to see each marker position, we can specify rug = TRUE to plot them underneath the line plots.

And we can also plot the markers with a significant phenotype association (the QTL) by adding a data frame with the labeling information to the function call. This data frame needs to contain the chromosome, position and LOD score information. If there is no name column, it will take the row names (which usually give the marker name) as labels.

qtl_plot(input = rbind(data.frame(out.em, method = "EM algorithm"), 
                       data.frame(out.hk, method = "Haley-Knott regression"), 
                       data.frame(out.imp, method = "Multiple imputation")), 
         chrs = c(1, 4), 
         lod = lod_threshold[1], 
         rug = TRUE, 
         labels = labels_df)


The second example I want to show is from the listeria data. This time, a slightly more complex 2-part model is built.

data(listeria)

listeria$pheno$logSurv <- log(listeria$pheno[,1])
listeria <- est.rf(listeria)

newmap <- est.map(listeria, error.prob = 0.01)
listeria <- replace.map(listeria, newmap)

listeria <- calc.errorlod(listeria, error.prob = 0.01)

listeria <- calc.genoprob(listeria, step = 2)
out.2p <- scanone(listeria, pheno.col = 3, model = "2part", upper = TRUE)

operm.2p <- scanone(listeria, model = "2part", pheno.col = 3, upper = TRUE, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.2p, alpha = 0.05)

labels_df <- as.data.frame(summary(out.2p, format = "allpeaks", perms = operm.2p, alpha = 0.05, pvalues = TRUE))

The output from a 2-part model looks different than from the other model functions: Instead of only one LOD score column, we now get three columns named lod.pu.mu, lod.p and lod.mu (check back with the tutorial if you want to know what they mean). The plotting function is built so that if you define the model parameter as “2part”, each of the three LOD columns will be plotted in a different color.

If we also want to plot the QTL labels, we need to prepare the label data frame before feeding it into the plotting function:

p.mu <- labels_df[, 1:4]
colnames(p.mu)[3] <- "lod"
p.mu$name <- "p.mu"

p <- labels_df[, c(1, 5:7)]
colnames(p)[3] <- "lod"
p$name <- "p"

mu <- labels_df[, c(1, 8:10)]
colnames(mu)[3] <- "lod"
mu$name <- "mu"

labels_df <- rbind(p.mu, p, mu)

If the chromosomes are not in the right order, we can reorder the factor labels:

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
out.2p$chr <- factor(out.2p$chr, levels = f)
labels_df$chr <- factor(labels_df$chr, levels = f)

The parameter ncol let’s you define how many facet columns you want to have. Here, I want two rows and three columns to display the six chromosomes I chose for plotting:

qtl_plot(input = out.2p, 
         model = "2part",
         chrs = c(1, 3, 5:6, 13, 15),
         lod = lod_threshold[1], 
         rug = TRUE, 
         ncol = 3,
         labels = labels_df)


Binary models are again treated the same way as e.g. normal models:

y <- listeria$pheno$logSurv
my <- max(y, na.rm = TRUE)
z <- as.numeric(y == my)
y[y == my] <- NA
listeria$pheno$logSurv2 <- y
listeria$pheno$binary <- z

out.p <- scanone(listeria, pheno.col = 5, model = "binary")
out.np1 <- scanone(listeria, model = "np", ties.random = TRUE)
out.np2 <- scanone(listeria, model = "np", ties.random = FALSE)
qtl_plot(input = rbind(data.frame(out.np1, method = "out.2p"),
                       data.frame(out.np1, method = "np1"),
                       data.frame(out.np2, method = "np2")))


The fake.bc data is used to show how to run QTL models with multiple phenotypes simultaneously and how to add additive or interactive covariates.

data(fake.bc)

fake.bc <- calc.genoprob(fake.bc, step = 2.5)
out.nocovar <- scanone(fake.bc, pheno.col = 1:2)

If the parameter mult.pheno is set to TRUE, the plotting function gathers the phenotype columns and plots each phenotype in a different color.

qtl_plot(input = out.nocovar, mult.pheno = TRUE)

If we specify both mult.pheno = TRUE and model = “2part”, the function plots each LOD score for each phenotype in a different color. If you wanted to plot only the lod.p column for example, you could subset the input data frame before feeding it into the plotting function.

out.nocovar.2p <- scanone(fake.bc, pheno.col = 1:2, model = "2part", upper = TRUE)

qtl_plot(input = out.nocovar.2p,
         chrs = c(2, 5),
         mult.pheno = TRUE, 
         model = "2part")


If we run a multiple phenotype model and want to determine the QTL positions, we get a different LOD threshold for each of the phenotypes and respectively, for significantly associated markers as well.

sex <- fake.bc$pheno$sex
out.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex)

out.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex)

out.sexint <- out.icovar - out.acovar

operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, n.perm = 25, verbose = FALSE)
lod_threshold <- summary(operm.icovar, alpha = 0.05)

labels_df <- as.data.frame(summary(out.icovar, perms = operm.icovar, alpha = 0.05, pvalues = TRUE))

This means that, if we want to plot QTL labels for both phenotypes, we first need to prepare the label data frame:

pheno1 <- labels_df[, 1:4]
colnames(pheno1)[3] <- "lod"
pheno1$name <- paste("pheno1", rownames(pheno1), sep = "\n")

pheno2 <- labels_df[, c(1, 2, 5, 6)]
colnames(pheno2)[3] <- "lod"
pheno2$name <- paste("pheno2", rownames(pheno1), sep = "\n")

labels_df <- rbind(pheno1, pheno2)

f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "X")
labels_df$chr <- factor(labels_df$chr, levels = f)

If different methods and groups are detected in the input data frame, the methods will be drawn with different colors, while the groups will have different line types. If we want to plot two LOD thresholds for the two phenotypes, we can give a data frame to the function that has a group column that corresponds to the phenotypes of the input data frame. The LOD threshold will now be plotted in the same line type as the phenotype LOD score plot.

qtl_plot(input = rbind(data.frame(out.acovar, method = "acovar"),
                       data.frame(out.icovar, method = "icovar"),
                       data.frame(out.sexint, method = "sexint")), 
         chrs = c(2, 5, 17),
         mult.pheno = TRUE,
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1:2]),
         labels = labels_df)


For QTL models with covariates, the same principles apply:

seed <- ceiling(runif(1, 0, 10^8))
set.seed(seed)
operm.acovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

set.seed(seed)
operm.icovar <- scanone(fake.bc, pheno.col = 1:2, addcovar = sex, intcovar = sex, method = "hk", n.perm = 100, verbose = FALSE)

operm.sexint <- operm.icovar - operm.acovar

lod_threshold <- summary(operm.sexint, alpha = c(0.05, 0.20))
labels <- as.data.frame(summary(out.sexint, perms = operm.sexint, alpha = 0.1, format = "allpeaks", pvalues = TRUE))
pheno1 <- data.frame(labels[, 1:4])
colnames(pheno1)[3] <- "lod"
pheno1$name <- "p1"

pheno2 <- data.frame(labels[, c(1, 5:7)])
colnames(pheno2)[3] <- "lod"
pheno2$name <- "p2"

labels_df <- rbind(pheno1, pheno2)
qtl_plot(input = out.sexint, 
         mult.pheno = TRUE,
         labels = labels_df,
         chrs = c(2, 7, 17), 
         lod = data.frame(group = c("pheno1", "pheno2"),
                 lod = lod_threshold[1, ]),
         ncol = 3)

hyper <- sim.geno(hyper, step = 2.5, n.draws = 16, err = 0.01)
g <- pull.geno(fill.geno(hyper))[, "D4Mit164"]
out1.c4i <- scanone(hyper, method = "imp", addcovar = g, intcovar = g)
qtl_plot(input = out1.c4i)


And as well for model fitting:

qc <- c(1, 1, 4, 6, 15)
qp <- c(43.3, 78.3, 30.0, 62.5, 18.0)
qtl <- makeqtl(hyper, chr = qc, pos = qp)
myformula <- y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5
out.fq <- fitqtl(hyper, qtl = qtl, formula = myformula, drop = FALSE, get.ests = TRUE)

revqtl <- refineqtl(hyper, qtl = qtl, formula = myformula, verbose = FALSE)
out1.c4r <- addqtl(hyper, qtl = revqtl, formula = y ~ Q3)

out2.c4r <- addpair(hyper, qtl = revqtl, formula = y ~ Q3, chr = c(1, 6, 7, 15), verbose = FALSE)
out.iw4 <- addqtl(hyper, qtl = revqtl, formula = y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q4:Q5 + Q6 + Q5:Q6)
out.1more <- addqtl(hyper, qtl = revqtl, formula = myformula)

qtl2 <- addtoqtl(hyper, revqtl, 7, 53.6)
qtl3 <- dropfromqtl(qtl2, index = 2)
qtl4 <- replaceqtl(hyper, qtl3, index = 1, chr = 1, pos = 50)
qtl5 <- reorderqtl(qtl4, c(1:3, 5, 4))

stepout.i <- stepwiseqtl(hyper, max.qtl = 6, verbose = FALSE)

Here, we also need to prepare the label data frame before the plotting function can handle it:

label_df <- data.frame(name = stepout.i$name, chr = stepout.i$chr, pos = stepout.i$pos)

qtl1 <- attr(stepout.i, "lodprofile")[[1]]
qtl4 <- attr(stepout.i, "lodprofile")[[2]]
qtl6 <- attr(stepout.i, "lodprofile")[[3]]
qtl15 <- attr(stepout.i, "lodprofile")[[4]]

input_df <- rbind(qtl1, qtl4, qtl6, qtl15)
input_df$marker <- rownames(input_df)
input_df$name <- paste(input_df$chr, round(input_df$pos, digits = 1), sep = "@")

label_df$pos <- as.numeric(gsub("(^[0-9]+)(.0$)", "\\1", label_df$pos))
label_df$name <- paste(label_df$chr, round(label_df$pos, digits = 1), sep = "@")

label_df <- left_join(label_df, input_df[, 3:5, drop = FALSE], by = "name")
label_df$name <- paste(label_df$name, label_df$marker, sep = "\n")
qtl_plot(input = input_df,
         labels = label_df)


These examples show only the simple QTL models. The qtl package also gives examples for building scantwo() models. But they can not be plotted in the same way as a scanone() output and thus, won’t be covered here.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.5.0   tidyr_0.6.0   ggrepel_0.6.5 ggplot2_2.2.1 qtl_1.40-8   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        knitr_1.15.1       magrittr_1.5      
##  [4] munsell_0.4.3      colorspace_1.3-2   R6_2.2.0          
##  [7] stringr_1.1.0      plyr_1.8.4         tools_3.3.2       
## [10] parallel_3.3.2     grid_3.3.2         gtable_0.2.0      
## [13] DBI_0.5-1          htmltools_0.3.5    yaml_2.1.14       
## [16] lazyeval_0.2.0     rprojroot_1.1      digest_0.6.11     
## [19] assertthat_0.1     tibble_1.2         RColorBrewer_1.1-2
## [22] codetools_0.2-15   evaluate_0.10      rmarkdown_1.3     
## [25] labeling_0.3       stringi_1.1.2      scales_0.4.1      
## [28] backports_1.0.4

Scratching the Surface of Gender Biases

2017-02-06 00:00:00 +0000

Today, I want to share my analysis of the World Gender Statistics dataset.

Last week I already introduced my Shiny app, where you can explore 160 measurements for 164 countries over 56 years. This week I’ve included a statistical analysis of these countries and measurements and put some finishing touches on the app.

Just in case anybody needed proof that gender bias is not only a problem of “other countries”, my analysis showed very nicely, that the Western world isn’t (yet, I hope) gender neutral in regards to many statistics!

You can also load the app via Github with the shiny package. There, you can also find the source code for the app.

library(shiny)
runGitHub("ShirinG/WGS_app") 


The data

The data was downloaded from The World Bank’s Open Data project via Kaggle. The main data table shows

  • Country.Name: the name of the country
  • Country.Code: the country’s code
  • Indicator.Name: the name of the variable that this row represents
  • Indicator.Code: a unique id for the variable
  • 1960 - 2016: one column EACH for the value of the variable in each year it was available

Unfortunately, the dataset doesn’t include a column that indicated which two female/ male statistics belong together and there is no consistent naming scheme either. Therefore, I chose to focus on those statistics, where the counterparts could be easily extracted: indicator codes that differed only in containing “.FE” or “.MA”. I then split the subsetted dataframe into female and male and produced a third dataset with the ratios between male and female values. To calculate the ratios, I added 0.001 to each data point, so that zero values would not produce NAs.

The finished datasets were saved as “R.Data” files, so that I could easily load them into the Shiny app.

dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Name <- gsub(",", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Name <- gsub(",", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")
length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  code <- unique(dataset_fem$Indicator.Name)[n]
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    if (i == 1) {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- cbind(male[i, c(1:4)], diff)
    } else {
      diff <- (male[i, 5:61] + 0.001) / (fem[i, 5:61] + 0.001)
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
    }
  }
  
  if (n == 1) {
    diff_table_bind <- diff_table
  } else {
    diff_table_bind <- rbind(diff_table_bind, diff_table)
  }
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")


The world map

The map has been downloaded from the Natural Earth Data website. The country borders were reduced by 200 meters with ArcGIS Pro, so that clicking within any country on the map would show the corresponding country’s border as the nearest point. ArcGIS Pro was also used to convert the map to Mercator projection. The changed shapefiles can be downloaded from my Github repository.

library(rgdal)
library(plyr)
library(scales)

wmap_countries <- readOGR(dsn = "shapefiles/changed_borders", layer = "ne_110m_admin_0_countries_smaller_wm")
wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")
wmap_countries_df_final$adm0_a3 <- as.character(wmap_countries_df_final$adm0_a3)
save(wmap_countries_df_final, file = "wmap_countries_smaller_df_final.RData")

diff_table_bind <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]
save(diff_table_bind, file = "diff_table_bind.RData")

countries <- as.character(unique(diff_table_bind$Country.Name))
save(countries, file = "countries.RData")
library(dplyr)
library(tidyr)

library(ggplot2)

map_theme <- list(theme(panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_blank(),
                        plot.background = element_rect(fill = "white"),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

colfunc <- colorRampPalette(c("yellow", "red"))


Which countries are most biased?

To explore differences between countries and statistics, I prepared a dataset that contains the first and last non-NA value in each time series of the male/ female ratio data (per statistic and country) to calculate whether the change over time was statistically significant. As for the world map plots, I am using the log2 of the ratios, so that ratios bigger and smaller than 1 can be more easily compared.

You can explore which countries have the strongest bias and the biggest change for each statistic in the app under the tab “Latest ratios”.

year_table <- diff_table_bind[which(diff_table_bind$Country.Code %in% wmap_countries_df_final$adm0_a3), ]

# last non-NA value
last_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[length(na.omit(x))]
  )
last_val_df <- data.frame(unlist(last_val))
last_val_df$year_of_val <- gsub("([0-9]+)(.X)([0-9]+)", "\\3", rownames(last_val_df))
rownames(last_val_df) <- gsub("([0-9]+)(.X)([0-9]+)", "\\1", rownames(last_val_df))

# first non-NA value
first_val <- apply(year_table[, grep("^X[0-9]+$", colnames(year_table))], 1, function(x) 
  na.omit(x)[1]
  )
first_val_df <- data.frame(na.omit(first_val))

year_table_last_val <- cbind(year_table[rownames(last_val_df), 1:3], last_val_df, first_val_df)

year_table_last_val$difference <- (year_table_last_val$unlist.last_val. + 0.001) / (year_table_last_val$na.omit.first_val. + 0.001)
save(year_table_last_val, file = "year_table_last_val.RData")

To identify countries with strongest biases towards either men or women, I calculated the absolute log2 ratio of male/ female values and subsetted the top 10 countries with highest absolute bias. I then counted for how many statistics each country was in the top 10.

for (i in 1:length(measures)) {
  
  m <- measures[i]
  
  subs <- subset(year_table_last_val, Indicator.Name == m)[, c(1, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  subs$measure <- m
    
  if (i == 1) {
    subs_comb <- subs[1:10, ]
    
  } else {
    subs_comb <- rbind(subs_comb, subs[1:10, ])
  }
}
subs_comb_ordered <- arrange(as.data.frame(table(subs_comb$Country.Name)), desc(Freq))

subs_comb$Country.Name <- factor(subs_comb$Country.Name, levels = subs_comb_ordered$Var1)

ggplot(subs_comb, aes(x = Country.Name)) +
  geom_bar(fill = "navyblue", alpha = 0.7) +
  coord_flip() +
  my_theme() +
  labs(title = "Most biased countries",
       subtitle = "For how many statistics were the countries in top 10 with highest absolute bias",
       x = "")

This plot shows that there are quite a few European/ Western countries with high biases in some statistics. For example:

arrange(subs_comb[which(subs_comb$Country.Name %in% c("Germany", "Netherlands", "United States", "United Kingdom")), - 2], 
        Country.Name, desc(abs_ratio))
##      Country.Name abs_ratio
## 1     Netherlands 1.4678833
## 2     Netherlands 1.4674207
## 3     Netherlands 1.3615747
## 4     Netherlands 1.1248210
## 5     Netherlands 0.9993447
## 6     Netherlands 0.6601651
## 7     Netherlands 0.5512176
## 8     Netherlands 0.1847042
## 9         Germany 8.2336197
## 10        Germany 2.0428320
## 11        Germany 1.1641113
## 12        Germany 1.0957438
## 13        Germany 0.9825083
## 14        Germany 0.7356144
## 15        Germany 0.6274894
## 16 United Kingdom 1.0932072
## 17 United Kingdom 0.8659095
## 18 United Kingdom 0.4347145
## 19  United States 1.3212071
##                                                                                    measure
## 1        Educational attainment completed Doctoral or equivalent population 25+ years  (%)
## 2  Educational attainment competed Doctoral or equivalent population 25+  (%) (cumulative)
## 3                              Borrowed from a store by buying on credit  (% age 15+) [w1]
## 4                           Prevalence of stunting height for age  (% of children under 5)
## 5                        Prevalence of underweight weight for age  (% of children under 5)
## 6                                                       Loan in the past year  (% age 15+)
## 7                      Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 8                                                     Progression to secondary school  (%)
## 9                  Prevalence of severe wasting weight for height  (% of children under 5)
## 10                                          Part time employment  (% of total  employment)
## 11                                         Time-related underemployment  (% of employment)
## 12              Educational attainment completed lower secondary population 25+ years  (%)
## 13         Educational attainment completed short-cycle tertiary population 25+ years  (%)
## 14                                       Borrowed from family or friends  (% age 15+) [ts]
## 15                     Coming up with emergency funds: somewhat possible  (% age 15+) [w2]
## 16                               Borrowed for health or medical purposes  (% age 15+) [w2]
## 17       Average number of hours spent on unpaid domestic work (housework and child care) 
## 18                                   Borrowed any money in the past year  (% age 15+) [w2]
## 19                          Prevalence of stunting height for age  (% of children under 5)

So, let’s take a closer look at the countries: I am calculating the sums of absolute male/ female ratios (again log2) for each country and show it on the map:

for (i in 1:length(countries)) {
  
  c <- countries[i]
  
  subs <- subset(year_table_last_val, Country.Name == c)[, c(2, 3, 4)] %>%
    mutate(abs_ratio = abs(log2(unlist.last_val.))) %>%
    arrange(desc(abs_ratio))

  if (i == 1) {
    subs_country <- data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio))
    
  } else {
    subs_country <- rbind(subs_country, data.frame(country = c, Country.Code = subs$Country.Code, sum_abs_ratio = sum(subs$abs_ratio)))
  }
}
left_join(subset(wmap_countries_df_final, !continent == "Antarctica"), subs_country, by = c("adm0_a3" = "Country.Code")) %>%
  ggplot(aes(long, lat, group = group, fill = sum_abs_ratio)) +
        coord_equal() +
        map_theme +
        geom_polygon() +
        geom_path(color = "white", size = 0.5) +
        labs(title = "General gender bias per country",
             subtitle = "Sum of absolute gender biases for all statistics",
             fill = "sum of absolute\nlog2 of male / female") +
        scale_fill_gradient2(low = "blue", mid = "blue", high = "red")


Statistical Analysis

You can explore the results for the individual statistics in the app under “Analysis - Plots” and “Analysis - Tests”.


The variables

The world data contains information on each country. I used the following variables to test for statistical significance:

  1. economic status,
  2. income group,
  3. estimated population size,
  4. estimated gross domestic product (GDP) and
  5. continent.

Because not all data was normally distributed I explored non-parametric models, as well as an ANOVA. The plots for each statistic can be explored in my app, so here, I am only showing the results for which measures were statistically significant for the different variables.


Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank test can be used to test whether observations from repeated measures differ statistically from each other. Here, I am using it to test whether the most recent recorded value is significantly different from the first recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

country.inf <- subset(wmap_countries_df_final[, c(17, 25, 26, 31, 32, 42, 43, 47, 48, 62)], !continent == "Antarctica")
country.inf <- country.inf[!duplicated(country.inf), ]
for (i in 1:length(measures)) {
  m <- measures[i]
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))
  table <- wilcox.test(log2(stats_test$na.omit.first_val.), log2(stats_test$unlist.last_val.), paired = TRUE) 
  
  table <- data.frame(Indicator.Name = paste(m),
                      diff = (median(stats_test$unlist.last_val.) + 0.0001) / (median(stats_test$na.omit.first_val.) + 0.0001),
                      V = table$statistic,
                      p.value = table$p.value,
                      alternative = table$alternative)

  if (i == 1) {
    table_wilcox <- table
  } else {
    table_wilcox <- rbind(table_wilcox, table)
  }
}

table_wilcox$p.adj <- p.adjust(table_wilcox$p.value, method = "fdr")
table_wilcox$significant <- as.factor(ifelse(table_wilcox$p.adj < 0.1, "significant", "non-significant"))

sig <- subset(table_wilcox, significant == "significant")
left_join(subset(year_table_last_val, Indicator.Name %in% sig$Indicator.Name), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  gather(x, y, unlist.last_val., na.omit.first_val.) %>%
  ggplot(aes(x = Indicator.Name, y = log2(y), color = x, fill = x)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  coord_flip() +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference over time",
    subtitle = "Wilcoxon Signed-Rank Test",
    x = "",
    y = "log2(male/ female)")


Kruskal-Wallis Test

The Kruskal-Wallis test is a simple non-parametric statistical test. I used it here to test for each variable of interest whether it has a statistically significant effect on the most recent recorded value. Because I am testing the same data multiple times, I am correcting the p-values and only consider them significant with False Discovery Rate (FDR) below 10% (adjusted p-value < 0.1).

for (i in 1:length(measures)) {
  m <- measures[i]
  
  stats_test <- left_join(subset(year_table_last_val, Indicator.Name == m), country.inf, by = c("Country.Code" = "adm0_a3"))

    stats_test$economy_rank <- as.numeric(as.character(gsub("(^[1-7])(.*)", "\\1", stats_test$economy)))
    stats_test$income_grp_rank <- as.numeric(as.character(gsub("(^[1-5])(.*)", "\\1", stats_test$income_grp)))

    k_economy_rank <- kruskal.test(log2(unlist.last_val.) ~ economy_rank, data = stats_test)
    k_income_rank <- kruskal.test(log2(unlist.last_val.) ~ income_grp_rank, data = stats_test)
    k_pop <- kruskal.test(log2(unlist.last_val.) ~ pop_est, data = stats_test)
    k_gdp <- kruskal.test(log2(unlist.last_val.) ~ gdp_md_est, data = stats_test)
    k_continent <- kruskal.test(log2(unlist.last_val.) ~ continent, data = stats_test)

    kruskal_last_val <- data.frame(
      group = c("economy (rank)", "income_grp (rank)", "pop_est", "gdp_md_est", "continent"),
      p.val = c(k_economy_rank$p.value, k_income_rank$p.value, k_pop$p.value, k_gdp$p.value, k_continent$p.value))

    kruskal_last_val$p.adj <- p.adjust(kruskal_last_val$p.val, method = "fdr")

    if (i == 1) {
      kruskal_last_val_df <- kruskal_last_val[, -2]
      colnames(kruskal_last_val_df)[2] <- paste(m)
    } else {
      pre <- kruskal_last_val[, -2]
      colnames(pre)[2] <- paste(m)
      
      kruskal_last_val_df <- left_join(kruskal_last_val_df, pre, by = "group")
    }
}
eco <- t(kruskal_last_val_df[1, -1])
eco_sig <- eco[which(eco < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(eco_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = economy, fill = economy)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in economy",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

inc <- t(kruskal_last_val_df[2, -1])
inc_sig <- inc[which(inc < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(inc_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = income_grp, fill = income_grp)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in income group",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

cont <- t(kruskal_last_val_df[5, -1])
cont_sig <- cont[which(cont < 0.1), ]
  
left_join(subset(year_table_last_val, Indicator.Name %in% names(cont_sig)), country.inf, by = c("Country.Code" = "adm0_a3")) %>%
  ggplot(aes(y = log2(unlist.last_val.), x = Indicator.Name, color = continent, fill = continent)) +
  coord_flip() +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  my_theme() +
  scale_y_continuous(limits = c(-3, 3)) +
  labs(
    title = "Measures with significant difference in continent",
    subtitle = "Kruskal-Wallis Test",
    x = "",
    y = "log2(male/ female)")

For population and GDP, there were no statistically significant measures.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 tidyr_0.6.1   dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9      digest_0.6.12    rprojroot_1.2    assertthat_0.1  
##  [5] plyr_1.8.4       grid_3.3.2       R6_2.2.0         gtable_0.2.0    
##  [9] DBI_0.5-1        backports_1.0.5  magrittr_1.5     scales_0.4.1    
## [13] evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0   rmarkdown_1.3   
## [17] labeling_0.3     tools_3.3.2      stringr_1.1.0    munsell_0.4.3   
## [21] yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5  knitr_1.15.1    
## [25] tibble_1.2

New features in World Gender Statistics app

2017-01-30 00:00:00 +0000

In my last post, I built a shiny app to explore World Gender Statistics.

To make it a bit nicer and more convenient, I added a few more features:

  • The drop-down menu for Years is now reactive, i.e. it only shows options with data (all NA years are removed)
  • You can click on any country on the map to get information about which country it is, its population size, income group and region
  • Below the world map, you can look at timelines for male vs female values for each country
  • The map is now in Mercator projection

Exploring World Gender Statistics with Shiny

2017-01-29 00:00:00 +0000

This week I explored the World Gender Statistics dataset. You can look at 160 measurements over 56 years with my Shiny app here.


I prepared the data as follows:

Data.csv

  • Country.Name: the name of the country
  • Country.Code: the country’s code
  • Indicator.Name: the name of the variable that this row represents
  • Indicator.Code: a unique id for the variable
  • 1960 - 2016: one column EACH for the value of the variable in each year it was available
dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")
length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  
  code <- unique(dataset_fem$Indicator.Name)[n]
  
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    
    if (i == 1) {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- cbind(male[i, c(1:4)], diff)
      
    } else {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
      
    }
  }
  
  if (n == 1) {
    
    diff_table_bind <- diff_table
    
  } else {
    
    diff_table_bind <- rbind(diff_table_bind, diff_table)
    
  }
  
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
diff_table_bind[diff_table_bind == "NaN"] <- NA

save(diff_table_bind, file = "diff_table_bind.RData")
measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")

Map

The map has been downloaded from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/

library(rgdal)
library(ggplot2)
library(plyr)
library(dplyr)
library(scales)

wmap_countries <- readOGR(dsn="shapefiles", layer="ne_110m_admin_0_countries")

wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")

wmap_countries_df_final$gu_a3 <- as.character(wmap_countries_df_final$gu_a3)

save(wmap_countries_df_final, file = "wmap_countries_df_final.RData")


## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.8     stringi_1.1.2  
##  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.11  
## [13] evaluate_0.10

R vs Python - a One-on-One Comparison

2017-01-22 00:00:00 +0000

I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative.

That’s why I wanted to see how R and Python fare in a one-on-one comparison of an analysis that’s representative of what I would typically work with.

Conclusions

All in all, the Python code could easily be translated into R and was comparable in length and simplicity between the two languages. While Python’s syntax is inherently cleaner/ tidier, we can use packages that implement piping in R and achieve similar results (even though Python’s dot-separated syntax is still much easier to type than using the piping operator of magrittr). For plotting and visualisation I still think that R’s ggplot2 is top of the line in both syntax, customizability and outcome (admittedly, I don’t know matplotlib as well as ggplot)! In terms of functionality, I couldn’t find major differences between the two languages and I would say they both have their merits. For me, R comes more natural as that is what I’m more fluent in, but I can see why Python holds an appeal too and I think I’ll make more of an effort to use both languages in my future projects.



R and Python

R and Python are both open-source languages used in a wide range of data analysis fields. Their main difference is that R has traditionally been geared towards statistical analysis, while Python is more generalist. Both comprise a large collection of packages for specific tasks and have a growing community that offers support and tutorials online.

For a nice overview of the languages respective strengths and weaknesses, check out Datacamp’s nice infographic.


Comparative analysis of genome data

To directly compare R and Python, I am following Zhuyi Xue’s “A Comprehensive Introduction To Your Genome With the SciPy Stack” (with some minor tweaks here and there). He gives a nice introduction to the data, so I will not repeat it here but focus on the comparison between the code lines.

For R, I am working with RStudio, for Python with Anaconda and Spyder.

Python:

For this analysis, we need the SciPy stack with pandas for data wrangling and matplotlib for visualisation. Anaconda already comes with all these packages that we need. Zhuyi Xue first imports pandas as “pd”, so that we can call pandas function by prefixing them with “pd.”.

import pandas as pd

R:

While the code could be replicated with base R, I prefer dplyr for data wrangling and ggplot2 for visualisation.

library(dplyr)
library(ggplot2)

Reading in data

Reading in data is straight forward in both R and Python. The code we need to read in the file is comparable between R and Python. Zhuyi Xue specified “compression = ‘gzip’”, but this would not have been necessary as the default is to infer it from the file suffix.

One big difference in the general syntax we can see here too: Boolean true/false values are written in all caps in R (TRUE/FALSE), while Python uses first letter capitalisation (True/False). The same prinicple applies to “none”.

Python:

df = pd.read_csv('Homo_sapiens.GRCh38.85.gff3.gz', 
                         compression = 'gzip',
                         sep = '\t', 
                         comment = '#', 
                         low_memory = False,
                         header = None, 
                         names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
df.head()

R:

df <- read.csv("~/Documents/Github/Homo_sapiens.GRCh38.85.gff3.gz", 
               header = FALSE, 
               sep = "\t", 
               col.names = c('seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'), 
               comment.char = "#")
head(df)
##   seqid source              type start       end   score strand phase
## 1     1 GRCh38        chromosome     1 248956422       .      .     .
## 2     1      . biological_region 10469     11240 1.3e+03      .     .
## 3     1      . biological_region 10650     10657   0.999      +     .
## 4     1      . biological_region 10655     10657   0.999      -     .
## 5     1      . biological_region 10678     10687   0.999      +     .
## 6     1      . biological_region 10681     10688   0.999      -     .
##                                           attributes
## 1 ID=chromosome:1;Alias=CM000663.2,chr1,NC_000001.11
## 2           external_name=oe %3D 0.79;logic_name=cpg
## 3                                 logic_name=eponine
## 4                                 logic_name=eponine
## 5                                 logic_name=eponine
## 6                                 logic_name=eponine

Examining data

  • listing unique strings

The first thing we want to know from the data is how many unique entries there are in the “seqid” column.

Here, we can already see the main difference in syntax between R and Python:

Python concatenates the object name (“df) with the column name and the functions that we want to run on this column in a sequential manner, separated by a dot. Base R uses nested functions, where each function is called with “function_name()” and we specify columns with “object_name$column_name”.

However, both R and Python can also call columns in a dataframe with “[ ]” with the difference that Python per default subsets data columns df[“seqid”], while R always needs index specifications for rows and columns, separated by “,”: e.g. df[, “seqid”] would subset every row and only the column named “seqid”.

The sequential calling of functions is indeed very handy, it makes the code easier to read and understand than lots of interwoven functions and brackets. But while this isn’t the concept of base R, dplyr uses the magrittr principle of chaining functions with the pipe symbol “%>%”. Even though this symbol isn’t as easily typed, it’s functionality is often superior to base R, especially if you need to run many functions on a dataframe. However, with just or two functions, I usually keep to base R as it is shorter.

Python:

df.seqid.unique() # alternatively: df['seqid'].unique()

R:

unique(df$seqid)
##   [1] 1          10         11         12         13         14        
##   [7] 15         16         17         18         19         2         
##  [13] 20         21         22         3          4          5         
##  [19] 6          7          8          9          GL000008.2 GL000009.2
##  [25] GL000194.1 GL000195.1 GL000205.2 GL000208.1 GL000213.1 GL000214.1
##  [31] GL000216.2 GL000218.1 GL000219.1 GL000220.1 GL000221.1 GL000224.1
##  [37] GL000225.1 GL000226.1 KI270302.1 KI270303.1 KI270304.1 KI270305.1
##  [43] KI270310.1 KI270311.1 KI270312.1 KI270315.1 KI270316.1 KI270317.1
##  [49] KI270320.1 KI270322.1 KI270329.1 KI270330.1 KI270333.1 KI270334.1
##  [55] KI270335.1 KI270336.1 KI270337.1 KI270338.1 KI270340.1 KI270362.1
##  [61] KI270363.1 KI270364.1 KI270366.1 KI270371.1 KI270372.1 KI270373.1
##  [67] KI270374.1 KI270375.1 KI270376.1 KI270378.1 KI270379.1 KI270381.1
##  [73] KI270382.1 KI270383.1 KI270384.1 KI270385.1 KI270386.1 KI270387.1
##  [79] KI270388.1 KI270389.1 KI270390.1 KI270391.1 KI270392.1 KI270393.1
##  [85] KI270394.1 KI270395.1 KI270396.1 KI270411.1 KI270412.1 KI270414.1
##  [91] KI270417.1 KI270418.1 KI270419.1 KI270420.1 KI270422.1 KI270423.1
##  [97] KI270424.1 KI270425.1 KI270429.1 KI270435.1 KI270438.1 KI270442.1
## [103] KI270448.1 KI270465.1 KI270466.1 KI270467.1 KI270468.1 KI270507.1
## [109] KI270508.1 KI270509.1 KI270510.1 KI270511.1 KI270512.1 KI270515.1
## [115] KI270516.1 KI270517.1 KI270518.1 KI270519.1 KI270521.1 KI270522.1
## [121] KI270528.1 KI270529.1 KI270530.1 KI270538.1 KI270539.1 KI270544.1
## [127] KI270548.1 KI270579.1 KI270580.1 KI270581.1 KI270582.1 KI270583.1
## [133] KI270584.1 KI270587.1 KI270588.1 KI270589.1 KI270590.1 KI270591.1
## [139] KI270593.1 KI270706.1 KI270707.1 KI270708.1 KI270709.1 KI270710.1
## [145] KI270711.1 KI270712.1 KI270713.1 KI270714.1 KI270715.1 KI270716.1
## [151] KI270717.1 KI270718.1 KI270719.1 KI270720.1 KI270721.1 KI270722.1
## [157] KI270723.1 KI270724.1 KI270725.1 KI270726.1 KI270727.1 KI270728.1
## [163] KI270729.1 KI270730.1 KI270731.1 KI270732.1 KI270733.1 KI270734.1
## [169] KI270735.1 KI270736.1 KI270737.1 KI270738.1 KI270739.1 KI270740.1
## [175] KI270741.1 KI270742.1 KI270743.1 KI270744.1 KI270745.1 KI270746.1
## [181] KI270747.1 KI270748.1 KI270749.1 KI270750.1 KI270751.1 KI270752.1
## [187] KI270753.1 KI270754.1 KI270755.1 KI270756.1 KI270757.1 MT        
## [193] X          Y         
## 194 Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 ... Y
# with dplyr:
# df %>% select(seqid) %>% unique


  • how many unique seqids are there?

To get the number of unique entries in the “seqid” column, we simply need to append “.shape” to the above Python code. In R, we can either wrap the above R code with the “length()” function or use dplyr and piping instead. If we use the latter, we need to use the “nrow()” function because base R returns a vector, while dplyr returns a dataframe. Here, we can see that with two functions, using dplyr is still a bit more code but it already looks much tidier.

Python:

df.seqid.unique().shape

R:

length(unique(df$seqid))
## [1] 194
# with dplyr:
# df %>% select(seqid) %>% unique %>% nrow


  • counting occurrences

To count the frequencies of each unique entry in the “source” column, we use the “value_counts()” function in Python and the “table()” function in R. These two functions differ in how they sort the output table: value_counts() sorts by decreasing frequency, while R alphabetically sorts the variables. To order the data as in Python, we need to add the “sort()” function to our R code.

Python:

df.source.value_counts()

R:

# table(df$source) is per default ordered alphabetically, if we want it ordered by decreasing counts, like with Python:
sort(table(df$source), decreasing = TRUE)
## 
##         havana ensembl_havana        ensembl              .        mirbase 
##        1441093         745065         228212         182510           4701 
##         GRCh38          insdc 
##            194             74
# dplyr:
# df %>% select(source) %>% table %>% sort(decreasing = TRUE)


How Much of the Genome Is Incomplete?

  • subsetting a dataframe

We are now subsetting our original dataframe and assign it a new object name with “ = “ or “ <- “.

To subset the dataframe to keep only rows which say “GRCh38” in the “source” column, ther are several ways to do this in R: the way that would be directly comparable to how it was done in Python would be to also use the square bracket indexing. However, there are two solutions which are more elegant: 1) base R’s “subset()” or dplyr’s “filter()” function. But with this short example, there is no big difference between the three.

Python’s “shape” gives us the same information as R’s “dim()” function: how many rows and columns our dataframe has.

To preview a random subset of 10 rows from our dataframe, we use Python’s “sample()” and dplyr’s “sample_n()” function.

Python:

gdf = df[df.source == 'GRCh38']

gdf.shape
gdf.sample(10)

R:

gdf <- df[df$source == "GRCh38", ]

# alternatively:
# gdf <- subset(df, source == "GRCh38")

# dplyr:
# gdf <- df %>% filter(source == "GRCh38")

# get number of rows and columns
dim(gdf)
## [1] 194   9
# randomly sample 10 rows for observation
sample_n(gdf, 10)
##              seqid source        type start       end score strand phase
## 2511484 KI270375.1 GRCh38 supercontig     1      2378     .      .     .
## 2511545 KI270466.1 GRCh38 supercontig     1      1233     .      .     .
## 2511473 KI270337.1 GRCh38 supercontig     1      1121     .      .     .
## 2511504 KI270411.1 GRCh38 supercontig     1      2646     .      .     .
## 2511487 KI270379.1 GRCh38 supercontig     1      1045     .      .     .
## 483371          12 GRCh38  chromosome     1 133275309     .      .     .
## 2511494 KI270387.1 GRCh38 supercontig     1      1537     .      .     .
## 674768          14 GRCh38  chromosome     1 107043718     .      .     .
## 2511493 KI270386.1 GRCh38 supercontig     1      1788     .      .     .
## 2511505 KI270412.1 GRCh38 supercontig     1      1179     .      .     .
##                                                           attributes
## 2511484 ID=supercontig:KI270375.1;Alias=chrUn_KI270375v1,NT_187493.1
## 2511545 ID=supercontig:KI270466.1;Alias=chrUn_KI270466v1,NT_187421.1
## 2511473 ID=supercontig:KI270337.1;Alias=chrUn_KI270337v1,NT_187466.1
## 2511504 ID=supercontig:KI270411.1;Alias=chrUn_KI270411v1,NT_187409.1
## 2511487 ID=supercontig:KI270379.1;Alias=chrUn_KI270379v1,NT_187472.1
## 483371          ID=chromosome:12;Alias=CM000674.2,chr12,NC_000012.12
## 2511494 ID=supercontig:KI270387.1;Alias=chrUn_KI270387v1,NT_187475.1
## 674768           ID=chromosome:14;Alias=CM000676.2,chr14,NC_000014.9
## 2511493 ID=supercontig:KI270386.1;Alias=chrUn_KI270386v1,NT_187480.1
## 2511505 ID=supercontig:KI270412.1;Alias=chrUn_KI270412v1,NT_187408.1


  • creating new columns and performing calculations

Now we want to create a new column called “length”. It should contain the gene lengths, i.e. the distance in base pairs between the start and end point on the chromosomes (“start” and “end” columns). In R we don’t need to copy the dataframe first but the rest of the code is very similar: we define a new column name for the dataframe and assign its value by subtracting the values of the “start” from the values of the “end” column (+1). Here, we can see again that in Python we use a dot to define columns, while R uses the Dollar sign.

The sum of all lengths is calculated with the “sum()” function in both languages.

Python:

gdf = gdf.copy()
gdf['length'] = gdf.end - gdf.start + 1

gdf.length.sum()

R:

gdf$length <- gdf$end - gdf$start + 1

sum(gdf$length)
## [1] 3096629726

Next, we want to calculating the proportion of the genome that is not on main chromosome assemblies. For that, we first define a character string with the main chromosomes: 1 to 23, X, Y and MT (mitochondrial chromosome). Defining this string is a bit easier in R.

We will use this string to calculate the sum of lengths of the subsetted dataframe and divide it by the sum of lengths of the whole dataframe. For subsetting, we use the “isin()” function of Python, which corresponds to R’s “%in%”.

Python:

chrs = [str(_) for _ in range(1, 23)] + ['X', 'Y', 'MT']
gdf[-gdf.seqid.isin(chrs)].length.sum() / gdf.length.sum()

# or
gdf[(gdf['type'] == 'supercontig')].length.sum() / gdf.length.sum()

R:

chrs <- c(1:23, "X", "Y", "MT")
sum(subset(gdf, !seqid %in% chrs)$length) / sum(gdf$length)
## [1] 0.003702192


How Many Genes Are There?

Here, we are again using the same functions as above for subsetting the dataframe, asking for its dimensions, printing 10 random lines and asking for the frequencies of each unique item in the “type” column. For the latter I am using base R over dplyr, because it’s faster to type.

Python:

edf = df[df.source.isin(['ensembl', 'havana', 'ensembl_havana'])]
edf.shape

edf.sample(10)
edf.type.value_counts()

R:

edf <- subset(df, source %in% c("ensembl", "havana", "ensembl_havana"))
dim(edf)
## [1] 2414370       9
sample_n(edf, 10)
##         seqid         source            type     start       end score
## 2548897     X        ensembl three_prime_UTR  67723842  67730618     .
## 790670     15 ensembl_havana             CDS  42352030  42352139     .
## 1298369    19         havana             CDS  38942683  38942751     .
## 189728      1 ensembl_havana             CDS 186304108 186304257     .
## 1080599    17         havana            exon  44401337  44401453     .
## 1562968    20 ensembl_havana            exon   2964272   2964350     .
## 1965100     5        ensembl      transcript    144087    144197     .
## 2581554     X         havana            exon 147927670 147928807     .
## 785399     15         havana            exon  40611478  40611511     .
## 1417830     2 ensembl_havana             CDS  71520178  71520208     .
##         strand phase
## 2548897      +     .
## 790670       +     2
## 1298369      -     0
## 189728       +     2
## 1080599      -     .
## 1562968      +     .
## 1965100      -     .
## 2581554      +     .
## 785399       +     .
## 1417830      +     0
##                                                                                                                                                                         attributes
## 2548897                                                                                                                                          Parent=transcript:ENST00000612452
## 790670                                                                                         ID=CDS:ENSP00000326227;Parent=transcript:ENST00000318010;protein_id=ENSP00000326227
## 1298369                                                                                        ID=CDS:ENSP00000472465;Parent=transcript:ENST00000599996;protein_id=ENSP00000472465
## 189728                                                                                         ID=CDS:ENSP00000356453;Parent=transcript:ENST00000367483;protein_id=ENSP00000356453
## 1080599                         Parent=transcript:ENST00000585614;Name=ENSE00002917300;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00002917300;rank=7;version=1
## 1562968                        Parent=transcript:ENST00000216877;Name=ENSE00003603687;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;exon_id=ENSE00003603687;rank=4;version=1
## 1965100 ID=transcript:ENST00000362670;Parent=gene:ENSG00000199540;Name=Y_RNA.43-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000362670;transcript_support_level=NA;version=1
## 2581554                       Parent=transcript:ENST00000620828;Name=ENSE00003724969;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003724969;rank=1;version=1
## 785399                          Parent=transcript:ENST00000399668;Name=ENSE00003659959;constitutive=0;ensembl_end_phase=2;ensembl_phase=1;exon_id=ENSE00003659959;rank=7;version=1
## 1417830                                                                                        ID=CDS:ENSP00000398305;Parent=transcript:ENST00000429174;protein_id=ENSP00000398305
sort(table(edf$type), decreasing = TRUE)
## 
##                          exon                           CDS 
##                       1180596                        704604 
##                five_prime_UTR               three_prime_UTR 
##                        142387                        133938 
##                    transcript                          gene 
##                         96375                         42470 
##          processed_transcript aberrant_processed_transcript 
##                         28228                         26944 
##        NMD_transcript_variant                       lincRNA 
##                         13761                         13247 
##          processed_pseudogene                  lincRNA_gene 
##                         10722                          7533 
##                    pseudogene                           RNA 
##                          3049                          2221 
##                         snRNA                    snRNA_gene 
##                          1909                          1909 
##                        snoRNA                   snoRNA_gene 
##                           956                           944 
##        pseudogenic_transcript                          rRNA 
##                           737                           549 
##                     rRNA_gene                         miRNA 
##                           549                           302 
##                V_gene_segment                J_gene_segment 
##                           216                           158 
##               VD_gene_segment                C_gene_segment 
##                            37                            29 
##             biological_region                    chromosome 
##                             0                             0 
##                    miRNA_gene                       mt_gene 
##                             0                             0 
##                   supercontig 
##                             0

Now we want to subset the dataframe to rows with the attribute “gene” in the “type” column, look at 10 random lines from the “attributes” column and get the dataframe dimensions.

Python:

ndf = edf[edf.type == 'gene']
ndf = ndf.copy()
ndf.sample(10).attributes.values
ndf.shape

R:

ndf <- subset(edf, type == "gene")
sample_n(ndf, 10)$attributes
##  [1] ID=gene:ENSG00000170949;Name=ZNF160;biotype=protein_coding;description=zinc finger protein 160 [Source:HGNC Symbol%3BAcc:HGNC:12948];gene_id=ENSG00000170949;havana_gene=OTTHUMG00000182854;havana_version=3;logic_name=ensembl_havana_gene;version=17                  
##  [2] ID=gene:ENSG00000215088;Name=RPS5P3;biotype=processed_pseudogene;description=ribosomal protein S5 pseudogene 3 [Source:HGNC Symbol%3BAcc:HGNC:10427];gene_id=ENSG00000215088;havana_gene=OTTHUMG00000065532;havana_version=1;logic_name=havana;version=3                
##  [3] ID=gene:ENSG00000249077;Name=RP11-478C1.8;biotype=processed_pseudogene;gene_id=ENSG00000249077;havana_gene=OTTHUMG00000160290;havana_version=1;logic_name=havana;version=1                                                                                              
##  [4] ID=gene:ENSG00000110619;Name=CARS;biotype=protein_coding;description=cysteinyl-tRNA synthetase [Source:HGNC Symbol%3BAcc:HGNC:1493];gene_id=ENSG00000110619;havana_gene=OTTHUMG00000010927;havana_version=9;logic_name=ensembl_havana_gene;version=16                   
##  [5] ID=gene:ENSG00000168010;Name=ATG16L2;biotype=protein_coding;description=autophagy related 16 like 2 [Source:HGNC Symbol%3BAcc:HGNC:25464];gene_id=ENSG00000168010;havana_gene=OTTHUMG00000167961;havana_version=2;logic_name=ensembl_havana_gene;version=10             
##  [6] ID=gene:ENSG00000055957;Name=ITIH1;biotype=protein_coding;description=inter-alpha-trypsin inhibitor heavy chain 1 [Source:HGNC Symbol%3BAcc:HGNC:6166];gene_id=ENSG00000055957;havana_gene=OTTHUMG00000150312;havana_version=9;logic_name=ensembl_havana_gene;version=10
##  [7] ID=gene:ENSG00000258781;Name=RP11-496I2.4;biotype=unprocessed_pseudogene;gene_id=ENSG00000258781;havana_gene=OTTHUMG00000170502;havana_version=2;logic_name=havana;version=2                                                                                            
##  [8] ID=gene:ENSG00000157965;Name=SSX8;biotype=unprocessed_pseudogene;description=SSX family member 8 [Source:HGNC Symbol%3BAcc:HGNC:19654];gene_id=ENSG00000157965;havana_gene=OTTHUMG00000021573;havana_version=4;logic_name=havana;version=11                             
##  [9] ID=gene:ENSG00000273489;Name=RP11-180C16.1;biotype=antisense;gene_id=ENSG00000273489;havana_gene=OTTHUMG00000186336;havana_version=1;logic_name=havana;version=1                                                                                                        
## [10] ID=gene:ENSG00000279532;Name=CTB-96E2.6;biotype=TEC;gene_id=ENSG00000279532;havana_gene=OTTHUMG00000179414;havana_version=1;logic_name=havana;version=1                                                                                                                 
## 1623077 Levels: external_name=Ala;logic_name=trnascan ...
dim(ndf)
## [1] 42470     9


  • extracting gene information from attributes field

I don’t know if there is an easier way in Python but in R we don’t need to create big helper functions around it. We can simply use “gsub()” with defining the regular expression for what we want to extract. This makes the R code much shorter and easier to understand! We then drop the original “attributes” column.

To have a look at the dataframe, we use “head()” this time.

Python:

import re

RE_GENE_NAME = re.compile(r'Name=(?P<gene_name>.+?);')
def extract_gene_name(attributes_str):
    res = RE_GENE_NAME.search(attributes_str)
    return res.group('gene_name')


ndf['gene_name'] = ndf.attributes.apply(extract_gene_name)

RE_GENE_ID = re.compile(r'gene_id=(?P<gene_id>ENSG.+?);')
def extract_gene_id(attributes_str):
    res = RE_GENE_ID.search(attributes_str)
    return res.group('gene_id')


ndf['gene_id'] = ndf.attributes.apply(extract_gene_id)


RE_DESC = re.compile('description=(?P<desc>.+?);')
def extract_description(attributes_str):
    res = RE_DESC.search(attributes_str)
    if res is None:
        return ''
    else:
        return res.group('desc')


ndf['desc'] = ndf.attributes.apply(extract_description)

ndf.drop('attributes', axis=1, inplace=True)
ndf.head()

R:

ndf$gene_name <- gsub("(.*Name=)(.*?)(;biotype.*)", "\\2", ndf$attributes)
ndf$gene_id <- gsub("(ID=gene:)(.*?)(;Name.*)", "\\2", ndf$attributes)
ndf$desc <- gsub("(.*description=)(.*?)(;.*)", "\\2", ndf$attributes)

# some genes don't have a description
ndf$desc <- ifelse(grepl("^ID=.*", ndf$desc), "", ndf$desc)

ndf <- subset(ndf, select = -attributes)
head(ndf)
##     seqid         source type  start    end score strand phase gene_name
## 17      1         havana gene  11869  14409     .      +     .   DDX11L1
## 29      1         havana gene  14404  29570     .      -     .    WASH7P
## 72      1         havana gene  52473  53312     .      +     .    OR4G4P
## 75      1         havana gene  62948  63887     .      +     .   OR4G11P
## 78      1 ensembl_havana gene  69091  70008     .      +     .     OR4F5
## 109     1         havana gene 131025 134836     .      +     .    CICP27
##             gene_id
## 17  ENSG00000223972
## 29  ENSG00000227232
## 72  ENSG00000268020
## 75  ENSG00000240361
## 78  ENSG00000186092
## 109 ENSG00000233750
##                                                                                                   desc
## 17                                 DEAD/H-box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:HGNC:37102]
## 29                       WAS protein family homolog 7 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:38034]
## 72   olfactory receptor family 4 subfamily G member 4 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:14822]
## 75  olfactory receptor family 4 subfamily G member 11 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:31276]
## 78              olfactory receptor family 4 subfamily F member 5 [Source:HGNC Symbol%3BAcc:HGNC:14825]
## 109              capicua transcriptional repressor pseudogene 27 [Source:HGNC Symbol%3BAcc:HGNC:48835]

Next, we want to know how many unique gene names and gene IDs there are. As above we use the “unique()”, “shape()” (Python) and “length()” (R) functions.

The count table for gene names can again be obtained with R’s “table()” function, even though Zhuyi Xue uses a slightly different approach: he first groups the “gene_name” column, then counts and sorts.

Finally, we can calculate the proportion of genes that have more than appear more than once and we can have a closer look at the SCARNA20 gene.

Python:

ndf.shape
ndf.gene_id.unique().shape
ndf.gene_name.unique().shape

count_df = ndf.groupby('gene_name').count().ix[:, 0].sort_values().ix[::-1]
count_df.head(10)

count_df[count_df > 1].shape
count_df.shape
count_df[count_df > 1].shape[0] / count_df.shape[0]

ndf[ndf.gene_name == 'SCARNA20']

R:

dim(ndf)
## [1] 42470    11
length(unique(ndf$gene_id))
## [1] 42470
length(unique(ndf$gene_name))
## [1] 42387
count_df <- sort(table(ndf$gene_name), decreasing = TRUE)
head(count_df, n = 10)
## 
##        SCARNA20        SCARNA16        SCARNA17        SCARNA11 
##               7               6               5               4 
##        SCARNA15        SCARNA21 Clostridiales-1         SCARNA4 
##               4               4               3               3 
##        ACTR3BP2           AGBL1 
##               2               2
length(count_df[count_df > 1])
## [1] 63
length(count_df)
## [1] 42387
length(count_df[count_df > 1]) / length(count_df)
## [1] 0.001486305
ndf[ndf$gene_name == "SCARNA20", ]
##         seqid  source type     start       end score strand phase
## 179400      1 ensembl gene 171768070 171768175     .      +     .
## 201038      1 ensembl gene 204727991 204728106     .      +     .
## 349204     11 ensembl gene   8555016   8555146     .      +     .
## 718521     14 ensembl gene  63479272  63479413     .      +     .
## 837234     15 ensembl gene  75121536  75121666     .      -     .
## 1039875    17 ensembl gene  28018770  28018907     .      +     .
## 1108216    17 ensembl gene  60231516  60231646     .      -     .
##         gene_name         gene_id
## 179400   SCARNA20 ENSG00000253060
## 201038   SCARNA20 ENSG00000251861
## 349204   SCARNA20 ENSG00000252778
## 718521   SCARNA20 ENSG00000252800
## 837234   SCARNA20 ENSG00000252722
## 1039875  SCARNA20 ENSG00000251818
## 1108216  SCARNA20 ENSG00000252577
##                                                                           desc
## 179400            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 201038            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 349204            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 718521            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 837234            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1039875           Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1108216 small Cajal body-specific RNA 20 [Source:HGNC Symbol%3BAcc:HGNC:32578]
# dplyr:
# ndf %>% filter(gene_name == "SCARNA20")


How Long Is a Typical Gene?

To calculate gene lengths we use the same code as before. R’s summary() is not exactly the same as Python’s describe() but it’s close enough.

Python:

ndf['length'] = ndf.end - ndf.start + 1
ndf.length.describe()

R:

ndf$length <- ndf$end - ndf$start + 1
summary(ndf$length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       8     884    5170   35830   30550 2305000

Now we produce the first plot, showing a histogram of gene length. In base R you can’t really plot a histogram with logarithmic y-axis scales (at least not without manually tweaking the hist() output but it isn’t recommended anyway because 0 will become -Inf). But we can do it easily with ggplot2 with “scale_y_log10()”. The code we need for ggplot2 is a bit longer than with matplotlib. We could of course further customize our plot but for now, let’s keep it simple.

Python:

import matplotlib as plt

ndf.length.plot(kind='hist', bins=50, logy=True)
plt.show()

R:

ndf %>% ggplot(aes(x = length)) + 
  geom_histogram(bins = 50, fill = "blue") + 
  scale_y_log10()

Now, we subset the dataframe to keep only rows where the “length” column contains values bigger than 2 million and order it by descending gene length. To see the shortest genes, we order the original dataframe and view the first 6 rows. Instead of “sort()”, we use dplyr’s “arrange()” function this time (I didn’t use it before because it can only be applied to dataframes).

Python:

ndf[ndf.length > 2e6].sort_values('length').ix[::-1]
ndf.sort_values('length').head()

R:

ndf %>% filter(length > 2e6) %>% arrange(desc(length))
##   seqid         source type     start       end score strand phase
## 1     7 ensembl_havana gene 146116002 148420998     .      +     .
## 2     9 ensembl_havana gene   8314246  10612723     .      -     .
## 3     X ensembl_havana gene  31097677  33339441     .      -     .
## 4    11 ensembl_havana gene  83455012  85627922     .      -     .
## 5     8 ensembl_havana gene   2935353   4994972     .      -     .
## 6    20 ensembl_havana gene  13995369  16053197     .      +     .
##   gene_name         gene_id
## 1   CNTNAP2 ENSG00000174469
## 2     PTPRD ENSG00000153707
## 3       DMD ENSG00000198947
## 4      DLG2 ENSG00000150672
## 5     CSMD1 ENSG00000183117
## 6   MACROD2 ENSG00000172264
##                                                                                   desc
## 1            contactin associated protein-like 2 [Source:HGNC Symbol%3BAcc:HGNC:13830]
## 2 protein tyrosine phosphatase%2C receptor type D [Source:HGNC Symbol%3BAcc:HGNC:9668]
## 3                                      dystrophin [Source:HGNC Symbol%3BAcc:HGNC:2928]
## 4            discs large MAGUK scaffold protein 2 [Source:HGNC Symbol%3BAcc:HGNC:2901]
## 5               CUB and Sushi multiple domains 1 [Source:HGNC Symbol%3BAcc:HGNC:14026]
## 6                      MACRO domain containing 2 [Source:HGNC Symbol%3BAcc:HGNC:16126]
##    length
## 1 2304997
## 2 2298478
## 3 2241765
## 4 2172911
## 5 2059620
## 6 2057829
head(arrange(ndf, length))
##   seqid source type     start       end score strand phase    gene_name
## 1    14 havana gene  22438547  22438554     .      +     .        TRDD1
## 2    14 havana gene  22439007  22439015     .      +     .        TRDD2
## 3     7 havana gene 142786213 142786224     .      +     .        TRBD1
## 4    14 havana gene  22449113  22449125     .      +     .        TRDD3
## 5     4 havana gene  10238213  10238235     .      -     .   AC006499.9
## 6     3 havana gene 179452395 179452419     .      -     . RP11-145M9.2
##           gene_id
## 1 ENSG00000223997
## 2 ENSG00000237235
## 3 ENSG00000282431
## 4 ENSG00000228985
## 5 ENSG00000271544
## 6 ENSG00000239255
##                                                                      desc
## 1 T cell receptor delta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12254]
## 2 T cell receptor delta diversity 2 [Source:HGNC Symbol%3BAcc:HGNC:12255]
## 3  T cell receptor beta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12158]
## 4 T cell receptor delta diversity 3 [Source:HGNC Symbol%3BAcc:HGNC:12256]
## 5                                                                        
## 6                                                                        
##   length
## 1      8
## 2      9
## 3     12
## 4     13
## 5     23
## 6     25


Gene Distribution Among Chromosomes

The number of genes per chromosome are counted with the “subset()”, “table()” and “sort()” functions as described earlier.

Python:

ndf = ndf[ndf.seqid.isin(chrs)]
chr_gene_counts = ndf.groupby('seqid').count().ix[:, 0].sort_values().ix[::-1]
chr_gene_counts

R:

ndf$seqid <- as.character(ndf$seqid) # as factors it will subset the dataframe but keep the factor levels
ndf <- subset(ndf, seqid %in% chrs)
chr_gene_counts <- sort(table(ndf$seqid), decreasing = TRUE)
chr_gene_counts
## 
##    1    2   11   19   17    3    6   12    7    5   16    X    4    9    8 
## 3902 2806 2561 2412 2280 2204 2154 2140 2106 2002 1881 1852 1751 1659 1628 
##   10   15   14   22   20   13   18   21    Y 
## 1600 1476 1449  996  965  872  766  541  436

To see all genes that are on mitochondrial chromosome, we subset the first dataframe by two conditions. In both, R and Python this is done with the ampersand symbol but in R we don’t need brackets around the individual conditions.

Python:

df[(df.type == 'gene') & (df.seqid == 'MT')]

R:

subset(df, type == "gene" & seqid == "MT")
##         seqid source type start   end score strand phase
## 2514004    MT  insdc gene   648  1601     .      +     .
## 2514010    MT  insdc gene  1671  3229     .      +     .
## 2514017    MT  insdc gene  3307  4262     .      +     .
## 2514030    MT  insdc gene  4470  5511     .      +     .
## 2514049    MT  insdc gene  5904  7445     .      +     .
## 2514059    MT  insdc gene  7586  8269     .      +     .
## 2514066    MT  insdc gene  8366  8572     .      +     .
## 2514070    MT  insdc gene  8527  9207     .      +     .
## 2514074    MT  insdc gene  9207  9990     .      +     .
## 2514081    MT  insdc gene 10059 10404     .      +     .
## 2514088    MT  insdc gene 10470 10766     .      +     .
## 2514092    MT  insdc gene 10760 12137     .      +     .
## 2514105    MT  insdc gene 12337 14148     .      +     .
## 2514109    MT  insdc gene 14149 14673     .      -     .
## 2514116    MT  insdc gene 14747 15887     .      +     .
##                                                                                                                                                                                                                                                 attributes
## 2514004                                               ID=gene:ENSG00000211459;Name=MT-RNR1;biotype=Mt_rRNA;description=mitochondrially encoded 12S RNA [Source:HGNC Symbol%3BAcc:HGNC:7470];gene_id=ENSG00000211459;logic_name=mt_genbank_import;version=2
## 2514010                                               ID=gene:ENSG00000210082;Name=MT-RNR2;biotype=Mt_rRNA;description=mitochondrially encoded 16S RNA [Source:HGNC Symbol%3BAcc:HGNC:7471];gene_id=ENSG00000210082;logic_name=mt_genbank_import;version=2
## 2514017   ID=gene:ENSG00000198888;Name=MT-ND1;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol%3BAcc:HGNC:7455];gene_id=ENSG00000198888;logic_name=mt_genbank_import;version=2
## 2514030   ID=gene:ENSG00000198763;Name=MT-ND2;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 2 [Source:HGNC Symbol%3BAcc:HGNC:7456];gene_id=ENSG00000198763;logic_name=mt_genbank_import;version=3
## 2514049                          ID=gene:ENSG00000198804;Name=MT-CO1;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase I [Source:HGNC Symbol%3BAcc:HGNC:7419];gene_id=ENSG00000198804;logic_name=mt_genbank_import;version=2
## 2514059                         ID=gene:ENSG00000198712;Name=MT-CO2;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase II [Source:HGNC Symbol%3BAcc:HGNC:7421];gene_id=ENSG00000198712;logic_name=mt_genbank_import;version=1
## 2514066                                 ID=gene:ENSG00000228253;Name=MT-ATP8;biotype=protein_coding;description=mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol%3BAcc:HGNC:7415];gene_id=ENSG00000228253;logic_name=mt_genbank_import;version=1
## 2514070                                 ID=gene:ENSG00000198899;Name=MT-ATP6;biotype=protein_coding;description=mitochondrially encoded ATP synthase 6 [Source:HGNC Symbol%3BAcc:HGNC:7414];gene_id=ENSG00000198899;logic_name=mt_genbank_import;version=2
## 2514074                        ID=gene:ENSG00000198938;Name=MT-CO3;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase III [Source:HGNC Symbol%3BAcc:HGNC:7422];gene_id=ENSG00000198938;logic_name=mt_genbank_import;version=2
## 2514081   ID=gene:ENSG00000198840;Name=MT-ND3;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 3 [Source:HGNC Symbol%3BAcc:HGNC:7458];gene_id=ENSG00000198840;logic_name=mt_genbank_import;version=2
## 2514088 ID=gene:ENSG00000212907;Name=MT-ND4L;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4L [Source:HGNC Symbol%3BAcc:HGNC:7460];gene_id=ENSG00000212907;logic_name=mt_genbank_import;version=2
## 2514092   ID=gene:ENSG00000198886;Name=MT-ND4;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4 [Source:HGNC Symbol%3BAcc:HGNC:7459];gene_id=ENSG00000198886;logic_name=mt_genbank_import;version=2
## 2514105   ID=gene:ENSG00000198786;Name=MT-ND5;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 5 [Source:HGNC Symbol%3BAcc:HGNC:7461];gene_id=ENSG00000198786;logic_name=mt_genbank_import;version=2
## 2514109   ID=gene:ENSG00000198695;Name=MT-ND6;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 6 [Source:HGNC Symbol%3BAcc:HGNC:7462];gene_id=ENSG00000198695;logic_name=mt_genbank_import;version=2
## 2514116                                    ID=gene:ENSG00000198727;Name=MT-CYB;biotype=protein_coding;description=mitochondrially encoded cytochrome b [Source:HGNC Symbol%3BAcc:HGNC:7427];gene_id=ENSG00000198727;logic_name=mt_genbank_import;version=2

We can get the chromosome lengths from the dataframe as well. We again subset to only the main chromosomes, then drop unwanted columns and order by length.

Python:

gdf = gdf[gdf.seqid.isin(chrs)]
gdf.drop(['start', 'end', 'score', 'strand', 'phase' ,'attributes'], axis=1, inplace=True)
gdf.sort_values('length').ix[::-1]

R:

gdf$seqid <- as.character(gdf$seqid) # as factors it will subset the dataframe but keep the factor levels
gdf <- subset(gdf, as.character(seqid) %in% chrs) %>%
  select(-(start:attributes))
arrange(gdf, desc(length))
##    seqid source       type    length
## 1      1 GRCh38 chromosome 248956422
## 2      2 GRCh38 chromosome 242193529
## 3      3 GRCh38 chromosome 198295559
## 4      4 GRCh38 chromosome 190214555
## 5      5 GRCh38 chromosome 181538259
## 6      6 GRCh38 chromosome 170805979
## 7      7 GRCh38 chromosome 159345973
## 8      X GRCh38 chromosome 156040895
## 9      8 GRCh38 chromosome 145138636
## 10     9 GRCh38 chromosome 138394717
## 11    11 GRCh38 chromosome 135086622
## 12    10 GRCh38 chromosome 133797422
## 13    12 GRCh38 chromosome 133275309
## 14    13 GRCh38 chromosome 114364328
## 15    14 GRCh38 chromosome 107043718
## 16    15 GRCh38 chromosome 101991189
## 17    16 GRCh38 chromosome  90338345
## 18    17 GRCh38 chromosome  83257441
## 19    18 GRCh38 chromosome  80373285
## 20    20 GRCh38 chromosome  64444167
## 21    19 GRCh38 chromosome  58617616
## 22     Y GRCh38 chromosome  54106423
## 23    22 GRCh38 chromosome  50818468
## 24    21 GRCh38 chromosome  46709983
## 25    MT GRCh38 chromosome     16569

Now, we merge the dataframe with the number of genes per chromosome with the dataframe of chromosome lengths. Because R’s “table()” function produces a vector, we need to convert it to a dataframe first and define the column names. Then, we use the “merge()” function and point to the name of the column we want to merge by.

Python:

cdf = chr_gene_counts.to_frame(name='gene_count').reset_index()
cdf.head(2)

merged = gdf.merge(cdf, on='seqid')

R:

cdf <- as.data.frame(chr_gene_counts)
colnames(cdf) <- c("seqid", "gene_count")
head(cdf, n = 2)
##   seqid gene_count
## 1     1       3902
## 2     2       2806
merged <- merge(gdf, cdf, by = "seqid")
merged
##    seqid source       type    length gene_count
## 1      1 GRCh38 chromosome 248956422       3902
## 2     10 GRCh38 chromosome 133797422       1600
## 3     11 GRCh38 chromosome 135086622       2561
## 4     12 GRCh38 chromosome 133275309       2140
## 5     13 GRCh38 chromosome 114364328        872
## 6     14 GRCh38 chromosome 107043718       1449
## 7     15 GRCh38 chromosome 101991189       1476
## 8     16 GRCh38 chromosome  90338345       1881
## 9     17 GRCh38 chromosome  83257441       2280
## 10    18 GRCh38 chromosome  80373285        766
## 11    19 GRCh38 chromosome  58617616       2412
## 12     2 GRCh38 chromosome 242193529       2806
## 13    20 GRCh38 chromosome  64444167        965
## 14    21 GRCh38 chromosome  46709983        541
## 15    22 GRCh38 chromosome  50818468        996
## 16     3 GRCh38 chromosome 198295559       2204
## 17     4 GRCh38 chromosome 190214555       1751
## 18     5 GRCh38 chromosome 181538259       2002
## 19     6 GRCh38 chromosome 170805979       2154
## 20     7 GRCh38 chromosome 159345973       2106
## 21     8 GRCh38 chromosome 145138636       1628
## 22     9 GRCh38 chromosome 138394717       1659
## 23     X GRCh38 chromosome 156040895       1852
## 24     Y GRCh38 chromosome  54106423        436

To calculate the correlation between length and gene count, we subset the merged dataframe to those two columns and use the “corr()” (Python) or “cor()” (R) function.

Python:

merged[['length', 'gene_count']].corr()

R:

cor(merged[, c("length", "gene_count")])
##               length gene_count
## length     1.0000000  0.7282208
## gene_count 0.7282208  1.0000000

And now we produce the final plot: a line plot of chromosome length by number of genes per chromosome. For Python, we again use the matplotlib and for R the ggplot2 packages. Because Zhuyi Xue creates a new dataframe and tweaks the plot somewhat, our ggplot2 code is simpler and tidier here.

Python:

ax = merged[['length', 'gene_count']].sort_values('length').plot(x='length', y='gene_count', style='o-')
# add some margin to both ends of x axis
xlim = ax.get_xlim()
margin = xlim[0] * 0.1
ax.set_xlim([xlim[0] - margin, xlim[1] + margin])
# Label each point on the graph
for (s, x, y) in merged[['seqid', 'length', 'gene_count']].sort_values('length').values:
    ax.text(x, y - 100, str(s))

R:

merged[, c("seqid", "length", "gene_count")] %>%
  arrange(desc(length)) %>%
  ggplot(aes(x = length, y = gene_count, label = seqid)) +
  geom_point(color = "blue") +
  geom_line(color = "blue") +
  geom_text()



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8      codetools_0.2-15 digest_0.6.11    rprojroot_1.1   
##  [5] assertthat_0.1   plyr_1.8.4       grid_3.3.2       R6_2.2.0        
##  [9] gtable_0.2.0     DBI_0.5-1        backports_1.0.4  magrittr_1.5    
## [13] scales_0.4.1     evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0  
## [17] rmarkdown_1.3    labeling_0.3     tools_3.3.2      stringr_1.1.0   
## [21] munsell_0.4.3    yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5 
## [25] knitr_1.15.1     tibble_1.2

Feature Selection in Machine Learning (Breast Cancer Datasets)

2017-01-15 00:00:00 +0000

Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.

There are several ways to identify how much each feature contributes to the model and to restrict the number of selected features. Here, I am going to examine the effect of feature selection via

  • Correlation,
  • Recursive Feature Elimination (RFE) and
  • Genetic Algorithm (GA)

on Random Forest models.

Additionally, I want to know how different data properties affect the influence of these feature selection methods on the outcome. For that I am using three breast cancer datasets, one of which has few features; the other two are larger but differ in how well the outcome clusters in PCA.

Based on my comparisons of the correlation method, RFE and GA, I would conclude that for Random Forest models

  • removing highly correlated features isn’t a generally suitable method,
  • GA produced the best models in this example but is impractical for everyday use-cases with many features because it takes a lot of time to run with sufficient generations and individuals and
  • data that doesn’t allow a good classification to begin with (because the features are not very distinct between classes) don’t necessarily benefit from feature selection.

My conclusions are of course not to be generalized to any ol’ data you are working with: There are many more feature selection methods and I am only looking at a limited number of datasets and only at their influence on Random Forest models. But even this small example shows how different features and parameters can influence your predictions. With machine learning, there is no “one size fits all”! It is always worthwhile to take a good hard look at your data, get acquainted with its quirks and properties before you even think about models and algorithms. And once you’ve got a feel for your data, investing the time and effort to compare different feature selection methods (or engineered features), model parameters and - finally - different machine learning algorithms can make a big difference!



Breast Cancer Wisconsin (Diagnostic) Dataset

The data I am going to use to explore feature selection methods is the Breast Cancer Wisconsin (Diagnostic) Dataset:

W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.

The data was downloaded from the UC Irvine Machine Learning Repository. The features in these datasets characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses.

Included are three datasets. The first dataset is small with only 9 features, the other two datasets have 30 and 33 features and vary in how strongly the two predictor classes cluster in PCA. I want to explore the effect of different feature selection methods on datasets with these different properties.

But first, I want to get to know the data I am working with.


Breast cancer dataset 1

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The phenotypes for characterisation are:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis

Missing values are imputed with the mice package.

bc_data <- read.table("breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",")
colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", 
                       "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")
bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))

bc_data[bc_data == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data)))
## [1] 16
# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)
##    benign malignant 
##       458       241
str(bc_data)
## 'data.frame':    699 obs. of  10 variables:
##  $ classes                    : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
##  $ clump_thickness            : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ uniformity_of_cell_size    : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ uniformity_of_cell_shape   : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginal_adhesion          : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ single_epithelial_cell_size: num  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare_nuclei                : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ bland_chromatin            : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal_nucleoli            : num  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitosis                    : num  1 1 1 1 1 1 1 1 5 1 ...


Breast cancer dataset 2

The second dataset looks again at the predictor classes:

  • M: malignant or
  • B: benign breast mass.

The first two columns give:

  • Sample ID
  • Classes, i.e. diagnosis

For each cell nucleus, the following ten characteristics were measured:

  • Radius (mean of all distances from the center to points on the perimeter)
  • Texture (standard deviation of gray-scale values)
  • Perimeter
  • Area
  • Smoothness (local variation in radius lengths)
  • Compactness (perimeter^2 / area - 1.0)
  • Concavity (severity of concave portions of the contour)
  • Concave points (number of concave portions of the contour)
  • Symmetry
  • Fractal dimension (“coastline approximation” - 1)

For each characteristic three measures are given:

  • Mean
  • Standard error
  • Largest/ “worst”
bc_data_2 <- read.table("wdbc.data.txt", header = FALSE, sep = ",")

phenotypes <- rep(c("radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension"), 3)
types <- rep(c("mean", "se", "largest_worst"), each = 10)

colnames(bc_data_2) <- c("ID", "diagnosis", paste(phenotypes, types, sep = "_"))

# how many NAs are in the data
length(which(is.na(bc_data_2)))
## [1] 0
# how many benign and malignant cases are there?
summary(bc_data_2$diagnosis)
##   B   M 
## 357 212
str(bc_data_2)
## 'data.frame':    569 obs. of  32 variables:
##  $ ID                             : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis                      : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean                    : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean                   : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean                 : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean                      : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean                : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean                 : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean            : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se                      : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se                     : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se                   : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                        : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se                  : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se                    : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_largest_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_largest_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_largest_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...


Breast cancer dataset 3

The third dataset looks at the predictor classes:

  • R: recurring or
  • N: nonrecurring breast cancer.

The first two columns give:

  • Sample ID
  • Classes, i.e. outcome

For each cell nucleus, the same ten characteristics and measures were given as in dataset 2, plus:

  • Time (recurrence time if field 2 = R, disease-free time if field 2 = N)
  • Tumor size - diameter of the excised tumor in centimeters
  • Lymph node status - number of positive axillary lymph nodes observed at time of surgery

Missing values are imputed with the mice package.

bc_data_3 <- read.table("wpbc.data.txt", header = FALSE, sep = ",")
colnames(bc_data_3) <- c("ID", "outcome", "time", paste(phenotypes, types, sep = "_"), "tumor_size", "lymph_node_status")

bc_data_3[bc_data_3 == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data_3)))
## [1] 4
# impute missing data
library(mice)

bc_data_3[,3:35] <- apply(bc_data_3[,3:35], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data_3[,3:35],  print = FALSE)
bc_data_3 <- cbind(bc_data_3[, 2, drop = FALSE], mice::complete(dataset_impute, 1))

# how many recurring and non-recurring cases are there?
summary(bc_data_3$outcome)
##   N   R 
## 151  47
str(bc_data_3)
## 'data.frame':    198 obs. of  34 variables:
##  $ outcome                        : Factor w/ 2 levels "N","R": 1 1 1 1 2 2 1 2 1 1 ...
##  $ time                           : num  31 61 116 123 27 77 60 77 119 76 ...
##  $ radius_mean                    : num  18 18 21.4 11.4 20.3 ...
##  $ texture_mean                   : num  27.6 10.4 17.4 20.4 14.3 ...
##  $ perimeter_mean                 : num  117.5 122.8 137.5 77.6 135.1 ...
##  $ area_mean                      : num  1013 1001 1373 386 1297 ...
##  $ smoothness_mean                : num  0.0949 0.1184 0.0884 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.104 0.278 0.119 0.284 0.133 ...
##  $ concavity_mean                 : num  0.109 0.3 0.126 0.241 0.198 ...
##  $ concave_points_mean            : num  0.0706 0.1471 0.0818 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.186 0.242 0.233 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0633 0.0787 0.0601 0.0974 0.0588 ...
##  $ radius_se                      : num  0.625 1.095 0.585 0.496 0.757 ...
##  $ texture_se                     : num  1.89 0.905 0.611 1.156 0.781 ...
##  $ perimeter_se                   : num  3.97 8.59 3.93 3.44 5.44 ...
##  $ area_se                        : num  71.5 153.4 82.2 27.2 94.4 ...
##  $ smoothness_se                  : num  0.00443 0.0064 0.00617 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.0142 0.049 0.0345 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0323 0.0537 0.033 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.00985 0.01587 0.01805 0.01867 0.01885 ...
##  $ symmetry_se                    : num  0.0169 0.03 0.0309 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00349 0.00619 0.00504 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  21.6 25.4 24.9 14.9 22.5 ...
##  $ texture_largest_worst          : num  37.1 17.3 21 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  139.7 184.6 159.1 98.9 152.2 ...
##  $ area_largest_worst             : num  1436 2019 1949 568 1575 ...
##  $ smoothness_largest_worst       : num  0.119 0.162 0.119 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.193 0.666 0.345 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.314 0.712 0.341 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.117 0.265 0.203 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.268 0.46 0.433 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.0811 0.1189 0.0907 0.173 0.0768 ...
##  $ tumor_size                     : num  5 3 2.5 2 3.5 2.5 1.5 4 2 6 ...
##  $ lymph_node_status              : num  5 2 0 0 0 0 0 10 1 20 ...


Principal Component Analysis (PCA)

To get an idea about the dimensionality and variance of the datasets, I am first looking at PCA plots for samples and features. The first two principal components (PCs) show the two components that explain the majority of variation in the data.

After defining my custom ggplot2 theme, I am creating a function that performs the PCA (using the pcaGoPromoter package), calculates ellipses of the data points (with the ellipse package) and produces the plot with ggplot2. Some of the features in datasets 2 and 3 are not very distinct and overlap in the PCA plots, therefore I am also plotting hierarchical clustering dendrograms.

# plotting theme

library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "navy", color = "navy", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())
# function for PCA plotting
library(pcaGoPromoter)
library(ellipse)

pca_func <- function(data, groups, title, print_ellipse = TRUE) {
  
  # perform pca and extract scores
  pcaOutput <- pca(data, printDropped = FALSE, scale = TRUE, center = TRUE)
  pcaOutput2 <- as.data.frame(pcaOutput$scores)
  
  # define groups for plotting
  pcaOutput2$groups <- groups
  
  # when plotting samples calculate ellipses for plotting (when plotting features, there are no replicates)
  if (print_ellipse) {
    
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
      data.frame(groups = as.character(t),
                 ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                       centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                       level = 0.95),
                 stringsAsFactors = FALSE)))
    
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.6) + 
      scale_color_brewer(palette = "Set1") +
      labs(title = title,
           color = "",
           fill = "",
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
    
  } else {
    
    # if there are fewer than 10 groups (e.g. the predictor classes) I want to have colors from RColorBrewer
    if (length(unique(pcaOutput2$groups)) <= 10) {
      
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        scale_color_brewer(palette = "Set1") +
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    } else {
      
      # otherwise use the default rainbow colors
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    }
  }
  
  return(plot)
  
}

library(gridExtra)
library(grid)


  • Dataset 1
p1 <- pca_func(data = t(bc_data[, 2:10]), groups = as.character(bc_data$classes), title = "Breast cancer dataset 1: Samples")
p2 <- pca_func(data = bc_data[, 2:10], groups = as.character(colnames(bc_data[, 2:10])), title = "Breast cancer dataset 1: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2)

h_1 <- hclust(dist(t(bc_data[, 2:10]), method = "euclidean"), method = "complete")
plot(h_1)

library(tidyr)
bc_data_gather <- bc_data %>%
  gather(measure, value, clump_thickness:mitosis)

ggplot(data = bc_data_gather, aes(x = value, fill = classes, color = classes)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


  • Dataset 2
p1 <- pca_func(data = t(bc_data_2[, 3:32]), groups = as.character(bc_data_2$diagnosis), title = "Breast cancer dataset 2: Samples")
p2 <- pca_func(data = bc_data_2[, 3:32], groups = as.character(colnames(bc_data_2[, 3:32])), title = "Breast cancer dataset 2: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_2 <- hclust(dist(t(bc_data_2[, 3:32]), method = "euclidean"), method = "complete")
plot(h_2)

bc_data_2_gather <- bc_data_2[, -1] %>%
  gather(measure, value, radius_mean:fractal_dimension_largest_worst)

ggplot(data = bc_data_2_gather, aes(x = value, fill = diagnosis, color = diagnosis)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


  • Dataset 3
p1 <- pca_func(data = t(bc_data_3[, 2:34]), groups = as.character(bc_data_3$outcome), title = "Breast cancer dataset 3: Samples")
p2 <- pca_func(data = bc_data_3[, 2:34], groups = as.character(colnames(bc_data_3[, 2:34])), title = "Breast cancer dataset 3: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_3 <- hclust(dist(t(bc_data_3[,2:34]), method = "euclidean"), method = "complete")
plot(h_3)

Datasets 1 and 2 show a nice separation of benign and malignant masses, models based on these features will likely be able to predict the classes quite well for most samples. The classes in dataset 3 don’t cluster into distinct groups, I assume that prediction will not be as accurate for these features.

The features of datasets 2 and 3 don’t cluster very distinctly, many features seem to show similar patterns across samples. Selecting an approriate subset of features will probably have different effects on the three different datasets.

bc_data_3_gather <- bc_data_3 %>%
  gather(measure, value, time:lymph_node_status)

ggplot(data = bc_data_3_gather, aes(x = value, fill = outcome, color = outcome)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)


Feature importance

To get an idea about the feature’s respective importances, I’m running Random Forest models with 10 x 10 cross validation using the caret package. If I wanted to use feature importance to select features for modeling, I would need to perform it on the training data instead of on the complete dataset. But here, I only want to use it to get acquainted with my data. I am again defining a function that estimates the feature importance and produces a plot.

library(caret)
library(doParallel) # parallel processing
registerDoParallel()

# prepare training scheme
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)

feature_imp <- function(model, title) {
  
  # estimate variable importance
  importance <- varImp(model, scale = TRUE)
  
  # prepare dataframes for plotting
  importance_df_1 <- importance$importance
  importance_df_1$group <- rownames(importance_df_1)
  
  importance_df_2 <- importance_df_1
  importance_df_2$Overall <- 0
  
  importance_df <- rbind(importance_df_1, importance_df_2)
  
  plot <- ggplot() +
    geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) +
    geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) +
    theme(legend.position = "none") +
    labs(
      x = "Importance",
      y = "",
      title = title,
      subtitle = "Scaled feature importance",
      caption = "\nDetermined with Random Forest and
      repeated cross validation (10 repeats, 10 times)"
    )
  
  return(plot)
  
}
# train the model
set.seed(27)
imp_1 <- train(classes ~ ., data = bc_data, method = "rf", preProcess = c("scale", "center"), trControl = control)
p1 <- feature_imp(imp_1, title = "Breast cancer dataset 1")
set.seed(27)
imp_2 <- train(diagnosis ~ ., data = bc_data_2[, -1], method = "rf", preProcess = c("scale", "center"), trControl = control)
p2 <- feature_imp(imp_2, title = "Breast cancer dataset 2")
set.seed(27)
imp_3 <- train(outcome ~ ., data = bc_data_3, method = "rf", preProcess = c("scale", "center"), trControl = control)
p3 <- feature_imp(imp_3, title = "Breast cancer dataset 3")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(0.3, 0.35, 0.35))


Feature Selection

Now that I have a general idea about the data, I will run three feature selection methods on all three datasets and compare how they effect the prediction accuracy of a Random Forest model.

Creating train and test data

Before doing anything else with the data, we need to subset the datasets into train and test data. Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!


  • Dataset 1
set.seed(27)
bc_data_index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
bc_data_train <- bc_data[bc_data_index, ]
bc_data_test  <- bc_data[-bc_data_index, ]


  • Dataset 2
set.seed(27)
bc_data_2_index <- createDataPartition(bc_data_2$diagnosis, p = 0.7, list = FALSE)
bc_data_2_train <- bc_data_2[bc_data_2_index, ]
bc_data_2_test  <- bc_data_2[-bc_data_2_index, ]


  • Dataset 3
set.seed(27)
bc_data_3_index <- createDataPartition(bc_data_3$outcome, p = 0.7, list = FALSE)
bc_data_3_train <- bc_data_3[bc_data_3_index, ]
bc_data_3_test  <- bc_data_3[-bc_data_3_index, ]


Correlation

Often we have features that are highly correlated and thus provide redundant information. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features. This also shows us, that when we want to make statements about the biological/ medical importance of specific features, we need to keep in mind that just because they are suitable to predicting an outcome they are not necessarily causal - they could simply be correlated with causal factors.

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.


  • Dataset 1
library(corrplot)

# calculate correlation matrix
corMatMy <- cor(bc_data_train[, -1])
corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70,
highlyCor <- colnames(bc_data_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 2  and column  3 with corr  0.913 
##   Means:  0.716 vs 0.601 so flagging column 2 
## Compare row 3  and column  6 with corr  0.741 
##   Means:  0.677 vs 0.579 so flagging column 3 
## Compare row 6  and column  7 with corr  0.706 
##   Means:  0.602 vs 0.545 so flagging column 6 
## All correlations <= 0.7
# which variables are flagged for removal?
highlyCor
## [1] "uniformity_of_cell_size"  "uniformity_of_cell_shape"
## [3] "bare_nuclei"
#then we remove these variables
bc_data_cor <- bc_data_train[, which(!colnames(bc_data_train) %in% highlyCor)]

Correlation between features in dataset 1 is generally high and 4 out of 10 feature were flagged for removal.


  • Dataset 2
corMatMy <- cor(bc_data_2_train[, 3:32])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_2_train[, 3:32])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 7  and column  8 with corr  0.92 
##   Means:  0.579 vs 0.393 so flagging column 7 
## Compare row 8  and column  6 with corr  0.84 
##   Means:  0.548 vs 0.38 so flagging column 8 
## Compare row 6  and column  28 with corr  0.827 
##   Means:  0.536 vs 0.368 so flagging column 6 
## Compare row 28  and column  27 with corr  0.855 
##   Means:  0.506 vs 0.357 so flagging column 28 
## Compare row 27  and column  26 with corr  0.894 
##   Means:  0.46 vs 0.346 so flagging column 27 
## Compare row 23  and column  21 with corr  0.993 
##   Means:  0.454 vs 0.336 so flagging column 23 
## Compare row 21  and column  24 with corr  0.983 
##   Means:  0.419 vs 0.327 so flagging column 21 
## Compare row 26  and column  30 with corr  0.817 
##   Means:  0.402 vs 0.323 so flagging column 26 
## Compare row 24  and column  3 with corr  0.943 
##   Means:  0.383 vs 0.312 so flagging column 24 
## Compare row 3  and column  1 with corr  0.998 
##   Means:  0.347 vs 0.306 so flagging column 3 
## Compare row 1  and column  4 with corr  0.986 
##   Means:  0.302 vs 0.304 so flagging column 4 
## Compare row 1  and column  14 with corr  0.726 
##   Means:  0.264 vs 0.304 so flagging column 14 
## Compare row 13  and column  11 with corr  0.973 
##   Means:  0.32 vs 0.304 so flagging column 13 
## Compare row 18  and column  16 with corr  0.757 
##   Means:  0.388 vs 0.295 so flagging column 18 
## Compare row 16  and column  17 with corr  0.796 
##   Means:  0.404 vs 0.288 so flagging column 16 
## Compare row 9  and column  29 with corr  0.711 
##   Means:  0.343 vs 0.274 so flagging column 9 
## Compare row 17  and column  20 with corr  0.745 
##   Means:  0.306 vs 0.261 so flagging column 17 
## Compare row 5  and column  25 with corr  0.809 
##   Means:  0.311 vs 0.255 so flagging column 5 
## Compare row 30  and column  10 with corr  0.753 
##   Means:  0.288 vs 0.241 so flagging column 30 
## Compare row 22  and column  2 with corr  0.913 
##   Means:  0.243 vs 0.242 so flagging column 22 
## All correlations <= 0.7
highlyCor
##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "concave_points_largest_worst"   
##  [5] "concavity_largest_worst"         "perimeter_largest_worst"        
##  [7] "radius_largest_worst"            "compactness_largest_worst"      
##  [9] "area_largest_worst"              "perimeter_mean"                 
## [11] "perimeter_se"                    "area_mean"                      
## [13] "concave_points_se"               "compactness_se"                 
## [15] "area_se"                         "symmetry_mean"                  
## [17] "concavity_se"                    "smoothness_mean"                
## [19] "fractal_dimension_largest_worst" "texture_largest_worst"
bc_data_2_cor <- bc_data_2_train[, which(!colnames(bc_data_2_train) %in% highlyCor)]

Here, we have more variation between the 30 features: some are highly correlated, while others seem to be very distinct. 20 are flagged for removal (see output above).


  • Dataset 3
corMatMy <- cor(bc_data_3_train[, -1])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_3_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 8  and column  9 with corr  0.898 
##   Means:  0.425 vs 0.286 so flagging column 8 
## Compare row 9  and column  7 with corr  0.714 
##   Means:  0.39 vs 0.277 so flagging column 9 
## Compare row 7  and column  29 with corr  0.753 
##   Means:  0.364 vs 0.271 so flagging column 7 
## Compare row 4  and column  2 with corr  0.996 
##   Means:  0.348 vs 0.264 so flagging column 4 
## Compare row 2  and column  5 with corr  0.993 
##   Means:  0.33 vs 0.259 so flagging column 2 
## Compare row 5  and column  24 with corr  0.921 
##   Means:  0.304 vs 0.255 so flagging column 5 
## Compare row 24  and column  22 with corr  0.985 
##   Means:  0.272 vs 0.252 so flagging column 24 
## Compare row 11  and column  31 with corr  0.83 
##   Means:  0.341 vs 0.247 so flagging column 11 
## Compare row 22  and column  15 with corr  0.773 
##   Means:  0.24 vs 0.242 so flagging column 15 
## Compare row 22  and column  25 with corr  0.989 
##   Means:  0.217 vs 0.242 so flagging column 25 
## Compare row 14  and column  12 with corr  0.975 
##   Means:  0.256 vs 0.243 so flagging column 14 
## Compare row 31  and column  28 with corr  0.71 
##   Means:  0.328 vs 0.238 so flagging column 31 
## Compare row 18  and column  17 with corr  0.812 
##   Means:  0.331 vs 0.229 so flagging column 18 
## Compare row 28  and column  27 with corr  0.84 
##   Means:  0.286 vs 0.219 so flagging column 28 
## Compare row 17  and column  21 with corr  0.839 
##   Means:  0.285 vs 0.212 so flagging column 17 
## Compare row 10  and column  30 with corr  0.766 
##   Means:  0.277 vs 0.204 so flagging column 10 
## Compare row 6  and column  26 with corr  0.754 
##   Means:  0.235 vs 0.198 so flagging column 6 
## Compare row 3  and column  23 with corr  0.858 
##   Means:  0.164 vs 0.195 so flagging column 23 
## All correlations <= 0.7
highlyCor
##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "perimeter_mean"                 
##  [5] "radius_mean"                     "area_mean"                      
##  [7] "perimeter_largest_worst"         "fractal_dimension_mean"         
##  [9] "perimeter_se"                    "area_se"                        
## [11] "area_largest_worst"              "fractal_dimension_largest_worst"
## [13] "concavity_se"                    "concavity_largest_worst"        
## [15] "compactness_se"                  "symmetry_mean"                  
## [17] "smoothness_mean"                 "texture_largest_worst"
bc_data_3_cor <- bc_data_3_train[, which(!colnames(bc_data_3_train) %in% highlyCor)]

The features in dataset 3 look similar: some are highly correlated, others are very different. 18 are flagged for removal (see output above).


Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.


  • Dataset 1
# ensure the results are repeatable
set.seed(7)
# define the control using a random forest selection function with cross validation
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# run the RFE algorithm
results_1 <- rfe(x = bc_data_train[, -1], y = bc_data_train$classes, sizes = c(1:9), rfeControl = control)

# chosen features
predictors(results_1)
## [1] "bare_nuclei"                 "uniformity_of_cell_size"    
## [3] "clump_thickness"             "uniformity_of_cell_shape"   
## [5] "bland_chromatin"             "marginal_adhesion"          
## [7] "mitosis"                     "normal_nucleoli"            
## [9] "single_epithelial_cell_size"
# subset the chosen features
bc_data_rfe <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% predictors(results_1)))]


  • Dataset 2
set.seed(7)
results_2 <- rfe(x = bc_data_2_train[, -c(1, 2)], y = as.factor(bc_data_2_train$diagnosis), sizes = c(1:30), rfeControl = control)

predictors(results_2)
##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"
bc_data_2_rfe <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% predictors(results_2)))]


  • Dataset 3
set.seed(7)
results_3 <- rfe(x = bc_data_3_train[,-1], y = as.factor(bc_data_3_train$outcome), sizes = c(1:33), rfeControl = control)

predictors(results_2)
##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"
bc_data_3_rfe <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% predictors(results_3)))]


Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

For demonstration purposes I am using only 10 generations consisting of 5 individuals. More iterations with larger populations would of course be preferable, but this takes quite long to run!

library(dplyr)

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv",    # 10 fold cross validation
                       genParallel = TRUE, # Use parallel programming
                       allowParallel = TRUE)


  • Dataset 1
lev <- c("malignant", "benign")     # Set the levels

set.seed(27)
model_1 <- gafs(x = bc_data_train[, -1], y = bc_data_train$classes,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_1) # Plot mean fitness (AUC) by generation

model_1$ga$final
## [1] "clump_thickness"             "uniformity_of_cell_shape"   
## [3] "marginal_adhesion"           "single_epithelial_cell_size"
## [5] "bare_nuclei"                 "bland_chromatin"            
## [7] "normal_nucleoli"             "mitosis"
bc_data_ga <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% model_1$ga$final))]


  • Dataset 2
lev <- c("M", "B")

set.seed(27)
model_2 <- gafs(x = bc_data_2_train[, -c(1, 2)], y = bc_data_2_train$diagnosis,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_2)

model_2$ga$final
##  [1] "radius_mean"                  "texture_mean"                
##  [3] "perimeter_mean"               "area_mean"                   
##  [5] "smoothness_mean"              "compactness_mean"            
##  [7] "concavity_mean"               "symmetry_mean"               
##  [9] "fractal_dimension_mean"       "texture_se"                  
## [11] "perimeter_se"                 "area_se"                     
## [13] "smoothness_se"                "compactness_se"              
## [15] "concavity_se"                 "concave_points_se"           
## [17] "symmetry_se"                  "radius_largest_worst"        
## [19] "texture_largest_worst"        "smoothness_largest_worst"    
## [21] "compactness_largest_worst"    "concave_points_largest_worst"
bc_data_2_ga <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% model_2$ga$final))]


  • Dataset 3
lev <- c("R", "N")

set.seed(27)
model_3 <- gafs(x = bc_data_3_train[, -1], y = bc_data_3_train$outcome,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)
plot(model_3)

model_3$ga$final
##  [1] "time"                            "perimeter_mean"                 
##  [3] "radius_se"                       "texture_se"                     
##  [5] "compactness_se"                  "concavity_se"                   
##  [7] "symmetry_se"                     "fractal_dimension_se"           
##  [9] "radius_largest_worst"            "texture_largest_worst"          
## [11] "perimeter_largest_worst"         "smoothness_largest_worst"       
## [13] "compactness_largest_worst"       "concavity_largest_worst"        
## [15] "concave_points_largest_worst"    "symmetry_largest_worst"         
## [17] "fractal_dimension_largest_worst" "tumor_size"                     
## [19] "lymph_node_status"
bc_data_3_ga <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% model_3$ga$final))]


Model comparison

Now I can compare Random Forest models with the different feature subsets. (I chose Random Forests because I find it works reasonbly well on a large variety of data and because it’s the algorithm I’ve used most often and thus know most about…)

For a more detailed description of building machine learning models see here and here.


All features

  • Dataset 1
set.seed(27)
model_bc_data_all <- train(classes ~ .,
                           data = bc_data_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_1 <- confusionMatrix(predict(model_bc_data_all, bc_data_test[, -1]), bc_data_test$classes)
cm_all_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         5
##   malignant      3        67
##                                          
##                Accuracy : 0.9617         
##                  95% CI : (0.926, 0.9833)
##     No Information Rate : 0.6555         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9147         
##  Mcnemar's Test P-Value : 0.7237         
##                                          
##             Sensitivity : 0.9781         
##             Specificity : 0.9306         
##          Pos Pred Value : 0.9640         
##          Neg Pred Value : 0.9571         
##              Prevalence : 0.6555         
##          Detection Rate : 0.6411         
##    Detection Prevalence : 0.6651         
##       Balanced Accuracy : 0.9543         
##                                          
##        'Positive' Class : benign         
## 


  • Dataset 2
set.seed(27)
model_bc_data_2_all <- train(diagnosis ~ .,
                           data = bc_data_2_train[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_2 <- confusionMatrix(predict(model_bc_data_2_all, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_all_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 


  • Dataset 3
set.seed(27)
model_bc_data_3_all <- train(outcome ~ .,
                           data = bc_data_3_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_all_3 <- confusionMatrix(predict(model_bc_data_3_all, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_all_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 


Selected features

Dataset 1

library(gplots)

venn_list <- list(cor = colnames(bc_data_cor)[-1],
                  rfe = colnames(bc_data_rfe)[-1],
                  ga = colnames(bc_data_ga)[-1])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   0   0   0  1
## 010   1   0   1  0
## 011   2   0   1  1
## 100   0   1   0  0
## 101   0   1   0  1
## 110   0   1   1  0
## 111   6   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "clump_thickness"             "marginal_adhesion"          
## [3] "single_epithelial_cell_size" "bland_chromatin"            
## [5] "normal_nucleoli"             "mitosis"                    
## 
## attr(,"intersections")$rfe
## [1] "uniformity_of_cell_size"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "uniformity_of_cell_shape" "bare_nuclei"             
## 
## attr(,"class")
## [1] "venn"

4 out of 10 features were chosen by all three methods; the biggest overlap is seen between GA and RFE with 7 features. RFE and GA both retained 8 features for modeling, compared to only 5 based on the correlation method.


  • Correlation
set.seed(27)
model_bc_data_cor <- train(classes ~ .,
                 data = bc_data_cor,
                 method = "rf",
                 preProcess = c("scale", "center"),
                 trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_1 <- confusionMatrix(predict(model_bc_data_cor, bc_data_test[, -1]), bc_data_test$classes)
cm_cor_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       131         6
##   malignant      6        66
##                                         
##                Accuracy : 0.9426        
##                  95% CI : (0.9019, 0.97)
##     No Information Rate : 0.6555        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.8729        
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.9562        
##             Specificity : 0.9167        
##          Pos Pred Value : 0.9562        
##          Neg Pred Value : 0.9167        
##              Prevalence : 0.6555        
##          Detection Rate : 0.6268        
##    Detection Prevalence : 0.6555        
##       Balanced Accuracy : 0.9364        
##                                         
##        'Positive' Class : benign        
## 


  • RFE
set.seed(27)
model_bc_data_rfe <- train(classes ~ .,
                           data = bc_data_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_1 <- confusionMatrix(predict(model_bc_data_rfe, bc_data_test[, -1]), bc_data_test$classes)
cm_rfe_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         4
##   malignant      3        68
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9322, 0.9864)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9256          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9710          
##          Neg Pred Value : 0.9577          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6603          
##       Balanced Accuracy : 0.9613          
##                                           
##        'Positive' Class : benign          
## 


  • GA
set.seed(27)
model_bc_data_ga <- train(classes ~ .,
                           data = bc_data_ga,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_1 <- confusionMatrix(predict(model_bc_data_ga, bc_data_test[, -1]), bc_data_test$classes)
cm_ga_1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         3
##   malignant      3        69
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9386, 0.9894)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9364          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9583          
##          Pos Pred Value : 0.9781          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6555          
##       Balanced Accuracy : 0.9682          
##                                           
##        'Positive' Class : benign          
## 


Dataset 2

venn_list <- list(cor = colnames(bc_data_2_cor)[-c(1, 2)],
                  rfe = colnames(bc_data_2_rfe)[-c(1, 2)],
                  ga = colnames(bc_data_2_ga)[-c(1, 2)])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   2   0   0  1
## 010   5   0   1  0
## 011  13   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   2   1   1  0
## 111   2   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "texture_mean"             "smoothness_largest_worst"
## 
## attr(,"intersections")$ga
## [1] "compactness_se"    "concave_points_se"
## 
## attr(,"intersections")$cor
## [1] "radius_mean"          "fractal_dimension_se"
## 
## attr(,"intersections")$rfe
## [1] "concave_points_mean"             "perimeter_largest_worst"        
## [3] "area_largest_worst"              "concavity_largest_worst"        
## [5] "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:ga`
## [1] "fractal_dimension_mean" "texture_se"            
## [3] "smoothness_se"          "symmetry_se"           
## 
## attr(,"intersections")$`rfe:ga`
##  [1] "perimeter_mean"               "area_mean"                   
##  [3] "smoothness_mean"              "compactness_mean"            
##  [5] "concavity_mean"               "symmetry_mean"               
##  [7] "perimeter_se"                 "area_se"                     
##  [9] "concavity_se"                 "radius_largest_worst"        
## [11] "texture_largest_worst"        "compactness_largest_worst"   
## [13] "concave_points_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "radius_se"              "symmetry_largest_worst"
## 
## attr(,"class")
## [1] "venn"

For dataset 2 we see a much bigger variation in chosen features between the three selection methods: only 1 feature was chosen by all and the biggest overlap is again seen between RFE and GA, followed by correlation and GA. But this time we also see quite a few features that are uniquely retained by any of the three feature selection methods.


  • Correlation
set.seed(27)
model_bc_data_2_cor <- train(diagnosis ~ .,
                           data = bc_data_2_cor[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_2 <- confusionMatrix(predict(model_bc_data_2_cor, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_cor_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   6
##          M   1  57
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9103         
##  Mcnemar's Test P-Value : 0.1306         
##                                          
##             Sensitivity : 0.9907         
##             Specificity : 0.9048         
##          Pos Pred Value : 0.9464         
##          Neg Pred Value : 0.9828         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6235         
##    Detection Prevalence : 0.6588         
##       Balanced Accuracy : 0.9477         
##                                          
##        'Positive' Class : B              
## 


  • RFE
set.seed(27)
model_bc_data_2_rfe <- train(diagnosis ~ .,
                           data = bc_data_2_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_2 <- confusionMatrix(predict(model_bc_data_2_rfe, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_rfe_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 


  • GA
set.seed(27)
model_bc_data_2_ga <- train(diagnosis ~ .,
                          data = bc_data_2_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_2 <- confusionMatrix(predict(model_bc_data_2_ga, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_ga_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   4
##          M   1  59
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9327, 0.9904)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9363          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9365          
##          Pos Pred Value : 0.9636          
##          Neg Pred Value : 0.9833          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6471          
##       Balanced Accuracy : 0.9636          
##                                           
##        'Positive' Class : B               
## 


Dataset 3

venn_list <- list(cor = colnames(bc_data_3_cor)[-1],
                  rfe = colnames(bc_data_3_rfe)[-1],
                  ga = colnames(bc_data_3_ga)[-1])

venn <- venn(venn_list)

venn
##     num cor rfe ga
## 000   0   0   0  0
## 001   5   0   0  1
## 010   6   0   1  0
## 011   2   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   1   1   1  0
## 111   8   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "time"                         "radius_se"                   
## [3] "symmetry_se"                  "smoothness_largest_worst"    
## [5] "concave_points_largest_worst" "symmetry_largest_worst"      
## [7] "tumor_size"                   "lymph_node_status"           
## 
## attr(,"intersections")$ga
## [1] "perimeter_mean"          "compactness_se"         
## [3] "texture_largest_worst"   "perimeter_largest_worst"
## [5] "concavity_largest_worst"
## 
## attr(,"intersections")$cor
## [1] "texture_mean"  "smoothness_se"
## 
## attr(,"intersections")$rfe
## [1] "compactness_mean"       "concavity_mean"        
## [3] "concave_points_mean"    "symmetry_mean"         
## [5] "fractal_dimension_mean" "perimeter_se"          
## 
## attr(,"intersections")$`cor:ga`
## [1] "texture_se"                "fractal_dimension_se"     
## [3] "radius_largest_worst"      "compactness_largest_worst"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "concavity_se"                    "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "concave_points_se"
## 
## attr(,"class")
## [1] "venn"

With the third dataset there is still some variation among the selected features but not as extreme as with dataset 2. This time, we find the biggest overlap between correlation and GA.


  • Correlation
set.seed(27)
model_bc_data_3_cor <- train(outcome ~ .,
                           data = bc_data_3_cor,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_cor_3 <- confusionMatrix(predict(model_bc_data_3_cor, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_cor_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 41  7
##          R  4  7
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.6909, 0.9031)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.2256          
##                                           
##                   Kappa : 0.4439          
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9111          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8542          
##          Neg Pred Value : 0.6364          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6949          
##    Detection Prevalence : 0.8136          
##       Balanced Accuracy : 0.7056          
##                                           
##        'Positive' Class : N               
## 


  • RFE
set.seed(27)
model_bc_data_3_rfe <- train(outcome ~ .,
                           data = bc_data_3_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_rfe_3 <- confusionMatrix(predict(model_bc_data_3_rfe, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_rfe_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 40  7
##          R  5  7
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.6717, 0.8902)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.3311          
##                                           
##                   Kappa : 0.409           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8511          
##          Neg Pred Value : 0.5833          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6780          
##    Detection Prevalence : 0.7966          
##       Balanced Accuracy : 0.6944          
##                                           
##        'Positive' Class : N               
## 


  • GA
set.seed(27)
model_bc_data_3_ga <- train(outcome ~ .,
                          data = bc_data_3_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))
cm_ga_3 <- confusionMatrix(predict(model_bc_data_3_ga, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_ga_3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 


Overall model parameters

To compare the feature selection methods’ influence on the models built with the three different datasets, I am going to plot model accuracy, Kappa, precision and sensitivity and specificity.

overall <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$overall,
                      cm_cor_1$overall,
                      cm_rfe_1$overall,
                      cm_ga_1$overall,
                      cm_all_2$overall,
                      cm_cor_2$overall,
                      cm_rfe_2$overall,
                      cm_ga_2$overall,
                      cm_all_3$overall,
                      cm_cor_3$overall,
                      cm_rfe_3$overall,
                      cm_ga_3$overall))

library(tidyr)
overall_gather <- overall[, 1:4] %>%
  gather(measure, value, Accuracy:Kappa)
byClass <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$byClass,
                      cm_cor_1$byClass,
                      cm_rfe_1$byClass,
                      cm_ga_1$byClass,
                      cm_all_2$byClass,
                      cm_cor_2$byClass,
                      cm_rfe_2$byClass,
                      cm_ga_2$byClass,
                      cm_all_3$byClass,
                      cm_cor_3$byClass,
                      cm_rfe_3$byClass,
                      cm_ga_3$byClass))

byClass_gather <- byClass[, c(1:4, 7)] %>%
  gather(measure, value, Sensitivity:Precision)
overall_byClass_gather <- rbind(overall_gather, byClass_gather)
overall_byClass_gather <- within(overall_byClass_gather, model <- factor(model, levels = c("all", "cor", "rfe", "ga")))

ggplot(overall_byClass_gather, aes(x = model, y = value, color = measure, shape = measure, group = measure)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_path(alpha = 0.7) +
  scale_colour_brewer(palette = "Set1") +
  facet_grid(dataset ~ ., scales = "free_y") +
  labs(
    x = "Feature Selection method",
    y = "Value",
    color = "",
    shape = "",
    title = "Comparison of feature selection methods",
    subtitle = "in three breast cancer datasets",
    caption = "\nBreast Cancer Wisconsin (Diagnostic) Data Sets: 1, 2 & 3
    Street et al., 1993;
    all: no feature selection
    cor: features with correlation > 0.7 removed
    rfe: Recursive Feature Elimination
    ga: Genetic Algorithm"
  )

Conclusions

As expected from PCA of sample classes, which showed that the two classes recur/ non-recur did not cluster well, Random Forest models of dataset 3 had much lower prediction accuracy than models of datasets 1 and 2.

The correlation method operates regardless of feature importance. E.g. in dataset 1, the features with the highest importance were also flagged as highly correlated. Correlation models performed worst in all three datasets. RFE and GA tend to include features with high importance, but feature importance alone is not a good indicator for whether several features will work well in combination when predicting an outcome.

Dataset 1 was small with only 9 features; here, removing highly correlated features was the least successful selection method. RFE and GA both improved the predictions compared to no feature selection, while GA performed best. Dataset 2 with its 30 original features produced the best models with the GA. And in dataset 3, which had lower overall accuracy, different feature selection methods did not have a strong influence.


If you are interested in more machine learning posts, check out the category listing for machine_learning.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] C
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1         dplyr_0.5.0          plyr_1.8.4          
##  [4] corrplot_0.77        randomForest_4.6-12  doParallel_1.0.10   
##  [7] iterators_1.0.8      foreach_1.4.3        caret_6.0-73        
## [10] lattice_0.20-34      tidyr_0.6.0          gridExtra_2.2.1     
## [13] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
## [16] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [19] ellipse_0.3-8        ggplot2_2.2.1        mice_2.25           
## [22] Rcpp_0.12.8         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0         class_7.3-14         assertthat_0.1      
##  [4] rprojroot_1.1        digest_0.6.11        R6_2.2.0            
##  [7] backports_1.0.4      MatrixModels_0.4-1   RSQLite_1.1-1       
## [10] evaluate_0.10        e1071_1.6-7          zlibbioc_1.20.0     
## [13] lazyeval_0.2.0       gdata_2.17.0         minqa_1.2.4         
## [16] SparseM_1.74         car_2.1-4            nloptr_1.0.4        
## [19] rpart_4.1-10         Matrix_1.2-7.1       rmarkdown_1.3       
## [22] labeling_0.3         splines_3.3.2        lme4_1.1-12         
## [25] stringr_1.1.0        munsell_0.4.3        compiler_3.3.2      
## [28] mgcv_1.8-16          htmltools_0.3.5      nnet_7.3-12         
## [31] tibble_1.2           codetools_0.2-15     bitops_1.0-6        
## [34] MASS_7.3-45          ModelMetrics_1.1.0   nlme_3.1-128        
## [37] gtable_0.2.0         DBI_0.5-1            magrittr_1.5        
## [40] scales_0.4.1         KernSmooth_2.23-15   stringi_1.1.2       
## [43] reshape2_1.4.2       RColorBrewer_1.1-2   tools_3.3.2         
## [46] Biobase_2.34.0       pbkrtest_0.4-6       survival_2.40-1     
## [49] yaml_2.1.14          AnnotationDbi_1.36.0 colorspace_1.3-2    
## [52] caTools_1.17.1       memoise_1.0.0        knitr_1.15.1        
## [55] quantreg_5.29

Gene homology Part 3 - Visualizing Gene Ontology of Conserved Genes

2017-01-05 00:00:00 +0000

Which genes have homologs in many species?

In Part 1 and Part 2 I have already explored gene homology between humans and other species. But there I have only considered how many genes where shared between the species.

In this post I want to have a closer look at genes with homologs in many species. These genes are likely evolutionarily convserved in their biological function. To find out what biological functions these genes have, I am using gene ontology (GO) enrichment analysis.

Because GO categories have a hierarchical parent-children structure, comparing enriched GO terms between groups of genes in a meaningful way isn’t trivial. Simply calculating the overlap between categories will miss the bigger picture when, e.g. two groups are enriched for closely related but not for exactly the same GO categories. Here, I will explore a few options of visualizing GO enrichment with the purpose of comparing enrichments between groups (here: species).

A network of the partial hierarchical structure of enriched GO terms of biological processes encompasses the information of relatedness between GO terms. Custom annotations show GO terms that were common, as well as terms that were unique to species.

Highly conserved biological functions include core regulatory processes that are essential for the cell cycle, energy production and metabolism, nucleoside biogenesis, protein metabolism, etc.



I’m starting with the same table as from this post, called homologs_table_combined.

head(homologs_table_combined)
##   celegans_gene_ensembl cfamiliaris_gene_ensembl
## 1                    NA                       NA
## 2                    NA                   477699
## 3                    NA                   477023
## 4                173979                   490207
## 5                179795                   607852
## 6                177055                       NA
##   dmelanogaster_gene_ensembl drerio_gene_ensembl ggallus_gene_ensembl
## 1                         NA                  NA                   NA
## 2                      44071                  NA                   NA
## 3                         NA              431754               770094
## 4                      38864              406283                   NA
## 5                      36760              794259               395373
## 6                    3772179              541489               421909
##   hsapiens_gene_ensembl mmusculus_gene_ensembl ptroglodytes_gene_ensembl
## 1                    NA                     NA                        NA
## 2                     2                 232345                    465372
## 3                    30                 235674                    460268
## 4                    34                  11364                    469356
## 5                    47                 104112                    454672
## 6                    52                  11431                    458990
##   rnorvegicus_gene_ensembl scerevisiae_gene_ensembl sscrofa_gene_ensembl
## 1                    24152                       NA                   NA
## 2                    24153                       NA               403166
## 3                    24157                   854646            100515577
## 4                    24158                       NA               397104
## 5                    24159                   854310                   NA
## 6                    24161                   856187            100737301

Each row in this table denotes a gene with its Entrez ID and corresponding Entrez IDs for homologs in each of the other 10 species I explored. If a gene doesn’t have homolog the table says “NA”. However, some genes have duplicate entries for a species if there are multiple homologs in different species.

By counting the number of NAs per row, we can identify genes with homologs in all species (sum of NAs = 0) and genes which are specific (sum of NAs = 10).

homologs_na <- rowSums(is.na(homologs_table_combined))

Before I delve deeper into the biology behind these genes, I want to examine the distribution of the NA-counts. To do so, I am plotting a histogram.

But first I’m preparing my custom ggplot2 theme:

library(ggplot2)
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "bottom",
    legend.justification = "top", 
    legend.box = "horizontal",
    legend.box.background = element_rect(colour = "grey50"),
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())
ggplot() + 
  aes(homologs_na) + 
  geom_histogram(binwidth = 1, color = "royalblue", fill = "royalblue", alpha = 0.8) +
  labs(
    x = "Number of NAs per row of gene homology table",
    y = "Count",
    title = "How many homologs do genes have?",
    subtitle = "Showing all bins from 0 to 10",
    caption = "\nEach row of the gene homology table list a gene and all its homologs in 10 other species.
    If a gene doesn't have a homolog, its value in the table is NA. Thus, rows with many NAs refer to a
    gene that is specific to one species, while rows with no NAs show genes with homologs in all species -
    such genes are said to be highly conserved."
  )


Clearly, most genes are specific to a species, they have NAs in all but one column. The rest of the histogram is a bit hard to differentiate with the peak at 10, so let’s look at the same data again with these genes:


ggplot() + 
  aes(subset(homologs_na, homologs_na  < 10)) + 
  geom_histogram(binwidth = 1, color = "royalblue", fill = "royalblue", alpha = 0.8) +
  labs(
    x = "Number of NAs per row of gene homology table",
    y = "Count",
    title = "How many homologs do genes have?",
    subtitle = "Showing bins from 0 to 9",
    caption = "\nEach row of the gene homology table list a gene and all its homologs in 10 other species.
    If a gene doesn't have a homolog, its value in the table is NA. Thus, rows with many NAs refer to a
    gene that is specific to one species, while rows with no NAs show genes with homologs in all species -
    such genes are said to be highly conserved."
  )



Now we can see that most genes have homologs in 9 species (2 NAs). But there are still quite a few genes with homologs in all species.


Which genes are highly conserved and what biological functions do they have?

There are 3461 rows in the original table with no NAs.

genes_homologs_all <- homologs_table_combined[which(rowSums(is.na(homologs_table_combined)) == 0), ]
nrow(genes_homologs_all)
## [1] 3461

Looking at all of these genes by hand wouldn’t be feasible. So, to find out what biological functions these genes have, I am using gene ontology (GO-term) enrichment analysis as implemented in clusterProfiler.

  • Biological processes (BP): Collections of molecular events and functions contributing to a biological process
  • Cellular component (CC): (sub-) cellular structures and locations, macromolecular complexes
  • Molecular functions (MF): Molecular functions like catalytic or binding activity
library(clusterProfiler)
library(DOSE)

for (i in 1:nrow(datasets)) {
  species <- datasets$dataset[i]
  genes <- as.character(unique(genes_homologs_all[, species]))
  universe <- as.character(unique(na.omit(homologs_table_combined[, species])))
  
  cat("\nSpecies", datasets$description[i], "has", length(universe), "unique Entrez IDs, of which", length(genes), "have homologs in all species.\n")
  
  try(go_enrich_BP <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "BP",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(go_enrich_MF <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "MF",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(go_enrich_CC <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "CC",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(assign(paste("go_enrich_BP", species, sep = "_"), go_enrich_BP))
  try(assign(paste("go_enrich_MF", species, sep = "_"), go_enrich_MF))
  try(assign(paste("go_enrich_CC", species, sep = "_"), go_enrich_CC))
}
## 
## Species Rat has 47989 unique Entrez IDs, of which 1796 have homologs in all species.
## 
## Species Yeast has 6350 unique Entrez IDs, of which 1363 have homologs in all species.
## 
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## 
## Species C. elegans has 46727 unique Entrez IDs, of which 1758 have homologs in all species.
## 
## Species Chimpanzee has 40540 unique Entrez IDs, of which 1704 have homologs in all species.
## 
## Species Pig has 54622 unique Entrez IDs, of which 1708 have homologs in all species.
## 
## Species Human has 60136 unique Entrez IDs, of which 1757 have homologs in all species.
## 
## Species Chicken has 34668 unique Entrez IDs, of which 1684 have homologs in all species.
## 
## Species Zebrafish has 47900 unique Entrez IDs, of which 1910 have homologs in all species.
## 
## Species Fruitfly has 25038 unique Entrez IDs, of which 1805 have homologs in all species.
## 
## Species Mouse has 68603 unique Entrez IDs, of which 1761 have homologs in all species.
## 
## Species Dog has 31376 unique Entrez IDs, of which 1716 have homologs in all species.

I’m not sure why yeast throws an error but since all other species worked I will ignore yeast genes for now…


Visualizing enriched GO categories

A heatmap of gene enrichment ratios shows enriched GO categories and their enrichment strength and allows us to capture comparable information for all species.

To generate this heatmap, I am using the output tables from clusterProfiler, which tell us the enrichment p-value (before and after correction for multiple testing, aka False Discovery Rate or FDR), the proportion of genes belonging to each GO category in the background versus all background genes and in our gene list compared to all genes in our list. To keep the graph reasonably small, I’m subsetting only the top 10 enriched genes with lowest adjusted p-values for biological processes (BP), molecular functions (MF) and cellular components (CC).

# removing yeast
datasets_2 <- datasets[-grep("scerevisiae_gene_ensembl", datasets$dataset), ]

library(dplyr)

cutoff = 10

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  
  go_enrich_BP_df <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "BP")
  
  plot_BP <- go_enrich_BP_df[1:cutoff, ]
  plot_BP$Description <- factor(plot_BP$Description, levels = plot_BP$Description[order(plot_BP$GR)])
    
  go_enrich_MF_df <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "MF")
  
  plot_MF <- go_enrich_MF_df[1:cutoff, ]
  plot_MF$Description <- factor(plot_MF$Description, levels = plot_MF$Description[order(plot_MF$GR)])

  go_enrich_CC_df <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "CC")
  
  plot_CC <- go_enrich_CC_df[1:cutoff, ]
  plot_CC$Description <- factor(plot_CC$Description, levels = plot_CC$Description[order(plot_CC$GR)])
  
  plot <- rbind(plot_BP, plot_MF, plot_CC)
  plot$Species <- paste(datasets_2$description[i])

  if (i == 1) {
    plot_df <- plot
  } else {
    plot_df <- rbind(plot_df, plot)
  }
}

length(unique(plot_df$ID))
## [1] 123
ggplot(plot_df, aes(x = Species, y = Description, fill = GR)) + 
  geom_tile(width = 1, height = 1) +
  scale_fill_gradient2(low = "white", mid = "blue", high = "red", space = "Lab", name = "Gene Ratio") +
  facet_grid(Category ~ ., scales = "free_y") +
  scale_x_discrete(position = "top") +
  labs(
      title = paste("Top", cutoff, "enriched GO terms"),
      x = "",
      y = ""
    )



Word clouds

Another way to capture information about GO terms is to use the GO description texts as basis for wordclouds. I’m using the general method described here. I’m also included a list of words to remove, like process, activity, etc. These words occur often and on their own don’t tell us much about the GO term.

library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

wordcloud_function <- function(data = data,
                               removewords = c("process", "activity", "positive", "negative", "response", "regulation"),
                               min.freq = 4,
                               max.words=Inf,
                               random.order=TRUE){
  input <- Corpus(VectorSource(data))
  
  input <- tm_map(input, content_transformer(tolower))
  input <- tm_map(input, content_transformer(removePunctuation))
  input <- tm_map(input, removeNumbers)
  input <- tm_map(input, stripWhitespace)
  
  toSpace <- content_transformer(function(x , pattern ) gsub(pattern, " ", x))
  input <- tm_map(input, toSpace, "/")
  input <- tm_map(input, toSpace, "@")
  input <- tm_map(input, toSpace, "\\|")
  
  input <- tm_map(input, function(x) removeWords(x, stopwords("english")))
  
  # specify your stopwords as a character vector
  input <- tm_map(input, removeWords, removewords)
  
  tdm <- TermDocumentMatrix(input)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing = TRUE)
  d <- data.frame(word = names(v),freq = v)
  
  set.seed(1234)
  wordcloud(words = d$word, freq = d$freq, min.freq = min.freq, scale = c(8,.2),
            max.words = max.words, random.order = random.order, rot.per = 0.15,
            colors = brewer.pal(8, "Dark2"))
}
layout(matrix(c(1:80), nrow = 4, byrow = FALSE), heights = c(0.1, 1))

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  df_MF <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_")))
  df_CC <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_")))
  
  par(mar = rep(0, 4))
  plot.new()
  text(x = 0.5, y = 0.5, paste0(datasets_2$description[i]), cex = 2)
  wordcloud_function(data = df_BP$Description)
  wordcloud_function(data = df_MF$Description)
  wordcloud_function(data = df_CC$Description)
}


Which GO terms are enriched in all species?

I also want to know which specific GO categories are enriched in all species. Even though we might miss related terms, this will give us a first idea about the biological functions that are highly conserved between species.

I’m creating a list of enriched GO categories of all species and ask for common elements using the list.common function of the rlist package.

# create empty list to populate with a loop
go_list_BP <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_BP) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  go_list_BP[[species]] <- df_BP$ID
}

# Now I know which GO IDs are common but I want to know the description as well.
# Because they are in all go_enrich tables, I'm chosen one to subset

library(rlist)
common_gos_BP <- as.data.frame(go_enrich_BP_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_BP)) %>%
  select(ID, Description) %>%
  arrange(Description)


go_list_MF <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_MF) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_MF <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_")))
  go_list_MF[[species]] <- df_MF$ID
}

library(rlist)
common_gos_MF <- as.data.frame(go_enrich_MF_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_MF)) %>%
  select(ID, Description) %>%
  arrange(Description)


go_list_CC <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_CC) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_CC <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_")))
  go_list_CC[[species]] <- df_CC$ID
}

library(rlist)
common_gos_CC <- as.data.frame(go_enrich_CC_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_CC)) %>%
  select(ID, Description) %>%
  arrange(Description)

To visualize these common GO categories, I’m going back to the clusterProfiler output tables. This time I’m not using a heatmap, but a point plot to show enrichment statistics for all species.

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  
  go_enrich_BP_df <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_BP)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "BP")
  
  go_enrich_BP_df$Description <- factor(go_enrich_BP_df$Description, levels = go_enrich_BP_df$Description[order(go_enrich_BP_df$GR)])
    
  go_enrich_MF_df <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_MF)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "MF")
  
  go_enrich_MF_df$Description <- factor(go_enrich_MF_df$Description, levels = go_enrich_MF_df$Description[order(go_enrich_MF_df$GR)])

  go_enrich_CC_df <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_CC)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "CC")
  
  go_enrich_CC_df$Description <- factor(go_enrich_CC_df$Description, levels = go_enrich_CC_df$Description[order(go_enrich_CC_df$GR)])
  
  plot <- rbind(go_enrich_BP_df, go_enrich_MF_df, go_enrich_CC_df)
  plot$Species <- paste(datasets_2$description[i])

  if (i == 1) {
    plot_df <- plot
  } else {
    plot_df <- rbind(plot_df, plot)
  }
}

ggplot(plot_df, aes(x = GR, y = Description, size = Count, color = p.adjust)) +
    geom_point() +
    labs(
      title = "Common enriched GO terms",
      x = "Gene Ratio",
      y = ""
    ) +
    facet_grid(Category ~ Species, scales = "free_y")


Gene Ratio is shown on the x-axis, point size shows the gene counts and point color shows the adjusted p-value.


GO hierachical structure

The package RamiGO can be used to plot all parents of the 10 common GO categories:

library(RamiGO)
library(RColorBrewer)
set3 <- brewer.pal(10, "Set3")
goIDs <- list.common(go_list_BP)
getAmigoTree(goIDs, set3, filename = "common_gos_BP", picType = "png", modeType = "amigo", saveResult = TRUE)

set3 <- brewer.pal(8, "Set3")
goIDs <- list.common(go_list_MF)
getAmigoTree(goIDs, set3, filename = "common_gos_MF", picType = "png", modeType = "amigo", saveResult = TRUE)

goIDs <- list.common(go_list_CC)
getAmigoTree(goIDs, set3, filename = "common_gos_CC", picType = "png", modeType = "amigo", saveResult = TRUE)
  • Common GOs BP

  • Common GOs MF

  • Common GOs CC


GO network

To visualize clusters of species-specific and common enriched GO categories within their hierachical structure, I’m creating a network of the top 15 enriched biological process categories of each species with two levels of parent terms.

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  df_BP <- df_BP[1:15, 1, drop = FALSE]
  df_BP$species <- datasets_2$description[i]
  
  if (i == 1) {
    df_BP_all <- df_BP
  } else {
    df_BP_all <- rbind(df_BP_all, df_BP)
  }
}

length(unique(df_BP_all$ID))
## [1] 69

Of these top GO categories most are not unique to individual species.

In the network I want to show GO categories that are species-specific and those that are common among all species. To create the annotation data with this information I’m creating a dataframe with the names of the species along with their specific GO term and combine this with the common GO terms from above.

library(tidyr)
df_BP_all$value <- 1
df_BP_all <- spread(df_BP_all, species, value)

go_sums <- data.frame(ID = df_BP_all$ID,
                      sum = rowSums(!is.na(df_BP_all[, -1])))

# GO in all species
go_anno <- data.frame(ID = go_sums$ID)
go_anno$group <- ifelse(go_sums$sum == 10, "common_all", NA)

x <- subset(go_sums, sum == 1)
xx <- subset(df_BP_all, ID %in% x$ID)

for (i in 2:11) {
  xxx <- xx[, i]
  xxx[which(!is.na(xxx))] <- colnames(xx[, i, drop = FALSE])
  unique <- as.data.frame(cbind(xx[, 1], xxx)) %>%
    subset(!is.na(xxx))
  
  if (i == 2) {
    unique_all <- unique
  } else {
    unique_all <- rbind(unique_all, unique)
  }
}

# Common GOs
unique_common_all <- rbind(unique_all, data.frame(V1 = list.common(go_list_BP), xxx = "common"))

To identify the parent GO terms, I’m using the GO.db package. I’m adding two levels of parental structure for each GO term.

library(GO.db)

# GO parents list
ontology <- as.list(GOBPPARENTS)

goChilds <- unique(df_BP_all$ID)
goChilds <- goChilds[goChilds %in% names(ontology)]

adjList <- matrix(nrow = 0, ncol = 3)
colnames(adjList) <- c("parent", "rel", "child")

# Find parents
goChilds <- goChilds[which(goChilds != "all")]
        
goFamilyList <- lapply(ontology[goChilds], function(x) data.frame(parent = x, rel = names(x)))
adjList <- rbind(adjList, as.matrix(cbind(do.call(rbind, goFamilyList), child = rep(names(goFamilyList), sapply(goFamilyList, nrow)))))

# Next parents
goParents <- unique(as.character(adjList[which(!adjList[,"parent"] %in% adjList[,"child"]), "parent"]))    
goFamilyList_2 <- lapply(ontology[goParents], function(x) data.frame(parent = x, rel = names(x)))
adjList <- rbind(adjList, as.matrix(cbind(do.call(rbind, goFamilyList_2), child = rep(names(goFamilyList_2), sapply(goFamilyList_2, nrow)))))

adjList <- adjList[which(adjList[,1] != "all"), , drop = FALSE]
adjList <- adjList[, c("child","parent", "rel"), drop = FALSE]

The graph is then prodcued with igraph as a directed network. Colors show species-specific, common and remaining enriched GO categories. Parents are shown in light blue.

library(igraph)

goGraph <- graph.edgelist(adjList[, c("parent", "child"), drop = FALSE], directed = TRUE)
goGraph <- set.edge.attribute(goGraph, "type", value = adjList[,"rel"])

goTerms <- sapply(V(goGraph)$name, function(x)
        {
            term <- GOTERM[[x]]@Term
            
                term <- gsub("( process)$", " pr.", term)
                term <- gsub(" development ", " dev. ", term)
                term <- gsub("( development)$", " dev.", term, perl = TRUE)
                term <- gsub("Development ", "Dev. ", term)
            
            # split in two lines
            spaceLoc <- gregexpr(" ", term, fixed = TRUE)[[1]]
            if ((spaceLoc[1] != -1) && (length(spaceLoc) > 1))
            {
                spaceLoc <- spaceLoc[ceiling(length(spaceLoc)/2)]
                term <- paste(substr(term, 1, spaceLoc - 1), substr(term, spaceLoc + 1, nchar(term)), sep = "\n")
            }

            term
})

goTerms_df <- as.data.frame(goTerms)

goTerms_df$color <- ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "C. elegans"), 1], "green",
                           ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Chicken"), 1], "yellowgreen",
                                  ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Chimpanzee"), 1], "violetred",
                                         ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Dog"), 1], "turquoise",
                                                ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Fruitfly"), 1], "violet", 
                                                       ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Human"), 1], "thistle", 
                                                              ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Mouse"), 1], "tan1", 
                                                                     ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Pig"), 1], "steelblue", 
                                                                            ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Rat"), 1], "orangered", 
                                                                                   ifelse(rownames(goTerms_df) %in% 
                                                                                   unique_common_all[which(unique_common_all$xxx == "Zebrafish"), 1], "olivedrab", 
                                                                                          ifelse(rownames(goTerms_df) %in% 
                                                                                          unique_common_all[which(unique_common_all$xxx == "common"), 1], "cyan", 
                                                                                          ifelse(rownames(goTerms_df) %in% goChilds, "red", "skyblue2"))))))))))))

V(goGraph)$color <- adjustcolor(goTerms_df$color, alpha.f = 0.8)
V(goGraph)$label <- goTerms

Ecolor <- ifelse(E(goGraph)$type == "is_a", "slategrey",
                           ifelse(E(goGraph)$type == "part_of", "springgreen4",
                           ifelse(E(goGraph)$type == "regulates", "tomato",
                           ifelse(E(goGraph)$type == "negatively_regulates", "orange", "greenyellow"))))

E(goGraph)$color <- Ecolor
plot(goGraph,
     vertex.label.color = "black",
     vertex.label.cex = 0.8,
     vertex.size = 2,
     margin = 0)

labels <- c("C. elegans", "Chicken", "Chimpanzee", "Dog", "Fruitfly", "Human", "Mouse", "Pig", "Rat", "Zebrafish", "common", "other enriched GO", "parent")
colors <- c("green", "yellowgreen", "violetred", "turquoise", "violet", "thistle", "tan1", "steelblue", "orangered", "olivedrab", "cyan", "red", "skyblue2")
legend("topleft", labels, pch = 19,
       col = colors, pt.cex = 2, cex = 2, bty = "n", ncol = 2)

For the output network graph, see summary at the beginning of this post.



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] igraph_1.0.1          GO.db_3.4.0           tidyr_0.6.0          
##  [4] RamiGO_1.20.0         gsubfn_0.6-6          proto_1.0.0          
##  [7] rlist_0.4.6.1         wordcloud_2.5         RColorBrewer_1.1-2   
## [10] SnowballC_0.5.1       tm_0.6-2              NLP_0.1-9            
## [13] dplyr_0.5.0           clusterProfiler_3.2.9 DOSE_3.0.9           
## [16] ggplot2_2.2.1         org.Cf.eg.db_3.4.0    org.Mm.eg.db_3.4.0   
## [19] org.Dm.eg.db_3.4.0    org.Dr.eg.db_3.4.0    org.Gg.eg.db_3.4.0   
## [22] org.Hs.eg.db_3.4.0    org.Ss.eg.db_3.4.0    org.Pt.eg.db_3.4.0   
## [25] org.Ce.eg.db_3.4.0    org.Sc.sgd.db_3.4.0   org.Rn.eg.db_3.4.0   
## [28] AnnotationDbi_1.36.0  IRanges_2.8.1         S4Vectors_0.12.1     
## [31] Biobase_2.34.0        BiocGenerics_0.20.0   biomaRt_2.30.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        png_0.1-7          assertthat_0.1    
##  [4] rprojroot_1.1      digest_0.6.11      slam_0.1-40       
##  [7] R6_2.2.0           plyr_1.8.4         backports_1.0.4   
## [10] RSQLite_1.1-1      evaluate_0.10      lazyeval_0.2.0    
## [13] data.table_1.10.0  qvalue_2.6.0       rmarkdown_1.3     
## [16] labeling_0.3       splines_3.3.2      BiocParallel_1.8.1
## [19] RCytoscape_1.24.1  stringr_1.1.0      RCurl_1.95-4.8    
## [22] munsell_0.4.3      fgsea_1.0.2        htmltools_0.3.5   
## [25] tcltk_3.3.2        tibble_1.2         gridExtra_2.2.1   
## [28] codetools_0.2-15   XML_3.98-1.5       bitops_1.0-6      
## [31] grid_3.3.2         gtable_0.2.0       DBI_0.5-1         
## [34] magrittr_1.5       scales_0.4.1       graph_1.52.0      
## [37] stringi_1.1.2      GOSemSim_2.0.3     reshape2_1.4.2    
## [40] DO.db_2.9          fastmatch_1.0-4    tools_3.3.2       
## [43] yaml_2.1.14        colorspace_1.3-2   memoise_1.0.0     
## [46] knitr_1.15.1       XMLRPC_0.3-0

How to map your Google location history with R

2016-12-30 00:00:00 +0000

It’s no secret that Google Big Brothers most of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.

If you want to see a few ways how to quickly and easily visualize your location history with R, stay tuned…

The Google location history can be downloaded from your Google account under https://takeout.google.com/settings/takeout. Make sure you only tick “location history” for download, otherwise it will take super long to get all your Google data.

The data Google provides you for download is a .json file and can be loaded with the jsonlite package. Loading this file into R might take a few minutes because it can be quite big, depending on how many location points Google had saved about you.

library(jsonlite)
system.time(x <- fromJSON("Standortverlauf.json"))
##    user  system elapsed 
##  112.52    1.12  132.59


The date and time column is in the POSIX milliseconds format, so I converted it to human readable POSIX.

Similarly, longitude and latitude are saved in E7 format and were converted to GPS coordinates.

# extracting the locations dataframe
loc = x$locations

# converting time column from posix milliseconds into a readable time scale
loc$time = as.POSIXct(as.numeric(x$locations$timestampMs)/1000, origin = "1970-01-01")

# converting longitude and latitude from E7 to GPS coordinates
loc$lat = loc$latitudeE7 / 1e7
loc$lon = loc$longitudeE7 / 1e7


This is how the data looks like now:

head(loc)
##     timestampMs latitudeE7 longitudeE7 accuracy                 activitys
## 1 1482393378938  519601402    76004708       29                      NULL
## 2 1482393333953  519601402    76004708       29                      NULL
## 3 1482393033893  519603616    76002628       20 1482393165600, still, 100
## 4 1482392814435  519603684    76001572       20 1482392817678, still, 100
## 5 1482392734911  519603684    76001572       20                      NULL
## 6 1482392433788  519603684    76001572       20                      NULL
##   velocity heading altitude                time      lat      lon
## 1       NA      NA       NA 2016-12-22 08:56:18 51.96014 7.600471
## 2       NA      NA       NA 2016-12-22 08:55:33 51.96014 7.600471
## 3       NA      NA       NA 2016-12-22 08:50:33 51.96036 7.600263
## 4       NA      NA       NA 2016-12-22 08:46:54 51.96037 7.600157
## 5       NA      NA       NA 2016-12-22 08:45:34 51.96037 7.600157
## 6       NA      NA       NA 2016-12-22 08:40:33 51.96037 7.600157

We have the original and converted time, latitude and longitude columns, plus accuracy, activities, velocity, heading and altitude. Accuracy gives the error distance around the point in metres. Activities are saved as a list of data frames and will be explored further down. Velocity, heading and altitude were not recorded for earlier data points.


Data stats

Before I get to actually plotting maps, I want to explore a few basic statistics of the data.


How many data points did Google record over what period of time?

# how many rows are in the data frame?
nrow(loc)
## [1] 600897
min(loc$time)
## [1] "2013-09-06 19:33:41 CEST"
max(loc$time)
## [1] "2016-12-22 08:56:18 CET"


And how are they distributed over days, months and years?

# calculate the number of data points per day, month and year
library(lubridate)
library(zoo)

loc$date <- as.Date(loc$time, '%Y/%m/%d')
loc$year <- year(loc$date)
loc$month_year <- as.yearmon(loc$date)

points_p_day <- data.frame(table(loc$date), group = "day")
points_p_month <- data.frame(table(loc$month_year), group = "month")
points_p_year <- data.frame(table(loc$year), group = "year")

How many days were recorded?

nrow(points_p_day)
## [1] 1057

How many months?

nrow(points_p_month)
## [1] 39

And how many years?

nrow(points_p_year)
## [1] 4
# set up plotting theme
library(ggplot2)
library(ggmap)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "navy"),
    legend.position = "right",
    legend.background = element_blank(),
    panel.margin = unit(.5, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
points <- rbind(points_p_day[, -1], points_p_month[, -1], points_p_year[, -1])

ggplot(points, aes(x = group, y = Freq)) + 
  geom_point(position = position_jitter(width = 0.2), alpha = 0.3) + 
  geom_boxplot(aes(color = group), size = 1, outlier.colour = NA) + 
  facet_grid(group ~ ., scales = "free") + my_theme() +
  theme(
    legend.position = "none",
    strip.placement = "outside",
    strip.background = element_blank(),
    strip.text = element_blank(),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)
  ) +
  labs(
    x = "",
    y = "Number of data points",
    title = "How many data points did Google collect about me?",
    subtitle = "Number of data points per day, month and year",
    caption = "\nGoogle collected between 0 and 1500 data points per day
    (median ~500), between 0 and 40,000 per month (median ~15,000) and 
    between 80,000 and 220,000 per year (median ~140,000)."
  )


How accurate is the data?

Accuracy is given in meters, i.e. the smaller the better.

accuracy <- data.frame(accuracy = loc$accuracy, group = ifelse(loc$accuracy < 800, "high", ifelse(loc$accuracy < 5000, "middle", "low")))

accuracy$group <- factor(accuracy$group, levels = c("high", "middle", "low"))

ggplot(accuracy, aes(x = accuracy, fill = group)) + 
  geom_histogram() + 
  facet_grid(group ~ ., scales="free") + 
  my_theme() +
  theme(
    legend.position = "none",
    strip.placement = "outside",
    strip.background = element_blank(),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)
  ) +
  labs(
    x = "Accuracy in metres",
    y = "Count",
    title = "How accurate is the location data?",
    subtitle = "Histogram of accuracy of location points",
    caption = "\nMost data points are pretty accurate, 
but there are still many data points with a high inaccuracy.
    These were probably from areas with bad satellite reception."
  )


Plotting data points on maps

Finally, we are actually going to plot some maps!

The first map is a simple point plot of all locations recorded around Germany.

You first specify the map area and the zoom factor (the furthest away is 1); the bigger the zoom, the closer to the center of the specified location. Location can be given as longitude/latitude pair or as city or country name.

On this map we can plot different types of plot with the regular ggplot2 syntax. For example, a point plot.

germany <- get_map(location = 'Germany', zoom = 5)

ggmap(germany) + geom_point(data = loc, aes(x = lon, y = lat), alpha = 0.5, color = "red") + 
  theme(legend.position = "right") + 
  labs(
    x = "Longitude", 
    y = "Latitude", 
    title = "Location history data points in Europe",
    caption = "\nA simple point plot shows recorded positions.")


The second map shows a 2D bin plot of accuracy measured for all data points recorded in my home town Münster.

munster <- get_map(location = 'Munster', zoom = 12)

options(stringsAsFactors = T)
ggmap(munster) + 
  stat_summary_2d(geom = "tile", bins = 100, data = loc, aes(x = lon, y = lat, z = accuracy), alpha = 0.5) + 
  scale_fill_gradient(low = "blue", high = "red", guide = guide_legend(title = "Accuracy")) +
  labs(
    x = "Longitude", 
    y = "Latitude", 
    title = "Location history data points around Münster",
    subtitle = "Color scale shows accuracy (low: blue, high: red)",
    caption = "\nThis bin plot shows recorded positions 
    and their accuracy in and around Münster")


We can also plot the velocity of each data point:

loc_2 <- loc[which(!is.na(loc$velocity)), ]

munster <- get_map(location = 'Munster', zoom = 10)

ggmap(munster) + geom_point(data = loc_2, aes(x = lon, y = lat, color = velocity), alpha = 0.3) + 
  theme(legend.position = "right") + 
  labs(x = "Longitude", y = "Latitude", 
       title = "Location history data points in Münster",
       subtitle = "Color scale shows velocity measured for location",
       caption = "\nA point plot where points are colored according 
       to velocity nicely reflects that I moved generally 
       slower in the city center than on the autobahn") +
  scale_colour_gradient(low = "blue", high = "red", guide = guide_legend(title = "Velocity"))


What distance did I travel?

To obtain the distance I moved, I am calculating the distance between data points. Because it takes a long time to calculate, I am only looking at data from last year.

loc3 <- with(loc, subset(loc, loc$time > as.POSIXct('2016-01-01 0:00:01')))
loc3 <- with(loc, subset(loc3, loc$time < as.POSIXct('2016-12-22 23:59:59')))

# Shifting vectors for latitude and longitude to include end position
shift.vec <- function(vec, shift){
  if (length(vec) <= abs(shift)){
    rep(NA ,length(vec))
  } else {
    if (shift >= 0) {
      c(rep(NA, shift), vec[1:(length(vec) - shift)]) }
    else {
      c(vec[(abs(shift) + 1):length(vec)], rep(NA, abs(shift)))
    }
  }
}

loc3$lat.p1 <- shift.vec(loc3$lat, -1)
loc3$lon.p1 <- shift.vec(loc3$lon, -1)

# Calculating distances between points (in metres) with the function pointDistance from the 'raster' package.
library(raster)
loc3$dist.to.prev <- apply(loc3, 1, FUN = function(row) {
  pointDistance(c(as.numeric(as.character(row["lat.p1"])),
                  as.numeric(as.character(row["lon.p1"]))),
                c(as.numeric(as.character(row["lat"])), as.numeric(as.character(row["lon"]))),
                lonlat = T) # Parameter 'lonlat' has to be TRUE!
})
# distance in km
round(sum(as.numeric(as.character(loc3$dist.to.prev)), na.rm = TRUE)*0.001, digits = 2)
## [1] 54466.08
distance_p_month <- aggregate(loc3$dist.to.prev, by = list(month_year = as.factor(loc3$month_year)), FUN = sum)
distance_p_month$x <- distance_p_month$x*0.001
ggplot(distance_p_month[-1, ], aes(x = month_year, y = x,  fill = month_year)) + 
  geom_bar(stat = "identity")  + 
  guides(fill = FALSE) +
  my_theme() +
  labs(
    x = "",
    y = "Distance in km",
    title = "Distance traveled per month in 2016",
    caption = "This barplot shows the sum of distances between recorded 
    positions for 2016. In September we went to the US and Canada."
  )


Activities

Google also guesses my activity based on distance travelled per time.

Here again, it would take too long to look at activity from all data points.

activities <- loc3$activitys

list.condition <- sapply(activities, function(x) !is.null(x[[1]]))
activities  <- activities[list.condition]

df <- do.call("rbind", activities)
main_activity <- sapply(df$activities, function(x) x[[1]][1][[1]][1])

activities_2 <- data.frame(main_activity = main_activity, 
                           time = as.POSIXct(as.numeric(df$timestampMs)/1000, origin = "1970-01-01"))

head(activities_2)
##   main_activity                time
## 1         still 2016-12-22 08:52:45
## 2         still 2016-12-22 08:46:57
## 3         still 2016-12-22 08:33:24
## 4         still 2016-12-22 08:21:31
## 5         still 2016-12-22 08:15:32
## 6         still 2016-12-22 08:10:25
ggplot(activities_2, aes(x = main_activity, group = main_activity, fill = main_activity)) + 
  geom_bar()  + 
  guides(fill = FALSE) +
  my_theme() +
  labs(
    x = "",
    y = "Count",
    title = "Main activities in 2016",
    caption = "Associated activity for recorded positions in 2016. 
    Because Google records activity probabilities for each position, 
    only the activity with highest likelihood were chosen for each position."
  )



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] raster_2.5-8    sp_1.2-3        ggmap_2.6.2     ggplot2_2.2.0  
## [5] zoo_1.7-13      lubridate_1.6.0 jsonlite_1.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8       plyr_1.8.4        bitops_1.0-6     
##  [4] tools_3.3.2       digest_0.6.10     evaluate_0.10    
##  [7] tibble_1.2        gtable_0.2.0      lattice_0.20-34  
## [10] png_0.1-7         DBI_0.5-1         mapproj_1.2-4    
## [13] yaml_2.1.14       proto_1.0.0       stringr_1.1.0    
## [16] dplyr_0.5.0       knitr_1.15.1      RgoogleMaps_1.4.1
## [19] maps_3.1.1        rprojroot_1.1     grid_3.3.2       
## [22] R6_2.2.0          jpeg_0.1-8        rmarkdown_1.2    
## [25] reshape2_1.4.2    magrittr_1.5      codetools_0.2-15 
## [28] backports_1.0.4   scales_0.4.1      htmltools_0.3.5  
## [31] assertthat_0.1    colorspace_1.3-2  geosphere_1.5-5  
## [34] labeling_0.3      stringi_1.1.2     lazyeval_0.2.0   
## [37] munsell_0.4.3     rjson_0.2.15

Animating Plots of Beer Ingredients and Sin Taxes over Time

2016-12-22 00:00:00 +0000

With the upcoming holidays, I thought it fitting to finally explore the ttbbeer package. It contains data on beer ingredients used in US breweries from 2006 to 2015 and on the (sin) tax rates for beer, champagne, distilled spirits, wine and various tobacco items since 1862.

I will be exploring these datasets and create animated plots with the animation package and the gganimate extension for ggplot2.


Merry R-Mas and Cheers! :-)

(For full code see below…)



Beer materials since 2006

This dataset includes the monthly aggregates of ingredients as reported by the Beer Monthly Statistical Releases, the Brewer’s Report of Operations and the Quarterly Brewer’s Report of Operations from the US Alcohol and Tobacco Tax and Trade Bureau (TTB).

The following nine ingredients are listed:

  • Malt and malt products
  • Corn and corn products
  • Rice and rice products
  • Barley and barley products
  • Wheat and wheat products
  • Sugar and syrups
  • Dry hops
  • Hops extracts
  • Other ingredients, like flavors, etc.

All ingredient amounts are given in pounds.

I’m not only interested in the total amounts of ingredients that were used but also in the change compared to the previous month. This, I’m calculating as the percent of change compared to the previous entry, starting with 0 in January 2006.

The first thing I wanted to know was how many ingredients were used in total each month. This probably gives a rough estimate of how much beer was produced that month.


The seasonal variation in ingredient use shows that more beer is produced in late spring/ early summer than in winter. Given that it takes about one to four months from production to the finished product, it fits my intuition that people drink more beer when it’s warm outside rather than in the cold of winter.

Now lets look at the ingredients separately. This plot shows the monthly sum of pounds for each ingredient:


We can see the following trends:

  • The seasonal trend in overall beer production is reflected in the individual ingredients
  • Barley and wheat use has been steadily increasing since 2006
  • Corn and corn products has spiked between 2008 and 2011 and slightly dropped since
  • Dry hops has two very prominent spikes in 2014 and 2015, we will look at these in more detail in the following figures
  • The use of hops extracts, malt and rice has slighty decreased since 2006
  • Sugar and syrup use has dropped of between 2006 and 2010 and has since been relatively stable, with the exception of two months in 2014 and 2015
  • Other additives, like flavors, have increased sharply since 2009; in the last two years their use shows extremely high variability between summer (many added) and winter (fewer added); this might be due to the popularity of craft beers and light summer variants.

We can see that barley use spikes every third month, except for a few outlier months. Dy hops is interesting as there are two very prominent spikes in December 2014 and in December 2015. In the same months we see a drop for corn, hops extracts, malt, other ingredients, sugar and syrups and wheat.

The following plot shows the percent of change compared to the previous month for the monthly sums of ingredients:


Here we can again see the seasonality of ingredient use, while we can also see that the overall trend shows no significant change from zero.

The two outlier months, December 2014 and 2015 are quite conspicuous in these plots as well. This might be due to specific trends or there were shortages of supply for fresh hops and malt during these months (even though we don’t see a drop in overall ingredient amounts).

The first animated plot shows the percentages of ingredients that were used per month. Each frame shows one year.


The percentage plots show that there are no major differences in the proportions of ingredients, again with the exception of December 2014 and 2015.



Sin taxes since 1862

Tax rates for beer, champagne, distilled spirits, wine (separated into three alcohol percentage classes: up to 14%, 14 - 21% and 21 24%) and various tobacco items are included in separate datasets. The tax rates are given in US Dollar per

  • Barrel of beer (31 gallons)
  • Wine gallon
  • Proof barrel of distilled spirits
  • 1000 units of small and large cigarettes and cigars
  • 1 lb. of pipe, chewing and roll-your-own tobacco and snuff

Before I can visualize the data, I need to to do some tidying and preprocessing:

  • I’m adding a column with the tax group
  • to keep it consistent, I’m converting all tobacco item names to upper case letters
  • to dates in the tobacco set were probably wrong, because they messed up the plots, so I corrected them

For each tax group I also want to know the percent of change in tax rate compared to the previous time point.

The first plot shows a time line of tax rate changes since 1862. Major historical events are marked.


The strongest increase in tax rates can be seen for champagne in the 1950s and for wine (14%) following the end of the Cold War. At the end of the civil war, there is a sharp drop in the tax rates for distilled spirits. Spirit tax increased little between 1874 and the beginning of prohibition. During prohibition we can also see an increase of tax rates for champagne and wine (21-24%). While tax rates remained stable during the Cold War, they increased again in 1991 for all drinks, except for spirits.

The same plot as above but with absolute tax rates is visualized in the animated plot below:



The plot below catches the same information in a static plot:


All tax rates have increased strongly since 1862. Beer and wine show a strong jump in 1991 and at the beginning of prohibition. Between the end of prohibition and World War II, beer and wine taxes had dropped again, albeit slightly. Since the 1940s, all alcohol taxes have risen steadily. For tobacoo there is only sporadic information.



R code

library(ttbbeer)
data("beermaterials")
library(dplyr)

# order by year from least to most recent
beermaterials_change <- arrange(beermaterials, Year)

# calculate percent change for each ingredient column
for (i in 3:ncol(beermaterials_change)){
  
  value <- beermaterials_change[[colnames(beermaterials_change)[i]]]
  beermaterials_change[, i] <- c(0, 
                       100 * round((value[-1] - value[1:(length(value)-1)]) /
                                     value[1:(length(value)-1)], 2))
  
}

head(beermaterials_change)
## # A tibble: 6 × 11
##      Month  Year Malt_and_malt_products `Corn_and _corn_products`
##      <chr> <int>                  <dbl>                     <dbl>
## 1  January  2006                      0                         0
## 2 February  2006                     -8                        -1
## 3    March  2006                     11                         7
## 4    April  2006                      2                        -6
## 5      May  2006                      5                        10
## 6     June  2006                     -2                       -15
## # ... with 7 more variables: Rice_and_rice_products <dbl>,
## #   Barley_and_barley_products <dbl>, Wheat_and_wheat_products <dbl>,
## #   Sugar_and_syrups <dbl>, Hops_dry <dbl>, Hops_extracts <dbl>,
## #   Other <dbl>
#calcluate sum of ingredients per month
beermaterials$sum <- rowSums(beermaterials[, -c(1:2)])

For plotting I’m converting the tables from wide to long format:

library(tidyr)

beermaterials_gather <- beermaterials %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_gather$Date <- paste("01", beermaterials_gather$Month, beermaterials_gather$Year, sep = "-") %>%
  as.Date(format = "%d-%B-%Y")
beermaterials_gather$Ingredient <- gsub("_", " ", beermaterials_gather$Ingredient)
beermaterials_change_gather <- beermaterials_change %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_change_gather$Date <- paste("01", beermaterials_change_gather$Month, beermaterials_change_gather$Year, sep = "-") %>%
  as.Date(format = "%d-%B-%Y")
beermaterials_change_gather$Ingredient <- gsub("_", " ", beermaterials_change_gather$Ingredient)

And I’m preparing my custom ggplot2 theme:

library(ggplot2)
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "bottom",
    legend.justification = "top", 
    legend.box = "horizontal",
    legend.box.background = element_rect(colour = "grey50"),
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())
ggplot(beermaterials_gather, aes(x = Date, y = sum)) +
  geom_point(size = 1.5, alpha = 0.6, color = "navy") +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_line(color = "navy") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Sum of Beer Ingredients by Month and Year",
    subtitle = "Sum of monthly aggregates of all ingredients from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "6 month", date_labels = "%B %Y")
ggplot(beermaterials_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_line() +
  facet_wrap(~Ingredient, ncol = 3, scales = "free_y") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Beer Ingredients by Month and Year",
    subtitle = "Monthly aggregates of ingredients from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")
ggplot(beermaterials_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_point(size = 1.5, alpha = 0.6) +
  guides(color = FALSE) +
  geom_line() +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Beer Ingredients by Month and Year",
    subtitle = "from January 2006 to December 2015"
  ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%B %Y") +
  facet_wrap(~Ingredient, ncol = 1, scales = "free_y")

To explore the seasonal changes in more detail, I’m widening the resolution of the x-axis.

ggplot(beermaterials_change_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_line() +
  facet_wrap(~Ingredient, ncol = 3, scales = "free_y") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Percent change",
    title = "Beer Ingredients by Month and Year",
    subtitle = "Percent change of monthly aggregates of ingredients compared to previous month from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")
beermaterials_percent <- cbind(beermaterials[, 1:2], prop.table(as.matrix(beermaterials[, -c(1:2)]), margin = 1) * 100)

beermaterials_percent_gather <- beermaterials_percent %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_percent_gather$Ingredient <- gsub("_", " ", beermaterials_percent_gather$Ingredient)

f = unique(beermaterials_percent_gather$Month)
beermaterials_percent_gather <- within(beermaterials_percent_gather, Month <- factor(Month, levels = f))
library(animation)
ani.options(ani.width=800, ani.height=500)

saveGIF({
  for (year in rev(unique(beermaterials_percent_gather$Year))){
  
  pie <- ggplot(subset(beermaterials_percent_gather, Year == paste(year)), aes(x = "", y = Amount, fill = Ingredient, frame = Year)) + 
  geom_bar(width = 1, stat = "identity") + theme_minimal() +
  coord_polar("y", start = 0) +
  labs(
    title = "Percentage of Beer Ingredients by Year and Month",
    subtitle = paste(year)) +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.ticks = element_blank(),
  plot.title = element_text(size = 14, face = "bold"),
  legend.title = element_blank(),
  legend.position = "bottom",
  legend.text = element_text(size = 8)
  ) + 
  facet_wrap( ~ Month, ncol = 6) +
    guides(fill = guide_legend(ncol = 9, byrow = FALSE))

  print(pie)
  }
  
}, movie.name = "beer_ingredients.gif")
ggplot(beermaterials_percent_gather, aes(x = Month, y = Amount, color = Ingredient, fill = Ingredient)) + 
  geom_bar(width = 0.8, stat = "identity", alpha = 0.7) + 
  my_theme() +
  guides(color = FALSE) +
  facet_wrap( ~ Year, ncol = 5) +
  labs(
    x = "",
    y = "Percent",
    title = "Percentage of Beer Ingredients by Year and Month",
    subtitle = "from 2006 to 2015",
    fill = ""
  )

To see the information from the animated figure at one glance, the next plots shows the percentages of ingredients per month and year as a barplot:


beertax$tax <- "BEER"
champagnetax$tax <- "CHAMPAGNE"
spirittax$tax <- "SPIRIT"
tobaccotax$ITEM <- toupper(tobaccotax$ITEM)
tobaccotax$tax <- paste("TOBACCO", tobaccotax$ITEM, sep = " ")
tobaccotax <- tobaccotax[, -1]
tobaccotax[18, 2] <- "1977-01-31"
tobaccotax[18, 1] <- "1977-01-01"
winetax14$tax <- "WINE 14%"
winetax1421$tax <- "WINE 14-21%"
winetax2124$tax <- "WINE 21-24%"
taxes_list <- c("beertax", "champagnetax", "spirittax", "tobaccotax", "winetax14", "winetax1421", "winetax2124")

for (tax in taxes_list){
  
  tax_df <- get(tax)
  tax_df <- tax_df[match(sort(tax_df$FROM), tax_df$FROM), ]
  
  if (tax != "tobaccotax"){
    tax_df$change <- c(0, 
                       100 * round((tax_df$RATE[-1] - tax_df$RATE[1:(length(tax_df$RATE)-1)]) /
                                     tax_df$RATE[1:(length(tax_df$RATE)-1)], 2))

  } else {
    tax_df$change <- "NA"
  }
  
  if (tax == "beertax"){
    tax_df_2 <- tax_df
  } else {
    tax_df_2 <- rbind(tax_df_2, tax_df)
  }
}

tax_2 <- tax_df_2
tax_2$ID <- paste(tax_2$tax, rownames(tax_2), sep = "_")
head(tax_2)
##         FROM         TO RATE  tax change     ID
## 4 1862-09-01 1863-03-03  1.0 BEER      0 BEER_4
## 5 1863-03-03 1864-03-31  0.6 BEER    -40 BEER_5
## 6 1864-04-01 1898-06-13  1.0 BEER     67 BEER_6
## 7 1898-06-14 1901-06-30  2.0 BEER    100 BEER_7
## 8 1901-07-01 1902-06-30  1.6 BEER    -20 BEER_8
## 9 1902-07-01 1914-10-22  1.0 BEER    -38 BEER_9

Now I can gather the dataframe for plotting.

library(tidyr)
tax_gather <- tax_2 %>%
  gather(dat_column, Date, FROM:TO)
tax_gather$group <- gsub(" .*", "", tax_gather$tax)
d <- data.frame(xmin = as.Date(c("1861-04-12", "1914-07-28", "1920-01-16", "1939-11-01", "1947-03-12")), 
                ymin = -Inf, 
                xmax = as.Date(c("1865-04-09", "1918-11-11", "1933-12-05", "1945-05-08", "1991-12-26")), 
                ymax = Inf, 
                description = c("Civial War", "WW1", "Prohibition", "WW2", "Cold War"),
                pos_y = as.numeric(c(rep(c("1700", "1500"), 2), "1700")))
d$pos_x <- as.Date(NA)

for (i in 1:nrow(d)){
  d[i, "pos_x"] <- as.Date(as.character(median(c(d$xmin[i], d$xmax[i]))))
}

tax_gather$group2 <- ifelse(tax_gather$group == "TOBACCO", "TOBACCO", "ALCOHOL")

ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_label(data = d, aes(pos_x, pos_y, label = description), check_overlap = TRUE) +
  geom_point(data = subset(tax_gather, group2 == "ALCOHOL"), aes(x = Date, y = as.numeric(change), color = tax), size = 1.5, alpha = 0.6) +
  geom_line(data = subset(tax_gather, group2 == "ALCOHOL"), aes(x = Date, y = as.numeric(change), color = tax), size = 1) +
  facet_wrap(~group2, ncol = 1, scales = "free_y") +
  labs(
    x = "",
    y = "Tax rate change in %",
    color = "",
    title = "Alcohol tax rate changes during history",
    subtitle = "Change in percent compared to previous timepoint from 1862 to 2016"
  ) +
  guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
  scale_x_date(date_breaks = "10 year", date_labels = "%Y")
d$pos_y <- as.numeric(c(rep(c("20", "18"), 2), "20"))

library(gganimate)

p <- ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_label(data = d, aes(pos_x, pos_y, label = description), check_overlap = TRUE) +
  geom_point(data = tax_gather, aes(x = Date, y = RATE, color = tax, frame = tax), size = 2, alpha = 0.6) +
  geom_line(data = tax_gather, aes(x = Date, y = RATE, color = tax, frame = tax), size = 1.5) +
  labs(
    x = "",
    y = "Tax rate",
    color = "",
    title = "Tax rates of ",
    subtitle = "from 1862 to 2016"
  ) +
  guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
  scale_x_date(date_breaks = "10 year", date_labels = "%Y")

ani.options(ani.width=1000, ani.height=600)
gganimate(p, filename = "taxes_by_tax.mp4", cumulative = TRUE)
gganimate(p, filename = "taxes_by_tax.gif", cumulative = TRUE)
ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_smooth(data = tax_gather, aes(x = Date, y = RATE, color = tax), alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_point(data = tax_gather, aes(x = Date, y = RATE, color = tax), size = 1.5, alpha = 0.6) +
  geom_line(data = tax_gather, aes(x = Date, y = RATE, color = tax), size = 1) +
  facet_wrap(~group, ncol = 2, scales = "free_y") +
  labs(
    x = "",
    y = "Tax rate",
    color = "",
    title = "Alcohol and tobacco tax rates",
    subtitle = "from 1862 to 2016"
  ) +
 guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
 scale_x_date(date_breaks = "20 year", date_labels = "%Y")

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.0 tidyr_0.6.0   dplyr_0.5.0   ttbbeer_1.1.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8      knitr_1.15.1     magrittr_1.5     munsell_0.4.3   
##  [5] lattice_0.20-34  colorspace_1.3-2 R6_2.2.0         stringr_1.1.0   
##  [9] plyr_1.8.4       tools_3.3.2      grid_3.3.2       nlme_3.1-128    
## [13] gtable_0.2.0     mgcv_1.8-16      DBI_0.5-1        htmltools_0.3.5 
## [17] yaml_2.1.14      lazyeval_0.2.0   assertthat_0.1   rprojroot_1.1   
## [21] digest_0.6.10    tibble_1.2       Matrix_1.2-7.1   evaluate_0.10   
## [25] rmarkdown_1.2    labeling_0.3     stringi_1.1.2    scales_0.4.1    
## [29] backports_1.0.4

How to build a Shiny app for disease- & trait-associated locations of the human genome

2016-12-18 00:00:00 +0000

This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously.

The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated resource of single-nucleotide polymorphism (SNP)-trait associations. The database contains more than 100,000 SNPs and all SNP-trait associations with a p-value <1 × 10^−5.


You can access the app at

https://shiring.shinyapps.io/gwas_shiny_app/

Loading the data might take a few seconds. Patience you must have, my young padawan… ;-)


Alternatively, if you are using R, you can load the app via Github with shiny:

library(shiny)
runGitHub("ShirinG/GWAS_Shiny_App") 



If you want to know how I built this app, contine reading.


Data preparation

Initially, I wanted to load all data directly from R packages. And while this worked in principal, it made the app load super slowly.

So, I decided to prepare the data beforehand and save the datatables as tab-delimited text files and load them in the app.


Genome information

First, I’m preparing the genome information: for each chromosome I want to know its length:

library(AnnotationDbi)
library(org.Hs.eg.db)

library(EnsDb.Hsapiens.v79)
edb <- EnsDb.Hsapiens.v79

keys <- keys(edb, keytype="SEQNAME")
chromosome_length <- select(edb, keys = keys, columns = c("SEQLENGTH", "SEQNAME"), keytype = "SEQNAME")
chromosome_length <- chromosome_length[grep("^[0-9]+$|^X$|^Y$|^MT$", chromosome_length$SEQNAME), ]

write.table(chromosome_length, "chromosome_length.txt", row.names = FALSE, col.names = TRUE, sep = "\t")


GWAS SNP data

I am saving the GWAS data as a text file as well; this datatable will be used for plotting the SNP locations.

I am also saving a table with the alphabetically sorted traits as input for the drop-down menu.

library(gwascat)
data(ebicat38)

gwas38 <- as.data.frame(ebicat38)
gwas38$DISEASE.TRAIT <- gsub("&beta;", "beta", gwas38$DISEASE.TRAIT)

# PVALUE_MLOG: -log(p-value)
gwas38[is.infinite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"] <- max(gwas38[is.finite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"]) + 10
summary(gwas38[is.finite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"])
summary(gwas38$PVALUE_MLOG)

# OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.
summary(gwas38$OR.or.BETA)

write.table(gwas38, "gwas38.txt", row.names = FALSE, col.names = TRUE, sep = "\t")

gwas38_traits <- as.data.frame(table(gwas38$DISEASE.TRAIT))
colnames(gwas38_traits) <- c("Trait", "Frequency")

write.table(gwas38_traits, "gwas38_traits.txt", row.names = FALSE, col.names = TRUE, sep = "\t")


The Shiny App

I built my Shiny app with the traditional two-file system. This means that I have a “ui.R” file containing the layout and a “server.R” file, which contains the R code.


ui.R

From top to bottom I chose the following settings:

  • loading the table with all traits
  • adding a “choose below” option before the traits to have no trait chosen by default
  • “united” theme for layout
  • two slider bars, one for the p-value threshold and one for the odds ratio/ beta coefficient threshold
  • drop-down menus for all traits, maximal eight can be chosen to be plotted simultaneously
  • main panel with explanatory text, main plot and output tables
library(shiny)
library(shinythemes)

gwas38_traits <- read.table("gwas38_traits.txt", header = TRUE, sep = "\t")

diseases <- c("choose below", as.character(gwas38_traits$Trait))

shinyUI(fluidPage(theme = shinytheme("united"),
                  titlePanel("GWAS disease- & trait-associated SNP locations of the human genome"),

                  sidebarLayout(
                    sidebarPanel(

                      sliderInput("pvalmlog",
                                  "-log(p-value):",
                                  min = -13,
                                  max = 332,
                                  value = -13),

                      sliderInput("orbeta",
                                  "odds ratio/ beta-coefficient :",
                                  min = 0,
                                  max = 4426,
                                  value = 0),

                      selectInput("variable1", "First trait:",
                                  choices = diseases[-1]),

                      selectInput("variable2", "Second trait:",
                                  choices = diseases),

                      selectInput("variable3", "Third trait:",
                                  choices = diseases),

                      selectInput("variable4", "Fourth trait:",
                                  choices = diseases),

                      selectInput("variable5", "Fifth trait:",
                                  choices = diseases),

                      selectInput("variable6", "Sixth trait:",
                                  choices = diseases),

                      selectInput("variable7", "Seventh trait:",
                                  choices = diseases),

                      selectInput("variable8", "Eighth trait:",
                                  choices = diseases)
                    ),
                    mainPanel(
                      br(),
                      p("The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated resource of single-nucleotide polymorphisms (SNP)-trait associations. The database contains more than 100,000 SNPs and all SNP-trait associations with a p-value <1 × 10^−5."),
                      p("This app is based on the 'gwascat' R package and its 'ebicat38' database and shows trait-associated SNP locations of the human genome."),
                      p("For more info on how I built this app check out", a("my blog.", href = "https://shiring.github.io/")),
                      br(),
                      h4("How to use this app:"),
                      div("Out of 1320 available traits or diseases you can choose up to 8 on the left-side panel und see their chromosomal locations below. The traits are sorted alphabetically. You can also start typing in the drop-down panel and traits matching your query will be suggested.", style = "color:blue"),
                      br(),
                      div("With the two sliders on the left-side panel you can select SNPs above a p-value threshold (-log of association p-value) and/or above an odds ratio/ beta-coefficient threshold. The higher the -log of the p-value the more significant the association of the SNP with the trait. Beware that some SNPs have no odds ratio value and will be shown regardless of the threshold.", style = "color:blue"),
                      br(),
                      div("The table directly below the plot shows the number of SNPs for each selected trait (without subsetting when p-value or odds ratio are changed). The second table below the first shows detailed information for each SNP of the chosen traits. This table shows only SNPs which are plotted (it subsets according to p-value and odds ratio thresholds).", style = "color:blue"),
                      br(),
                      div("Loading might take a few seconds...", style = "color:red"),
                      br(),
                      plotOutput("plot"),
                      tableOutput('table'),
                      tableOutput('table2'),
                      br(),
                      p("GWAS catalog:", a("http://www.ebi.ac.uk/gwas/", href = "http://www.ebi.ac.uk/gwas/")),
                      p(a("Welter, Danielle et al. “The NHGRI GWAS Catalog, a Curated Resource of SNP-Trait Associations.” Nucleic Acids Research 42.Database issue (2014): D1001–D1006. PMC. Web. 17 Dec. 2016.", href = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965119/"))
                    )
                  )
))


server.R

This file contains the R code to produce the plot and output tables.

  • The chromosome data is loaded for the chromosome length barplot. In order to have the correct order, I am setting the chromosome factor order manually
  • renderPlot() contains the code for the plot
  • from the eight possible input traits, all unique traits are plotted (removing “choose below” first), unique traits were set here, so as not to plot the same trait on top of themselves if the same trait is chosen in two drop-down menus
  • the first output table shows the number of SNPs for the chosen traits
  • the second output table shows SNP information for all SNPs of the chosen trait(s) within the chosen thresholds
library(shiny)

chr_data <- read.table("chromosome_length.txt", header = TRUE, sep = "\t")

chr_data$SEQNAME <- as.factor(chr_data$SEQNAME)
f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "MT")
chr_data <- within(chr_data, SEQNAME <- factor(SEQNAME, levels = f))

library(ggplot2)

gwas38 <- read.table("gwas38.txt", header = TRUE, sep = "\t")
gwas38_traits <- read.table("gwas38_traits.txt", header = TRUE, sep = "\t")

# Define server logic required to plot variables
shinyServer(function(input, output) {

  # Generate a plot of the requested variables
  output$plot <- renderPlot({

    plot_snps <- function(trait){

      if (any(trait == "choose below")){
        trait <- trait[-which(trait == "choose below")]
      } else {
        trait <- unique(trait)
      }

        for (i in 1:length(unique(trait))){

          trait_pre <- unique(trait)[i]
          snps_data_pre <- gwas38[which(gwas38$DISEASE.TRAIT == paste(trait_pre)), ]
          snps_data_pre <- data.frame(Chr = snps_data_pre$seqnames,
                                      Start = snps_data_pre$CHR_POS,
                                      SNPid = snps_data_pre$SNPS,
                                      Trait = rep(paste(trait_pre), nrow(snps_data_pre)),
                                      PVALUE_MLOG = snps_data_pre$PVALUE_MLOG,
                                      OR.or.BETA = snps_data_pre$OR.or.BETA)

          snps_data_pre <- subset(snps_data_pre, PVALUE_MLOG > input$pvalmlog)
          snps_data_pre <- subset(snps_data_pre, OR.or.BETA > input$orbeta | is.na(OR.or.BETA))

          if (i == 1){

            snps_data <- snps_data_pre

          } else {

            snps_data <- rbind(snps_data, snps_data_pre)

          }
        }

      snps_data <- within(snps_data, Chr <- factor(Chr, levels = f))

      p <- ggplot(data = snps_data, aes(x = Chr, y = as.numeric(Start))) +
        geom_bar(data = chr_data, aes(x = SEQNAME, y = as.numeric(SEQLENGTH)), stat = "identity", fill = "grey90", color = "black") +
        theme(
          axis.text = element_text(size = 14),
          axis.title = element_text(size = 14),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_rect(fill = "white"),
          legend.position = "bottom"
        ) +
        labs(x = "Chromosome", y = "Position")

      p + geom_segment(data = snps_data, aes(x = as.numeric(as.character(Chr)) - 0.45, xend = as.numeric(as.character(Chr)) + 0.45,
                                           y = Start, yend = Start, colour = Trait), size = 2, alpha = 0.5) +
        scale_colour_brewer(palette = "Set1") +
        guides(colour = guide_legend(ncol = 3, byrow = FALSE))

      }

    plot_snps(trait = c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8, input$variable9))
  })

  output$table <- renderTable({
    table <- gwas38_traits[which(gwas38_traits$Trait %in% c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8)), ]

    })

  output$table2 <- renderTable({
    table <- gwas38[which(gwas38$DISEASE.TRAIT %in% c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8)), c(1:5, 8, 10, 11, 13, 14, 20, 27, 32, 33, 34, 36)]

    table <- subset(table, PVALUE_MLOG > input$pvalmlog)
    table <- subset(table, OR.or.BETA > input$orbeta | is.na(OR.or.BETA))

    table[order(table$seqnames), ]
    })

})


Deploying to shinyapps.io

Finally, I am deploying my finished app to shinyapps.io with the rsconnect package. You will need to register with shinyapps.io before you can host your Shiny app there and register rsconnect with the token you received.

library(rsconnect)
rsconnect::deployApp('~/Documents/Github/blog_posts_prep/gwas/shiny/GWAS_Shiny_App')

This will open your deployed app right aways. You can now share the link to your app with the world! :-)



If you are interested in human genomics…

… you might also like these posts:



## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rsconnect_0.6                          
##  [2] gwascat_2.6.0                          
##  [3] Homo.sapiens_1.3.1                     
##  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [5] GO.db_3.4.0                            
##  [6] OrganismDbi_1.16.0                     
##  [7] EnsDb.Hsapiens.v79_2.1.0               
##  [8] ensembldb_1.6.2                        
##  [9] GenomicFeatures_1.26.0                 
## [10] GenomicRanges_1.26.1                   
## [11] GenomeInfoDb_1.10.1                    
## [12] org.Hs.eg.db_3.4.0                     
## [13] AnnotationDbi_1.36.0                   
## [14] IRanges_2.8.1                          
## [15] S4Vectors_0.12.1                       
## [16] Biobase_2.34.0                         
## [17] BiocGenerics_0.20.0                    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.3-1              rprojroot_1.1                
##   [3] biovizBase_1.22.0             htmlTable_1.7                
##   [5] XVector_0.14.0                base64enc_0.1-3              
##   [7] dichromat_2.0-0               base64_2.0                   
##   [9] interactiveDisplayBase_1.12.0 codetools_0.2-15             
##  [11] splines_3.3.2                 snpStats_1.24.0              
##  [13] ggbio_1.22.0                  fail_1.3                     
##  [15] doParallel_1.0.10             knitr_1.15.1                 
##  [17] Formula_1.2-1                 gQTLBase_1.6.0               
##  [19] Rsamtools_1.26.1              cluster_2.0.5                
##  [21] graph_1.52.0                  shiny_0.14.2                 
##  [23] httr_1.2.1                    backports_1.0.4              
##  [25] assertthat_0.1                Matrix_1.2-7.1               
##  [27] lazyeval_0.2.0                limma_3.30.6                 
##  [29] acepack_1.4.1                 htmltools_0.3.5              
##  [31] tools_3.3.2                   gtable_0.2.0                 
##  [33] reshape2_1.4.2                dplyr_0.5.0                  
##  [35] fastmatch_1.0-4               Rcpp_0.12.8                  
##  [37] Biostrings_2.42.1             nlme_3.1-128                 
##  [39] rtracklayer_1.34.1            iterators_1.0.8              
##  [41] ffbase_0.12.3                 stringr_1.1.0                
##  [43] mime_0.5                      XML_3.98-1.5                 
##  [45] AnnotationHub_2.6.4           zlibbioc_1.20.0              
##  [47] scales_0.4.1                  BSgenome_1.42.0              
##  [49] VariantAnnotation_1.20.2      BiocInstaller_1.24.0         
##  [51] SummarizedExperiment_1.4.0    RBGL_1.50.0                  
##  [53] RColorBrewer_1.1-2            BBmisc_1.10                  
##  [55] yaml_2.1.14                   memoise_1.0.0                
##  [57] gridExtra_2.2.1               ggplot2_2.2.0                
##  [59] biomaRt_2.30.0                rpart_4.1-10                 
##  [61] reshape_0.8.6                 latticeExtra_0.6-28          
##  [63] stringi_1.1.2                 RSQLite_1.1-1                
##  [65] foreach_1.4.3                 checkmate_1.8.2              
##  [67] BiocParallel_1.8.1            BatchJobs_1.6                
##  [69] GenomicFiles_1.10.3           bitops_1.0-6                 
##  [71] matrixStats_0.51.0            evaluate_0.10                
##  [73] lattice_0.20-34               GenomicAlignments_1.10.0     
##  [75] bit_1.1-12                    GGally_1.3.0                 
##  [77] plyr_1.8.4                    magrittr_1.5                 
##  [79] sendmailR_1.2-1               R6_2.2.0                     
##  [81] Hmisc_4.0-1                   DBI_0.5-1                    
##  [83] foreign_0.8-67                mgcv_1.8-16                  
##  [85] survival_2.40-1               RCurl_1.95-4.8               
##  [87] nnet_7.3-12                   tibble_1.2                   
##  [89] rmarkdown_1.2                 grid_3.3.2                   
##  [91] data.table_1.10.0             gQTLstats_1.6.0              
##  [93] digest_0.6.10                 xtable_1.8-2                 
##  [95] ff_2.2-13                     httpuv_1.3.3                 
##  [97] brew_1.0-6                    openssl_0.9.5                
##  [99] munsell_0.4.3                 beeswarm_0.2.3               
## [101] Gviz_1.18.1