Exploring World Gender Statistics with Shiny

2017-01-29 00:00:00 +0000

This week I explored the World Gender Statistics dataset. You can look at 160 measurements over 56 years with my Shiny app here.

I prepared the data as follows:

Data.csv

Country.Name: the name of the country
Country.Code: the country’s code
Indicator.Name: the name of the variable that this row represents
Indicator.Code: a unique id for the variable
1960 - 2016: one column EACH for the value of the variable in each year it was available

dataset <- read.csv("Data.csv")
dataset_subs <- dataset[grep(".FE|.MA", dataset$Indicator.Code), ]
head(dataset_subs)

dataset_subs$Indicator.Name <- as.character(dataset_subs$Indicator.Name)

dataset_fem <- dataset[grep("female", dataset$Indicator.Name), ]
dataset_fem$Indicator.Name <- gsub("female", "", dataset_fem$Indicator.Name)
dataset_fem$Indicator.Code <- gsub(".FE", "", dataset_fem$Indicator.Code)
dataset_fem$gender <- "female"

dataset_male <- dataset[-grep("female", dataset$Indicator.Name), ]
dataset_male$Indicator.Name <- gsub("male", "", dataset_male$Indicator.Name)
dataset_male$Indicator.Code <- gsub(".FE", "", dataset_male$Indicator.Code)
dataset_male$gender <- "male"

dataset_fem <- dataset_fem[which(dataset_fem$Indicator.Name %in% dataset_male$Indicator.Name), ]
dataset_male <- dataset_male[which(dataset_male$Indicator.Name %in% dataset_fem$Indicator.Name), ]

dataset_fem <- dataset_fem[which(dataset_fem$Country.Code %in% dataset_male$Country.Code), ]
dataset_male <- dataset_male[which(dataset_male$Country.Code %in% dataset_fem$Country.Code), ]

library(dplyr)
dataset_fem <- arrange(dataset_fem, Country.Code)
dataset_male <- arrange(dataset_male, Country.Code)

dataset_fem$Country.Code <- as.character(dataset_fem$Country.Code)
dataset_male$Country.Code <- as.character(dataset_male$Country.Code)

save(dataset_fem, file = "dataset_fem.RData")
save(dataset_male, file = "dataset_male.RData")

length(unique(dataset_fem$Indicator.Name)) == length(unique(dataset_male$Indicator.Name))

for (n in 1:length(unique(dataset_fem$Indicator.Name))) {
  
  code <- unique(dataset_fem$Indicator.Name)[n]
  
  print(code)
                 
  fem <- dataset_fem[which(dataset_fem$Indicator.Name == code), ]
  male <- dataset_male[which(dataset_male$Indicator.Name == code), ]

  for (i in 1:nrow(fem)) {
    
    if (i == 1) {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- cbind(male[i, c(1:4)], diff)
      
    } else {
      
      diff <- male[i, 5:61] / fem[i, 5:61]
      diff_table <- rbind(diff_table, 
                          cbind(male[i, c(1:4)], diff))
      
    }
  }
  
  if (n == 1) {
    
    diff_table_bind <- diff_table
    
  } else {
    
    diff_table_bind <- rbind(diff_table_bind, diff_table)
    
  }
  
}

diff_table_bind$Country.Code <- as.character(diff_table_bind$Country.Code)
diff_table_bind[diff_table_bind == "NaN"] <- NA

save(diff_table_bind, file = "diff_table_bind.RData")

measures <- unique(diff_table_bind$Indicator.Name)
save(measures, file = "measures.RData")

years <- gsub("X", "", colnames(diff_table_bind)[-c(1:4)])
years <- years[-length(years)]
save(years, file = "years.RData")

Map

The map has been downloaded from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/

library(rgdal)
library(ggplot2)
library(plyr)
library(dplyr)
library(scales)

wmap_countries <- readOGR(dsn="shapefiles", layer="ne_110m_admin_0_countries")

wmap_countries_df <- fortify(wmap_countries)
wmap_countries@data$id <- rownames(wmap_countries@data)
wmap_countries_df_final <- join(wmap_countries_df, wmap_countries@data, by = "id")

wmap_countries_df_final$gu_a3 <- as.character(wmap_countries_df_final$gu_a3)

save(wmap_countries_df_final, file = "wmap_countries_df_final.RData")

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.8     stringi_1.1.2  
##  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.1.0   digest_0.6.11  
## [13] evaluate_0.10

R vs Python - a One-on-One Comparison

2017-01-22 00:00:00 +0000

I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative.

That’s why I wanted to see how R and Python fare in a one-on-one comparison of an analysis that’s representative of what I would typically work with.

Conclusions

All in all, the Python code could easily be translated into R and was comparable in length and simplicity between the two languages. While Python’s syntax is inherently cleaner/ tidier, we can use packages that implement piping in R and achieve similar results (even though Python’s dot-separated syntax is still much easier to type than using the piping operator of magrittr). For plotting and visualisation I still think that R’s ggplot2 is top of the line in both syntax, customizability and outcome (admittedly, I don’t know matplotlib as well as ggplot)! In terms of functionality, I couldn’t find major differences between the two languages and I would say they both have their merits. For me, R comes more natural as that is what I’m more fluent in, but I can see why Python holds an appeal too and I think I’ll make more of an effort to use both languages in my future projects.

R and Python

R and Python are both open-source languages used in a wide range of data analysis fields. Their main difference is that R has traditionally been geared towards statistical analysis, while Python is more generalist. Both comprise a large collection of packages for specific tasks and have a growing community that offers support and tutorials online.

For a nice overview of the languages respective strengths and weaknesses, check out Datacamp’s nice infographic.

Comparative analysis of genome data

To directly compare R and Python, I am following Zhuyi Xue’s “A Comprehensive Introduction To Your Genome With the SciPy Stack” (with some minor tweaks here and there). He gives a nice introduction to the data, so I will not repeat it here but focus on the comparison between the code lines.

For R, I am working with RStudio, for Python with Anaconda and Spyder.

Python:

For this analysis, we need the SciPy stack with pandas for data wrangling and matplotlib for visualisation. Anaconda already comes with all these packages that we need. Zhuyi Xue first imports pandas as “pd”, so that we can call pandas function by prefixing them with “pd.”.

import pandas as pd

While the code could be replicated with base R, I prefer dplyr for data wrangling and ggplot2 for visualisation.

library(dplyr)
library(ggplot2)

Reading in data

Reading in data is straight forward in both R and Python. The code we need to read in the file is comparable between R and Python. Zhuyi Xue specified “compression = ‘gzip’”, but this would not have been necessary as the default is to infer it from the file suffix.

One big difference in the general syntax we can see here too: Boolean true/false values are written in all caps in R (TRUE/FALSE), while Python uses first letter capitalisation (True/False). The same prinicple applies to “none”.

Python:

df = pd.read_csv('Homo_sapiens.GRCh38.85.gff3.gz', 
                         compression = 'gzip',
                         sep = '\t', 
                         comment = '#', 
                         low_memory = False,
                         header = None, 
                         names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
df.head()

df <- read.csv("~/Documents/Github/Homo_sapiens.GRCh38.85.gff3.gz", 
               header = FALSE, 
               sep = "\t", 
               col.names = c('seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'), 
               comment.char = "#")
head(df)

##   seqid source              type start       end   score strand phase
## 1     1 GRCh38        chromosome     1 248956422       .      .     .
## 2     1      . biological_region 10469     11240 1.3e+03      .     .
## 3     1      . biological_region 10650     10657   0.999      +     .
## 4     1      . biological_region 10655     10657   0.999      -     .
## 5     1      . biological_region 10678     10687   0.999      +     .
## 6     1      . biological_region 10681     10688   0.999      -     .
##                                           attributes
## 1 ID=chromosome:1;Alias=CM000663.2,chr1,NC_000001.11
## 2           external_name=oe %3D 0.79;logic_name=cpg
## 3                                 logic_name=eponine
## 4                                 logic_name=eponine
## 5                                 logic_name=eponine
## 6                                 logic_name=eponine

Examining data

listing unique strings

The first thing we want to know from the data is how many unique entries there are in the “seqid” column.

Here, we can already see the main difference in syntax between R and Python:

Python concatenates the object name (“df) with the column name and the functions that we want to run on this column in a sequential manner, separated by a dot. Base R uses nested functions, where each function is called with “function_name()” and we specify columns with “object_name$column_name”.

However, both R and Python can also call columns in a dataframe with “[ ]” with the difference that Python per default subsets data columns df[“seqid”], while R always needs index specifications for rows and columns, separated by “,”: e.g. df[, “seqid”] would subset every row and only the column named “seqid”.

The sequential calling of functions is indeed very handy, it makes the code easier to read and understand than lots of interwoven functions and brackets. But while this isn’t the concept of base R, dplyr uses the magrittr principle of chaining functions with the pipe symbol “%>%”. Even though this symbol isn’t as easily typed, it’s functionality is often superior to base R, especially if you need to run many functions on a dataframe. However, with just or two functions, I usually keep to base R as it is shorter.

Python:

df.seqid.unique() # alternatively: df['seqid'].unique()

unique(df$seqid)

##   [1] 1          10         11         12         13         14        
##   [7] 15         16         17         18         19         2         
##  [13] 20         21         22         3          4          5         
##  [19] 6          7          8          9          GL000008.2 GL000009.2
##  [25] GL000194.1 GL000195.1 GL000205.2 GL000208.1 GL000213.1 GL000214.1
##  [31] GL000216.2 GL000218.1 GL000219.1 GL000220.1 GL000221.1 GL000224.1
##  [37] GL000225.1 GL000226.1 KI270302.1 KI270303.1 KI270304.1 KI270305.1
##  [43] KI270310.1 KI270311.1 KI270312.1 KI270315.1 KI270316.1 KI270317.1
##  [49] KI270320.1 KI270322.1 KI270329.1 KI270330.1 KI270333.1 KI270334.1
##  [55] KI270335.1 KI270336.1 KI270337.1 KI270338.1 KI270340.1 KI270362.1
##  [61] KI270363.1 KI270364.1 KI270366.1 KI270371.1 KI270372.1 KI270373.1
##  [67] KI270374.1 KI270375.1 KI270376.1 KI270378.1 KI270379.1 KI270381.1
##  [73] KI270382.1 KI270383.1 KI270384.1 KI270385.1 KI270386.1 KI270387.1
##  [79] KI270388.1 KI270389.1 KI270390.1 KI270391.1 KI270392.1 KI270393.1
##  [85] KI270394.1 KI270395.1 KI270396.1 KI270411.1 KI270412.1 KI270414.1
##  [91] KI270417.1 KI270418.1 KI270419.1 KI270420.1 KI270422.1 KI270423.1
##  [97] KI270424.1 KI270425.1 KI270429.1 KI270435.1 KI270438.1 KI270442.1
## [103] KI270448.1 KI270465.1 KI270466.1 KI270467.1 KI270468.1 KI270507.1
## [109] KI270508.1 KI270509.1 KI270510.1 KI270511.1 KI270512.1 KI270515.1
## [115] KI270516.1 KI270517.1 KI270518.1 KI270519.1 KI270521.1 KI270522.1
## [121] KI270528.1 KI270529.1 KI270530.1 KI270538.1 KI270539.1 KI270544.1
## [127] KI270548.1 KI270579.1 KI270580.1 KI270581.1 KI270582.1 KI270583.1
## [133] KI270584.1 KI270587.1 KI270588.1 KI270589.1 KI270590.1 KI270591.1
## [139] KI270593.1 KI270706.1 KI270707.1 KI270708.1 KI270709.1 KI270710.1
## [145] KI270711.1 KI270712.1 KI270713.1 KI270714.1 KI270715.1 KI270716.1
## [151] KI270717.1 KI270718.1 KI270719.1 KI270720.1 KI270721.1 KI270722.1
## [157] KI270723.1 KI270724.1 KI270725.1 KI270726.1 KI270727.1 KI270728.1
## [163] KI270729.1 KI270730.1 KI270731.1 KI270732.1 KI270733.1 KI270734.1
## [169] KI270735.1 KI270736.1 KI270737.1 KI270738.1 KI270739.1 KI270740.1
## [175] KI270741.1 KI270742.1 KI270743.1 KI270744.1 KI270745.1 KI270746.1
## [181] KI270747.1 KI270748.1 KI270749.1 KI270750.1 KI270751.1 KI270752.1
## [187] KI270753.1 KI270754.1 KI270755.1 KI270756.1 KI270757.1 MT        
## [193] X          Y         
## 194 Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 ... Y

# with dplyr:
# df %>% select(seqid) %>% unique

how many unique seqids are there?

To get the number of unique entries in the “seqid” column, we simply need to append “.shape” to the above Python code. In R, we can either wrap the above R code with the “length()” function or use dplyr and piping instead. If we use the latter, we need to use the “nrow()” function because base R returns a vector, while dplyr returns a dataframe. Here, we can see that with two functions, using dplyr is still a bit more code but it already looks much tidier.

Python:

df.seqid.unique().shape

length(unique(df$seqid))

## [1] 194

# with dplyr:
# df %>% select(seqid) %>% unique %>% nrow

counting occurrences

To count the frequencies of each unique entry in the “source” column, we use the “value_counts()” function in Python and the “table()” function in R. These two functions differ in how they sort the output table: value_counts() sorts by decreasing frequency, while R alphabetically sorts the variables. To order the data as in Python, we need to add the “sort()” function to our R code.

Python:

df.source.value_counts()

# table(df$source) is per default ordered alphabetically, if we want it ordered by decreasing counts, like with Python:
sort(table(df$source), decreasing = TRUE)

## 
##         havana ensembl_havana        ensembl              .        mirbase 
##        1441093         745065         228212         182510           4701 
##         GRCh38          insdc 
##            194             74

# dplyr:
# df %>% select(source) %>% table %>% sort(decreasing = TRUE)

How Much of the Genome Is Incomplete?

subsetting a dataframe

We are now subsetting our original dataframe and assign it a new object name with “ = “ or “ <- “.

To subset the dataframe to keep only rows which say “GRCh38” in the “source” column, ther are several ways to do this in R: the way that would be directly comparable to how it was done in Python would be to also use the square bracket indexing. However, there are two solutions which are more elegant: 1) base R’s “subset()” or dplyr’s “filter()” function. But with this short example, there is no big difference between the three.

Python’s “shape” gives us the same information as R’s “dim()” function: how many rows and columns our dataframe has.

To preview a random subset of 10 rows from our dataframe, we use Python’s “sample()” and dplyr’s “sample_n()” function.

Python:

gdf = df[df.source == 'GRCh38']

gdf.shape
gdf.sample(10)

gdf <- df[df$source == "GRCh38", ]

# alternatively:
# gdf <- subset(df, source == "GRCh38")

# dplyr:
# gdf <- df %>% filter(source == "GRCh38")

# get number of rows and columns
dim(gdf)

## [1] 194   9

# randomly sample 10 rows for observation
sample_n(gdf, 10)

##              seqid source        type start       end score strand phase
## 2511484 KI270375.1 GRCh38 supercontig     1      2378     .      .     .
## 2511545 KI270466.1 GRCh38 supercontig     1      1233     .      .     .
## 2511473 KI270337.1 GRCh38 supercontig     1      1121     .      .     .
## 2511504 KI270411.1 GRCh38 supercontig     1      2646     .      .     .
## 2511487 KI270379.1 GRCh38 supercontig     1      1045     .      .     .
## 483371          12 GRCh38  chromosome     1 133275309     .      .     .
## 2511494 KI270387.1 GRCh38 supercontig     1      1537     .      .     .
## 674768          14 GRCh38  chromosome     1 107043718     .      .     .
## 2511493 KI270386.1 GRCh38 supercontig     1      1788     .      .     .
## 2511505 KI270412.1 GRCh38 supercontig     1      1179     .      .     .
##                                                           attributes
## 2511484 ID=supercontig:KI270375.1;Alias=chrUn_KI270375v1,NT_187493.1
## 2511545 ID=supercontig:KI270466.1;Alias=chrUn_KI270466v1,NT_187421.1
## 2511473 ID=supercontig:KI270337.1;Alias=chrUn_KI270337v1,NT_187466.1
## 2511504 ID=supercontig:KI270411.1;Alias=chrUn_KI270411v1,NT_187409.1
## 2511487 ID=supercontig:KI270379.1;Alias=chrUn_KI270379v1,NT_187472.1
## 483371          ID=chromosome:12;Alias=CM000674.2,chr12,NC_000012.12
## 2511494 ID=supercontig:KI270387.1;Alias=chrUn_KI270387v1,NT_187475.1
## 674768           ID=chromosome:14;Alias=CM000676.2,chr14,NC_000014.9
## 2511493 ID=supercontig:KI270386.1;Alias=chrUn_KI270386v1,NT_187480.1
## 2511505 ID=supercontig:KI270412.1;Alias=chrUn_KI270412v1,NT_187408.1

creating new columns and performing calculations

Now we want to create a new column called “length”. It should contain the gene lengths, i.e. the distance in base pairs between the start and end point on the chromosomes (“start” and “end” columns). In R we don’t need to copy the dataframe first but the rest of the code is very similar: we define a new column name for the dataframe and assign its value by subtracting the values of the “start” from the values of the “end” column (+1). Here, we can see again that in Python we use a dot to define columns, while R uses the Dollar sign.

The sum of all lengths is calculated with the “sum()” function in both languages.

Python:

gdf = gdf.copy()
gdf['length'] = gdf.end - gdf.start + 1

gdf.length.sum()

gdf$length <- gdf$end - gdf$start + 1

sum(gdf$length)

## [1] 3096629726

Next, we want to calculating the proportion of the genome that is not on main chromosome assemblies. For that, we first define a character string with the main chromosomes: 1 to 23, X, Y and MT (mitochondrial chromosome). Defining this string is a bit easier in R.

We will use this string to calculate the sum of lengths of the subsetted dataframe and divide it by the sum of lengths of the whole dataframe. For subsetting, we use the “isin()” function of Python, which corresponds to R’s “%in%”.

Python:

chrs = [str(_) for _ in range(1, 23)] + ['X', 'Y', 'MT']
gdf[-gdf.seqid.isin(chrs)].length.sum() / gdf.length.sum()

# or
gdf[(gdf['type'] == 'supercontig')].length.sum() / gdf.length.sum()

chrs <- c(1:23, "X", "Y", "MT")
sum(subset(gdf, !seqid %in% chrs)$length) / sum(gdf$length)

## [1] 0.003702192

How Many Genes Are There?

Here, we are again using the same functions as above for subsetting the dataframe, asking for its dimensions, printing 10 random lines and asking for the frequencies of each unique item in the “type” column. For the latter I am using base R over dplyr, because it’s faster to type.

Python:

edf = df[df.source.isin(['ensembl', 'havana', 'ensembl_havana'])]
edf.shape

edf.sample(10)
edf.type.value_counts()

edf <- subset(df, source %in% c("ensembl", "havana", "ensembl_havana"))
dim(edf)

## [1] 2414370       9

sample_n(edf, 10)

##         seqid         source            type     start       end score
## 2548897     X        ensembl three_prime_UTR  67723842  67730618     .
## 790670     15 ensembl_havana             CDS  42352030  42352139     .
## 1298369    19         havana             CDS  38942683  38942751     .
## 189728      1 ensembl_havana             CDS 186304108 186304257     .
## 1080599    17         havana            exon  44401337  44401453     .
## 1562968    20 ensembl_havana            exon   2964272   2964350     .
## 1965100     5        ensembl      transcript    144087    144197     .
## 2581554     X         havana            exon 147927670 147928807     .
## 785399     15         havana            exon  40611478  40611511     .
## 1417830     2 ensembl_havana             CDS  71520178  71520208     .
##         strand phase
## 2548897      +     .
## 790670       +     2
## 1298369      -     0
## 189728       +     2
## 1080599      -     .
## 1562968      +     .
## 1965100      -     .
## 2581554      +     .
## 785399       +     .
## 1417830      +     0
##                                                                                                                                                                         attributes
## 2548897                                                                                                                                          Parent=transcript:ENST00000612452
## 790670                                                                                         ID=CDS:ENSP00000326227;Parent=transcript:ENST00000318010;protein_id=ENSP00000326227
## 1298369                                                                                        ID=CDS:ENSP00000472465;Parent=transcript:ENST00000599996;protein_id=ENSP00000472465
## 189728                                                                                         ID=CDS:ENSP00000356453;Parent=transcript:ENST00000367483;protein_id=ENSP00000356453
## 1080599                         Parent=transcript:ENST00000585614;Name=ENSE00002917300;constitutive=0;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00002917300;rank=7;version=1
## 1562968                        Parent=transcript:ENST00000216877;Name=ENSE00003603687;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;exon_id=ENSE00003603687;rank=4;version=1
## 1965100 ID=transcript:ENST00000362670;Parent=gene:ENSG00000199540;Name=Y_RNA.43-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000362670;transcript_support_level=NA;version=1
## 2581554                       Parent=transcript:ENST00000620828;Name=ENSE00003724969;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003724969;rank=1;version=1
## 785399                          Parent=transcript:ENST00000399668;Name=ENSE00003659959;constitutive=0;ensembl_end_phase=2;ensembl_phase=1;exon_id=ENSE00003659959;rank=7;version=1
## 1417830                                                                                        ID=CDS:ENSP00000398305;Parent=transcript:ENST00000429174;protein_id=ENSP00000398305

sort(table(edf$type), decreasing = TRUE)

## 
##                          exon                           CDS 
##                       1180596                        704604 
##                five_prime_UTR               three_prime_UTR 
##                        142387                        133938 
##                    transcript                          gene 
##                         96375                         42470 
##          processed_transcript aberrant_processed_transcript 
##                         28228                         26944 
##        NMD_transcript_variant                       lincRNA 
##                         13761                         13247 
##          processed_pseudogene                  lincRNA_gene 
##                         10722                          7533 
##                    pseudogene                           RNA 
##                          3049                          2221 
##                         snRNA                    snRNA_gene 
##                          1909                          1909 
##                        snoRNA                   snoRNA_gene 
##                           956                           944 
##        pseudogenic_transcript                          rRNA 
##                           737                           549 
##                     rRNA_gene                         miRNA 
##                           549                           302 
##                V_gene_segment                J_gene_segment 
##                           216                           158 
##               VD_gene_segment                C_gene_segment 
##                            37                            29 
##             biological_region                    chromosome 
##                             0                             0 
##                    miRNA_gene                       mt_gene 
##                             0                             0 
##                   supercontig 
##                             0

Now we want to subset the dataframe to rows with the attribute “gene” in the “type” column, look at 10 random lines from the “attributes” column and get the dataframe dimensions.

Python:

ndf = edf[edf.type == 'gene']
ndf = ndf.copy()
ndf.sample(10).attributes.values
ndf.shape

ndf <- subset(edf, type == "gene")
sample_n(ndf, 10)$attributes

##  [1] ID=gene:ENSG00000170949;Name=ZNF160;biotype=protein_coding;description=zinc finger protein 160 [Source:HGNC Symbol%3BAcc:HGNC:12948];gene_id=ENSG00000170949;havana_gene=OTTHUMG00000182854;havana_version=3;logic_name=ensembl_havana_gene;version=17                  
##  [2] ID=gene:ENSG00000215088;Name=RPS5P3;biotype=processed_pseudogene;description=ribosomal protein S5 pseudogene 3 [Source:HGNC Symbol%3BAcc:HGNC:10427];gene_id=ENSG00000215088;havana_gene=OTTHUMG00000065532;havana_version=1;logic_name=havana;version=3                
##  [3] ID=gene:ENSG00000249077;Name=RP11-478C1.8;biotype=processed_pseudogene;gene_id=ENSG00000249077;havana_gene=OTTHUMG00000160290;havana_version=1;logic_name=havana;version=1                                                                                              
##  [4] ID=gene:ENSG00000110619;Name=CARS;biotype=protein_coding;description=cysteinyl-tRNA synthetase [Source:HGNC Symbol%3BAcc:HGNC:1493];gene_id=ENSG00000110619;havana_gene=OTTHUMG00000010927;havana_version=9;logic_name=ensembl_havana_gene;version=16                   
##  [5] ID=gene:ENSG00000168010;Name=ATG16L2;biotype=protein_coding;description=autophagy related 16 like 2 [Source:HGNC Symbol%3BAcc:HGNC:25464];gene_id=ENSG00000168010;havana_gene=OTTHUMG00000167961;havana_version=2;logic_name=ensembl_havana_gene;version=10             
##  [6] ID=gene:ENSG00000055957;Name=ITIH1;biotype=protein_coding;description=inter-alpha-trypsin inhibitor heavy chain 1 [Source:HGNC Symbol%3BAcc:HGNC:6166];gene_id=ENSG00000055957;havana_gene=OTTHUMG00000150312;havana_version=9;logic_name=ensembl_havana_gene;version=10
##  [7] ID=gene:ENSG00000258781;Name=RP11-496I2.4;biotype=unprocessed_pseudogene;gene_id=ENSG00000258781;havana_gene=OTTHUMG00000170502;havana_version=2;logic_name=havana;version=2                                                                                            
##  [8] ID=gene:ENSG00000157965;Name=SSX8;biotype=unprocessed_pseudogene;description=SSX family member 8 [Source:HGNC Symbol%3BAcc:HGNC:19654];gene_id=ENSG00000157965;havana_gene=OTTHUMG00000021573;havana_version=4;logic_name=havana;version=11                             
##  [9] ID=gene:ENSG00000273489;Name=RP11-180C16.1;biotype=antisense;gene_id=ENSG00000273489;havana_gene=OTTHUMG00000186336;havana_version=1;logic_name=havana;version=1                                                                                                        
## [10] ID=gene:ENSG00000279532;Name=CTB-96E2.6;biotype=TEC;gene_id=ENSG00000279532;havana_gene=OTTHUMG00000179414;havana_version=1;logic_name=havana;version=1                                                                                                                 
## 1623077 Levels: external_name=Ala;logic_name=trnascan ...

dim(ndf)

## [1] 42470     9

extracting gene information from attributes field

I don’t know if there is an easier way in Python but in R we don’t need to create big helper functions around it. We can simply use “gsub()” with defining the regular expression for what we want to extract. This makes the R code much shorter and easier to understand! We then drop the original “attributes” column.

To have a look at the dataframe, we use “head()” this time.

Python:

import re

RE_GENE_NAME = re.compile(r'Name=(?P<gene_name>.+?);')
def extract_gene_name(attributes_str):
    res = RE_GENE_NAME.search(attributes_str)
    return res.group('gene_name')


ndf['gene_name'] = ndf.attributes.apply(extract_gene_name)

RE_GENE_ID = re.compile(r'gene_id=(?P<gene_id>ENSG.+?);')
def extract_gene_id(attributes_str):
    res = RE_GENE_ID.search(attributes_str)
    return res.group('gene_id')


ndf['gene_id'] = ndf.attributes.apply(extract_gene_id)


RE_DESC = re.compile('description=(?P<desc>.+?);')
def extract_description(attributes_str):
    res = RE_DESC.search(attributes_str)
    if res is None:
        return ''
    else:
        return res.group('desc')


ndf['desc'] = ndf.attributes.apply(extract_description)

ndf.drop('attributes', axis=1, inplace=True)
ndf.head()

ndf$gene_name <- gsub("(.*Name=)(.*?)(;biotype.*)", "\\2", ndf$attributes)
ndf$gene_id <- gsub("(ID=gene:)(.*?)(;Name.*)", "\\2", ndf$attributes)
ndf$desc <- gsub("(.*description=)(.*?)(;.*)", "\\2", ndf$attributes)

# some genes don't have a description
ndf$desc <- ifelse(grepl("^ID=.*", ndf$desc), "", ndf$desc)

ndf <- subset(ndf, select = -attributes)
head(ndf)

##     seqid         source type  start    end score strand phase gene_name
## 17      1         havana gene  11869  14409     .      +     .   DDX11L1
## 29      1         havana gene  14404  29570     .      -     .    WASH7P
## 72      1         havana gene  52473  53312     .      +     .    OR4G4P
## 75      1         havana gene  62948  63887     .      +     .   OR4G11P
## 78      1 ensembl_havana gene  69091  70008     .      +     .     OR4F5
## 109     1         havana gene 131025 134836     .      +     .    CICP27
##             gene_id
## 17  ENSG00000223972
## 29  ENSG00000227232
## 72  ENSG00000268020
## 75  ENSG00000240361
## 78  ENSG00000186092
## 109 ENSG00000233750
##                                                                                                   desc
## 17                                 DEAD/H-box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:HGNC:37102]
## 29                       WAS protein family homolog 7 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:38034]
## 72   olfactory receptor family 4 subfamily G member 4 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:14822]
## 75  olfactory receptor family 4 subfamily G member 11 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:31276]
## 78              olfactory receptor family 4 subfamily F member 5 [Source:HGNC Symbol%3BAcc:HGNC:14825]
## 109              capicua transcriptional repressor pseudogene 27 [Source:HGNC Symbol%3BAcc:HGNC:48835]

Next, we want to know how many unique gene names and gene IDs there are. As above we use the “unique()”, “shape()” (Python) and “length()” (R) functions.

The count table for gene names can again be obtained with R’s “table()” function, even though Zhuyi Xue uses a slightly different approach: he first groups the “gene_name” column, then counts and sorts.

Finally, we can calculate the proportion of genes that have more than appear more than once and we can have a closer look at the SCARNA20 gene.

Python:

ndf.shape
ndf.gene_id.unique().shape
ndf.gene_name.unique().shape

count_df = ndf.groupby('gene_name').count().ix[:, 0].sort_values().ix[::-1]
count_df.head(10)

count_df[count_df > 1].shape
count_df.shape
count_df[count_df > 1].shape[0] / count_df.shape[0]

ndf[ndf.gene_name == 'SCARNA20']

dim(ndf)

## [1] 42470    11

length(unique(ndf$gene_id))

## [1] 42470

length(unique(ndf$gene_name))

## [1] 42387

count_df <- sort(table(ndf$gene_name), decreasing = TRUE)
head(count_df, n = 10)

## 
##        SCARNA20        SCARNA16        SCARNA17        SCARNA11 
##               7               6               5               4 
##        SCARNA15        SCARNA21 Clostridiales-1         SCARNA4 
##               4               4               3               3 
##        ACTR3BP2           AGBL1 
##               2               2

length(count_df[count_df > 1])

## [1] 63

length(count_df)

## [1] 42387

length(count_df[count_df > 1]) / length(count_df)

## [1] 0.001486305

ndf[ndf$gene_name == "SCARNA20", ]

##         seqid  source type     start       end score strand phase
## 179400      1 ensembl gene 171768070 171768175     .      +     .
## 201038      1 ensembl gene 204727991 204728106     .      +     .
## 349204     11 ensembl gene   8555016   8555146     .      +     .
## 718521     14 ensembl gene  63479272  63479413     .      +     .
## 837234     15 ensembl gene  75121536  75121666     .      -     .
## 1039875    17 ensembl gene  28018770  28018907     .      +     .
## 1108216    17 ensembl gene  60231516  60231646     .      -     .
##         gene_name         gene_id
## 179400   SCARNA20 ENSG00000253060
## 201038   SCARNA20 ENSG00000251861
## 349204   SCARNA20 ENSG00000252778
## 718521   SCARNA20 ENSG00000252800
## 837234   SCARNA20 ENSG00000252722
## 1039875  SCARNA20 ENSG00000251818
## 1108216  SCARNA20 ENSG00000252577
##                                                                           desc
## 179400            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 201038            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 349204            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 718521            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 837234            Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1039875           Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
## 1108216 small Cajal body-specific RNA 20 [Source:HGNC Symbol%3BAcc:HGNC:32578]

# dplyr:
# ndf %>% filter(gene_name == "SCARNA20")

How Long Is a Typical Gene?

To calculate gene lengths we use the same code as before. R’s summary() is not exactly the same as Python’s describe() but it’s close enough.

Python:

ndf['length'] = ndf.end - ndf.start + 1
ndf.length.describe()

ndf$length <- ndf$end - ndf$start + 1
summary(ndf$length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       8     884    5170   35830   30550 2305000

Now we produce the first plot, showing a histogram of gene length. In base R you can’t really plot a histogram with logarithmic y-axis scales (at least not without manually tweaking the hist() output but it isn’t recommended anyway because 0 will become -Inf). But we can do it easily with ggplot2 with “scale_y_log10()”. The code we need for ggplot2 is a bit longer than with matplotlib. We could of course further customize our plot but for now, let’s keep it simple.

Python:

import matplotlib as plt

ndf.length.plot(kind='hist', bins=50, logy=True)
plt.show()

ndf %>% ggplot(aes(x = length)) + 
  geom_histogram(bins = 50, fill = "blue") + 
  scale_y_log10()

Now, we subset the dataframe to keep only rows where the “length” column contains values bigger than 2 million and order it by descending gene length. To see the shortest genes, we order the original dataframe and view the first 6 rows. Instead of “sort()”, we use dplyr’s “arrange()” function this time (I didn’t use it before because it can only be applied to dataframes).

Python:

ndf[ndf.length > 2e6].sort_values('length').ix[::-1]
ndf.sort_values('length').head()

ndf %>% filter(length > 2e6) %>% arrange(desc(length))

##   seqid         source type     start       end score strand phase
## 1     7 ensembl_havana gene 146116002 148420998     .      +     .
## 2     9 ensembl_havana gene   8314246  10612723     .      -     .
## 3     X ensembl_havana gene  31097677  33339441     .      -     .
## 4    11 ensembl_havana gene  83455012  85627922     .      -     .
## 5     8 ensembl_havana gene   2935353   4994972     .      -     .
## 6    20 ensembl_havana gene  13995369  16053197     .      +     .
##   gene_name         gene_id
## 1   CNTNAP2 ENSG00000174469
## 2     PTPRD ENSG00000153707
## 3       DMD ENSG00000198947
## 4      DLG2 ENSG00000150672
## 5     CSMD1 ENSG00000183117
## 6   MACROD2 ENSG00000172264
##                                                                                   desc
## 1            contactin associated protein-like 2 [Source:HGNC Symbol%3BAcc:HGNC:13830]
## 2 protein tyrosine phosphatase%2C receptor type D [Source:HGNC Symbol%3BAcc:HGNC:9668]
## 3                                      dystrophin [Source:HGNC Symbol%3BAcc:HGNC:2928]
## 4            discs large MAGUK scaffold protein 2 [Source:HGNC Symbol%3BAcc:HGNC:2901]
## 5               CUB and Sushi multiple domains 1 [Source:HGNC Symbol%3BAcc:HGNC:14026]
## 6                      MACRO domain containing 2 [Source:HGNC Symbol%3BAcc:HGNC:16126]
##    length
## 1 2304997
## 2 2298478
## 3 2241765
## 4 2172911
## 5 2059620
## 6 2057829

head(arrange(ndf, length))

##   seqid source type     start       end score strand phase    gene_name
## 1    14 havana gene  22438547  22438554     .      +     .        TRDD1
## 2    14 havana gene  22439007  22439015     .      +     .        TRDD2
## 3     7 havana gene 142786213 142786224     .      +     .        TRBD1
## 4    14 havana gene  22449113  22449125     .      +     .        TRDD3
## 5     4 havana gene  10238213  10238235     .      -     .   AC006499.9
## 6     3 havana gene 179452395 179452419     .      -     . RP11-145M9.2
##           gene_id
## 1 ENSG00000223997
## 2 ENSG00000237235
## 3 ENSG00000282431
## 4 ENSG00000228985
## 5 ENSG00000271544
## 6 ENSG00000239255
##                                                                      desc
## 1 T cell receptor delta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12254]
## 2 T cell receptor delta diversity 2 [Source:HGNC Symbol%3BAcc:HGNC:12255]
## 3  T cell receptor beta diversity 1 [Source:HGNC Symbol%3BAcc:HGNC:12158]
## 4 T cell receptor delta diversity 3 [Source:HGNC Symbol%3BAcc:HGNC:12256]
## 5                                                                        
## 6                                                                        
##   length
## 1      8
## 2      9
## 3     12
## 4     13
## 5     23
## 6     25

Gene Distribution Among Chromosomes

The number of genes per chromosome are counted with the “subset()”, “table()” and “sort()” functions as described earlier.

Python:

ndf = ndf[ndf.seqid.isin(chrs)]
chr_gene_counts = ndf.groupby('seqid').count().ix[:, 0].sort_values().ix[::-1]
chr_gene_counts

ndf$seqid <- as.character(ndf$seqid) # as factors it will subset the dataframe but keep the factor levels
ndf <- subset(ndf, seqid %in% chrs)
chr_gene_counts <- sort(table(ndf$seqid), decreasing = TRUE)
chr_gene_counts

## 
##    1    2   11   19   17    3    6   12    7    5   16    X    4    9    8 
## 3902 2806 2561 2412 2280 2204 2154 2140 2106 2002 1881 1852 1751 1659 1628 
##   10   15   14   22   20   13   18   21    Y 
## 1600 1476 1449  996  965  872  766  541  436

To see all genes that are on mitochondrial chromosome, we subset the first dataframe by two conditions. In both, R and Python this is done with the ampersand symbol but in R we don’t need brackets around the individual conditions.

Python:

df[(df.type == 'gene') & (df.seqid == 'MT')]

subset(df, type == "gene" & seqid == "MT")

##         seqid source type start   end score strand phase
## 2514004    MT  insdc gene   648  1601     .      +     .
## 2514010    MT  insdc gene  1671  3229     .      +     .
## 2514017    MT  insdc gene  3307  4262     .      +     .
## 2514030    MT  insdc gene  4470  5511     .      +     .
## 2514049    MT  insdc gene  5904  7445     .      +     .
## 2514059    MT  insdc gene  7586  8269     .      +     .
## 2514066    MT  insdc gene  8366  8572     .      +     .
## 2514070    MT  insdc gene  8527  9207     .      +     .
## 2514074    MT  insdc gene  9207  9990     .      +     .
## 2514081    MT  insdc gene 10059 10404     .      +     .
## 2514088    MT  insdc gene 10470 10766     .      +     .
## 2514092    MT  insdc gene 10760 12137     .      +     .
## 2514105    MT  insdc gene 12337 14148     .      +     .
## 2514109    MT  insdc gene 14149 14673     .      -     .
## 2514116    MT  insdc gene 14747 15887     .      +     .
##                                                                                                                                                                                                                                                 attributes
## 2514004                                               ID=gene:ENSG00000211459;Name=MT-RNR1;biotype=Mt_rRNA;description=mitochondrially encoded 12S RNA [Source:HGNC Symbol%3BAcc:HGNC:7470];gene_id=ENSG00000211459;logic_name=mt_genbank_import;version=2
## 2514010                                               ID=gene:ENSG00000210082;Name=MT-RNR2;biotype=Mt_rRNA;description=mitochondrially encoded 16S RNA [Source:HGNC Symbol%3BAcc:HGNC:7471];gene_id=ENSG00000210082;logic_name=mt_genbank_import;version=2
## 2514017   ID=gene:ENSG00000198888;Name=MT-ND1;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol%3BAcc:HGNC:7455];gene_id=ENSG00000198888;logic_name=mt_genbank_import;version=2
## 2514030   ID=gene:ENSG00000198763;Name=MT-ND2;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 2 [Source:HGNC Symbol%3BAcc:HGNC:7456];gene_id=ENSG00000198763;logic_name=mt_genbank_import;version=3
## 2514049                          ID=gene:ENSG00000198804;Name=MT-CO1;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase I [Source:HGNC Symbol%3BAcc:HGNC:7419];gene_id=ENSG00000198804;logic_name=mt_genbank_import;version=2
## 2514059                         ID=gene:ENSG00000198712;Name=MT-CO2;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase II [Source:HGNC Symbol%3BAcc:HGNC:7421];gene_id=ENSG00000198712;logic_name=mt_genbank_import;version=1
## 2514066                                 ID=gene:ENSG00000228253;Name=MT-ATP8;biotype=protein_coding;description=mitochondrially encoded ATP synthase 8 [Source:HGNC Symbol%3BAcc:HGNC:7415];gene_id=ENSG00000228253;logic_name=mt_genbank_import;version=1
## 2514070                                 ID=gene:ENSG00000198899;Name=MT-ATP6;biotype=protein_coding;description=mitochondrially encoded ATP synthase 6 [Source:HGNC Symbol%3BAcc:HGNC:7414];gene_id=ENSG00000198899;logic_name=mt_genbank_import;version=2
## 2514074                        ID=gene:ENSG00000198938;Name=MT-CO3;biotype=protein_coding;description=mitochondrially encoded cytochrome c oxidase III [Source:HGNC Symbol%3BAcc:HGNC:7422];gene_id=ENSG00000198938;logic_name=mt_genbank_import;version=2
## 2514081   ID=gene:ENSG00000198840;Name=MT-ND3;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 3 [Source:HGNC Symbol%3BAcc:HGNC:7458];gene_id=ENSG00000198840;logic_name=mt_genbank_import;version=2
## 2514088 ID=gene:ENSG00000212907;Name=MT-ND4L;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4L [Source:HGNC Symbol%3BAcc:HGNC:7460];gene_id=ENSG00000212907;logic_name=mt_genbank_import;version=2
## 2514092   ID=gene:ENSG00000198886;Name=MT-ND4;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 4 [Source:HGNC Symbol%3BAcc:HGNC:7459];gene_id=ENSG00000198886;logic_name=mt_genbank_import;version=2
## 2514105   ID=gene:ENSG00000198786;Name=MT-ND5;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 5 [Source:HGNC Symbol%3BAcc:HGNC:7461];gene_id=ENSG00000198786;logic_name=mt_genbank_import;version=2
## 2514109   ID=gene:ENSG00000198695;Name=MT-ND6;biotype=protein_coding;description=mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 6 [Source:HGNC Symbol%3BAcc:HGNC:7462];gene_id=ENSG00000198695;logic_name=mt_genbank_import;version=2
## 2514116                                    ID=gene:ENSG00000198727;Name=MT-CYB;biotype=protein_coding;description=mitochondrially encoded cytochrome b [Source:HGNC Symbol%3BAcc:HGNC:7427];gene_id=ENSG00000198727;logic_name=mt_genbank_import;version=2

We can get the chromosome lengths from the dataframe as well. We again subset to only the main chromosomes, then drop unwanted columns and order by length.

Python:

gdf = gdf[gdf.seqid.isin(chrs)]
gdf.drop(['start', 'end', 'score', 'strand', 'phase' ,'attributes'], axis=1, inplace=True)
gdf.sort_values('length').ix[::-1]

gdf$seqid <- as.character(gdf$seqid) # as factors it will subset the dataframe but keep the factor levels
gdf <- subset(gdf, as.character(seqid) %in% chrs) %>%
  select(-(start:attributes))
arrange(gdf, desc(length))

##    seqid source       type    length
## 1      1 GRCh38 chromosome 248956422
## 2      2 GRCh38 chromosome 242193529
## 3      3 GRCh38 chromosome 198295559
## 4      4 GRCh38 chromosome 190214555
## 5      5 GRCh38 chromosome 181538259
## 6      6 GRCh38 chromosome 170805979
## 7      7 GRCh38 chromosome 159345973
## 8      X GRCh38 chromosome 156040895
## 9      8 GRCh38 chromosome 145138636
## 10     9 GRCh38 chromosome 138394717
## 11    11 GRCh38 chromosome 135086622
## 12    10 GRCh38 chromosome 133797422
## 13    12 GRCh38 chromosome 133275309
## 14    13 GRCh38 chromosome 114364328
## 15    14 GRCh38 chromosome 107043718
## 16    15 GRCh38 chromosome 101991189
## 17    16 GRCh38 chromosome  90338345
## 18    17 GRCh38 chromosome  83257441
## 19    18 GRCh38 chromosome  80373285
## 20    20 GRCh38 chromosome  64444167
## 21    19 GRCh38 chromosome  58617616
## 22     Y GRCh38 chromosome  54106423
## 23    22 GRCh38 chromosome  50818468
## 24    21 GRCh38 chromosome  46709983
## 25    MT GRCh38 chromosome     16569

Now, we merge the dataframe with the number of genes per chromosome with the dataframe of chromosome lengths. Because R’s “table()” function produces a vector, we need to convert it to a dataframe first and define the column names. Then, we use the “merge()” function and point to the name of the column we want to merge by.

Python:

cdf = chr_gene_counts.to_frame(name='gene_count').reset_index()
cdf.head(2)

merged = gdf.merge(cdf, on='seqid')

cdf <- as.data.frame(chr_gene_counts)
colnames(cdf) <- c("seqid", "gene_count")
head(cdf, n = 2)

##   seqid gene_count
## 1     1       3902
## 2     2       2806

merged <- merge(gdf, cdf, by = "seqid")
merged

##    seqid source       type    length gene_count
## 1      1 GRCh38 chromosome 248956422       3902
## 2     10 GRCh38 chromosome 133797422       1600
## 3     11 GRCh38 chromosome 135086622       2561
## 4     12 GRCh38 chromosome 133275309       2140
## 5     13 GRCh38 chromosome 114364328        872
## 6     14 GRCh38 chromosome 107043718       1449
## 7     15 GRCh38 chromosome 101991189       1476
## 8     16 GRCh38 chromosome  90338345       1881
## 9     17 GRCh38 chromosome  83257441       2280
## 10    18 GRCh38 chromosome  80373285        766
## 11    19 GRCh38 chromosome  58617616       2412
## 12     2 GRCh38 chromosome 242193529       2806
## 13    20 GRCh38 chromosome  64444167        965
## 14    21 GRCh38 chromosome  46709983        541
## 15    22 GRCh38 chromosome  50818468        996
## 16     3 GRCh38 chromosome 198295559       2204
## 17     4 GRCh38 chromosome 190214555       1751
## 18     5 GRCh38 chromosome 181538259       2002
## 19     6 GRCh38 chromosome 170805979       2154
## 20     7 GRCh38 chromosome 159345973       2106
## 21     8 GRCh38 chromosome 145138636       1628
## 22     9 GRCh38 chromosome 138394717       1659
## 23     X GRCh38 chromosome 156040895       1852
## 24     Y GRCh38 chromosome  54106423        436

To calculate the correlation between length and gene count, we subset the merged dataframe to those two columns and use the “corr()” (Python) or “cor()” (R) function.

Python:

merged[['length', 'gene_count']].corr()

cor(merged[, c("length", "gene_count")])

##               length gene_count
## length     1.0000000  0.7282208
## gene_count 0.7282208  1.0000000

And now we produce the final plot: a line plot of chromosome length by number of genes per chromosome. For Python, we again use the matplotlib and for R the ggplot2 packages. Because Zhuyi Xue creates a new dataframe and tweaks the plot somewhat, our ggplot2 code is simpler and tidier here.

Python:

ax = merged[['length', 'gene_count']].sort_values('length').plot(x='length', y='gene_count', style='o-')
# add some margin to both ends of x axis
xlim = ax.get_xlim()
margin = xlim[0] * 0.1
ax.set_xlim([xlim[0] - margin, xlim[1] + margin])
# Label each point on the graph
for (s, x, y) in merged[['seqid', 'length', 'gene_count']].sort_values('length').values:
    ax.text(x, y - 100, str(s))

merged[, c("seqid", "length", "gene_count")] %>%
  arrange(desc(length)) %>%
  ggplot(aes(x = length, y = gene_count, label = seqid)) +
  geom_point(color = "blue") +
  geom_line(color = "blue") +
  geom_text()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 dplyr_0.5.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8      codetools_0.2-15 digest_0.6.11    rprojroot_1.1   
##  [5] assertthat_0.1   plyr_1.8.4       grid_3.3.2       R6_2.2.0        
##  [9] gtable_0.2.0     DBI_0.5-1        backports_1.0.4  magrittr_1.5    
## [13] scales_0.4.1     evaluate_0.10    stringi_1.1.2    lazyeval_0.2.0  
## [17] rmarkdown_1.3    labeling_0.3     tools_3.3.2      stringr_1.1.0   
## [21] munsell_0.4.3    yaml_2.1.14      colorspace_1.3-2 htmltools_0.3.5 
## [25] knitr_1.15.1     tibble_1.2

Feature Selection in Machine Learning (Breast Cancer Datasets)

2017-01-15 00:00:00 +0000

Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.

There are several ways to identify how much each feature contributes to the model and to restrict the number of selected features. Here, I am going to examine the effect of feature selection via

Correlation,
Recursive Feature Elimination (RFE) and
Genetic Algorithm (GA)

on Random Forest models.

Additionally, I want to know how different data properties affect the influence of these feature selection methods on the outcome. For that I am using three breast cancer datasets, one of which has few features; the other two are larger but differ in how well the outcome clusters in PCA.

Based on my comparisons of the correlation method, RFE and GA, I would conclude that for Random Forest models

removing highly correlated features isn’t a generally suitable method,
GA produced the best models in this example but is impractical for everyday use-cases with many features because it takes a lot of time to run with sufficient generations and individuals and
data that doesn’t allow a good classification to begin with (because the features are not very distinct between classes) don’t necessarily benefit from feature selection.

My conclusions are of course not to be generalized to any ol’ data you are working with: There are many more feature selection methods and I am only looking at a limited number of datasets and only at their influence on Random Forest models. But even this small example shows how different features and parameters can influence your predictions. With machine learning, there is no “one size fits all”! It is always worthwhile to take a good hard look at your data, get acquainted with its quirks and properties before you even think about models and algorithms. And once you’ve got a feel for your data, investing the time and effort to compare different feature selection methods (or engineered features), model parameters and - finally - different machine learning algorithms can make a big difference!

Breast Cancer Wisconsin (Diagnostic) Dataset

The data I am going to use to explore feature selection methods is the Breast Cancer Wisconsin (Diagnostic) Dataset:

W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.

O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.

The data was downloaded from the UC Irvine Machine Learning Repository. The features in these datasets characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses.

Included are three datasets. The first dataset is small with only 9 features, the other two datasets have 30 and 33 features and vary in how strongly the two predictor classes cluster in PCA. I want to explore the effect of different feature selection methods on datasets with these different properties.

But first, I want to get to know the data I am working with.

Breast cancer dataset 1

The first dataset looks at the predictor classes:

malignant or
benign breast mass.

The phenotypes for characterisation are:

Sample ID (code number)
Clump thickness
Uniformity of cell size
Uniformity of cell shape
Marginal adhesion
Single epithelial cell size
Number of bare nuclei
Bland chromatin
Number of normal nuclei
Mitosis
Classes, i.e. diagnosis

Missing values are imputed with the mice package.

bc_data <- read.table("breast-cancer-wisconsin.data.txt", header = FALSE, sep = ",")
colnames(bc_data) <- c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", 
                       "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")
bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))

bc_data[bc_data == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data)))

## [1] 16

# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)

##    benign malignant 
##       458       241

str(bc_data)

## 'data.frame':    699 obs. of  10 variables:
##  $ classes                    : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
##  $ clump_thickness            : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ uniformity_of_cell_size    : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ uniformity_of_cell_shape   : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginal_adhesion          : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ single_epithelial_cell_size: num  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare_nuclei                : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ bland_chromatin            : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal_nucleoli            : num  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitosis                    : num  1 1 1 1 1 1 1 1 5 1 ...

Breast cancer dataset 2

The second dataset looks again at the predictor classes:

M: malignant or
B: benign breast mass.

The first two columns give:

Sample ID
Classes, i.e. diagnosis

For each cell nucleus, the following ten characteristics were measured:

Radius (mean of all distances from the center to points on the perimeter)
Texture (standard deviation of gray-scale values)
Perimeter
Area
Smoothness (local variation in radius lengths)
Compactness (perimeter^2 / area - 1.0)
Concavity (severity of concave portions of the contour)
Concave points (number of concave portions of the contour)
Symmetry
Fractal dimension (“coastline approximation” - 1)

For each characteristic three measures are given:

Mean
Standard error
Largest/ “worst”

bc_data_2 <- read.table("wdbc.data.txt", header = FALSE, sep = ",")

phenotypes <- rep(c("radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension"), 3)
types <- rep(c("mean", "se", "largest_worst"), each = 10)

colnames(bc_data_2) <- c("ID", "diagnosis", paste(phenotypes, types, sep = "_"))

# how many NAs are in the data
length(which(is.na(bc_data_2)))

## [1] 0

# how many benign and malignant cases are there?
summary(bc_data_2$diagnosis)

##   B   M 
## 357 212

str(bc_data_2)

## 'data.frame':    569 obs. of  32 variables:
##  $ ID                             : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis                      : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean                    : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean                   : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean                 : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean                      : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean                : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean                 : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean            : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se                      : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se                     : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se                   : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                        : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se                  : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se                    : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_largest_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_largest_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_largest_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Breast cancer dataset 3

The third dataset looks at the predictor classes:

R: recurring or
N: nonrecurring breast cancer.

The first two columns give:

Sample ID
Classes, i.e. outcome

For each cell nucleus, the same ten characteristics and measures were given as in dataset 2, plus:

Time (recurrence time if field 2 = R, disease-free time if field 2 = N)
Tumor size - diameter of the excised tumor in centimeters
Lymph node status - number of positive axillary lymph nodes observed at time of surgery

Missing values are imputed with the mice package.

bc_data_3 <- read.table("wpbc.data.txt", header = FALSE, sep = ",")
colnames(bc_data_3) <- c("ID", "outcome", "time", paste(phenotypes, types, sep = "_"), "tumor_size", "lymph_node_status")

bc_data_3[bc_data_3 == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data_3)))

## [1] 4

# impute missing data
library(mice)

bc_data_3[,3:35] <- apply(bc_data_3[,3:35], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data_3[,3:35],  print = FALSE)
bc_data_3 <- cbind(bc_data_3[, 2, drop = FALSE], mice::complete(dataset_impute, 1))

# how many recurring and non-recurring cases are there?
summary(bc_data_3$outcome)

##   N   R 
## 151  47

str(bc_data_3)

## 'data.frame':    198 obs. of  34 variables:
##  $ outcome                        : Factor w/ 2 levels "N","R": 1 1 1 1 2 2 1 2 1 1 ...
##  $ time                           : num  31 61 116 123 27 77 60 77 119 76 ...
##  $ radius_mean                    : num  18 18 21.4 11.4 20.3 ...
##  $ texture_mean                   : num  27.6 10.4 17.4 20.4 14.3 ...
##  $ perimeter_mean                 : num  117.5 122.8 137.5 77.6 135.1 ...
##  $ area_mean                      : num  1013 1001 1373 386 1297 ...
##  $ smoothness_mean                : num  0.0949 0.1184 0.0884 0.1425 0.1003 ...
##  $ compactness_mean               : num  0.104 0.278 0.119 0.284 0.133 ...
##  $ concavity_mean                 : num  0.109 0.3 0.126 0.241 0.198 ...
##  $ concave_points_mean            : num  0.0706 0.1471 0.0818 0.1052 0.1043 ...
##  $ symmetry_mean                  : num  0.186 0.242 0.233 0.26 0.181 ...
##  $ fractal_dimension_mean         : num  0.0633 0.0787 0.0601 0.0974 0.0588 ...
##  $ radius_se                      : num  0.625 1.095 0.585 0.496 0.757 ...
##  $ texture_se                     : num  1.89 0.905 0.611 1.156 0.781 ...
##  $ perimeter_se                   : num  3.97 8.59 3.93 3.44 5.44 ...
##  $ area_se                        : num  71.5 153.4 82.2 27.2 94.4 ...
##  $ smoothness_se                  : num  0.00443 0.0064 0.00617 0.00911 0.01149 ...
##  $ compactness_se                 : num  0.0142 0.049 0.0345 0.0746 0.0246 ...
##  $ concavity_se                   : num  0.0323 0.0537 0.033 0.0566 0.0569 ...
##  $ concave_points_se              : num  0.00985 0.01587 0.01805 0.01867 0.01885 ...
##  $ symmetry_se                    : num  0.0169 0.03 0.0309 0.0596 0.0176 ...
##  $ fractal_dimension_se           : num  0.00349 0.00619 0.00504 0.00921 0.00511 ...
##  $ radius_largest_worst           : num  21.6 25.4 24.9 14.9 22.5 ...
##  $ texture_largest_worst          : num  37.1 17.3 21 26.5 16.7 ...
##  $ perimeter_largest_worst        : num  139.7 184.6 159.1 98.9 152.2 ...
##  $ area_largest_worst             : num  1436 2019 1949 568 1575 ...
##  $ smoothness_largest_worst       : num  0.119 0.162 0.119 0.21 0.137 ...
##  $ compactness_largest_worst      : num  0.193 0.666 0.345 0.866 0.205 ...
##  $ concavity_largest_worst        : num  0.314 0.712 0.341 0.687 0.4 ...
##  $ concave_points_largest_worst   : num  0.117 0.265 0.203 0.258 0.163 ...
##  $ symmetry_largest_worst         : num  0.268 0.46 0.433 0.664 0.236 ...
##  $ fractal_dimension_largest_worst: num  0.0811 0.1189 0.0907 0.173 0.0768 ...
##  $ tumor_size                     : num  5 3 2.5 2 3.5 2.5 1.5 4 2 6 ...
##  $ lymph_node_status              : num  5 2 0 0 0 0 0 10 1 20 ...

Principal Component Analysis (PCA)

To get an idea about the dimensionality and variance of the datasets, I am first looking at PCA plots for samples and features. The first two principal components (PCs) show the two components that explain the majority of variation in the data.

After defining my custom ggplot2 theme, I am creating a function that performs the PCA (using the pcaGoPromoter package), calculates ellipses of the data points (with the ellipse package) and produces the plot with ggplot2. Some of the features in datasets 2 and 3 are not very distinct and overlap in the PCA plots, therefore I am also plotting hierarchical clustering dendrograms.

# plotting theme

library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "navy", color = "navy", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())

# function for PCA plotting
library(pcaGoPromoter)
library(ellipse)

pca_func <- function(data, groups, title, print_ellipse = TRUE) {
  
  # perform pca and extract scores
  pcaOutput <- pca(data, printDropped = FALSE, scale = TRUE, center = TRUE)
  pcaOutput2 <- as.data.frame(pcaOutput$scores)
  
  # define groups for plotting
  pcaOutput2$groups <- groups
  
  # when plotting samples calculate ellipses for plotting (when plotting features, there are no replicates)
  if (print_ellipse) {
    
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
      data.frame(groups = as.character(t),
                 ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                       centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                       level = 0.95),
                 stringsAsFactors = FALSE)))
    
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.6) + 
      scale_color_brewer(palette = "Set1") +
      labs(title = title,
           color = "",
           fill = "",
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
    
  } else {
    
    # if there are fewer than 10 groups (e.g. the predictor classes) I want to have colors from RColorBrewer
    if (length(unique(pcaOutput2$groups)) <= 10) {
      
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        scale_color_brewer(palette = "Set1") +
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    } else {
      
      # otherwise use the default rainbow colors
      plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
        geom_point(size = 2, alpha = 0.6) + 
        labs(title = title,
             color = "",
             fill = "",
             x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance"))
      
    }
  }
  
  return(plot)
  
}

library(gridExtra)
library(grid)

Dataset 1

p1 <- pca_func(data = t(bc_data[, 2:10]), groups = as.character(bc_data$classes), title = "Breast cancer dataset 1: Samples")
p2 <- pca_func(data = bc_data[, 2:10], groups = as.character(colnames(bc_data[, 2:10])), title = "Breast cancer dataset 1: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2)

h_1 <- hclust(dist(t(bc_data[, 2:10]), method = "euclidean"), method = "complete")
plot(h_1)

library(tidyr)
bc_data_gather <- bc_data %>%
  gather(measure, value, clump_thickness:mitosis)

ggplot(data = bc_data_gather, aes(x = value, fill = classes, color = classes)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)

Dataset 2

p1 <- pca_func(data = t(bc_data_2[, 3:32]), groups = as.character(bc_data_2$diagnosis), title = "Breast cancer dataset 2: Samples")
p2 <- pca_func(data = bc_data_2[, 3:32], groups = as.character(colnames(bc_data_2[, 3:32])), title = "Breast cancer dataset 2: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_2 <- hclust(dist(t(bc_data_2[, 3:32]), method = "euclidean"), method = "complete")
plot(h_2)

bc_data_2_gather <- bc_data_2[, -1] %>%
  gather(measure, value, radius_mean:fractal_dimension_largest_worst)

ggplot(data = bc_data_2_gather, aes(x = value, fill = diagnosis, color = diagnosis)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)

Dataset 3

p1 <- pca_func(data = t(bc_data_3[, 2:34]), groups = as.character(bc_data_3$outcome), title = "Breast cancer dataset 3: Samples")
p2 <- pca_func(data = bc_data_3[, 2:34], groups = as.character(colnames(bc_data_3[, 2:34])), title = "Breast cancer dataset 3: Features", print_ellipse = FALSE)
grid.arrange(p1, p2, ncol = 2, widths = c(0.4, 0.6))

h_3 <- hclust(dist(t(bc_data_3[,2:34]), method = "euclidean"), method = "complete")
plot(h_3)

Datasets 1 and 2 show a nice separation of benign and malignant masses, models based on these features will likely be able to predict the classes quite well for most samples. The classes in dataset 3 don’t cluster into distinct groups, I assume that prediction will not be as accurate for these features.

The features of datasets 2 and 3 don’t cluster very distinctly, many features seem to show similar patterns across samples. Selecting an approriate subset of features will probably have different effects on the three different datasets.

bc_data_3_gather <- bc_data_3 %>%
  gather(measure, value, time:lymph_node_status)

ggplot(data = bc_data_3_gather, aes(x = value, fill = outcome, color = outcome)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free_y", ncol = 3)

Feature importance

To get an idea about the feature’s respective importances, I’m running Random Forest models with 10 x 10 cross validation using the caret package. If I wanted to use feature importance to select features for modeling, I would need to perform it on the training data instead of on the complete dataset. But here, I only want to use it to get acquainted with my data. I am again defining a function that estimates the feature importance and produces a plot.

library(caret)
library(doParallel) # parallel processing
registerDoParallel()

# prepare training scheme
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)

feature_imp <- function(model, title) {
  
  # estimate variable importance
  importance <- varImp(model, scale = TRUE)
  
  # prepare dataframes for plotting
  importance_df_1 <- importance$importance
  importance_df_1$group <- rownames(importance_df_1)
  
  importance_df_2 <- importance_df_1
  importance_df_2$Overall <- 0
  
  importance_df <- rbind(importance_df_1, importance_df_2)
  
  plot <- ggplot() +
    geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) +
    geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) +
    theme(legend.position = "none") +
    labs(
      x = "Importance",
      y = "",
      title = title,
      subtitle = "Scaled feature importance",
      caption = "\nDetermined with Random Forest and
      repeated cross validation (10 repeats, 10 times)"
    )
  
  return(plot)
  
}

# train the model
set.seed(27)
imp_1 <- train(classes ~ ., data = bc_data, method = "rf", preProcess = c("scale", "center"), trControl = control)

p1 <- feature_imp(imp_1, title = "Breast cancer dataset 1")

set.seed(27)
imp_2 <- train(diagnosis ~ ., data = bc_data_2[, -1], method = "rf", preProcess = c("scale", "center"), trControl = control)

p2 <- feature_imp(imp_2, title = "Breast cancer dataset 2")

set.seed(27)
imp_3 <- train(outcome ~ ., data = bc_data_3, method = "rf", preProcess = c("scale", "center"), trControl = control)

p3 <- feature_imp(imp_3, title = "Breast cancer dataset 3")

grid.arrange(p1, p2, p3, ncol = 3, widths = c(0.3, 0.35, 0.35))

Feature Selection

Now that I have a general idea about the data, I will run three feature selection methods on all three datasets and compare how they effect the prediction accuracy of a Random Forest model.

Creating train and test data

Before doing anything else with the data, we need to subset the datasets into train and test data. Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

Dataset 1

set.seed(27)
bc_data_index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
bc_data_train <- bc_data[bc_data_index, ]
bc_data_test  <- bc_data[-bc_data_index, ]

Dataset 2

set.seed(27)
bc_data_2_index <- createDataPartition(bc_data_2$diagnosis, p = 0.7, list = FALSE)
bc_data_2_train <- bc_data_2[bc_data_2_index, ]
bc_data_2_test  <- bc_data_2[-bc_data_2_index, ]

Dataset 3

set.seed(27)
bc_data_3_index <- createDataPartition(bc_data_3$outcome, p = 0.7, list = FALSE)
bc_data_3_train <- bc_data_3[bc_data_3_index, ]
bc_data_3_test  <- bc_data_3[-bc_data_3_index, ]

Correlation

Often we have features that are highly correlated and thus provide redundant information. By eliminating highly correlated features we can avoid a predictive bias for the information contained in these features. This also shows us, that when we want to make statements about the biological/ medical importance of specific features, we need to keep in mind that just because they are suitable to predicting an outcome they are not necessarily causal - they could simply be correlated with causal factors.

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

Dataset 1

library(corrplot)

# calculate correlation matrix
corMatMy <- cor(bc_data_train[, -1])
corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70,
highlyCor <- colnames(bc_data_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]

## Compare row 2  and column  3 with corr  0.913 
##   Means:  0.716 vs 0.601 so flagging column 2 
## Compare row 3  and column  6 with corr  0.741 
##   Means:  0.677 vs 0.579 so flagging column 3 
## Compare row 6  and column  7 with corr  0.706 
##   Means:  0.602 vs 0.545 so flagging column 6 
## All correlations <= 0.7

# which variables are flagged for removal?
highlyCor

## [1] "uniformity_of_cell_size"  "uniformity_of_cell_shape"
## [3] "bare_nuclei"

#then we remove these variables
bc_data_cor <- bc_data_train[, which(!colnames(bc_data_train) %in% highlyCor)]

Correlation between features in dataset 1 is generally high and 4 out of 10 feature were flagged for removal.

Dataset 2

corMatMy <- cor(bc_data_2_train[, 3:32])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_2_train[, 3:32])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]

## Compare row 7  and column  8 with corr  0.92 
##   Means:  0.579 vs 0.393 so flagging column 7 
## Compare row 8  and column  6 with corr  0.84 
##   Means:  0.548 vs 0.38 so flagging column 8 
## Compare row 6  and column  28 with corr  0.827 
##   Means:  0.536 vs 0.368 so flagging column 6 
## Compare row 28  and column  27 with corr  0.855 
##   Means:  0.506 vs 0.357 so flagging column 28 
## Compare row 27  and column  26 with corr  0.894 
##   Means:  0.46 vs 0.346 so flagging column 27 
## Compare row 23  and column  21 with corr  0.993 
##   Means:  0.454 vs 0.336 so flagging column 23 
## Compare row 21  and column  24 with corr  0.983 
##   Means:  0.419 vs 0.327 so flagging column 21 
## Compare row 26  and column  30 with corr  0.817 
##   Means:  0.402 vs 0.323 so flagging column 26 
## Compare row 24  and column  3 with corr  0.943 
##   Means:  0.383 vs 0.312 so flagging column 24 
## Compare row 3  and column  1 with corr  0.998 
##   Means:  0.347 vs 0.306 so flagging column 3 
## Compare row 1  and column  4 with corr  0.986 
##   Means:  0.302 vs 0.304 so flagging column 4 
## Compare row 1  and column  14 with corr  0.726 
##   Means:  0.264 vs 0.304 so flagging column 14 
## Compare row 13  and column  11 with corr  0.973 
##   Means:  0.32 vs 0.304 so flagging column 13 
## Compare row 18  and column  16 with corr  0.757 
##   Means:  0.388 vs 0.295 so flagging column 18 
## Compare row 16  and column  17 with corr  0.796 
##   Means:  0.404 vs 0.288 so flagging column 16 
## Compare row 9  and column  29 with corr  0.711 
##   Means:  0.343 vs 0.274 so flagging column 9 
## Compare row 17  and column  20 with corr  0.745 
##   Means:  0.306 vs 0.261 so flagging column 17 
## Compare row 5  and column  25 with corr  0.809 
##   Means:  0.311 vs 0.255 so flagging column 5 
## Compare row 30  and column  10 with corr  0.753 
##   Means:  0.288 vs 0.241 so flagging column 30 
## Compare row 22  and column  2 with corr  0.913 
##   Means:  0.243 vs 0.242 so flagging column 22 
## All correlations <= 0.7

highlyCor

##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "concave_points_largest_worst"   
##  [5] "concavity_largest_worst"         "perimeter_largest_worst"        
##  [7] "radius_largest_worst"            "compactness_largest_worst"      
##  [9] "area_largest_worst"              "perimeter_mean"                 
## [11] "perimeter_se"                    "area_mean"                      
## [13] "concave_points_se"               "compactness_se"                 
## [15] "area_se"                         "symmetry_mean"                  
## [17] "concavity_se"                    "smoothness_mean"                
## [19] "fractal_dimension_largest_worst" "texture_largest_worst"

bc_data_2_cor <- bc_data_2_train[, which(!colnames(bc_data_2_train) %in% highlyCor)]

Here, we have more variation between the 30 features: some are highly correlated, while others seem to be very distinct. 20 are flagged for removal (see output above).

Dataset 3

corMatMy <- cor(bc_data_3_train[, -1])
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(bc_data_3_train[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]

## Compare row 8  and column  9 with corr  0.898 
##   Means:  0.425 vs 0.286 so flagging column 8 
## Compare row 9  and column  7 with corr  0.714 
##   Means:  0.39 vs 0.277 so flagging column 9 
## Compare row 7  and column  29 with corr  0.753 
##   Means:  0.364 vs 0.271 so flagging column 7 
## Compare row 4  and column  2 with corr  0.996 
##   Means:  0.348 vs 0.264 so flagging column 4 
## Compare row 2  and column  5 with corr  0.993 
##   Means:  0.33 vs 0.259 so flagging column 2 
## Compare row 5  and column  24 with corr  0.921 
##   Means:  0.304 vs 0.255 so flagging column 5 
## Compare row 24  and column  22 with corr  0.985 
##   Means:  0.272 vs 0.252 so flagging column 24 
## Compare row 11  and column  31 with corr  0.83 
##   Means:  0.341 vs 0.247 so flagging column 11 
## Compare row 22  and column  15 with corr  0.773 
##   Means:  0.24 vs 0.242 so flagging column 15 
## Compare row 22  and column  25 with corr  0.989 
##   Means:  0.217 vs 0.242 so flagging column 25 
## Compare row 14  and column  12 with corr  0.975 
##   Means:  0.256 vs 0.243 so flagging column 14 
## Compare row 31  and column  28 with corr  0.71 
##   Means:  0.328 vs 0.238 so flagging column 31 
## Compare row 18  and column  17 with corr  0.812 
##   Means:  0.331 vs 0.229 so flagging column 18 
## Compare row 28  and column  27 with corr  0.84 
##   Means:  0.286 vs 0.219 so flagging column 28 
## Compare row 17  and column  21 with corr  0.839 
##   Means:  0.285 vs 0.212 so flagging column 17 
## Compare row 10  and column  30 with corr  0.766 
##   Means:  0.277 vs 0.204 so flagging column 10 
## Compare row 6  and column  26 with corr  0.754 
##   Means:  0.235 vs 0.198 so flagging column 6 
## Compare row 3  and column  23 with corr  0.858 
##   Means:  0.164 vs 0.195 so flagging column 23 
## All correlations <= 0.7

highlyCor

##  [1] "concavity_mean"                  "concave_points_mean"            
##  [3] "compactness_mean"                "perimeter_mean"                 
##  [5] "radius_mean"                     "area_mean"                      
##  [7] "perimeter_largest_worst"         "fractal_dimension_mean"         
##  [9] "perimeter_se"                    "area_se"                        
## [11] "area_largest_worst"              "fractal_dimension_largest_worst"
## [13] "concavity_se"                    "concavity_largest_worst"        
## [15] "compactness_se"                  "symmetry_mean"                  
## [17] "smoothness_mean"                 "texture_largest_worst"

bc_data_3_cor <- bc_data_3_train[, which(!colnames(bc_data_3_train) %in% highlyCor)]

The features in dataset 3 look similar: some are highly correlated, others are very different. 18 are flagged for removal (see output above).

Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

Dataset 1

# ensure the results are repeatable
set.seed(7)
# define the control using a random forest selection function with cross validation
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# run the RFE algorithm
results_1 <- rfe(x = bc_data_train[, -1], y = bc_data_train$classes, sizes = c(1:9), rfeControl = control)

# chosen features
predictors(results_1)

## [1] "bare_nuclei"                 "uniformity_of_cell_size"    
## [3] "clump_thickness"             "uniformity_of_cell_shape"   
## [5] "bland_chromatin"             "marginal_adhesion"          
## [7] "mitosis"                     "normal_nucleoli"            
## [9] "single_epithelial_cell_size"

# subset the chosen features
bc_data_rfe <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% predictors(results_1)))]

Dataset 2

set.seed(7)
results_2 <- rfe(x = bc_data_2_train[, -c(1, 2)], y = as.factor(bc_data_2_train$diagnosis), sizes = c(1:30), rfeControl = control)

predictors(results_2)

##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"

bc_data_2_rfe <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% predictors(results_2)))]

Dataset 3

set.seed(7)
results_3 <- rfe(x = bc_data_3_train[,-1], y = as.factor(bc_data_3_train$outcome), sizes = c(1:33), rfeControl = control)

predictors(results_2)

##  [1] "perimeter_largest_worst"         "area_largest_worst"             
##  [3] "radius_largest_worst"            "concave_points_largest_worst"   
##  [5] "concave_points_mean"             "texture_largest_worst"          
##  [7] "texture_mean"                    "area_se"                        
##  [9] "concavity_largest_worst"         "concavity_mean"                 
## [11] "radius_se"                       "perimeter_mean"                 
## [13] "radius_mean"                     "area_mean"                      
## [15] "perimeter_se"                    "smoothness_largest_worst"       
## [17] "compactness_largest_worst"       "symmetry_largest_worst"         
## [19] "compactness_mean"                "smoothness_mean"                
## [21] "fractal_dimension_largest_worst" "concavity_se"                   
## [23] "symmetry_mean"

bc_data_3_rfe <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% predictors(results_3)))]

Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

For demonstration purposes I am using only 10 generations consisting of 5 individuals. More iterations with larger populations would of course be preferable, but this takes quite long to run!

library(dplyr)

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv",    # 10 fold cross validation
                       genParallel = TRUE, # Use parallel programming
                       allowParallel = TRUE)

Dataset 1

lev <- c("malignant", "benign")     # Set the levels

set.seed(27)
model_1 <- gafs(x = bc_data_train[, -1], y = bc_data_train$classes,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)

plot(model_1) # Plot mean fitness (AUC) by generation

model_1$ga$final

## [1] "clump_thickness"             "uniformity_of_cell_shape"   
## [3] "marginal_adhesion"           "single_epithelial_cell_size"
## [5] "bare_nuclei"                 "bland_chromatin"            
## [7] "normal_nucleoli"             "mitosis"

bc_data_ga <- bc_data_train[, c(1, which(colnames(bc_data_train) %in% model_1$ga$final))]

Dataset 2

lev <- c("M", "B")

set.seed(27)
model_2 <- gafs(x = bc_data_2_train[, -c(1, 2)], y = bc_data_2_train$diagnosis,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)

plot(model_2)

model_2$ga$final

##  [1] "radius_mean"                  "texture_mean"                
##  [3] "perimeter_mean"               "area_mean"                   
##  [5] "smoothness_mean"              "compactness_mean"            
##  [7] "concavity_mean"               "symmetry_mean"               
##  [9] "fractal_dimension_mean"       "texture_se"                  
## [11] "perimeter_se"                 "area_se"                     
## [13] "smoothness_se"                "compactness_se"              
## [15] "concavity_se"                 "concave_points_se"           
## [17] "symmetry_se"                  "radius_largest_worst"        
## [19] "texture_largest_worst"        "smoothness_largest_worst"    
## [21] "compactness_largest_worst"    "concave_points_largest_worst"

bc_data_2_ga <- bc_data_2_train[, c(2, which(colnames(bc_data_2_train) %in% model_2$ga$final))]

Dataset 3

lev <- c("R", "N")

set.seed(27)
model_3 <- gafs(x = bc_data_3_train[, -1], y = bc_data_3_train$outcome,
                   iters = 10, # generations of algorithm
                   popSize = 5, # population size for each generation
                   levels = lev,
                   gafsControl = ga_ctrl)

plot(model_3)

model_3$ga$final

##  [1] "time"                            "perimeter_mean"                 
##  [3] "radius_se"                       "texture_se"                     
##  [5] "compactness_se"                  "concavity_se"                   
##  [7] "symmetry_se"                     "fractal_dimension_se"           
##  [9] "radius_largest_worst"            "texture_largest_worst"          
## [11] "perimeter_largest_worst"         "smoothness_largest_worst"       
## [13] "compactness_largest_worst"       "concavity_largest_worst"        
## [15] "concave_points_largest_worst"    "symmetry_largest_worst"         
## [17] "fractal_dimension_largest_worst" "tumor_size"                     
## [19] "lymph_node_status"

bc_data_3_ga <- bc_data_3_train[, c(1, which(colnames(bc_data_3_train) %in% model_3$ga$final))]

Model comparison

Now I can compare Random Forest models with the different feature subsets. (I chose Random Forests because I find it works reasonbly well on a large variety of data and because it’s the algorithm I’ve used most often and thus know most about…)

For a more detailed description of building machine learning models see here and here.

All features

Dataset 1

set.seed(27)
model_bc_data_all <- train(classes ~ .,
                           data = bc_data_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_all_1 <- confusionMatrix(predict(model_bc_data_all, bc_data_test[, -1]), bc_data_test$classes)
cm_all_1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         5
##   malignant      3        67
##                                          
##                Accuracy : 0.9617         
##                  95% CI : (0.926, 0.9833)
##     No Information Rate : 0.6555         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9147         
##  Mcnemar's Test P-Value : 0.7237         
##                                          
##             Sensitivity : 0.9781         
##             Specificity : 0.9306         
##          Pos Pred Value : 0.9640         
##          Neg Pred Value : 0.9571         
##              Prevalence : 0.6555         
##          Detection Rate : 0.6411         
##    Detection Prevalence : 0.6651         
##       Balanced Accuracy : 0.9543         
##                                          
##        'Positive' Class : benign         
## 

Dataset 2

set.seed(27)
model_bc_data_2_all <- train(diagnosis ~ .,
                           data = bc_data_2_train[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_all_2 <- confusionMatrix(predict(model_bc_data_2_all, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_all_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 

Dataset 3

set.seed(27)
model_bc_data_3_all <- train(outcome ~ .,
                           data = bc_data_3_train,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_all_3 <- confusionMatrix(predict(model_bc_data_3_all, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_all_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 

Selected features

Dataset 1

library(gplots)

venn_list <- list(cor = colnames(bc_data_cor)[-1],
                  rfe = colnames(bc_data_rfe)[-1],
                  ga = colnames(bc_data_ga)[-1])

venn <- venn(venn_list)

venn

##     num cor rfe ga
## 000   0   0   0  0
## 001   0   0   0  1
## 010   1   0   1  0
## 011   2   0   1  1
## 100   0   1   0  0
## 101   0   1   0  1
## 110   0   1   1  0
## 111   6   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "clump_thickness"             "marginal_adhesion"          
## [3] "single_epithelial_cell_size" "bland_chromatin"            
## [5] "normal_nucleoli"             "mitosis"                    
## 
## attr(,"intersections")$rfe
## [1] "uniformity_of_cell_size"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "uniformity_of_cell_shape" "bare_nuclei"             
## 
## attr(,"class")
## [1] "venn"

4 out of 10 features were chosen by all three methods; the biggest overlap is seen between GA and RFE with 7 features. RFE and GA both retained 8 features for modeling, compared to only 5 based on the correlation method.

Correlation

set.seed(27)
model_bc_data_cor <- train(classes ~ .,
                 data = bc_data_cor,
                 method = "rf",
                 preProcess = c("scale", "center"),
                 trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_cor_1 <- confusionMatrix(predict(model_bc_data_cor, bc_data_test[, -1]), bc_data_test$classes)
cm_cor_1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       131         6
##   malignant      6        66
##                                         
##                Accuracy : 0.9426        
##                  95% CI : (0.9019, 0.97)
##     No Information Rate : 0.6555        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.8729        
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.9562        
##             Specificity : 0.9167        
##          Pos Pred Value : 0.9562        
##          Neg Pred Value : 0.9167        
##              Prevalence : 0.6555        
##          Detection Rate : 0.6268        
##    Detection Prevalence : 0.6555        
##       Balanced Accuracy : 0.9364        
##                                         
##        'Positive' Class : benign        
## 

set.seed(27)
model_bc_data_rfe <- train(classes ~ .,
                           data = bc_data_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_rfe_1 <- confusionMatrix(predict(model_bc_data_rfe, bc_data_test[, -1]), bc_data_test$classes)
cm_rfe_1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         4
##   malignant      3        68
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9322, 0.9864)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9256          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9710          
##          Neg Pred Value : 0.9577          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6603          
##       Balanced Accuracy : 0.9613          
##                                           
##        'Positive' Class : benign          
## 

set.seed(27)
model_bc_data_ga <- train(classes ~ .,
                           data = bc_data_ga,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_ga_1 <- confusionMatrix(predict(model_bc_data_ga, bc_data_test[, -1]), bc_data_test$classes)
cm_ga_1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       134         3
##   malignant      3        69
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9386, 0.9894)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9364          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9583          
##          Pos Pred Value : 0.9781          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6411          
##    Detection Prevalence : 0.6555          
##       Balanced Accuracy : 0.9682          
##                                           
##        'Positive' Class : benign          
## 

Dataset 2

venn_list <- list(cor = colnames(bc_data_2_cor)[-c(1, 2)],
                  rfe = colnames(bc_data_2_rfe)[-c(1, 2)],
                  ga = colnames(bc_data_2_ga)[-c(1, 2)])

venn <- venn(venn_list)

venn

##     num cor rfe ga
## 000   0   0   0  0
## 001   2   0   0  1
## 010   5   0   1  0
## 011  13   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   2   1   1  0
## 111   2   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "texture_mean"             "smoothness_largest_worst"
## 
## attr(,"intersections")$ga
## [1] "compactness_se"    "concave_points_se"
## 
## attr(,"intersections")$cor
## [1] "radius_mean"          "fractal_dimension_se"
## 
## attr(,"intersections")$rfe
## [1] "concave_points_mean"             "perimeter_largest_worst"        
## [3] "area_largest_worst"              "concavity_largest_worst"        
## [5] "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:ga`
## [1] "fractal_dimension_mean" "texture_se"            
## [3] "smoothness_se"          "symmetry_se"           
## 
## attr(,"intersections")$`rfe:ga`
##  [1] "perimeter_mean"               "area_mean"                   
##  [3] "smoothness_mean"              "compactness_mean"            
##  [5] "concavity_mean"               "symmetry_mean"               
##  [7] "perimeter_se"                 "area_se"                     
##  [9] "concavity_se"                 "radius_largest_worst"        
## [11] "texture_largest_worst"        "compactness_largest_worst"   
## [13] "concave_points_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "radius_se"              "symmetry_largest_worst"
## 
## attr(,"class")
## [1] "venn"

For dataset 2 we see a much bigger variation in chosen features between the three selection methods: only 1 feature was chosen by all and the biggest overlap is again seen between RFE and GA, followed by correlation and GA. But this time we also see quite a few features that are uniquely retained by any of the three feature selection methods.

Correlation

set.seed(27)
model_bc_data_2_cor <- train(diagnosis ~ .,
                           data = bc_data_2_cor[, -1],
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_cor_2 <- confusionMatrix(predict(model_bc_data_2_cor, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_cor_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   6
##          M   1  57
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9103         
##  Mcnemar's Test P-Value : 0.1306         
##                                          
##             Sensitivity : 0.9907         
##             Specificity : 0.9048         
##          Pos Pred Value : 0.9464         
##          Neg Pred Value : 0.9828         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6235         
##    Detection Prevalence : 0.6588         
##       Balanced Accuracy : 0.9477         
##                                          
##        'Positive' Class : B              
## 

set.seed(27)
model_bc_data_2_rfe <- train(diagnosis ~ .,
                           data = bc_data_2_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_rfe_2 <- confusionMatrix(predict(model_bc_data_2_rfe, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_rfe_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 

set.seed(27)
model_bc_data_2_ga <- train(diagnosis ~ .,
                          data = bc_data_2_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_ga_2 <- confusionMatrix(predict(model_bc_data_2_ga, bc_data_2_test[, -c(1, 2)]), bc_data_2_test$diagnosis)
cm_ga_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   4
##          M   1  59
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9327, 0.9904)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9363          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9365          
##          Pos Pred Value : 0.9636          
##          Neg Pred Value : 0.9833          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6471          
##       Balanced Accuracy : 0.9636          
##                                           
##        'Positive' Class : B               
## 

Dataset 3

venn_list <- list(cor = colnames(bc_data_3_cor)[-1],
                  rfe = colnames(bc_data_3_rfe)[-1],
                  ga = colnames(bc_data_3_ga)[-1])

venn <- venn(venn_list)

venn

##     num cor rfe ga
## 000   0   0   0  0
## 001   5   0   0  1
## 010   6   0   1  0
## 011   2   0   1  1
## 100   2   1   0  0
## 101   4   1   0  1
## 110   1   1   1  0
## 111   8   1   1  1
## attr(,"intersections")
## attr(,"intersections")$`cor:rfe:ga`
## [1] "time"                         "radius_se"                   
## [3] "symmetry_se"                  "smoothness_largest_worst"    
## [5] "concave_points_largest_worst" "symmetry_largest_worst"      
## [7] "tumor_size"                   "lymph_node_status"           
## 
## attr(,"intersections")$ga
## [1] "perimeter_mean"          "compactness_se"         
## [3] "texture_largest_worst"   "perimeter_largest_worst"
## [5] "concavity_largest_worst"
## 
## attr(,"intersections")$cor
## [1] "texture_mean"  "smoothness_se"
## 
## attr(,"intersections")$rfe
## [1] "compactness_mean"       "concavity_mean"        
## [3] "concave_points_mean"    "symmetry_mean"         
## [5] "fractal_dimension_mean" "perimeter_se"          
## 
## attr(,"intersections")$`cor:ga`
## [1] "texture_se"                "fractal_dimension_se"     
## [3] "radius_largest_worst"      "compactness_largest_worst"
## 
## attr(,"intersections")$`rfe:ga`
## [1] "concavity_se"                    "fractal_dimension_largest_worst"
## 
## attr(,"intersections")$`cor:rfe`
## [1] "concave_points_se"
## 
## attr(,"class")
## [1] "venn"

With the third dataset there is still some variation among the selected features but not as extreme as with dataset 2. This time, we find the biggest overlap between correlation and GA.

Correlation

set.seed(27)
model_bc_data_3_cor <- train(outcome ~ .,
                           data = bc_data_3_cor,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_cor_3 <- confusionMatrix(predict(model_bc_data_3_cor, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_cor_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 41  7
##          R  4  7
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.6909, 0.9031)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.2256          
##                                           
##                   Kappa : 0.4439          
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9111          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8542          
##          Neg Pred Value : 0.6364          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6949          
##    Detection Prevalence : 0.8136          
##       Balanced Accuracy : 0.7056          
##                                           
##        'Positive' Class : N               
## 

set.seed(27)
model_bc_data_3_rfe <- train(outcome ~ .,
                           data = bc_data_3_rfe,
                           method = "rf",
                           preProcess = c("scale", "center"),
                           trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_rfe_3 <- confusionMatrix(predict(model_bc_data_3_rfe, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_rfe_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 40  7
##          R  5  7
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.6717, 0.8902)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.3311          
##                                           
##                   Kappa : 0.409           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8511          
##          Neg Pred Value : 0.5833          
##              Prevalence : 0.7627          
##          Detection Rate : 0.6780          
##    Detection Prevalence : 0.7966          
##       Balanced Accuracy : 0.6944          
##                                           
##        'Positive' Class : N               
## 

set.seed(27)
model_bc_data_3_ga <- train(outcome ~ .,
                          data = bc_data_3_ga,
                          method = "rf",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10, verboseIter = FALSE))

cm_ga_3 <- confusionMatrix(predict(model_bc_data_3_ga, bc_data_3_test[, -1]), bc_data_3_test$outcome)
cm_ga_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  R
##          N 42  7
##          R  3  7
##                                           
##                Accuracy : 0.8305          
##                  95% CI : (0.7103, 0.9156)
##     No Information Rate : 0.7627          
##     P-Value [Acc > NIR] : 0.1408          
##                                           
##                   Kappa : 0.4806          
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7627          
##          Detection Rate : 0.7119          
##    Detection Prevalence : 0.8305          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : N               
## 

Overall model parameters

To compare the feature selection methods’ influence on the models built with the three different datasets, I am going to plot model accuracy, Kappa, precision and sensitivity and specificity.

overall <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$overall,
                      cm_cor_1$overall,
                      cm_rfe_1$overall,
                      cm_ga_1$overall,
                      cm_all_2$overall,
                      cm_cor_2$overall,
                      cm_rfe_2$overall,
                      cm_ga_2$overall,
                      cm_all_3$overall,
                      cm_cor_3$overall,
                      cm_rfe_3$overall,
                      cm_ga_3$overall))

library(tidyr)
overall_gather <- overall[, 1:4] %>%
  gather(measure, value, Accuracy:Kappa)

byClass <- data.frame(dataset = rep(c("1", "2", "3"), each = 4),
                      model = rep(c("all", "cor", "rfe", "ga"), 3),
                      rbind(cm_all_1$byClass,
                      cm_cor_1$byClass,
                      cm_rfe_1$byClass,
                      cm_ga_1$byClass,
                      cm_all_2$byClass,
                      cm_cor_2$byClass,
                      cm_rfe_2$byClass,
                      cm_ga_2$byClass,
                      cm_all_3$byClass,
                      cm_cor_3$byClass,
                      cm_rfe_3$byClass,
                      cm_ga_3$byClass))

byClass_gather <- byClass[, c(1:4, 7)] %>%
  gather(measure, value, Sensitivity:Precision)

overall_byClass_gather <- rbind(overall_gather, byClass_gather)
overall_byClass_gather <- within(overall_byClass_gather, model <- factor(model, levels = c("all", "cor", "rfe", "ga")))

ggplot(overall_byClass_gather, aes(x = model, y = value, color = measure, shape = measure, group = measure)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_path(alpha = 0.7) +
  scale_colour_brewer(palette = "Set1") +
  facet_grid(dataset ~ ., scales = "free_y") +
  labs(
    x = "Feature Selection method",
    y = "Value",
    color = "",
    shape = "",
    title = "Comparison of feature selection methods",
    subtitle = "in three breast cancer datasets",
    caption = "\nBreast Cancer Wisconsin (Diagnostic) Data Sets: 1, 2 & 3
    Street et al., 1993;
    all: no feature selection
    cor: features with correlation > 0.7 removed
    rfe: Recursive Feature Elimination
    ga: Genetic Algorithm"
  )

Conclusions

As expected from PCA of sample classes, which showed that the two classes recur/ non-recur did not cluster well, Random Forest models of dataset 3 had much lower prediction accuracy than models of datasets 1 and 2.

The correlation method operates regardless of feature importance. E.g. in dataset 1, the features with the highest importance were also flagged as highly correlated. Correlation models performed worst in all three datasets. RFE and GA tend to include features with high importance, but feature importance alone is not a good indicator for whether several features will work well in combination when predicting an outcome.

Dataset 1 was small with only 9 features; here, removing highly correlated features was the least successful selection method. RFE and GA both improved the predictions compared to no feature selection, while GA performed best. Dataset 2 with its 30 original features produced the best models with the GA. And in dataset 3, which had lower overall accuracy, different feature selection methods did not have a strong influence.

If you are interested in more machine learning posts, check out the category listing for machine_learning.

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] C
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_3.0.1         dplyr_0.5.0          plyr_1.8.4          
##  [4] corrplot_0.77        randomForest_4.6-12  doParallel_1.0.10   
##  [7] iterators_1.0.8      foreach_1.4.3        caret_6.0-73        
## [10] lattice_0.20-34      tidyr_0.6.0          gridExtra_2.2.1     
## [13] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
## [16] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [19] ellipse_0.3-8        ggplot2_2.2.1        mice_2.25           
## [22] Rcpp_0.12.8         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0         class_7.3-14         assertthat_0.1      
##  [4] rprojroot_1.1        digest_0.6.11        R6_2.2.0            
##  [7] backports_1.0.4      MatrixModels_0.4-1   RSQLite_1.1-1       
## [10] evaluate_0.10        e1071_1.6-7          zlibbioc_1.20.0     
## [13] lazyeval_0.2.0       gdata_2.17.0         minqa_1.2.4         
## [16] SparseM_1.74         car_2.1-4            nloptr_1.0.4        
## [19] rpart_4.1-10         Matrix_1.2-7.1       rmarkdown_1.3       
## [22] labeling_0.3         splines_3.3.2        lme4_1.1-12         
## [25] stringr_1.1.0        munsell_0.4.3        compiler_3.3.2      
## [28] mgcv_1.8-16          htmltools_0.3.5      nnet_7.3-12         
## [31] tibble_1.2           codetools_0.2-15     bitops_1.0-6        
## [34] MASS_7.3-45          ModelMetrics_1.1.0   nlme_3.1-128        
## [37] gtable_0.2.0         DBI_0.5-1            magrittr_1.5        
## [40] scales_0.4.1         KernSmooth_2.23-15   stringi_1.1.2       
## [43] reshape2_1.4.2       RColorBrewer_1.1-2   tools_3.3.2         
## [46] Biobase_2.34.0       pbkrtest_0.4-6       survival_2.40-1     
## [49] yaml_2.1.14          AnnotationDbi_1.36.0 colorspace_1.3-2    
## [52] caTools_1.17.1       memoise_1.0.0        knitr_1.15.1        
## [55] quantreg_5.29

Gene homology Part 3 - Visualizing Gene Ontology of Conserved Genes

2017-01-05 00:00:00 +0000

Which genes have homologs in many species?

In Part 1 and Part 2 I have already explored gene homology between humans and other species. But there I have only considered how many genes where shared between the species.

In this post I want to have a closer look at genes with homologs in many species. These genes are likely evolutionarily convserved in their biological function. To find out what biological functions these genes have, I am using gene ontology (GO) enrichment analysis.

Because GO categories have a hierarchical parent-children structure, comparing enriched GO terms between groups of genes in a meaningful way isn’t trivial. Simply calculating the overlap between categories will miss the bigger picture when, e.g. two groups are enriched for closely related but not for exactly the same GO categories. Here, I will explore a few options of visualizing GO enrichment with the purpose of comparing enrichments between groups (here: species).

A network of the partial hierarchical structure of enriched GO terms of biological processes encompasses the information of relatedness between GO terms. Custom annotations show GO terms that were common, as well as terms that were unique to species.

Highly conserved biological functions include core regulatory processes that are essential for the cell cycle, energy production and metabolism, nucleoside biogenesis, protein metabolism, etc.

I’m starting with the same table as from this post, called homologs_table_combined.

head(homologs_table_combined)

##   celegans_gene_ensembl cfamiliaris_gene_ensembl
## 1                    NA                       NA
## 2                    NA                   477699
## 3                    NA                   477023
## 4                173979                   490207
## 5                179795                   607852
## 6                177055                       NA
##   dmelanogaster_gene_ensembl drerio_gene_ensembl ggallus_gene_ensembl
## 1                         NA                  NA                   NA
## 2                      44071                  NA                   NA
## 3                         NA              431754               770094
## 4                      38864              406283                   NA
## 5                      36760              794259               395373
## 6                    3772179              541489               421909
##   hsapiens_gene_ensembl mmusculus_gene_ensembl ptroglodytes_gene_ensembl
## 1                    NA                     NA                        NA
## 2                     2                 232345                    465372
## 3                    30                 235674                    460268
## 4                    34                  11364                    469356
## 5                    47                 104112                    454672
## 6                    52                  11431                    458990
##   rnorvegicus_gene_ensembl scerevisiae_gene_ensembl sscrofa_gene_ensembl
## 1                    24152                       NA                   NA
## 2                    24153                       NA               403166
## 3                    24157                   854646            100515577
## 4                    24158                       NA               397104
## 5                    24159                   854310                   NA
## 6                    24161                   856187            100737301

Each row in this table denotes a gene with its Entrez ID and corresponding Entrez IDs for homologs in each of the other 10 species I explored. If a gene doesn’t have homolog the table says “NA”. However, some genes have duplicate entries for a species if there are multiple homologs in different species.

By counting the number of NAs per row, we can identify genes with homologs in all species (sum of NAs = 0) and genes which are specific (sum of NAs = 10).

homologs_na <- rowSums(is.na(homologs_table_combined))

Before I delve deeper into the biology behind these genes, I want to examine the distribution of the NA-counts. To do so, I am plotting a histogram.

But first I’m preparing my custom ggplot2 theme:

library(ggplot2)
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "bottom",
    legend.justification = "top", 
    legend.box = "horizontal",
    legend.box.background = element_rect(colour = "grey50"),
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())

ggplot() + 
  aes(homologs_na) + 
  geom_histogram(binwidth = 1, color = "royalblue", fill = "royalblue", alpha = 0.8) +
  labs(
    x = "Number of NAs per row of gene homology table",
    y = "Count",
    title = "How many homologs do genes have?",
    subtitle = "Showing all bins from 0 to 10",
    caption = "\nEach row of the gene homology table list a gene and all its homologs in 10 other species.
    If a gene doesn't have a homolog, its value in the table is NA. Thus, rows with many NAs refer to a
    gene that is specific to one species, while rows with no NAs show genes with homologs in all species -
    such genes are said to be highly conserved."
  )

Clearly, most genes are specific to a species, they have NAs in all but one column. The rest of the histogram is a bit hard to differentiate with the peak at 10, so let’s look at the same data again with these genes:

ggplot() + 
  aes(subset(homologs_na, homologs_na  < 10)) + 
  geom_histogram(binwidth = 1, color = "royalblue", fill = "royalblue", alpha = 0.8) +
  labs(
    x = "Number of NAs per row of gene homology table",
    y = "Count",
    title = "How many homologs do genes have?",
    subtitle = "Showing bins from 0 to 9",
    caption = "\nEach row of the gene homology table list a gene and all its homologs in 10 other species.
    If a gene doesn't have a homolog, its value in the table is NA. Thus, rows with many NAs refer to a
    gene that is specific to one species, while rows with no NAs show genes with homologs in all species -
    such genes are said to be highly conserved."
  )

Now we can see that most genes have homologs in 9 species (2 NAs). But there are still quite a few genes with homologs in all species.

Which genes are highly conserved and what biological functions do they have?

There are 3461 rows in the original table with no NAs.

genes_homologs_all <- homologs_table_combined[which(rowSums(is.na(homologs_table_combined)) == 0), ]
nrow(genes_homologs_all)

## [1] 3461

Looking at all of these genes by hand wouldn’t be feasible. So, to find out what biological functions these genes have, I am using gene ontology (GO-term) enrichment analysis as implemented in clusterProfiler.

Biological processes (BP): Collections of molecular events and functions contributing to a biological process
Cellular component (CC): (sub-) cellular structures and locations, macromolecular complexes
Molecular functions (MF): Molecular functions like catalytic or binding activity

library(clusterProfiler)
library(DOSE)

for (i in 1:nrow(datasets)) {
  species <- datasets$dataset[i]
  genes <- as.character(unique(genes_homologs_all[, species]))
  universe <- as.character(unique(na.omit(homologs_table_combined[, species])))
  
  cat("\nSpecies", datasets$description[i], "has", length(universe), "unique Entrez IDs, of which", length(genes), "have homologs in all species.\n")
  
  try(go_enrich_BP <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "BP",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(go_enrich_MF <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "MF",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(go_enrich_CC <- enrichGO(gene = genes, 
                        keytype = "ENTREZID",
                        OrgDb = get(datasets$orgDb[i]),
                        ont = "CC",
                        qvalueCutoff = 0.05,
                        universe = universe,
                        readable = TRUE))
  
  try(assign(paste("go_enrich_BP", species, sep = "_"), go_enrich_BP))
  try(assign(paste("go_enrich_MF", species, sep = "_"), go_enrich_MF))
  try(assign(paste("go_enrich_CC", species, sep = "_"), go_enrich_CC))
}

## 
## Species Rat has 47989 unique Entrez IDs, of which 1796 have homologs in all species.
## 
## Species Yeast has 6350 unique Entrez IDs, of which 1363 have homologs in all species.
## 
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## Error in .testForValidCols(x, cols) : 
##   Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.
## 
## Species C. elegans has 46727 unique Entrez IDs, of which 1758 have homologs in all species.
## 
## Species Chimpanzee has 40540 unique Entrez IDs, of which 1704 have homologs in all species.
## 
## Species Pig has 54622 unique Entrez IDs, of which 1708 have homologs in all species.
## 
## Species Human has 60136 unique Entrez IDs, of which 1757 have homologs in all species.
## 
## Species Chicken has 34668 unique Entrez IDs, of which 1684 have homologs in all species.
## 
## Species Zebrafish has 47900 unique Entrez IDs, of which 1910 have homologs in all species.
## 
## Species Fruitfly has 25038 unique Entrez IDs, of which 1805 have homologs in all species.
## 
## Species Mouse has 68603 unique Entrez IDs, of which 1761 have homologs in all species.
## 
## Species Dog has 31376 unique Entrez IDs, of which 1716 have homologs in all species.

I’m not sure why yeast throws an error but since all other species worked I will ignore yeast genes for now…

Visualizing enriched GO categories

A heatmap of gene enrichment ratios shows enriched GO categories and their enrichment strength and allows us to capture comparable information for all species.

To generate this heatmap, I am using the output tables from clusterProfiler, which tell us the enrichment p-value (before and after correction for multiple testing, aka False Discovery Rate or FDR), the proportion of genes belonging to each GO category in the background versus all background genes and in our gene list compared to all genes in our list. To keep the graph reasonably small, I’m subsetting only the top 10 enriched genes with lowest adjusted p-values for biological processes (BP), molecular functions (MF) and cellular components (CC).

# removing yeast
datasets_2 <- datasets[-grep("scerevisiae_gene_ensembl", datasets$dataset), ]

library(dplyr)

cutoff = 10

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  
  go_enrich_BP_df <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "BP")
  
  plot_BP <- go_enrich_BP_df[1:cutoff, ]
  plot_BP$Description <- factor(plot_BP$Description, levels = plot_BP$Description[order(plot_BP$GR)])
    
  go_enrich_MF_df <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "MF")
  
  plot_MF <- go_enrich_MF_df[1:cutoff, ]
  plot_MF$Description <- factor(plot_MF$Description, levels = plot_MF$Description[order(plot_MF$GR)])

  go_enrich_CC_df <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_"))) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "CC")
  
  plot_CC <- go_enrich_CC_df[1:cutoff, ]
  plot_CC$Description <- factor(plot_CC$Description, levels = plot_CC$Description[order(plot_CC$GR)])
  
  plot <- rbind(plot_BP, plot_MF, plot_CC)
  plot$Species <- paste(datasets_2$description[i])

  if (i == 1) {
    plot_df <- plot
  } else {
    plot_df <- rbind(plot_df, plot)
  }
}

length(unique(plot_df$ID))

## [1] 123

ggplot(plot_df, aes(x = Species, y = Description, fill = GR)) + 
  geom_tile(width = 1, height = 1) +
  scale_fill_gradient2(low = "white", mid = "blue", high = "red", space = "Lab", name = "Gene Ratio") +
  facet_grid(Category ~ ., scales = "free_y") +
  scale_x_discrete(position = "top") +
  labs(
      title = paste("Top", cutoff, "enriched GO terms"),
      x = "",
      y = ""
    )

Word clouds

Another way to capture information about GO terms is to use the GO description texts as basis for wordclouds. I’m using the general method described here. I’m also included a list of words to remove, like process, activity, etc. These words occur often and on their own don’t tell us much about the GO term.

library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

wordcloud_function <- function(data = data,
                               removewords = c("process", "activity", "positive", "negative", "response", "regulation"),
                               min.freq = 4,
                               max.words=Inf,
                               random.order=TRUE){
  input <- Corpus(VectorSource(data))
  
  input <- tm_map(input, content_transformer(tolower))
  input <- tm_map(input, content_transformer(removePunctuation))
  input <- tm_map(input, removeNumbers)
  input <- tm_map(input, stripWhitespace)
  
  toSpace <- content_transformer(function(x , pattern ) gsub(pattern, " ", x))
  input <- tm_map(input, toSpace, "/")
  input <- tm_map(input, toSpace, "@")
  input <- tm_map(input, toSpace, "\\|")
  
  input <- tm_map(input, function(x) removeWords(x, stopwords("english")))
  
  # specify your stopwords as a character vector
  input <- tm_map(input, removeWords, removewords)
  
  tdm <- TermDocumentMatrix(input)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing = TRUE)
  d <- data.frame(word = names(v),freq = v)
  
  set.seed(1234)
  wordcloud(words = d$word, freq = d$freq, min.freq = min.freq, scale = c(8,.2),
            max.words = max.words, random.order = random.order, rot.per = 0.15,
            colors = brewer.pal(8, "Dark2"))
}

layout(matrix(c(1:80), nrow = 4, byrow = FALSE), heights = c(0.1, 1))

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  df_MF <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_")))
  df_CC <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_")))
  
  par(mar = rep(0, 4))
  plot.new()
  text(x = 0.5, y = 0.5, paste0(datasets_2$description[i]), cex = 2)
  wordcloud_function(data = df_BP$Description)
  wordcloud_function(data = df_MF$Description)
  wordcloud_function(data = df_CC$Description)
}

Which GO terms are enriched in all species?

I also want to know which specific GO categories are enriched in all species. Even though we might miss related terms, this will give us a first idea about the biological functions that are highly conserved between species.

I’m creating a list of enriched GO categories of all species and ask for common elements using the list.common function of the rlist package.

# create empty list to populate with a loop
go_list_BP <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_BP) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  go_list_BP[[species]] <- df_BP$ID
}

# Now I know which GO IDs are common but I want to know the description as well.
# Because they are in all go_enrich tables, I'm chosen one to subset

library(rlist)
common_gos_BP <- as.data.frame(go_enrich_BP_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_BP)) %>%
  select(ID, Description) %>%
  arrange(Description)


go_list_MF <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_MF) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_MF <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_")))
  go_list_MF[[species]] <- df_MF$ID
}

library(rlist)
common_gos_MF <- as.data.frame(go_enrich_MF_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_MF)) %>%
  select(ID, Description) %>%
  arrange(Description)


go_list_CC <- lapply(datasets_2$dataset, function(x) NULL)
names(go_list_CC) <- paste(datasets_2$dataset)

for (species in datasets_2$dataset) {
  df_CC <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_")))
  go_list_CC[[species]] <- df_CC$ID
}

library(rlist)
common_gos_CC <- as.data.frame(go_enrich_CC_cfamiliaris_gene_ensembl) %>%
  filter(ID %in% list.common(go_list_CC)) %>%
  select(ID, Description) %>%
  arrange(Description)

To visualize these common GO categories, I’m going back to the clusterProfiler output tables. This time I’m not using a heatmap, but a point plot to show enrichment statistics for all species.

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  
  go_enrich_BP_df <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_BP)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "BP")
  
  go_enrich_BP_df$Description <- factor(go_enrich_BP_df$Description, levels = go_enrich_BP_df$Description[order(go_enrich_BP_df$GR)])
    
  go_enrich_MF_df <- as.data.frame(get(paste("go_enrich_MF", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_MF)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "MF")
  
  go_enrich_MF_df$Description <- factor(go_enrich_MF_df$Description, levels = go_enrich_MF_df$Description[order(go_enrich_MF_df$GR)])

  go_enrich_CC_df <- as.data.frame(get(paste("go_enrich_CC", species, sep = "_"))) %>%
    filter(ID %in% list.common(go_list_CC)) %>%
    mutate(GR = as.numeric(Count) / as.numeric(gsub(".*/", "", GeneRatio)),
           Category = "CC")
  
  go_enrich_CC_df$Description <- factor(go_enrich_CC_df$Description, levels = go_enrich_CC_df$Description[order(go_enrich_CC_df$GR)])
  
  plot <- rbind(go_enrich_BP_df, go_enrich_MF_df, go_enrich_CC_df)
  plot$Species <- paste(datasets_2$description[i])

  if (i == 1) {
    plot_df <- plot
  } else {
    plot_df <- rbind(plot_df, plot)
  }
}

ggplot(plot_df, aes(x = GR, y = Description, size = Count, color = p.adjust)) +
    geom_point() +
    labs(
      title = "Common enriched GO terms",
      x = "Gene Ratio",
      y = ""
    ) +
    facet_grid(Category ~ Species, scales = "free_y")

Gene Ratio is shown on the x-axis, point size shows the gene counts and point color shows the adjusted p-value.

GO hierachical structure

The package RamiGO can be used to plot all parents of the 10 common GO categories:

library(RamiGO)
library(RColorBrewer)

set3 <- brewer.pal(10, "Set3")
goIDs <- list.common(go_list_BP)
getAmigoTree(goIDs, set3, filename = "common_gos_BP", picType = "png", modeType = "amigo", saveResult = TRUE)

set3 <- brewer.pal(8, "Set3")
goIDs <- list.common(go_list_MF)
getAmigoTree(goIDs, set3, filename = "common_gos_MF", picType = "png", modeType = "amigo", saveResult = TRUE)

goIDs <- list.common(go_list_CC)
getAmigoTree(goIDs, set3, filename = "common_gos_CC", picType = "png", modeType = "amigo", saveResult = TRUE)

Common GOs BP

Common GOs MF

Common GOs CC

GO network

To visualize clusters of species-specific and common enriched GO categories within their hierachical structure, I’m creating a network of the top 15 enriched biological process categories of each species with two levels of parent terms.

for (i in 1:nrow(datasets_2)) {
  species <- datasets_2$dataset[i]
  df_BP <- as.data.frame(get(paste("go_enrich_BP", species, sep = "_")))
  df_BP <- df_BP[1:15, 1, drop = FALSE]
  df_BP$species <- datasets_2$description[i]
  
  if (i == 1) {
    df_BP_all <- df_BP
  } else {
    df_BP_all <- rbind(df_BP_all, df_BP)
  }
}

length(unique(df_BP_all$ID))

## [1] 69

Of these top GO categories most are not unique to individual species.

In the network I want to show GO categories that are species-specific and those that are common among all species. To create the annotation data with this information I’m creating a dataframe with the names of the species along with their specific GO term and combine this with the common GO terms from above.

library(tidyr)
df_BP_all$value <- 1
df_BP_all <- spread(df_BP_all, species, value)

go_sums <- data.frame(ID = df_BP_all$ID,
                      sum = rowSums(!is.na(df_BP_all[, -1])))

# GO in all species
go_anno <- data.frame(ID = go_sums$ID)
go_anno$group <- ifelse(go_sums$sum == 10, "common_all", NA)

x <- subset(go_sums, sum == 1)
xx <- subset(df_BP_all, ID %in% x$ID)

for (i in 2:11) {
  xxx <- xx[, i]
  xxx[which(!is.na(xxx))] <- colnames(xx[, i, drop = FALSE])
  unique <- as.data.frame(cbind(xx[, 1], xxx)) %>%
    subset(!is.na(xxx))
  
  if (i == 2) {
    unique_all <- unique
  } else {
    unique_all <- rbind(unique_all, unique)
  }
}

# Common GOs
unique_common_all <- rbind(unique_all, data.frame(V1 = list.common(go_list_BP), xxx = "common"))

To identify the parent GO terms, I’m using the GO.db package. I’m adding two levels of parental structure for each GO term.

library(GO.db)

# GO parents list
ontology <- as.list(GOBPPARENTS)

goChilds <- unique(df_BP_all$ID)
goChilds <- goChilds[goChilds %in% names(ontology)]

adjList <- matrix(nrow = 0, ncol = 3)
colnames(adjList) <- c("parent", "rel", "child")

# Find parents
goChilds <- goChilds[which(goChilds != "all")]
        
goFamilyList <- lapply(ontology[goChilds], function(x) data.frame(parent = x, rel = names(x)))
adjList <- rbind(adjList, as.matrix(cbind(do.call(rbind, goFamilyList), child = rep(names(goFamilyList), sapply(goFamilyList, nrow)))))

# Next parents
goParents <- unique(as.character(adjList[which(!adjList[,"parent"] %in% adjList[,"child"]), "parent"]))    
goFamilyList_2 <- lapply(ontology[goParents], function(x) data.frame(parent = x, rel = names(x)))
adjList <- rbind(adjList, as.matrix(cbind(do.call(rbind, goFamilyList_2), child = rep(names(goFamilyList_2), sapply(goFamilyList_2, nrow)))))

adjList <- adjList[which(adjList[,1] != "all"), , drop = FALSE]
adjList <- adjList[, c("child","parent", "rel"), drop = FALSE]

The graph is then prodcued with igraph as a directed network. Colors show species-specific, common and remaining enriched GO categories. Parents are shown in light blue.

library(igraph)

goGraph <- graph.edgelist(adjList[, c("parent", "child"), drop = FALSE], directed = TRUE)
goGraph <- set.edge.attribute(goGraph, "type", value = adjList[,"rel"])

goTerms <- sapply(V(goGraph)$name, function(x)
        {
            term <- GOTERM[[x]]@Term
            
                term <- gsub("( process)$", " pr.", term)
                term <- gsub(" development ", " dev. ", term)
                term <- gsub("( development)$", " dev.", term, perl = TRUE)
                term <- gsub("Development ", "Dev. ", term)
            
            # split in two lines
            spaceLoc <- gregexpr(" ", term, fixed = TRUE)[[1]]
            if ((spaceLoc[1] != -1) && (length(spaceLoc) > 1))
            {
                spaceLoc <- spaceLoc[ceiling(length(spaceLoc)/2)]
                term <- paste(substr(term, 1, spaceLoc - 1), substr(term, spaceLoc + 1, nchar(term)), sep = "\n")
            }

            term
})

goTerms_df <- as.data.frame(goTerms)

goTerms_df$color <- ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "C. elegans"), 1], "green",
                           ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Chicken"), 1], "yellowgreen",
                                  ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Chimpanzee"), 1], "violetred",
                                         ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Dog"), 1], "turquoise",
                                                ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Fruitfly"), 1], "violet", 
                                                       ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Human"), 1], "thistle", 
                                                              ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Mouse"), 1], "tan1", 
                                                                     ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Pig"), 1], "steelblue", 
                                                                            ifelse(rownames(goTerms_df) %in% unique_common_all[which(unique_common_all$xxx == "Rat"), 1], "orangered", 
                                                                                   ifelse(rownames(goTerms_df) %in% 
                                                                                   unique_common_all[which(unique_common_all$xxx == "Zebrafish"), 1], "olivedrab", 
                                                                                          ifelse(rownames(goTerms_df) %in% 
                                                                                          unique_common_all[which(unique_common_all$xxx == "common"), 1], "cyan", 
                                                                                          ifelse(rownames(goTerms_df) %in% goChilds, "red", "skyblue2"))))))))))))

V(goGraph)$color <- adjustcolor(goTerms_df$color, alpha.f = 0.8)
V(goGraph)$label <- goTerms

Ecolor <- ifelse(E(goGraph)$type == "is_a", "slategrey",
                           ifelse(E(goGraph)$type == "part_of", "springgreen4",
                           ifelse(E(goGraph)$type == "regulates", "tomato",
                           ifelse(E(goGraph)$type == "negatively_regulates", "orange", "greenyellow"))))

E(goGraph)$color <- Ecolor

plot(goGraph,
     vertex.label.color = "black",
     vertex.label.cex = 0.8,
     vertex.size = 2,
     margin = 0)

labels <- c("C. elegans", "Chicken", "Chimpanzee", "Dog", "Fruitfly", "Human", "Mouse", "Pig", "Rat", "Zebrafish", "common", "other enriched GO", "parent")
colors <- c("green", "yellowgreen", "violetred", "turquoise", "violet", "thistle", "tan1", "steelblue", "orangered", "olivedrab", "cyan", "red", "skyblue2")
legend("topleft", labels, pch = 19,
       col = colors, pt.cex = 2, cex = 2, bty = "n", ncol = 2)

For the output network graph, see summary at the beginning of this post.

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] igraph_1.0.1          GO.db_3.4.0           tidyr_0.6.0          
##  [4] RamiGO_1.20.0         gsubfn_0.6-6          proto_1.0.0          
##  [7] rlist_0.4.6.1         wordcloud_2.5         RColorBrewer_1.1-2   
## [10] SnowballC_0.5.1       tm_0.6-2              NLP_0.1-9            
## [13] dplyr_0.5.0           clusterProfiler_3.2.9 DOSE_3.0.9           
## [16] ggplot2_2.2.1         org.Cf.eg.db_3.4.0    org.Mm.eg.db_3.4.0   
## [19] org.Dm.eg.db_3.4.0    org.Dr.eg.db_3.4.0    org.Gg.eg.db_3.4.0   
## [22] org.Hs.eg.db_3.4.0    org.Ss.eg.db_3.4.0    org.Pt.eg.db_3.4.0   
## [25] org.Ce.eg.db_3.4.0    org.Sc.sgd.db_3.4.0   org.Rn.eg.db_3.4.0   
## [28] AnnotationDbi_1.36.0  IRanges_2.8.1         S4Vectors_0.12.1     
## [31] Biobase_2.34.0        BiocGenerics_0.20.0   biomaRt_2.30.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8        png_0.1-7          assertthat_0.1    
##  [4] rprojroot_1.1      digest_0.6.11      slam_0.1-40       
##  [7] R6_2.2.0           plyr_1.8.4         backports_1.0.4   
## [10] RSQLite_1.1-1      evaluate_0.10      lazyeval_0.2.0    
## [13] data.table_1.10.0  qvalue_2.6.0       rmarkdown_1.3     
## [16] labeling_0.3       splines_3.3.2      BiocParallel_1.8.1
## [19] RCytoscape_1.24.1  stringr_1.1.0      RCurl_1.95-4.8    
## [22] munsell_0.4.3      fgsea_1.0.2        htmltools_0.3.5   
## [25] tcltk_3.3.2        tibble_1.2         gridExtra_2.2.1   
## [28] codetools_0.2-15   XML_3.98-1.5       bitops_1.0-6      
## [31] grid_3.3.2         gtable_0.2.0       DBI_0.5-1         
## [34] magrittr_1.5       scales_0.4.1       graph_1.52.0      
## [37] stringi_1.1.2      GOSemSim_2.0.3     reshape2_1.4.2    
## [40] DO.db_2.9          fastmatch_1.0-4    tools_3.3.2       
## [43] yaml_2.1.14        colorspace_1.3-2   memoise_1.0.0     
## [46] knitr_1.15.1       XMLRPC_0.3-0

How to map your Google location history with R

2016-12-30 00:00:00 +0000

It’s no secret that Google Big Brothers most of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.

If you want to see a few ways how to quickly and easily visualize your location history with R, stay tuned…

The Google location history can be downloaded from your Google account under https://takeout.google.com/settings/takeout. Make sure you only tick “location history” for download, otherwise it will take super long to get all your Google data.

The data Google provides you for download is a .json file and can be loaded with the jsonlite package. Loading this file into R might take a few minutes because it can be quite big, depending on how many location points Google had saved about you.

library(jsonlite)
system.time(x <- fromJSON("Standortverlauf.json"))

##    user  system elapsed 
##  112.52    1.12  132.59

The date and time column is in the POSIX milliseconds format, so I converted it to human readable POSIX.

Similarly, longitude and latitude are saved in E7 format and were converted to GPS coordinates.

# extracting the locations dataframe
loc = x$locations

# converting time column from posix milliseconds into a readable time scale
loc$time = as.POSIXct(as.numeric(x$locations$timestampMs)/1000, origin = "1970-01-01")

# converting longitude and latitude from E7 to GPS coordinates
loc$lat = loc$latitudeE7 / 1e7
loc$lon = loc$longitudeE7 / 1e7

This is how the data looks like now:

head(loc)

##     timestampMs latitudeE7 longitudeE7 accuracy                 activitys
## 1 1482393378938  519601402    76004708       29                      NULL
## 2 1482393333953  519601402    76004708       29                      NULL
## 3 1482393033893  519603616    76002628       20 1482393165600, still, 100
## 4 1482392814435  519603684    76001572       20 1482392817678, still, 100
## 5 1482392734911  519603684    76001572       20                      NULL
## 6 1482392433788  519603684    76001572       20                      NULL
##   velocity heading altitude                time      lat      lon
## 1       NA      NA       NA 2016-12-22 08:56:18 51.96014 7.600471
## 2       NA      NA       NA 2016-12-22 08:55:33 51.96014 7.600471
## 3       NA      NA       NA 2016-12-22 08:50:33 51.96036 7.600263
## 4       NA      NA       NA 2016-12-22 08:46:54 51.96037 7.600157
## 5       NA      NA       NA 2016-12-22 08:45:34 51.96037 7.600157
## 6       NA      NA       NA 2016-12-22 08:40:33 51.96037 7.600157

We have the original and converted time, latitude and longitude columns, plus accuracy, activities, velocity, heading and altitude. Accuracy gives the error distance around the point in metres. Activities are saved as a list of data frames and will be explored further down. Velocity, heading and altitude were not recorded for earlier data points.

Data stats

Before I get to actually plotting maps, I want to explore a few basic statistics of the data.

How many data points did Google record over what period of time?

# how many rows are in the data frame?
nrow(loc)

## [1] 600897

min(loc$time)

## [1] "2013-09-06 19:33:41 CEST"

max(loc$time)

## [1] "2016-12-22 08:56:18 CET"

And how are they distributed over days, months and years?

# calculate the number of data points per day, month and year
library(lubridate)
library(zoo)

loc$date <- as.Date(loc$time, '%Y/%m/%d')
loc$year <- year(loc$date)
loc$month_year <- as.yearmon(loc$date)

points_p_day <- data.frame(table(loc$date), group = "day")
points_p_month <- data.frame(table(loc$month_year), group = "month")
points_p_year <- data.frame(table(loc$year), group = "year")

How many days were recorded?

nrow(points_p_day)

## [1] 1057

How many months?

nrow(points_p_month)

## [1] 39

And how many years?

nrow(points_p_year)

## [1] 4

# set up plotting theme
library(ggplot2)
library(ggmap)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "navy"),
    legend.position = "right",
    legend.background = element_blank(),
    panel.margin = unit(.5, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

points <- rbind(points_p_day[, -1], points_p_month[, -1], points_p_year[, -1])

ggplot(points, aes(x = group, y = Freq)) + 
  geom_point(position = position_jitter(width = 0.2), alpha = 0.3) + 
  geom_boxplot(aes(color = group), size = 1, outlier.colour = NA) + 
  facet_grid(group ~ ., scales = "free") + my_theme() +
  theme(
    legend.position = "none",
    strip.placement = "outside",
    strip.background = element_blank(),
    strip.text = element_blank(),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)
  ) +
  labs(
    x = "",
    y = "Number of data points",
    title = "How many data points did Google collect about me?",
    subtitle = "Number of data points per day, month and year",
    caption = "\nGoogle collected between 0 and 1500 data points per day
    (median ~500), between 0 and 40,000 per month (median ~15,000) and 
    between 80,000 and 220,000 per year (median ~140,000)."
  )

How accurate is the data?

Accuracy is given in meters, i.e. the smaller the better.

accuracy <- data.frame(accuracy = loc$accuracy, group = ifelse(loc$accuracy < 800, "high", ifelse(loc$accuracy < 5000, "middle", "low")))

accuracy$group <- factor(accuracy$group, levels = c("high", "middle", "low"))

ggplot(accuracy, aes(x = accuracy, fill = group)) + 
  geom_histogram() + 
  facet_grid(group ~ ., scales="free") + 
  my_theme() +
  theme(
    legend.position = "none",
    strip.placement = "outside",
    strip.background = element_blank(),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)
  ) +
  labs(
    x = "Accuracy in metres",
    y = "Count",
    title = "How accurate is the location data?",
    subtitle = "Histogram of accuracy of location points",
    caption = "\nMost data points are pretty accurate, 
but there are still many data points with a high inaccuracy.
    These were probably from areas with bad satellite reception."
  )

Plotting data points on maps

Finally, we are actually going to plot some maps!

The first map is a simple point plot of all locations recorded around Germany.

You first specify the map area and the zoom factor (the furthest away is 1); the bigger the zoom, the closer to the center of the specified location. Location can be given as longitude/latitude pair or as city or country name.

On this map we can plot different types of plot with the regular ggplot2 syntax. For example, a point plot.

germany <- get_map(location = 'Germany', zoom = 5)

ggmap(germany) + geom_point(data = loc, aes(x = lon, y = lat), alpha = 0.5, color = "red") + 
  theme(legend.position = "right") + 
  labs(
    x = "Longitude", 
    y = "Latitude", 
    title = "Location history data points in Europe",
    caption = "\nA simple point plot shows recorded positions.")

The second map shows a 2D bin plot of accuracy measured for all data points recorded in my home town Münster.

munster <- get_map(location = 'Munster', zoom = 12)

options(stringsAsFactors = T)
ggmap(munster) + 
  stat_summary_2d(geom = "tile", bins = 100, data = loc, aes(x = lon, y = lat, z = accuracy), alpha = 0.5) + 
  scale_fill_gradient(low = "blue", high = "red", guide = guide_legend(title = "Accuracy")) +
  labs(
    x = "Longitude", 
    y = "Latitude", 
    title = "Location history data points around Münster",
    subtitle = "Color scale shows accuracy (low: blue, high: red)",
    caption = "\nThis bin plot shows recorded positions 
    and their accuracy in and around Münster")

We can also plot the velocity of each data point:

loc_2 <- loc[which(!is.na(loc$velocity)), ]

munster <- get_map(location = 'Munster', zoom = 10)

ggmap(munster) + geom_point(data = loc_2, aes(x = lon, y = lat, color = velocity), alpha = 0.3) + 
  theme(legend.position = "right") + 
  labs(x = "Longitude", y = "Latitude", 
       title = "Location history data points in Münster",
       subtitle = "Color scale shows velocity measured for location",
       caption = "\nA point plot where points are colored according 
       to velocity nicely reflects that I moved generally 
       slower in the city center than on the autobahn") +
  scale_colour_gradient(low = "blue", high = "red", guide = guide_legend(title = "Velocity"))

What distance did I travel?

To obtain the distance I moved, I am calculating the distance between data points. Because it takes a long time to calculate, I am only looking at data from last year.

loc3 <- with(loc, subset(loc, loc$time > as.POSIXct('2016-01-01 0:00:01')))
loc3 <- with(loc, subset(loc3, loc$time < as.POSIXct('2016-12-22 23:59:59')))

# Shifting vectors for latitude and longitude to include end position
shift.vec <- function(vec, shift){
  if (length(vec) <= abs(shift)){
    rep(NA ,length(vec))
  } else {
    if (shift >= 0) {
      c(rep(NA, shift), vec[1:(length(vec) - shift)]) }
    else {
      c(vec[(abs(shift) + 1):length(vec)], rep(NA, abs(shift)))
    }
  }
}

loc3$lat.p1 <- shift.vec(loc3$lat, -1)
loc3$lon.p1 <- shift.vec(loc3$lon, -1)

# Calculating distances between points (in metres) with the function pointDistance from the 'raster' package.
library(raster)
loc3$dist.to.prev <- apply(loc3, 1, FUN = function(row) {
  pointDistance(c(as.numeric(as.character(row["lat.p1"])),
                  as.numeric(as.character(row["lon.p1"]))),
                c(as.numeric(as.character(row["lat"])), as.numeric(as.character(row["lon"]))),
                lonlat = T) # Parameter 'lonlat' has to be TRUE!
})

# distance in km
round(sum(as.numeric(as.character(loc3$dist.to.prev)), na.rm = TRUE)*0.001, digits = 2)

## [1] 54466.08

distance_p_month <- aggregate(loc3$dist.to.prev, by = list(month_year = as.factor(loc3$month_year)), FUN = sum)
distance_p_month$x <- distance_p_month$x*0.001

ggplot(distance_p_month[-1, ], aes(x = month_year, y = x,  fill = month_year)) + 
  geom_bar(stat = "identity")  + 
  guides(fill = FALSE) +
  my_theme() +
  labs(
    x = "",
    y = "Distance in km",
    title = "Distance traveled per month in 2016",
    caption = "This barplot shows the sum of distances between recorded 
    positions for 2016. In September we went to the US and Canada."
  )

Activities

Google also guesses my activity based on distance travelled per time.

Here again, it would take too long to look at activity from all data points.

activities <- loc3$activitys

list.condition <- sapply(activities, function(x) !is.null(x[[1]]))
activities  <- activities[list.condition]

df <- do.call("rbind", activities)
main_activity <- sapply(df$activities, function(x) x[[1]][1][[1]][1])

activities_2 <- data.frame(main_activity = main_activity, 
                           time = as.POSIXct(as.numeric(df$timestampMs)/1000, origin = "1970-01-01"))

head(activities_2)

##   main_activity                time
## 1         still 2016-12-22 08:52:45
## 2         still 2016-12-22 08:46:57
## 3         still 2016-12-22 08:33:24
## 4         still 2016-12-22 08:21:31
## 5         still 2016-12-22 08:15:32
## 6         still 2016-12-22 08:10:25

ggplot(activities_2, aes(x = main_activity, group = main_activity, fill = main_activity)) + 
  geom_bar()  + 
  guides(fill = FALSE) +
  my_theme() +
  labs(
    x = "",
    y = "Count",
    title = "Main activities in 2016",
    caption = "Associated activity for recorded positions in 2016. 
    Because Google records activity probabilities for each position, 
    only the activity with highest likelihood were chosen for each position."
  )

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] raster_2.5-8    sp_1.2-3        ggmap_2.6.2     ggplot2_2.2.0  
## [5] zoo_1.7-13      lubridate_1.6.0 jsonlite_1.1   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8       plyr_1.8.4        bitops_1.0-6     
##  [4] tools_3.3.2       digest_0.6.10     evaluate_0.10    
##  [7] tibble_1.2        gtable_0.2.0      lattice_0.20-34  
## [10] png_0.1-7         DBI_0.5-1         mapproj_1.2-4    
## [13] yaml_2.1.14       proto_1.0.0       stringr_1.1.0    
## [16] dplyr_0.5.0       knitr_1.15.1      RgoogleMaps_1.4.1
## [19] maps_3.1.1        rprojroot_1.1     grid_3.3.2       
## [22] R6_2.2.0          jpeg_0.1-8        rmarkdown_1.2    
## [25] reshape2_1.4.2    magrittr_1.5      codetools_0.2-15 
## [28] backports_1.0.4   scales_0.4.1      htmltools_0.3.5  
## [31] assertthat_0.1    colorspace_1.3-2  geosphere_1.5-5  
## [34] labeling_0.3      stringi_1.1.2     lazyeval_0.2.0   
## [37] munsell_0.4.3     rjson_0.2.15

Animating Plots of Beer Ingredients and Sin Taxes over Time

2016-12-22 00:00:00 +0000

With the upcoming holidays, I thought it fitting to finally explore the ttbbeer package. It contains data on beer ingredients used in US breweries from 2006 to 2015 and on the (sin) tax rates for beer, champagne, distilled spirits, wine and various tobacco items since 1862.

I will be exploring these datasets and create animated plots with the animation package and the gganimate extension for ggplot2.

Merry R-Mas and Cheers! :-)

(For full code see below…)

Beer materials since 2006

This dataset includes the monthly aggregates of ingredients as reported by the Beer Monthly Statistical Releases, the Brewer’s Report of Operations and the Quarterly Brewer’s Report of Operations from the US Alcohol and Tobacco Tax and Trade Bureau (TTB).

The following nine ingredients are listed:

Malt and malt products
Corn and corn products
Rice and rice products
Barley and barley products
Wheat and wheat products
Sugar and syrups
Dry hops
Hops extracts
Other ingredients, like flavors, etc.

All ingredient amounts are given in pounds.

I’m not only interested in the total amounts of ingredients that were used but also in the change compared to the previous month. This, I’m calculating as the percent of change compared to the previous entry, starting with 0 in January 2006.

The first thing I wanted to know was how many ingredients were used in total each month. This probably gives a rough estimate of how much beer was produced that month.

The seasonal variation in ingredient use shows that more beer is produced in late spring/ early summer than in winter. Given that it takes about one to four months from production to the finished product, it fits my intuition that people drink more beer when it’s warm outside rather than in the cold of winter.

Now lets look at the ingredients separately. This plot shows the monthly sum of pounds for each ingredient:

We can see the following trends:

The seasonal trend in overall beer production is reflected in the individual ingredients
Barley and wheat use has been steadily increasing since 2006
Corn and corn products has spiked between 2008 and 2011 and slightly dropped since
Dry hops has two very prominent spikes in 2014 and 2015, we will look at these in more detail in the following figures
The use of hops extracts, malt and rice has slighty decreased since 2006
Sugar and syrup use has dropped of between 2006 and 2010 and has since been relatively stable, with the exception of two months in 2014 and 2015
Other additives, like flavors, have increased sharply since 2009; in the last two years their use shows extremely high variability between summer (many added) and winter (fewer added); this might be due to the popularity of craft beers and light summer variants.

We can see that barley use spikes every third month, except for a few outlier months. Dy hops is interesting as there are two very prominent spikes in December 2014 and in December 2015. In the same months we see a drop for corn, hops extracts, malt, other ingredients, sugar and syrups and wheat.

The following plot shows the percent of change compared to the previous month for the monthly sums of ingredients:

Here we can again see the seasonality of ingredient use, while we can also see that the overall trend shows no significant change from zero.

The two outlier months, December 2014 and 2015 are quite conspicuous in these plots as well. This might be due to specific trends or there were shortages of supply for fresh hops and malt during these months (even though we don’t see a drop in overall ingredient amounts).

The first animated plot shows the percentages of ingredients that were used per month. Each frame shows one year.

The percentage plots show that there are no major differences in the proportions of ingredients, again with the exception of December 2014 and 2015.

Sin taxes since 1862

Tax rates for beer, champagne, distilled spirits, wine (separated into three alcohol percentage classes: up to 14%, 14 - 21% and 21 24%) and various tobacco items are included in separate datasets. The tax rates are given in US Dollar per

Barrel of beer (31 gallons)
Wine gallon
Proof barrel of distilled spirits
1000 units of small and large cigarettes and cigars
1 lb. of pipe, chewing and roll-your-own tobacco and snuff

Before I can visualize the data, I need to to do some tidying and preprocessing:

I’m adding a column with the tax group
to keep it consistent, I’m converting all tobacco item names to upper case letters
to dates in the tobacco set were probably wrong, because they messed up the plots, so I corrected them

For each tax group I also want to know the percent of change in tax rate compared to the previous time point.

The first plot shows a time line of tax rate changes since 1862. Major historical events are marked.

The strongest increase in tax rates can be seen for champagne in the 1950s and for wine (14%) following the end of the Cold War. At the end of the civil war, there is a sharp drop in the tax rates for distilled spirits. Spirit tax increased little between 1874 and the beginning of prohibition. During prohibition we can also see an increase of tax rates for champagne and wine (21-24%). While tax rates remained stable during the Cold War, they increased again in 1991 for all drinks, except for spirits.

The same plot as above but with absolute tax rates is visualized in the animated plot below:

The plot below catches the same information in a static plot:

All tax rates have increased strongly since 1862. Beer and wine show a strong jump in 1991 and at the beginning of prohibition. Between the end of prohibition and World War II, beer and wine taxes had dropped again, albeit slightly. Since the 1940s, all alcohol taxes have risen steadily. For tobacoo there is only sporadic information.

R code

library(ttbbeer)
data("beermaterials")

library(dplyr)

# order by year from least to most recent
beermaterials_change <- arrange(beermaterials, Year)

# calculate percent change for each ingredient column
for (i in 3:ncol(beermaterials_change)){
  
  value <- beermaterials_change[[colnames(beermaterials_change)[i]]]
  beermaterials_change[, i] <- c(0, 
                       100 * round((value[-1] - value[1:(length(value)-1)]) /
                                     value[1:(length(value)-1)], 2))
  
}

head(beermaterials_change)

## # A tibble: 6 × 11
##      Month  Year Malt_and_malt_products `Corn_and _corn_products`
##      <chr> <int>                  <dbl>                     <dbl>
## 1  January  2006                      0                         0
## 2 February  2006                     -8                        -1
## 3    March  2006                     11                         7
## 4    April  2006                      2                        -6
## 5      May  2006                      5                        10
## 6     June  2006                     -2                       -15
## # ... with 7 more variables: Rice_and_rice_products <dbl>,
## #   Barley_and_barley_products <dbl>, Wheat_and_wheat_products <dbl>,
## #   Sugar_and_syrups <dbl>, Hops_dry <dbl>, Hops_extracts <dbl>,
## #   Other <dbl>

#calcluate sum of ingredients per month
beermaterials$sum <- rowSums(beermaterials[, -c(1:2)])

For plotting I’m converting the tables from wide to long format:

library(tidyr)

beermaterials_gather <- beermaterials %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_gather$Date <- paste("01", beermaterials_gather$Month, beermaterials_gather$Year, sep = "-") %>%
  as.Date(format = "%d-%B-%Y")
beermaterials_gather$Ingredient <- gsub("_", " ", beermaterials_gather$Ingredient)

beermaterials_change_gather <- beermaterials_change %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_change_gather$Date <- paste("01", beermaterials_change_gather$Month, beermaterials_change_gather$Year, sep = "-") %>%
  as.Date(format = "%d-%B-%Y")
beermaterials_change_gather$Ingredient <- gsub("_", " ", beermaterials_change_gather$Ingredient)

And I’m preparing my custom ggplot2 theme:

library(ggplot2)
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "royalblue", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "bottom",
    legend.justification = "top", 
    legend.box = "horizontal",
    legend.box.background = element_rect(colour = "grey50"),
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())

ggplot(beermaterials_gather, aes(x = Date, y = sum)) +
  geom_point(size = 1.5, alpha = 0.6, color = "navy") +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_line(color = "navy") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Sum of Beer Ingredients by Month and Year",
    subtitle = "Sum of monthly aggregates of all ingredients from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "6 month", date_labels = "%B %Y")

ggplot(beermaterials_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_line() +
  facet_wrap(~Ingredient, ncol = 3, scales = "free_y") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Beer Ingredients by Month and Year",
    subtitle = "Monthly aggregates of ingredients from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")

ggplot(beermaterials_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_point(size = 1.5, alpha = 0.6) +
  guides(color = FALSE) +
  geom_line() +
  labs(
    x = "",
    y = "Amount in pounds",
    title = "Beer Ingredients by Month and Year",
    subtitle = "from January 2006 to December 2015"
  ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%B %Y") +
  facet_wrap(~Ingredient, ncol = 1, scales = "free_y")

To explore the seasonal changes in more detail, I’m widening the resolution of the x-axis.

ggplot(beermaterials_change_gather, aes(x = Date, y = Amount, color = Ingredient)) +
  geom_smooth(alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_line() +
  facet_wrap(~Ingredient, ncol = 3, scales = "free_y") +
  guides(color = FALSE) +
  labs(
    x = "",
    y = "Percent change",
    title = "Beer Ingredients by Month and Year",
    subtitle = "Percent change of monthly aggregates of ingredients compared to previous month from 2006 to 2015"
  ) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")

beermaterials_percent <- cbind(beermaterials[, 1:2], prop.table(as.matrix(beermaterials[, -c(1:2)]), margin = 1) * 100)

beermaterials_percent_gather <- beermaterials_percent %>%
  gather(Ingredient, Amount, Malt_and_malt_products:Other)
beermaterials_percent_gather$Ingredient <- gsub("_", " ", beermaterials_percent_gather$Ingredient)

f = unique(beermaterials_percent_gather$Month)
beermaterials_percent_gather <- within(beermaterials_percent_gather, Month <- factor(Month, levels = f))

library(animation)
ani.options(ani.width=800, ani.height=500)

saveGIF({
  for (year in rev(unique(beermaterials_percent_gather$Year))){
  
  pie <- ggplot(subset(beermaterials_percent_gather, Year == paste(year)), aes(x = "", y = Amount, fill = Ingredient, frame = Year)) + 
  geom_bar(width = 1, stat = "identity") + theme_minimal() +
  coord_polar("y", start = 0) +
  labs(
    title = "Percentage of Beer Ingredients by Year and Month",
    subtitle = paste(year)) +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.ticks = element_blank(),
  plot.title = element_text(size = 14, face = "bold"),
  legend.title = element_blank(),
  legend.position = "bottom",
  legend.text = element_text(size = 8)
  ) + 
  facet_wrap( ~ Month, ncol = 6) +
    guides(fill = guide_legend(ncol = 9, byrow = FALSE))

  print(pie)
  }
  
}, movie.name = "beer_ingredients.gif")

ggplot(beermaterials_percent_gather, aes(x = Month, y = Amount, color = Ingredient, fill = Ingredient)) + 
  geom_bar(width = 0.8, stat = "identity", alpha = 0.7) + 
  my_theme() +
  guides(color = FALSE) +
  facet_wrap( ~ Year, ncol = 5) +
  labs(
    x = "",
    y = "Percent",
    title = "Percentage of Beer Ingredients by Year and Month",
    subtitle = "from 2006 to 2015",
    fill = ""
  )

To see the information from the animated figure at one glance, the next plots shows the percentages of ingredients per month and year as a barplot:

beertax$tax <- "BEER"
champagnetax$tax <- "CHAMPAGNE"
spirittax$tax <- "SPIRIT"
tobaccotax$ITEM <- toupper(tobaccotax$ITEM)
tobaccotax$tax <- paste("TOBACCO", tobaccotax$ITEM, sep = " ")
tobaccotax <- tobaccotax[, -1]
tobaccotax[18, 2] <- "1977-01-31"
tobaccotax[18, 1] <- "1977-01-01"
winetax14$tax <- "WINE 14%"
winetax1421$tax <- "WINE 14-21%"
winetax2124$tax <- "WINE 21-24%"

taxes_list <- c("beertax", "champagnetax", "spirittax", "tobaccotax", "winetax14", "winetax1421", "winetax2124")

for (tax in taxes_list){
  
  tax_df <- get(tax)
  tax_df <- tax_df[match(sort(tax_df$FROM), tax_df$FROM), ]
  
  if (tax != "tobaccotax"){
    tax_df$change <- c(0, 
                       100 * round((tax_df$RATE[-1] - tax_df$RATE[1:(length(tax_df$RATE)-1)]) /
                                     tax_df$RATE[1:(length(tax_df$RATE)-1)], 2))

  } else {
    tax_df$change <- "NA"
  }
  
  if (tax == "beertax"){
    tax_df_2 <- tax_df
  } else {
    tax_df_2 <- rbind(tax_df_2, tax_df)
  }
}

tax_2 <- tax_df_2
tax_2$ID <- paste(tax_2$tax, rownames(tax_2), sep = "_")
head(tax_2)

##         FROM         TO RATE  tax change     ID
## 4 1862-09-01 1863-03-03  1.0 BEER      0 BEER_4
## 5 1863-03-03 1864-03-31  0.6 BEER    -40 BEER_5
## 6 1864-04-01 1898-06-13  1.0 BEER     67 BEER_6
## 7 1898-06-14 1901-06-30  2.0 BEER    100 BEER_7
## 8 1901-07-01 1902-06-30  1.6 BEER    -20 BEER_8
## 9 1902-07-01 1914-10-22  1.0 BEER    -38 BEER_9

Now I can gather the dataframe for plotting.

library(tidyr)
tax_gather <- tax_2 %>%
  gather(dat_column, Date, FROM:TO)
tax_gather$group <- gsub(" .*", "", tax_gather$tax)

d <- data.frame(xmin = as.Date(c("1861-04-12", "1914-07-28", "1920-01-16", "1939-11-01", "1947-03-12")), 
                ymin = -Inf, 
                xmax = as.Date(c("1865-04-09", "1918-11-11", "1933-12-05", "1945-05-08", "1991-12-26")), 
                ymax = Inf, 
                description = c("Civial War", "WW1", "Prohibition", "WW2", "Cold War"),
                pos_y = as.numeric(c(rep(c("1700", "1500"), 2), "1700")))
d$pos_x <- as.Date(NA)

for (i in 1:nrow(d)){
  d[i, "pos_x"] <- as.Date(as.character(median(c(d$xmin[i], d$xmax[i]))))
}

tax_gather$group2 <- ifelse(tax_gather$group == "TOBACCO", "TOBACCO", "ALCOHOL")

ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_label(data = d, aes(pos_x, pos_y, label = description), check_overlap = TRUE) +
  geom_point(data = subset(tax_gather, group2 == "ALCOHOL"), aes(x = Date, y = as.numeric(change), color = tax), size = 1.5, alpha = 0.6) +
  geom_line(data = subset(tax_gather, group2 == "ALCOHOL"), aes(x = Date, y = as.numeric(change), color = tax), size = 1) +
  facet_wrap(~group2, ncol = 1, scales = "free_y") +
  labs(
    x = "",
    y = "Tax rate change in %",
    color = "",
    title = "Alcohol tax rate changes during history",
    subtitle = "Change in percent compared to previous timepoint from 1862 to 2016"
  ) +
  guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
  scale_x_date(date_breaks = "10 year", date_labels = "%Y")

d$pos_y <- as.numeric(c(rep(c("20", "18"), 2), "20"))

library(gganimate)

p <- ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_label(data = d, aes(pos_x, pos_y, label = description), check_overlap = TRUE) +
  geom_point(data = tax_gather, aes(x = Date, y = RATE, color = tax, frame = tax), size = 2, alpha = 0.6) +
  geom_line(data = tax_gather, aes(x = Date, y = RATE, color = tax, frame = tax), size = 1.5) +
  labs(
    x = "",
    y = "Tax rate",
    color = "",
    title = "Tax rates of ",
    subtitle = "from 1862 to 2016"
  ) +
  guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
  scale_x_date(date_breaks = "10 year", date_labels = "%Y")

ani.options(ani.width=1000, ani.height=600)
gganimate(p, filename = "taxes_by_tax.mp4", cumulative = TRUE)
gganimate(p, filename = "taxes_by_tax.gif", cumulative = TRUE)

ggplot() +
  geom_rect(data = d, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), alpha=0.2, fill = "navy") +
  geom_smooth(data = tax_gather, aes(x = Date, y = RATE, color = tax), alpha = 0.3, color = "black", size = 0.5, linetype = "dashed") +
  geom_point(data = tax_gather, aes(x = Date, y = RATE, color = tax), size = 1.5, alpha = 0.6) +
  geom_line(data = tax_gather, aes(x = Date, y = RATE, color = tax), size = 1) +
  facet_wrap(~group, ncol = 2, scales = "free_y") +
  labs(
    x = "",
    y = "Tax rate",
    color = "",
    title = "Alcohol and tobacco tax rates",
    subtitle = "from 1862 to 2016"
  ) +
 guides(color = guide_legend(ncol = 6, byrow = FALSE)) +
 scale_x_date(date_breaks = "20 year", date_labels = "%Y")

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.0 tidyr_0.6.0   dplyr_0.5.0   ttbbeer_1.1.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8      knitr_1.15.1     magrittr_1.5     munsell_0.4.3   
##  [5] lattice_0.20-34  colorspace_1.3-2 R6_2.2.0         stringr_1.1.0   
##  [9] plyr_1.8.4       tools_3.3.2      grid_3.3.2       nlme_3.1-128    
## [13] gtable_0.2.0     mgcv_1.8-16      DBI_0.5-1        htmltools_0.3.5 
## [17] yaml_2.1.14      lazyeval_0.2.0   assertthat_0.1   rprojroot_1.1   
## [21] digest_0.6.10    tibble_1.2       Matrix_1.2-7.1   evaluate_0.10   
## [25] rmarkdown_1.2    labeling_0.3     stringi_1.1.2    scales_0.4.1    
## [29] backports_1.0.4

How to build a Shiny app for disease- & trait-associated locations of the human genome

2016-12-18 00:00:00 +0000

This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously.

The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated resource of single-nucleotide polymorphism (SNP)-trait associations. The database contains more than 100,000 SNPs and all SNP-trait associations with a p-value <1 × 10^−5.

You can access the app at

https://shiring.shinyapps.io/gwas_shiny_app/

Loading the data might take a few seconds. Patience you must have, my young padawan… ;-)

Alternatively, if you are using R, you can load the app via Github with shiny:

library(shiny)
runGitHub("ShirinG/GWAS_Shiny_App") 

If you want to know how I built this app, contine reading.

Data preparation

Initially, I wanted to load all data directly from R packages. And while this worked in principal, it made the app load super slowly.

So, I decided to prepare the data beforehand and save the datatables as tab-delimited text files and load them in the app.

Genome information

First, I’m preparing the genome information: for each chromosome I want to know its length:

library(AnnotationDbi)
library(org.Hs.eg.db)

library(EnsDb.Hsapiens.v79)

edb <- EnsDb.Hsapiens.v79

keys <- keys(edb, keytype="SEQNAME")
chromosome_length <- select(edb, keys = keys, columns = c("SEQLENGTH", "SEQNAME"), keytype = "SEQNAME")
chromosome_length <- chromosome_length[grep("^[0-9]+$|^X$|^Y$|^MT$", chromosome_length$SEQNAME), ]

write.table(chromosome_length, "chromosome_length.txt", row.names = FALSE, col.names = TRUE, sep = "\t")

GWAS SNP data

I am saving the GWAS data as a text file as well; this datatable will be used for plotting the SNP locations.

I am also saving a table with the alphabetically sorted traits as input for the drop-down menu.

library(gwascat)

data(ebicat38)

gwas38 <- as.data.frame(ebicat38)
gwas38$DISEASE.TRAIT <- gsub("&beta;", "beta", gwas38$DISEASE.TRAIT)

# PVALUE_MLOG: -log(p-value)
gwas38[is.infinite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"] <- max(gwas38[is.finite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"]) + 10
summary(gwas38[is.finite(gwas38$PVALUE_MLOG), "PVALUE_MLOG"])
summary(gwas38$PVALUE_MLOG)

# OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.
summary(gwas38$OR.or.BETA)

write.table(gwas38, "gwas38.txt", row.names = FALSE, col.names = TRUE, sep = "\t")

gwas38_traits <- as.data.frame(table(gwas38$DISEASE.TRAIT))
colnames(gwas38_traits) <- c("Trait", "Frequency")

write.table(gwas38_traits, "gwas38_traits.txt", row.names = FALSE, col.names = TRUE, sep = "\t")

The Shiny App

I built my Shiny app with the traditional two-file system. This means that I have a “ui.R” file containing the layout and a “server.R” file, which contains the R code.

ui.R

From top to bottom I chose the following settings:

loading the table with all traits
adding a “choose below” option before the traits to have no trait chosen by default
“united” theme for layout
two slider bars, one for the p-value threshold and one for the odds ratio/ beta coefficient threshold
drop-down menus for all traits, maximal eight can be chosen to be plotted simultaneously
main panel with explanatory text, main plot and output tables

library(shiny)
library(shinythemes)

gwas38_traits <- read.table("gwas38_traits.txt", header = TRUE, sep = "\t")

diseases <- c("choose below", as.character(gwas38_traits$Trait))

shinyUI(fluidPage(theme = shinytheme("united"),
                  titlePanel("GWAS disease- & trait-associated SNP locations of the human genome"),

                  sidebarLayout(
                    sidebarPanel(

                      sliderInput("pvalmlog",
                                  "-log(p-value):",
                                  min = -13,
                                  max = 332,
                                  value = -13),

                      sliderInput("orbeta",
                                  "odds ratio/ beta-coefficient :",
                                  min = 0,
                                  max = 4426,
                                  value = 0),

                      selectInput("variable1", "First trait:",
                                  choices = diseases[-1]),

                      selectInput("variable2", "Second trait:",
                                  choices = diseases),

                      selectInput("variable3", "Third trait:",
                                  choices = diseases),

                      selectInput("variable4", "Fourth trait:",
                                  choices = diseases),

                      selectInput("variable5", "Fifth trait:",
                                  choices = diseases),

                      selectInput("variable6", "Sixth trait:",
                                  choices = diseases),

                      selectInput("variable7", "Seventh trait:",
                                  choices = diseases),

                      selectInput("variable8", "Eighth trait:",
                                  choices = diseases)
                    ),
                    mainPanel(
                      br(),
                      p("The National Human Genome Research Institute (NHGRI) catalog of Genome-Wide Association Studies (GWAS) is a curated resource of single-nucleotide polymorphisms (SNP)-trait associations. The database contains more than 100,000 SNPs and all SNP-trait associations with a p-value <1 × 10^−5."),
                      p("This app is based on the 'gwascat' R package and its 'ebicat38' database and shows trait-associated SNP locations of the human genome."),
                      p("For more info on how I built this app check out", a("my blog.", href = "https://shiring.github.io/")),
                      br(),
                      h4("How to use this app:"),
                      div("Out of 1320 available traits or diseases you can choose up to 8 on the left-side panel und see their chromosomal locations below. The traits are sorted alphabetically. You can also start typing in the drop-down panel and traits matching your query will be suggested.", style = "color:blue"),
                      br(),
                      div("With the two sliders on the left-side panel you can select SNPs above a p-value threshold (-log of association p-value) and/or above an odds ratio/ beta-coefficient threshold. The higher the -log of the p-value the more significant the association of the SNP with the trait. Beware that some SNPs have no odds ratio value and will be shown regardless of the threshold.", style = "color:blue"),
                      br(),
                      div("The table directly below the plot shows the number of SNPs for each selected trait (without subsetting when p-value or odds ratio are changed). The second table below the first shows detailed information for each SNP of the chosen traits. This table shows only SNPs which are plotted (it subsets according to p-value and odds ratio thresholds).", style = "color:blue"),
                      br(),
                      div("Loading might take a few seconds...", style = "color:red"),
                      br(),
                      plotOutput("plot"),
                      tableOutput('table'),
                      tableOutput('table2'),
                      br(),
                      p("GWAS catalog:", a("http://www.ebi.ac.uk/gwas/", href = "http://www.ebi.ac.uk/gwas/")),
                      p(a("Welter, Danielle et al. “The NHGRI GWAS Catalog, a Curated Resource of SNP-Trait Associations.” Nucleic Acids Research 42.Database issue (2014): D1001–D1006. PMC. Web. 17 Dec. 2016.", href = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965119/"))
                    )
                  )
))

server.R

This file contains the R code to produce the plot and output tables.

The chromosome data is loaded for the chromosome length barplot. In order to have the correct order, I am setting the chromosome factor order manually
renderPlot() contains the code for the plot
from the eight possible input traits, all unique traits are plotted (removing “choose below” first), unique traits were set here, so as not to plot the same trait on top of themselves if the same trait is chosen in two drop-down menus
the first output table shows the number of SNPs for the chosen traits
the second output table shows SNP information for all SNPs of the chosen trait(s) within the chosen thresholds

library(shiny)

chr_data <- read.table("chromosome_length.txt", header = TRUE, sep = "\t")

chr_data$SEQNAME <- as.factor(chr_data$SEQNAME)
f = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "MT")
chr_data <- within(chr_data, SEQNAME <- factor(SEQNAME, levels = f))

library(ggplot2)

gwas38 <- read.table("gwas38.txt", header = TRUE, sep = "\t")
gwas38_traits <- read.table("gwas38_traits.txt", header = TRUE, sep = "\t")

# Define server logic required to plot variables
shinyServer(function(input, output) {

  # Generate a plot of the requested variables
  output$plot <- renderPlot({

    plot_snps <- function(trait){

      if (any(trait == "choose below")){
        trait <- trait[-which(trait == "choose below")]
      } else {
        trait <- unique(trait)
      }

        for (i in 1:length(unique(trait))){

          trait_pre <- unique(trait)[i]
          snps_data_pre <- gwas38[which(gwas38$DISEASE.TRAIT == paste(trait_pre)), ]
          snps_data_pre <- data.frame(Chr = snps_data_pre$seqnames,
                                      Start = snps_data_pre$CHR_POS,
                                      SNPid = snps_data_pre$SNPS,
                                      Trait = rep(paste(trait_pre), nrow(snps_data_pre)),
                                      PVALUE_MLOG = snps_data_pre$PVALUE_MLOG,
                                      OR.or.BETA = snps_data_pre$OR.or.BETA)

          snps_data_pre <- subset(snps_data_pre, PVALUE_MLOG > input$pvalmlog)
          snps_data_pre <- subset(snps_data_pre, OR.or.BETA > input$orbeta | is.na(OR.or.BETA))

          if (i == 1){

            snps_data <- snps_data_pre

          } else {

            snps_data <- rbind(snps_data, snps_data_pre)

          }
        }

      snps_data <- within(snps_data, Chr <- factor(Chr, levels = f))

      p <- ggplot(data = snps_data, aes(x = Chr, y = as.numeric(Start))) +
        geom_bar(data = chr_data, aes(x = SEQNAME, y = as.numeric(SEQLENGTH)), stat = "identity", fill = "grey90", color = "black") +
        theme(
          axis.text = element_text(size = 14),
          axis.title = element_text(size = 14),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_rect(fill = "white"),
          legend.position = "bottom"
        ) +
        labs(x = "Chromosome", y = "Position")

      p + geom_segment(data = snps_data, aes(x = as.numeric(as.character(Chr)) - 0.45, xend = as.numeric(as.character(Chr)) + 0.45,
                                           y = Start, yend = Start, colour = Trait), size = 2, alpha = 0.5) +
        scale_colour_brewer(palette = "Set1") +
        guides(colour = guide_legend(ncol = 3, byrow = FALSE))

      }

    plot_snps(trait = c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8, input$variable9))
  })

  output$table <- renderTable({
    table <- gwas38_traits[which(gwas38_traits$Trait %in% c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8)), ]

    })

  output$table2 <- renderTable({
    table <- gwas38[which(gwas38$DISEASE.TRAIT %in% c(input$variable1, input$variable2, input$variable3, input$variable4, input$variable5, input$variable6, input$variable7, input$variable8)), c(1:5, 8, 10, 11, 13, 14, 20, 27, 32, 33, 34, 36)]

    table <- subset(table, PVALUE_MLOG > input$pvalmlog)
    table <- subset(table, OR.or.BETA > input$orbeta | is.na(OR.or.BETA))

    table[order(table$seqnames), ]
    })

})

Deploying to shinyapps.io

Finally, I am deploying my finished app to shinyapps.io with the rsconnect package. You will need to register with shinyapps.io before you can host your Shiny app there and register rsconnect with the token you received.

library(rsconnect)

rsconnect::deployApp('~/Documents/Github/blog_posts_prep/gwas/shiny/GWAS_Shiny_App')

This will open your deployed app right aways. You can now share the link to your app with the world! :-)

If you are interested in human genomics…

… you might also like these posts:

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rsconnect_0.6                          
##  [2] gwascat_2.6.0                          
##  [3] Homo.sapiens_1.3.1                     
##  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [5] GO.db_3.4.0                            
##  [6] OrganismDbi_1.16.0                     
##  [7] EnsDb.Hsapiens.v79_2.1.0               
##  [8] ensembldb_1.6.2                        
##  [9] GenomicFeatures_1.26.0                 
## [10] GenomicRanges_1.26.1                   
## [11] GenomeInfoDb_1.10.1                    
## [12] org.Hs.eg.db_3.4.0                     
## [13] AnnotationDbi_1.36.0                   
## [14] IRanges_2.8.1                          
## [15] S4Vectors_0.12.1                       
## [16] Biobase_2.34.0                         
## [17] BiocGenerics_0.20.0                    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.3-1              rprojroot_1.1                
##   [3] biovizBase_1.22.0             htmlTable_1.7                
##   [5] XVector_0.14.0                base64enc_0.1-3              
##   [7] dichromat_2.0-0               base64_2.0                   
##   [9] interactiveDisplayBase_1.12.0 codetools_0.2-15             
##  [11] splines_3.3.2                 snpStats_1.24.0              
##  [13] ggbio_1.22.0                  fail_1.3                     
##  [15] doParallel_1.0.10             knitr_1.15.1                 
##  [17] Formula_1.2-1                 gQTLBase_1.6.0               
##  [19] Rsamtools_1.26.1              cluster_2.0.5                
##  [21] graph_1.52.0                  shiny_0.14.2                 
##  [23] httr_1.2.1                    backports_1.0.4              
##  [25] assertthat_0.1                Matrix_1.2-7.1               
##  [27] lazyeval_0.2.0                limma_3.30.6                 
##  [29] acepack_1.4.1                 htmltools_0.3.5              
##  [31] tools_3.3.2                   gtable_0.2.0                 
##  [33] reshape2_1.4.2                dplyr_0.5.0                  
##  [35] fastmatch_1.0-4               Rcpp_0.12.8                  
##  [37] Biostrings_2.42.1             nlme_3.1-128                 
##  [39] rtracklayer_1.34.1            iterators_1.0.8              
##  [41] ffbase_0.12.3                 stringr_1.1.0                
##  [43] mime_0.5                      XML_3.98-1.5                 
##  [45] AnnotationHub_2.6.4           zlibbioc_1.20.0              
##  [47] scales_0.4.1                  BSgenome_1.42.0              
##  [49] VariantAnnotation_1.20.2      BiocInstaller_1.24.0         
##  [51] SummarizedExperiment_1.4.0    RBGL_1.50.0                  
##  [53] RColorBrewer_1.1-2            BBmisc_1.10                  
##  [55] yaml_2.1.14                   memoise_1.0.0                
##  [57] gridExtra_2.2.1               ggplot2_2.2.0                
##  [59] biomaRt_2.30.0                rpart_4.1-10                 
##  [61] reshape_0.8.6                 latticeExtra_0.6-28          
##  [63] stringi_1.1.2                 RSQLite_1.1-1                
##  [65] foreach_1.4.3                 checkmate_1.8.2              
##  [67] BiocParallel_1.8.1            BatchJobs_1.6                
##  [69] GenomicFiles_1.10.3           bitops_1.0-6                 
##  [71] matrixStats_0.51.0            evaluate_0.10                
##  [73] lattice_0.20-34               GenomicAlignments_1.10.0     
##  [75] bit_1.1-12                    GGally_1.3.0                 
##  [77] plyr_1.8.4                    magrittr_1.5                 
##  [79] sendmailR_1.2-1               R6_2.2.0                     
##  [81] Hmisc_4.0-1                   DBI_0.5-1                    
##  [83] foreign_0.8-67                mgcv_1.8-16                  
##  [85] survival_2.40-1               RCurl_1.95-4.8               
##  [87] nnet_7.3-12                   tibble_1.2                   
##  [89] rmarkdown_1.2                 grid_3.3.2                   
##  [91] data.table_1.10.0             gQTLstats_1.6.0              
##  [93] digest_0.6.10                 xtable_1.8-2                 
##  [95] ff_2.2-13                     httpuv_1.3.3                 
##  [97] brew_1.0-6                    openssl_0.9.5                
##  [99] munsell_0.4.3                 beeswarm_0.2.3               
## [101] Gviz_1.18.1

Gene homology Part 2 - creating directed networks with igraph

2016-12-14 00:00:00 +0000

In my last post I created a gene homology network for human genes. In this post I want to extend the network to include edges for other species.

First, I am loading biomaRt and extract a list of all available datasets.

library(biomaRt)
ensembl = useMart("ensembl")
datasets <- listDatasets(ensembl)

For this network I don’t want to include all species with available information in Biomart because it would get way too messy. So, I am only picking the 11 species below: Human, fruitfly, mouse, C. elegans, dog, zebrafish, chicken, chimp, rat, yeast and pig. For all of these species, I am going to use their respective org.db database to extract gene information.

In order to loop over all species’ org.dbs for installation and loading, I create a new column with the name of the library.

datasets$orgDb <- NA

datasets[grep("hsapiens", datasets$dataset), "orgDb"] <- "org.Hs.eg.db"
datasets[grep("dmel", datasets$dataset), "orgDb"] <- "org.Dm.eg.db"
datasets[grep("mmus", datasets$dataset), "orgDb"] <- "org.Mm.eg.db"
datasets[grep("celegans", datasets$dataset), "orgDb"] <- "org.Ce.eg.db"
datasets[grep("cfam", datasets$dataset), "orgDb"] <- "org.Cf.eg.db"
datasets[grep("drerio", datasets$dataset), "orgDb"] <- "org.Dr.eg.db"
datasets[grep("ggallus", datasets$dataset), "orgDb"] <- "org.Gg.eg.db"
datasets[grep("ptrog", datasets$dataset), "orgDb"] <- "org.Pt.eg.db"
datasets[grep("rnor", datasets$dataset), "orgDb"] <- "org.Rn.eg.db"
datasets[grep("scer", datasets$dataset), "orgDb"] <- "org.Sc.sgd.db"
datasets[grep("sscrofa", datasets$dataset), "orgDb"] <- "org.Ss.eg.db"

datasets <- datasets[!is.na(datasets$orgDb), ]
datasets

##                       dataset                              description
## 9    rnorvegicus_gene_ensembl                     Rat genes (Rnor_6.0)
## 13   scerevisiae_gene_ensembl Saccharomyces cerevisiae genes (R64-1-1)
## 14      celegans_gene_ensembl  Caenorhabditis elegans genes (WBcel235)
## 22  ptroglodytes_gene_ensembl            Chimpanzee genes (CHIMP2.1.4)
## 26       sscrofa_gene_ensembl                  Pig genes (Sscrofa10.2)
## 32      hsapiens_gene_ensembl                  Human genes (GRCh38.p7)
## 36       ggallus_gene_ensembl        Chicken genes (Gallus_gallus-5.0)
## 41        drerio_gene_ensembl                 Zebrafish genes (GRCz10)
## 53 dmelanogaster_gene_ensembl                   Fruitfly genes (BDGP6)
## 61     mmusculus_gene_ensembl                  Mouse genes (GRCm38.p5)
## 69   cfamiliaris_gene_ensembl                    Dog genes (CanFam3.1)
##              version         orgDb
## 9           Rnor_6.0  org.Rn.eg.db
## 13           R64-1-1 org.Sc.sgd.db
## 14          WBcel235  org.Ce.eg.db
## 22        CHIMP2.1.4  org.Pt.eg.db
## 26       Sscrofa10.2  org.Ss.eg.db
## 32         GRCh38.p7  org.Hs.eg.db
## 36 Gallus_gallus-5.0  org.Gg.eg.db
## 41            GRCz10  org.Dr.eg.db
## 53             BDGP6  org.Dm.eg.db
## 61         GRCm38.p5  org.Mm.eg.db
## 69         CanFam3.1  org.Cf.eg.db

Now I can load the Ensembl mart for each species.

for (i in 1:nrow(datasets)) {
  ensembl <- datasets[i, 1]
  assign(paste0(ensembl), useMart("ensembl", dataset = paste0(ensembl)))
}

specieslist <- datasets$dataset

And install all org.db libraries, if necessary.

library(AnnotationDbi)

load_orgDb <- function(orgDb){
  if(!orgDb %in% installed.packages()[,"Package"]){
    source("https://bioconductor.org/biocLite.R")
    biocLite(orgDb)
  }
}

sapply(datasets$orgDb, load_orgDb, simplify = TRUE, USE.NAMES = TRUE)

Once they are all installed, I can load the libraries.

lapply(datasets$orgDb, require, character.only = TRUE)

Because I want to exctract and compare information on homologus genes for all possible combinations of species, they need to have a common identifier. To find them, I first produce a list with all available keytypes and then ask for common elements using rlist’s list.common() function.

keytypes_list <- lapply(datasets$orgDb, function(x) NULL)
names(keytypes_list) <- paste(datasets$orgDb)

for (orgDb in datasets$orgDb){
  keytypes_list[[orgDb]] <- keytypes(get(orgDb))
}

library(rlist)
list.common(keytypes_list)

##  [1] "ENTREZID"    "ENZYME"      "EVIDENCE"    "EVIDENCEALL" "GENENAME"   
##  [6] "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PATH"       
## [11] "PMID"        "REFSEQ"      "UNIPROT"

Entrez IDs are available for all species, so these are the keys I am using as gene identifiers. I am first creating a table of homologous genes for each species with all other species separately by looping over the datasets table.

for (i in 1:nrow(datasets)){
  orgDbs <- datasets$orgDb[i]
  values <- keys(get(orgDbs), keytype = "ENTREZID")

  ds <- datasets$dataset[i]
  mart <- useMart("ensembl", dataset = paste(ds))
  print(mart)

  if (!is.na(listFilters(mart)$name[grep("^entrezgene$", listFilters(mart)$name)])){
    if (!is.na(listAttributes(mart)$name[grep("^entrezgene$", listAttributes(mart)$name)])){
    print("TRUE")
    for (species in specieslist) {
      print(species)
      if (species != ds){
        assign(paste("homologs", orgDbs, species, sep = "_"), getLDS(attributes = c("entrezgene"),
                                                                     filters = "entrezgene",
                                                                     values = values,
                                                                     mart = mart,
                                                                     attributesL = c("entrezgene"),
                                                                     martL = get(species)))
      }
    }
  }
  }
}

Now I sort and combine these tables into one big table and remove duplicate entries.

library(dplyr)

for (i in 1:nrow(datasets)){
  orgDbs <- datasets$orgDb[i]
  values <- data.frame(GeneID = keys(get(orgDbs), keytype = "ENTREZID"))
  values$GeneID <- as.character(values$GeneID)
  ds <- datasets$dataset[i]

  for (j in 1:length(specieslist)){
    species <- specieslist[j]

    if (j == 1){
      homologs_table <- values
    }

    if (species != ds){
      homologs_species <- get(paste("homologs", orgDbs, species, sep = "_"))
      homologs_species$EntrezGene.ID <- as.character(homologs_species$EntrezGene.ID)

      homologs <- left_join(values, homologs_species, by = c("GeneID" = "EntrezGene.ID"))
      homologs <- homologs[!duplicated(homologs$GeneID), ]
      colnames(homologs)[2] <- paste(species)

      homologs_table <- left_join(homologs_table, homologs, by = "GeneID")
  }
  }

  colnames(homologs_table)[1] <- paste(ds)

  assign(paste("homologs_table", ds, sep = "_"), homologs_table)
}


for (i in 1:nrow(datasets)){
  ds <- datasets$dataset[i]

  if (i == 1){
    homologs_table_combined <- get(paste("homologs_table", ds, sep = "_"))
    homologs_table_combined <- homologs_table_combined[, order(colnames(homologs_table_combined))]
  } else {
    homologs_table_species <- get(paste("homologs_table", ds, sep = "_"))
    homologs_table_species <- homologs_table_species[, order(colnames(homologs_table_species))]

    homologs_table_combined <- rbind(homologs_table_combined, homologs_table_species)
  }
}

homologs_table_combined <- homologs_table_combined[!duplicated(homologs_table_combined), ]

Now each row in the table shows one gene and its homologs in all of the 11 species. Some genes have multiple homologs in other species (e.g. gene 44071 of the fruitfly has two homologs in zebrafish).

head(homologs_table_combined)

##   celegans_gene_ensembl cfamiliaris_gene_ensembl
## 1                    NA                       NA
## 2                    NA                   477699
## 3                    NA                   477023
## 4                173979                   490207
## 5                179795                   607852
## 6                177055                       NA
##   dmelanogaster_gene_ensembl drerio_gene_ensembl ggallus_gene_ensembl
## 1                         NA                  NA                   NA
## 2                      44071                  NA                   NA
## 3                         NA              431754               770094
## 4                      38864              406283                   NA
## 5                      36760              794259               395373
## 6                    3772179              541489               421909
##   hsapiens_gene_ensembl mmusculus_gene_ensembl ptroglodytes_gene_ensembl
## 1                    NA                     NA                        NA
## 2                     2                 232345                    465372
## 3                    30                 235674                    460268
## 4                    34                  11364                    469356
## 5                    47                 104112                    454672
## 6                    52                  11431                    458990
##   rnorvegicus_gene_ensembl scerevisiae_gene_ensembl sscrofa_gene_ensembl
## 1                    24152                       NA                   NA
## 2                    24153                       NA               403166
## 3                    24157                   854646            100515577
## 4                    24158                       NA               397104
## 5                    24159                   854310                   NA
## 6                    24161                   856187            100737301

In order to create the cooccurrence matrix, I am converting gene names to “1” and NAs to “0” and multiply the matrix with its transposed self. This is now the basis for igraph’s graph_from_adjacency_matrix() function with which I’m creating a directed network. Before plotting the network, I am removing multiple entries and loops.

homologs_table_combined_matrix <- as.matrix(ifelse(is.na(homologs_table_combined), 0, 1))

co_occurrence <- t(as.matrix(homologs_table_combined_matrix)) %*% as.matrix(homologs_table_combined_matrix)

library(igraph)
g <- graph_from_adjacency_matrix(co_occurrence,
                                 weighted = TRUE,
                                 diag = FALSE,
                                 mode = "directed")

g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)

I am also preparing a node annotation table with colors for each species group and the count of unique gene identifiers for species.

datasets[, 2] <- gsub("(.*)( genes (.*))", "\\1", datasets[, 2])
datasets$description[grep("Saccharomyces", datasets$description)] <- "Yeast"
datasets$description[grep("elegans", datasets$description)] <- "C. elegans"

datasets$group <- ifelse(datasets$description == "Yeast", "fungus",
                         ifelse(datasets$description == "C. elegans", "roundworm",
                                ifelse(datasets$description == "Chicken", "bird",
                                       ifelse(datasets$dataset == "Zebrafish", "fish",
                                              ifelse(datasets$description == "Fruitfly", "insect", "mammal")))))

list <- as.character(unique(datasets$dataset))

library(RColorBrewer)
set3 <- brewer.pal(length(list), "Set3")

datasets$col <- NA
for (i in 1:length(list)){
  datasets[datasets$dataset == list[i], "col"] <- set3[i]
}

no_genes <- as.data.frame(apply(homologs_table_combined, 2, function(x) length(unique(x))))
no_genes$dataset <- rownames(no_genes)
colnames(no_genes)[1] <- "no_genes"

library(dplyr)
datasets <- left_join(datasets, no_genes, by = "dataset")

datasets <- datasets[order(datasets$dataset), ]

I also want to have the proportion of genes that have homologs in each of the respective other species as edge attributes. I am preparing an extra table for this of all possible combinations of nodes.

for (i in 1:ncol(homologs_table_combined)){
  input <- unique(na.omit(homologs_table_combined[, i]))
  homologs_table_combined_subset <- homologs_table_combined[!is.na(homologs_table_combined[, i]), ]

  for (j in 1:ncol(homologs_table_combined)){
    if (i != j){
      output <- unique(na.omit(homologs_table_combined_subset[, j]))
      
      if (i == 1 & j == 2){
        edge_table <- data.frame(source = colnames(homologs_table_combined)[i], 
                                 target = colnames(homologs_table_combined)[j], 
                                 weight = length(output)/length(input))
      } else {
        table <- data.frame(source = colnames(homologs_table_combined)[i], 
                            target = colnames(homologs_table_combined)[j], 
                            weight = length(output)/length(input))
        edge_table <- rbind(edge_table, table)
      }
    }
  }
}

list <- as.character(unique(edge_table$source))

edge_table$col <- NA
for (i in 1:length(list)){
  edge_table[edge_table$source == list[i], "col"] <- set3[i]
}

And finally I can produce the plot:

V(g)$color <- datasets$col
V(g)$label <- datasets$description
V(g)$size <- datasets$no_genes/2000
E(g)$arrow.size <- 3
E(g)$width <- edge_table$weight*25

plot(g,
     vertex.label.font = 1,
     vertex.shape = "sphere",
     vertex.label.cex = 3,
     vertex.label.color = "black",
     vertex.frame.color = NA,
     edge.curved = rep(0.1, ecount(g)),
     edge.color = edge_table$col,
     layout = layout_in_circle)

Node size shows the total number of genes of each species and edge width shows the proportion of these genes with homologs in the respective other species. Edge and node colors show the 11 species and their outgoing edges.

Weirdly, the mouse seems to have more gene entries in the org.db library than human. If anyone knows why that is, please let me know!

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_0.5.0          RColorBrewer_1.1-2   igraph_1.0.1        
##  [4] rlist_0.4.6.1        org.Cf.eg.db_3.4.0   org.Mm.eg.db_3.4.0  
##  [7] org.Dm.eg.db_3.4.0   org.Dr.eg.db_3.4.0   org.Gg.eg.db_3.4.0  
## [10] org.Hs.eg.db_3.4.0   org.Ss.eg.db_3.4.0   org.Pt.eg.db_3.4.0  
## [13] org.Ce.eg.db_3.4.0   org.Sc.sgd.db_3.4.0  org.Rn.eg.db_3.4.0  
## [16] AnnotationDbi_1.36.0 IRanges_2.8.1        S4Vectors_0.12.1    
## [19] Biobase_2.34.0       BiocGenerics_0.20.0  biomaRt_2.30.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8       bitops_1.0-6      tools_3.3.2      
##  [4] digest_0.6.10     tibble_1.2        RSQLite_1.1-1    
##  [7] evaluate_0.10     memoise_1.0.0     DBI_0.5-1        
## [10] yaml_2.1.14       stringr_1.1.0     knitr_1.15.1     
## [13] rprojroot_1.1     data.table_1.10.0 R6_2.2.0         
## [16] XML_3.98-1.5      rmarkdown_1.2     magrittr_1.5     
## [19] backports_1.0.4   htmltools_0.3.5   assertthat_0.1   
## [22] stringi_1.1.2     RCurl_1.95-4.8

Creating a network of human gene homology with R and D3

2016-12-11 00:00:00 +0000

Edited on 20 December 2016

This week I want to explore how many of our human genes have homologs in other species. I will use this question to show how to visualize dendrograms and (D3-) networks.

Here, I am looking at gene homologs for all genes of the human genome. In Part 2 I am creating a full network between a subset of the species from this post.

Accessing species information from Biomart

I am using the biomaRt package to access all datasets available in the Biomart Ensembl database. The listDatasets() function shows that currently there is data for 69 species (including humans). This dataframe gives us the name of each dataset in Biomart Ensembl, the respective common species name and version number.

library(biomaRt)
ensembl = useMart("ensembl")
datasets <- listDatasets(ensembl)

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

I am then looping over each species to call useMart() and assign an object to each species’ Ensembl database.

for (i in 1:nrow(datasets)) {
  ensembl <- datasets[i, 1]
  assign(paste0(ensembl), useMart("ensembl", dataset = paste0(ensembl)))
}

specieslist <- datasets$dataset

Identifying human gene homologs

Protein coding genes

The majority of human genes are protein coding genes. This means that their DNA sequence will be translated into a protein with specific cellular functions. For an overview of all gene biotypes in the human genome, have a look at this previous post.

To identify all protein coding genes in the human genome, I am using AnnotationDbi and the EnsDb.Hsapiens.v79 database.

library(AnnotationDbi)
library(EnsDb.Hsapiens.v79)

# get all Ensembl gene IDs
human_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "GENEID")

# get the biotype of each gene ID
human_gene_EnsDb <- ensembldb::select(EnsDb.Hsapiens.v79, keys = human_EnsDb, columns = "GENEBIOTYPE", keytype = "GENEID")

# and keep only protein coding genes
human_prot_coding_genes <- human_gene_EnsDb[which(human_gene_EnsDb$GENEBIOTYPE == "protein_coding"), ]

To get a dataframe of human genes and their homologs in other species, I am looping over all Biomart Ensembl databases using biomaRt’s getLDS() function. getLDS() links two datasets and is equivalent to homology mapping with Ensembl. I am also including the human database so that I can see how many genes from EnsDb.Hsapiens.v79 are found via Biomart.

for (species in specieslist) {
  print(species)
  assign(paste0("homologs_human_", species), getLDS(attributes = c("ensembl_gene_id", "chromosome_name"), 
       filters = "ensembl_gene_id", 
       values = human_prot_coding_genes$GENEID, 
       mart = human, 
       attributesL = c("ensembl_gene_id", "chromosome_name"), 
       martL = get(species)))
}

Based on EnsDb.Hsapiens.v79 protein coding genes, I now have 69 individual datasets with homologous genes found in other species. I now want to combine these 69 datasets into one cooccurrence matrix. To do this, I am first joining the homologous gene subsets back to the original dataframe with all protein coding genes. Then, I convert existing homologous gene names to 1 and NAs to 0. These, I then merge into one table. This final table now contains the information whether each gene has a homolog in each of the other species (“1”) or not (“0”).

library(dplyr)

for (i in 1:length(specieslist)){
  species <- specieslist[i]
  homologs_human <- left_join(human_prot_coding_genes, get(paste0("homologs_human_", species)), by = c("GENEID" = "Gene.ID"))
  homologs_human[, paste0(species)] <- ifelse(is.na(homologs_human$Gene.ID.1), 0, 1)
  homologs_human <- homologs_human[, c(1, 6)]
  homologs_human <- homologs_human[!duplicated(homologs_human$GENEID), ]
  
  if (i == 1){
    homologs_human_table <- homologs_human
  } else {
    homologs_human_table <- left_join(homologs_human_table, homologs_human, by = "GENEID")
  }
}

I only want to keep the 21,925 human genes found via Biomart (77 genes were not found).

homologs_human_table <- homologs_human_table[which(homologs_human_table[, grep("sapiens", colnames(homologs_human_table))] == 1), ]

The final steps to creating a cooccurrence matrix are

removing the column with Ensembl gene IDs (because we have the Homo sapiens column from Biomart),
multiplying with the transposed matrix and
setting all rows and columns which don’t show cooccurrences with humans to 0 (because I only looked at homology from the human perspective).

gene_matrix <- homologs_human_table[, -1]

co_occurrence <- t(as.matrix(gene_matrix)) %*% as.matrix(gene_matrix)
co_occurrence[-grep("sapiens", rownames(co_occurrence)), -grep("sapiens", colnames(co_occurrence))] <- 0

The first information I want to extract from this data is how many and what proportion of human genes had a homolog in each of the other species.

genes <- data.frame(organism = colnames(gene_matrix),
                    number_genes = colSums(gene_matrix),
                    proportion_human_genes = colSums(gene_matrix)/nrow(homologs_human_table))

I also want to know which species are similar regarding the gene homology they share with humans. This, I will visualize with a hierarchical clustering dendrogram using the dendextend and circlize packages.

Before I can produce plots however, I want to create annotation attributes for each species:

their common name (because most people won’t know off the top of their heads what species T. nigroviridis is) and
a grouping attribute, e.g. mammal, fish, bird, etc.

In order to have the correct order of species, I create the annotation dataframe based off the dendrogram.

Most of the common names I can extract from the list of Biomart Ensembl datasets but a few I have to change manually.

library(dendextend)
library(circlize)

# create a dendrogram
h <- hclust(dist(scale(as.matrix(t(gene_matrix))), method = "manhattan"))
dend <- as.dendrogram(h)

library(dplyr)
labels <- as.data.frame(dend %>% labels) %>%
  left_join(datasets, by = c("dend %>% labels" = "dataset")) %>%
  left_join(genes, by = c("dend %>% labels" = "organism"))

labels[, 2] <- gsub("(.*)( genes (.*))", "\\1", labels[, 2])
labels$group <- c(rep("mammal", 2), rep("fish", 8), "amphibia", rep("fish", 4), rep("bird", 5), rep("reptile", 2), rep("mammal", 41), "fungus", "lamprey", rep("seasquirt", 2), "nematode", "insect")

labels$description[grep("hedgehog", labels$description)] <- "Hedgehog Tenrec"
labels$description[grep("Saccharomyces", labels$description)] <- "Yeast"
labels$description[grep("savignyi", labels$description)] <- "C. savignyi"
labels$description[grep("intestinalis", labels$description)] <- "C. intestinalis"
labels$description[grep("elegans", labels$description)] <- "C. elegans"
labels$description[grep("turtle", labels$description)] <- "Turtle"
labels$description[grep("Vervet", labels$description)] <- "Vervet"

The first plot shows the proportion of human genes with a homolog in each of the other species.

library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "navy"),
    legend.position = "bottom",
    legend.background = element_blank(),
    panel.margin = unit(.5, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

labels_p <- labels[-1, ]

f = as.factor(labels_p[order(labels_p$proportion_human_genes, decreasing = TRUE), "description"])
labels_p <- within(labels_p, description <- factor(description, levels = f))

ggplot(labels_p, aes(x = description, y = proportion_human_genes, fill = group)) +
  geom_bar(stat = "identity") +
  my_theme() +
  labs(
    x = "",
    y = "Proportion of homology",
    fill = "",
    title = "Human gene homology",
    subtitle = "Proportion of human protein coding genes with homologs in 68 other species"
  )

As expected, the highest proportions of homologous genes are found between humans and most other mammals: between 81 and 70% of our genes have a homolog in most mammals. The lowest proportion of homologous genes are found in yeast, but even such a simply organism shares more than 20% of its genes with us. Think about that the next time you enjoy a delicious freshly baked bread or drink a nice cold glass of beer…

However, based on these Biomart Ensembl databases, more of our genes have a homolog in mice than in primates. I would have expected primates to come out on top, but what we see here might be confounded by the different amounts of information we have on species’ genomes and the accuracy of this information: there has been more research done on the mouse than e.g. on the alpaca. We therefore have more and more accurate information on the mouse genome and have identified more homologous genes than in other species.

For plotting the dendrograms, I am adding a column with color specifications for each group to the annotation table.

labels$col <- c(rep("aquamarine4", 2), rep("darkorange3", 8), "blue4", rep("darkorange3", 4), rep("darkred", 5), rep("darkmagenta", 2), rep("aquamarine4", 41), "darkslateblue", "deepskyblue1", rep("deeppink3", 2), "forestgreen", "brown3")

This dendrogram shows the hierarchical clustering of human gene homology with 68 other species. The species are grouped by color (see legend).

labels(dend) <- labels$description
dend %>% 
  set("labels_col", labels$col) %>% 
  set("labels_cex", 2) %>% 
  circlize_dendrogram(labels_track_height = NA, dend_track_height = 0.3) 

legend("topleft", unique(labels$group), pch = 19,
       col = unique(labels$col), pt.cex = 3, cex = 3, bty = "n", ncol = 1)

The most distant group from all other species include yeast, fruit fly, lamprey, C. elegans and the two seasquirt species, followed by tarsier, alpaca, sloth and shrew. The next group includes the fish, reptiles, amphibiens, birds and the platypus. Interestingly, the platypus the Xenopus are more similar to fish than to birds and reptiles.

And finally, we have the remaining mammals: here, the primates cluster nicely together.

In order to show the homology between human genes with the other species, we can also plot a network. The 2D-network is created from the cooccurrence matrix with node size and edge width representing the proportion of homology and colors depicting the species’ group.

# plot the network graph
library(igraph)
g <- graph_from_adjacency_matrix(co_occurrence,
                         weighted = TRUE,
                         diag = FALSE,
                         mode = "undirected")

g <- simplify(g, remove.multiple = F, remove.loops = T, edge.attr.comb = c(weight = "sum", type = "ignore"))

labels_2 <- labels[match(rownames(co_occurrence), labels[, 1]), ]

V(g)$color <- labels_2$col
V(g)$label <- labels_2$description
V(g)$size <- labels_2$proportion_human_genes*25
E(g)$arrow.size <- 0.2
E(g)$edge.color <- "gray80"
E(g)$width <- labels_2$proportion_human_genes*10

plot(g,
     vertex.label.font = 1,
     vertex.shape = "sphere",
     vertex.label.cex = 1,
     vertex.label.color = "white",
     vertex.frame.color = NA)

legend("topleft", unique(labels_2$group), pch = 19,
       col = unique(labels_2$col), pt.cex = 2, cex = 1.5, bty = "n", ncol = 1)

Another nice way to visualize networks is to create a D3 JavaScript network graph with the networkD3 package. It shows the same information as the 2D-network above but here you can interact with the nodes, which makes the species names easier to read.

library(networkD3)

net_d3 <- igraph_to_networkD3(g, group = labels_2$group)
net_d3$nodes <- merge(net_d3$nodes, genes, by.x = "name", by.y = "row.names")
net_d3$nodes$proportion_human_genes <- net_d3$nodes$proportion_human_genes
net_d3$nodes <- merge(net_d3$nodes, labels_2[, 1:2], by.x = "name", by.y = "dend %>% labels")
net_d3$nodes[, 1] <- net_d3$nodes$description

net_d3$nodes <- net_d3$nodes[match(labels_2[, 1], net_d3$nodes$organism), ]

# Create force directed network plot
forceNetwork(Links = net_d3$links, 
             Nodes = net_d3$nodes,
             Nodesize = "proportion_human_genes",
             radiusCalculation = JS(" Math.sqrt(d.nodesize)*15"),
             linkDistance = 100,
             zoom = TRUE, 
             opacity = 0.9,
             Source = 'source', 
             Target = 'target',
             linkWidth = networkD3::JS("function(d) { return d.value/5; }"),
             bounded = FALSE, 
             colourScale = JS("d3.scale.category10()"),
             NodeID = 'name', 
             Group = 'group', 
             fontSize = 10, 
             charge = -200,
             opacityNoHover = 0.7)

Click to open the interactive network and use the mouse to zoom and turn the plot:

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] networkD3_0.2.13         igraph_1.0.1            
##  [3] ggplot2_2.2.0            dplyr_0.5.0             
##  [5] circlize_0.3.9           dendextend_1.3.0        
##  [7] EnsDb.Hsapiens.v79_1.1.0 ensembldb_1.6.0         
##  [9] GenomicFeatures_1.26.0   GenomicRanges_1.26.0    
## [11] GenomeInfoDb_1.10.0      AnnotationDbi_1.36.0    
## [13] IRanges_2.8.0            S4Vectors_0.12.0        
## [15] Biobase_2.34.0           BiocGenerics_0.20.0     
## [17] biomaRt_2.30.0          
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1                    jsonlite_1.1                 
##  [3] AnnotationHub_2.6.0           shiny_0.14.2                 
##  [5] assertthat_0.1                interactiveDisplayBase_1.12.0
##  [7] Rsamtools_1.26.0              yaml_2.1.14                  
##  [9] robustbase_0.92-6             RSQLite_1.0.0                
## [11] lattice_0.20-34               digest_0.6.10                
## [13] XVector_0.14.0                colorspace_1.3-1             
## [15] htmltools_0.3.5               httpuv_1.3.3                 
## [17] Matrix_1.2-7.1                plyr_1.8.4                   
## [19] XML_3.98-1.5                  zlibbioc_1.20.0              
## [21] xtable_1.8-2                  mvtnorm_1.0-5                
## [23] scales_0.4.1                  whisker_0.3-2                
## [25] BiocParallel_1.8.0            tibble_1.2                   
## [27] SummarizedExperiment_1.4.0    nnet_7.3-12                  
## [29] lazyeval_0.2.0                magrittr_1.5                 
## [31] mime_0.5                      mclust_5.2                   
## [33] evaluate_0.10                 MASS_7.3-45                  
## [35] class_7.3-14                  BiocInstaller_1.24.0         
## [37] tools_3.3.2                   GlobalOptions_0.0.10         
## [39] trimcluster_0.1-2             stringr_1.1.0                
## [41] kernlab_0.9-25                munsell_0.4.3                
## [43] cluster_2.0.5                 fpc_2.1-10                   
## [45] Biostrings_2.42.0             grid_3.3.2                   
## [47] RCurl_1.95-4.8                htmlwidgets_0.8              
## [49] bitops_1.0-6                  labeling_0.3                 
## [51] rmarkdown_1.1                 gtable_0.2.0                 
## [53] flexmix_2.3-13                DBI_0.5-1                    
## [55] R6_2.2.0                      GenomicAlignments_1.10.0     
## [57] knitr_1.15                    prabclus_2.2-6               
## [59] rtracklayer_1.34.0            shape_1.4.2                  
## [61] modeltools_0.2-21             stringi_1.1.2                
## [63] Rcpp_0.12.8                   DEoptimR_1.0-8               
## [65] diptest_0.75-7

How to set up your own R blog with Github pages and Jekyll Bootstrap

2016-12-04 00:00:00 +0000

This post is in reply to a request: How did I set up this R blog?

I have wanted to have my own R blog for a while before I actually went ahead and realised this page. I had seen all the cool things people do with R by following R-bloggers and reading their newsletter every day!

While I am using R every day at work, the thematic scope there is of course very bioinformatics-centric (with a little bit of Machine Learning and lots of visualization and statistics), I wanted to have an incentive to regularly explore other types of analyses and other types of data that I don’t normally work with. I have also very often benefited from other people’s published code in that it gave me ideas for my own work and I hope that sharing my own analyses will inspire others as much as I often am by what can be be done with data.

Moreover, since I’ve started my R blog, I’ve very much enjoyed interacting more with other R-bloggers and R users and getting valuable feedback. I definitely feel more part of the community now that I’m an active contributor!

In this post, I will describe how I go about publishing R code and its output to my blog. I will not go into detail on how to customize your page but only give a quick introduction to setting up a blog with Github pages and JekyllBoostrap. I will focus on how to write content with RMarkdown and RStudio.

Setting up Github pages

I decided to host my blog on Github pages because I was already used to working with Github (it’s where I store my exprAnalysis R package).

Setting up your own Github page is free and very easy! All you need is a Github account. On Github you need to create a new repository called “username.github.io”” (substitute username for your Github user name, in my case: ShirinG.github.io).

I set up the rest via the terminal interface:

go to the folder where your repository should live (in this example called “Test”)
clone the repository and change to its directory
pull from Github to make sure everything is synchronized and up to date between your computer and Github

cd Home/Documents/Test
git clone https://github.com/username/username.github.io
cd username.github.io
git pull origin master

JekyllBoostrap

Github pages also supports Jekyll, an easy to use static site generator perfect for blogging. I set up my blog with JekyllBoostrap, a full blog scaffold for Jekyll based blogs. Follow their Quick Start Guide for instructions on how to install it.

It also gives detailed instructions on how to set up comments, analytics and permalinks.

Customizing your page

Most of the heavy-lifting has already been done for you with JekyllBootstrap but of course there are many options for customizing and adapting your blog’s layout to your liking.

The main file for customizing your page is Jekyll’s “*_config.yml*” file (check back with JekyllBoostrap for details). Here you need to fill it all the relevant information about your page, like title, author, etc.

If you know a little bit about JavaScript, CSS and HTML, you can also directly modify the style sheets in the directory “*_includes*”.

You can change your page’s theme by following this tutorial.

In your page repository’s main folder you have a markdown called “index”. This is your main page and is already set up to show a list of all your blog post.

You can have more pages in the navigation bar if you want, like an “About me”, etc. You simply need to create a new markdown with the respective content in your page repository’s main folder.

You will also automatically get .xml files with your blog’s RSS and Atom feeds.

Preparing R blog posts with RMarkdown

I am generally working in RStudio, so this is where I’m preparing all my blog posts. I’m writing my code in RMarkdown and produce html output files until I deem a blog post finished.

This is how an example Rmarkdown document could look like.

Your Rmarkdown will need a header with title, author, date and output information:

---
title: "This is a test post for my R blog"
author: "Dr. Shirin Glander"
date: '`r Sys.Date()`'
output: html_document
---

Below the header you can write text, insert pictures, create tables and much more. Code chunks will go in between brackets.

```{r, echo=TRUE}
summary(cars)
```

As always, there is much more information on how to work with RMarkdown on the webs.

The final code chunk will look something like this in the output:

library(ggplot2)
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

Once I’m happy with my finished blog post, I’m changing the output format in the YAML header to create a markdown file (I’m using markdown_github) and knit again.

output: 
  md_document:
    variant: markdown_github

This will produce a markdown document (ending in .md) of your blog post. This markdown is now almost ready to be published but first you need to add a header in the markdown document to the top of your finished markdown for Jekyll to recognize. An example markdown header looks like this:

---
layout: post
title: "This is a test post for my R blog"
date: 2016-12-01
categories: rblogging
tags: test ggplot2
---

It needs to contain the layout definition for post, the title and date. Categories and tags are optional but I like using them to categorize the posts I have written and make them easier to find again.

Now this file is ready to go on your blog.

Publishing a blog post

I’m working in a different directory to produce my posts and test out things (in this example called “blog_prep”), so once I have my markdown file ready (in this example called “example_post.md”), I am going back to the terminal window and copy it to my blog repository’s folder “_post” and rename it so that it has the following format: Date-name.md (of course, you could also manually copy, paste and rename the file).

cp ../blog_prep/example_post.md _posts
mv _posts/example_post.md _posts/2016-11-20-example_post.md

If you also have figures or graphs in your blog post, there will be a folder generated along with your markdown with the name “example_post_files”. You need to copy this folder to your blog repository as well. If you gave your post a category, it will look for the files in a folder with the same name as your category with subfolders for the date. With the example above, which was categorized as “rblogging”, I will need to have/ make a folder with the same name and create subdirectories for the date on which my post is published:

mkdir -p rblogging/2016/12/01
cp ../blog_prep/example_post_files rblogging/2016/12/01

Testing your blog page locally

Before I push my updated blog to Github where it will go live, I’m testing the output locally with Jekyll. In order to be able to run Jekyll locally, you need to follow these excellent guidelines. Once you have Jekyll up and running, all you need to do is to be in your page repository and run jekyll serve.

bundle exec jekyll serve

This will build your page from the contents of your repository and you can preview whether everything looks the way you want it to.

If my new blog post looks okay and I don’t want to change anything any more, I can finally stage, commit and push the updated blog to Github.

git add *
git commit -m "name"
git push origin master

Alternatively, if you’d rather use RStudio for pushing changes to Github, you can also connect your blog repository to RStudio and stage, commit and push from there.

And voilà, your new R blog post is out there for the world to see!

Previous Page: 4 of 5 Next

« Prev 1 2 3 4 5 Next »