Creating a Gilmore Girls character network with R

2016-11-13 00:00:00 +0000

With the impending (and by many - including me - much awaited) Gilmore Girls Revival, I wanted to take a somewhat different look at our beloved characters from Stars Hollow.

I had recently read a few cool examples of how to create co-occurrence networks and wanted to combine this with an analysis similar to what David Robinson did for Love Actually.

Fortunately, there are people out there, who have invested their time (and a lot of it, I imagine) to write up transcripts for every Gilmore Girls episode. I chose www.crazy-internet-people.com’s list of transcripts.

Based on these transcripts I calculated the main character’s number of lines per episode and from there the co-occurrence matrix with other characters. This told me with which other major characters they appeared together in episodes and how often. This network nicely illustrates that Lorelai has the most lines of all characters (node size reflects total number of lines in all episodes a character had), followed by Rory. No surprise there but interestingly, the third place goes to Luke and fourth to Emily. Most interaction happened between Lorelai and Rory of course (edge width reflects number of co-occurences in episodes), but Lorelai and Luke, Rory and Luke, Emily and Rory and Lorelai and Sookie follow suit. What the network also shows is that Lorelai has major connections with most other characters - more so than Rory.

A cluster dendrogram shows us the character co-occurrence in a slightly different way: The further down in the dendrogram tree two nodes split, the more episodes these characters had in common. Again, no surprise here that Lorelai and Rory have the most closely connected nodes, closely followed by Luke. We can also see quite nicely that the couples Sookie and Jackson, Lane and Zack and Emily and Richard share a lot of episodes.


Of course, these numbers also reflect the total number of episodes that characters were in, so that there is an inherent bias for characters with short occurrences in many episodes being more strongly connected to e.g. Lorelai than characters with fewer but more important plots. It would have been interesting to calculate the co-occurrence per scene instead of episode but unfortunately, this information was not given in the transcripts (if someone knows transcripts that denote scene number, please contact me).

I also wanted to see in which episodes these 20 characters appeared. Of course, Lorelai and Rory appeared in every episode but for other characters, there are clear gaps.



And finally, I wanted to know how many lines per episode each of them spoke: This boxplot shows the median number of lines per episode for each character (middle line of the boxes), as well as the lower and upper quartiles (outer edges of the boxes) and the outlier episodes (dots).



There will be a part 2 next week, where I will explore the Gilmore Girls a bit more. No spoilers, but among other things, I’ll be looking at their coffee consumption through the data lens…


For a detailed description, plus R code for the plots see further below or find the R Markdown on Github.

If you don’t care about the show and have not (unlike me) watched every episode at least twice, maybe you’ll be interested in using my R code to recreate a similar character network for other TV shows, movies or books (and if you do, please share them with me)!



Obtaining all episode transcripts

The transcript URLs from www.crazy-internet-people.com have the following scheme: “http://www.crazy-internet-people.com/site/gilmoregirls/pages/, s#/, s#s/, §.html” (#: number of season, of which there are seven; §: running number of episode, from 1 to 153). Following this scheme, I looped over all seasons and episodes to read the lines for each HTML directly into R via their respective URLs.

Because the raw HTML looked a bit messy, I had to do some tidying of the text:

  • First, I grabbed only lines with a character name at the beginning, which indicates the character who is speaking (these were all in caps).
  • Then, I had to remove the remaining HTML tags/ descriptors.
  • After this, I had the transcript text remaining, which I could transform into a data frame
  • and add season, episode number and running episode number to each line of text.
  • And finally, I combined all transcripts into one object.
for(i in 1:7){                                            # there are 7 seasons

  if(i == 1){                                             # all seasons except the first have 22 episodes (the first has 21)

    for(j in 1:21){

      cat("\nSeason", i, ", Episode", j, "\n")            # to see the progress I am printing the season and episode number

      thepage <- readLines(paste0("http://www.crazy-internet-people.com/site/gilmoregirls/pages/s", i, "/s", i, "s/", j, ".html"))

      thepage <- thepage[grep("^[[:upper:]]+:", thepage)]  # grabbing character lines only
      thepage <- gsub("\t", "", thepage)                  # removing HTML tags
      thepage <- gsub("<.*>", "", thepage)                # removing some more HTML tags

      thepage <- as.data.frame(thepage)
      thepage$season <- i                                 # add season number
      thepage$episode <- paste(i, j, sep = "_")           # add episode number
      thepage$episode_running_nr <- j                     # add running epsiode number

      if(i == 1 & j == 1){                                # combine all transcripts into one object
        transcripts <- thepage
      } else {
        transcripts <- rbind(transcripts, thepage)
      }
    }

    } else {                                              # repeat for seasons 2 to 7

      for(j in 1:22){

        cat("\nSeason", i, ", Episode", j, "\n")

        if(i == 2){                                       # to get the running episode number, 
                                                          # I have to add the number of episodes from previous seasons
          n <- j+21
        }

        if(i == 3){
          n <- j+21+22
        }

        if(i == 4){
          n <- j+21+22+22
        }

        if(i == 5){
          n <- j+21+22+22+22
        }

        if(i == 6){
          n <- j+21+22+22+22+22
        }

        if(i == 7){
          n <- j+21+22+22+22+22+22
        }

        # rinse and repeat
        
        thepage <- readLines(paste0("http://www.crazy-internet-people.com/site/gilmoregirls/pages/s", i, "/s", i, "s/", n, ".html"))

        thepage <- thepage[grep("^[[:upper:]]{2,}:", thepage)]
        thepage <- gsub("\t", "", thepage)
        thepage <- gsub("<.*>", "", thepage)

        thepage <- as.data.frame(thepage)
        thepage$season <- i
        thepage$episode <- paste(i, j, sep = "_")
        thepage$episode_running_nr <- n

        if(i == 1 & j == 1){
          transcripts <- thepage
        } else {
          transcripts <- rbind(transcripts, thepage)
        }
        }
  }
}


Some of the lines were empty, so I removed those.

transcripts$thepage <- as.character(transcripts$thepage)  # convert to character vector
transcripts <- transcripts[!transcripts$thepage == "", ]  # remove empty lines


This is how the data frame looked like at this point:

  • each row contains one line of dialogue with the character name in caps coming before their lines.
head(transcripts)
##                                          thepage season episode
## 1 LORELAI: Please, Luke. Please, please, please.      1     1_1
## 2 LUKE: How many cups have you had this morning?      1     1_1
## 3                                 LORELAI: None.      1     1_1
## 4                                  LUKE: Plus...      1     1_1
## 5            LORELAI: Five, but yours is better.      1     1_1
## 6                      LUKE: You have a problem.      1     1_1
##   episode_running_nr
## 1                  1
## 2                  1
## 3                  1
## 4                  1
## 5                  1
## 6                  1


To be able to count the characters, I separated the character names from their lines. This was done by splitting the first column after the first colon, using the tidyr package.

I also removed all leading and trailing whitespace from the character names, changed all letters in the character column to all caps and changed “ands” and apostrophes to the proper encoding. And I also had to manually correct quite a few misspelled character names.

# separate first column after first colon
library(tidyr)
transcripts_2 <- separate(transcripts, "thepage", into = c("character", "dialogue"), sep = ":", extra = "merge", fill = "right")

# remove leading and trailing whitespace
transcripts_2$character <- gsub("^\\s+|\\s+$", "", transcripts_2$character)

# convert all character names to all upper case
transcripts_2$character <- toupper(transcripts_2$character)

# fix misspelled character names
transcripts_2$character <- gsub("ZACK", "ZACH", transcripts_2$character)
transcripts_2$character <- gsub("LORLEAI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LOREALI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LORELI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LORLAI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LORELA$", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LORLELAI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("^ORELAI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("LOREAI", "LORELAI", transcripts_2$character)
transcripts_2$character <- gsub("^ORY", "RORY", transcripts_2$character)
transcripts_2$character <- gsub("LUK$", "LUKE", transcripts_2$character)
transcripts_2$character <- gsub("BABETE", "BABETTE", transcripts_2$character)
transcripts_2$character <- gsub("BABETTER", "BABETTE", transcripts_2$character)
transcripts_2$character <- gsub("BARBETTE", "BABETTE", transcripts_2$character)
transcripts_2$character <- gsub("BABETTE/MISS PATTY", "BABETTE AND MISS PATTY", transcripts_2$character)
transcripts_2$character <- gsub("JACKSON/SOOKIE", "JACKSON AND SOOKIE", transcripts_2$character)
transcripts_2$character <- gsub("LORELAI/SOOKIE", "LORELAI AND SOOKIE", transcripts_2$character)
transcripts_2$character <- gsub("LORELAI/RORY", "LORELAI AND RORY", transcripts_2$character)
transcripts_2$character <- gsub("TAYOR", "TAYLOR", transcripts_2$character)
transcripts_2$character <- gsub("TRISTIN", "TRISTAN", transcripts_2$character)
transcripts_2$character <- gsub("MICHE$", "MICHEL", transcripts_2$character)
transcripts_2$character <- gsub("MICHELL", "MICHEL", transcripts_2$character)
transcripts_2$character <- gsub("SOOKI$", "SOOKIE", transcripts_2$character)
transcripts_2$character <- gsub("SOOKEI", "SOOKIE", transcripts_2$character)
transcripts_2$character <- gsub("SOOKIES", "SOOKIE", transcripts_2$character)
transcripts_2$character <- gsub("Mrs.KIM", "MRS KIM", transcripts_2$character)
transcripts_2$character <- gsub("MRS.KIM", "MRS KIM", transcripts_2$character)
transcripts_2$character <- gsub("MRS KIM", "MRS KIM", transcripts_2$character)
transcripts_2$character <- gsub("RICHRAD", "RICHARD", transcripts_2$character)
transcripts_2$character <- gsub("RMILY", "EMILY", transcripts_2$character)
transcripts_2$character <- gsub("CHRISTOHPER", "CHRISTOPHER", transcripts_2$character)
transcripts_2$character <- gsub("CHRISTOPER", "CHRISTOPHER", transcripts_2$character)
transcripts_2$character <- gsub("CHRSTOPHER", "CHRISTOPHER", transcripts_2$character)
transcripts_2$character <- gsub("CHRIS$", "CHRISTOPHER", transcripts_2$character)
transcripts_2$character <- gsub("CHERRY", "SHERRY", transcripts_2$character)
transcripts_2$character <- gsub("LINDAY", "LINDSAY", transcripts_2$character)

# substitute &#146; with apostrophe
transcripts_2$character <- gsub("&#146;", "'", transcripts_2$character)

# some ANDs are written as &AMP; so they will be changed as well
transcripts_2$character <- gsub("&AMP;", "AND", transcripts_2$character)

# and finally I want ANDs to be written as semicolons
transcripts_2$character <- gsub(" AND ", ";", transcripts_2$character)

#  and remove disclaimer lines
transcripts_2 <- transcripts_2[-which(transcripts_2$character == "DISCLAIMER"), ] 
head(transcripts_2)
##   character                                  dialogue season episode
## 1   LORELAI     Please, Luke. Please, please, please.      1     1_1
## 2      LUKE  How many cups have you had this morning?      1     1_1
## 3   LORELAI                                     None.      1     1_1
## 4      LUKE                                   Plus...      1     1_1
## 5   LORELAI                Five, but yours is better.      1     1_1
## 6      LUKE                       You have a problem.      1     1_1
##   episode_running_nr
## 1                  1
## 2                  1
## 3                  1
## 4                  1
## 5                  1
## 6                  1

This is how the data frame looked like after tidying.

nrow(transcripts_2)
## [1] 116954

In total there are now 116,983 lines.


How many characters are there and how many lines do they have?

To find out how many characters there were in Gilmore Girls during 153 episodes, I couldn’t simply count them because there are combined characters (e.g. Lorelai and Rory speaking together) and voice overs among them.

First, I want to duplicate all lines with two speakers, to make them count for each character. I also want to only count lines where there is only one character, so I removed all character fields with multiple, generic or unspecific characters. And I don’t want to have voice overs either.

Most of the characters, however, were still not recurring characters, so I filtered out all those characters that only occurred in one episode.

# separating all rows where multiple characters spoke into one line per character with duplicate line text
library(splitstackshape)
transcripts_2 <- cSplit(transcripts_2, splitCols = "character", sep = ";", direction = "long")

# manually removing characters I don't want to keep
characters <- as.data.frame(table(transcripts_2$character, transcripts_2$episode)) %>%
  subset(!grepl(" VOICE", Var1)) %>%
  subset(!grepl("^ALL", Var1)) %>%
  subset(!grepl("AS A GROUP", Var1)) %>%
  subset(!grepl("CROWD", Var1)) %>%
  subset(!grepl("RADIO", Var1)) %>%
  subset(!grepl("BIKERS", Var1)) %>%
  subset(!grepl("ANNOUNCER", Var1)) %>%
  subset(!grepl("BOTH", Var1)) %>%
  subset(!grepl("WOMAN", Var1)) %>%
  subset(!grepl("VOICE", Var1)) %>%
  subset(!grepl("BARTENDER", Var1)) %>%
  subset(!grepl("OFFICER", Var1)) %>%
  subset(!grepl("GIRL", Var1)) %>%
  subset(!grepl("GIRLS", Var1)) %>%
  subset(!grepl("BOYS", Var1)) %>%
  subset(!grepl("EVERYONE", Var1)) %>%
  subset(!grepl("SUPERVISOR", Var1)) %>%
  subset(!grepl("PHOTOGRAPHER", Var1)) %>%
  subset(!grepl("RECEPTIONIST", Var1)) %>%
  subset(!grepl("CUSTOMER", Var1)) %>%
  subset(!grepl("TV", Var1)) %>%
  subset(!grepl("VET", Var1)) %>%
  subset(!grepl("KID", Var1)) %>%
  subset(!grepl("MOM", Var1)) %>%
  subset(!grepl("DOCTOR", Var1)) %>%
  subset(!grepl("BOUNCER", Var1)) %>%
  subset(!grepl("LYRICS", Var1)) %>%
  subset(!grepl("SPEAKER", Var1)) %>%
  subset(!grepl("BOY", Var1)) %>%
  subset(!grepl("TEACHER", Var1)) %>%
  subset(!grepl("EMPLOYEE", Var1)) %>%
  subset(!grepl("MAID", Var1)) %>%
  subset(!grepl("CASHIER", Var1)) %>%
  subset(!grepl("MAN", Var1)) %>%
  subset(!grepl("NURSE", Var1)) %>%
  subset(!grepl("PLAYER", Var1)) %>%
  subset(!grepl("WAITER", Var1)) %>%
  subset(!grepl("WAITRESS", Var1)) %>%
  subset(!grepl("DRIVER", Var1)) %>%
  subset(!grepl("GRANDMA", Var1)) %>%
  subset(!grepl("LADIES", Var1)) %>%
  subset(!grepl("OPERATOR", Var1)) %>%
  subset(!grepl("TOURIST", Var1)) %>%
  subset(!grepl("CHEF", Var1)) %>%
  subset(!grepl("GUEST", Var1)) %>%
  subset(!grepl("HOSTESS", Var1)) %>%
  subset(!grepl("JUDGE", Var1)) %>%
  subset(!grepl("LADY", Var1)) %>%
  subset(!grepl("SECRETARY", Var1)) %>%
  subset(!grepl("PRIEST", Var1)) %>%
  subset(!grepl("STUDENT", Var1)) %>%
  subset(!grepl("PROFESSOR", Var1)) %>%
  subset(!grepl("ANSWERING MACHINE", Var1))

# find out which characters had the most lines and which to remove
library(reshape2)
characters_2 <- dcast(characters, Var1 ~ Var2)
rownames(characters_2) <- characters_2[, 1]
characters_2 <- characters_2[, -1]

characters_3 <- rowSums(characters_2)
major_characters <- characters_3[order(characters_3, decreasing = TRUE)]

characters_2 <- ifelse(characters_2 > 0, 1, 0)

characters_to_remove <- rownames(characters_2[which(rowSums(characters_2) <= 1), ])

characters <- characters[-which(characters$Var1 %in% characters_to_remove), ]
length(unique(as.character(characters$Var1)))
## [1] 153

This leaves us with these 153 characters:

unique(as.character(characters$Var1))
##   [1] "AK"               "ALEX"             "ANDREW"          
##   [4] "ANNA"             "APRIL"            "ASHER"           
##   [7] "AUDREY"           "AUNT JUN"         "BABETTE"         
##  [10] "BARBARA"          "BILL"             "BILLY"           
##  [13] "BOBBI"            "BOBBY"            "BOOTSY"          
##  [16] "BRAD"             "BRIAN"            "BRUCE"           
##  [19] "CAESAR"           "CAESER"           "CARL"            
##  [22] "CAROL"            "CARRIE"           "CHARLESTON"      
##  [25] "CHARLIE"          "CHRISTOPHER"      "CLARA"           
##  [28] "CLAUDE"           "COLIN"            "DAVE"            
##  [31] "DEAN"             "DENNIS"           "DEREK"           
##  [34] "DOUG"             "DOYLE"            "DR SCHULTZ"      
##  [37] "DRELLA"           "ED"               "EMCEE"           
##  [40] "EMILY"            "FINN"             "FLOYD"           
##  [43] "FRAN"             "FRANCIE"          "FRANCINE"        
##  [46] "FRED"             "GIGI"             "GIL"             
##  [49] "GLENN"            "GUY"              "GYPSY"           
##  [52] "HARRY"            "HEADMASTER"       "HENRY"           
##  [55] "HONOR"            "JACK"             "JACKSON"         
##  [58] "JACOB"            "JAMIE"            "JANET"           
##  [61] "JASON"            "JESS"             "JIMMY"           
##  [64] "JOE"              "JOEL"             "JOHN"            
##  [67] "JONI"             "JORDAN"           "JULIET"          
##  [70] "KAREN"            "KIRK"             "KYLE"            
##  [73] "KYON"             "LANCE"            "LANE"            
##  [76] "LAURA"            "LINDSAY"          "LIZ"             
##  [79] "LIZA"             "LOGAN"            "LORELAI"         
##  [82] "LOU"              "LOUISE"           "LUCY"            
##  [85] "LUKE"             "LULU"             "MADELINE"        
##  [88] "MAGGIE"           "MARCIA"           "MARILYN"         
##  [91] "MARSHALL"         "MARTY"            "MAUREEN"         
##  [94] "MAX"              "MEGAN"            "MIA"             
##  [97] "MICHEL"           "MISS PATTY"       "MITCHUM"         
## [100] "MOREY"            "MR. HUNTER"       "MRS KIM"         
## [103] "MRS. CASSINI"     "MRS. KIM"         "MRS. LISTER"     
## [106] "MRS. O'MALLEY"    "MUSIC"            "NANCY"           
## [109] "NATALIE"          "NICK"             "NICOLE"          
## [112] "NORA"             "OLIVIA"           "PARIS"           
## [115] "PATTY"            "PAUL ANKA"        "PHILLIP"         
## [118] "PRINCIPAL"        "RABBI"            "RACHEL"          
## [121] "RAJ"              "REVEREND"         "REVEREND SKINNER"
## [124] "RICHARD"          "ROB"              "ROBERT"          
## [127] "RORY"             "ROSEMARY"         "RUNE"            
## [130] "SANDRA"           "SARAH"            "SHANE"           
## [133] "SHEILA"           "SHERRY"           "SHIRA"           
## [136] "SIMON"            "SOOKIE"           "SOPHIE"          
## [139] "STRAUB"           "SUE"              "SUSAN"           
## [142] "T.J.'S BROTHER"   "TANNA"            "TAYLOR"          
## [145] "TJ"               "TOBIN"            "TOM"             
## [148] "TRISTAN"          "TRIX"             "TROUBADOUR"      
## [151] "VIVIAN"           "YOUNG CHUI"       "ZACH"


Character co-occurrence network

The major part was to calculate the co-occurence matrix for characters per episode (i.e. which characters co-occurred together in episodes and how often). Because the network would get too big had I used too many characters, I restricted the network to the 20 main characters with the most lines over all episodes.

# extracting only those characters with the most lines
transcripts_3 <- transcripts_2[which(transcripts_2$character %in% names(major_characters[1:20])), ]
transcripts_3$season_name <- paste("Season", transcripts_3$season, sep = " ")
# create lines per episode matrix for each of these characters
library(reshape2)
speaker_scene_matrix <- transcripts_3 %>%
  acast(character ~ episode, fun.aggregate = length)

The network was calculated from the co-occurrence matrix as a weighted network. Node colors reflect the character’s gender.

# calculate co-occurrence matrix
data_matrix <- as.matrix(t(speaker_scene_matrix))
total_occurrences <- colSums(t(speaker_scene_matrix))

co_occurrence <- t(data_matrix) %*% data_matrix

# plot the network graph
library(igraph)
g <- graph.adjacency(co_occurrence,
                         weighted = TRUE,
                         diag = FALSE,
                         mode = "upper")

g <- simplify(g, remove.multiple = F, remove.loops = T, edge.attr.comb = c(weight = "sum", type = "ignore"))

females <- c("EMILY", "LANE", "LORELAI", "MISS PATTY", "PARIS", "RORY", "SOOKIE") 
        
V(g)$gender <- ifelse(V(g)$name %in% females, "female", "male")

plot(g,
     vertex.label.family = "Helvetica",
     vertex.label.font = 1,
     vertex.shape = "sphere",
     vertex.size=total_occurrences/800,
     vertex.label.cex=0.8,
     vertex.color=c( "pink", "skyblue")[1+(V(g)$gender=="male")],
     vertex.label.color="black",
     vertex.frame.color = NA,
     edge.width = E(g)$weight/100000,
     edge.curved=.1,
     layout=layout_in_circle)


The cluster dendrogram shows us the character co-occurrence in a slightly different way:

norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)

h <- hclust(dist(norm, method = "manhattan"))

plot(h)


And finally, I wanted to plot in which episodes these 20 characters appeared:

library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "black"),
    legend.position = "none",
    legend.background = element_blank(),
    panel.margin = unit(.5, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(transcripts_3, aes(x = episode_running_nr, y = character, color = season_name)) +
  geom_point(size = 2) +
  geom_path(aes(group = episode_running_nr)) +
  my_theme() +
  facet_wrap(~ season_name, ncol = 2, scales = "free_x") +
  scale_color_hue(l = 50) +
  labs(y = "",
       x = "Episode running number",
       title = "Gilmore Girls major character's occurence in episodes")


And I wanted to know how many lines per episode each of them spoke:

# setting factor levels
f = characters[order(characters$Freq, decreasing = TRUE), "Var1"]
characters <- within(characters, Var1 <- factor(Var1, levels = f))

ggplot(data = subset(characters, Var1 %in% names(major_characters[1:20])), aes(x = Var1, y = Freq)) +
  geom_boxplot(fill = "navy", alpha = 0.8) +
  labs(y = "Number of Lines",
       title = "How many lines do the main characters have per epsisode?", 
       x = "Characters") +
  my_theme()



sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.1.0         igraph_1.0.1          reshape2_1.4.2       
## [4] splitstackshape_1.4.2 data.table_1.9.6      tidyr_0.6.0          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7      knitr_1.14       magrittr_1.5     munsell_0.4.3   
##  [5] colorspace_1.2-7 stringr_1.1.0    plyr_1.8.4       tools_3.3.2     
##  [9] grid_3.3.2       gtable_0.2.0     htmltools_0.3.5  yaml_2.1.13     
## [13] assertthat_0.1   digest_0.6.10    tibble_1.2       formatR_1.4     
## [17] evaluate_0.10    rmarkdown_1.1    labeling_0.3     stringi_1.1.2   
## [21] scales_0.4.0     chron_2.3-47

Is 'Yeah' Josh and Chuck's favorite word?

2016-11-06 00:00:00 +0000

Text mining and sentiment analysis of a Stuff You Should Know Podcast


Stuff You Should Know (or SYSK) is one of the many great podcasts from How Stuff Works. The two SYSK hosts Josh and Chuck have taught me so many fascinating things over the years, and today I want to use one of their podcasts to learn a little bit about text analysis in R.

Initially, I wanted to explore all SYSK podcasts. Unfortunately however, I could only find a transcript for the episode Earwax: Live With It, posted on March 19, 2015.


The complete R code can be found at the end of this post or as an R-Markdown on Github.


The podcast transcript

I copied the episode transcript from its web page and saved it as a tab delimited text file. The file can be downloaded from Github.


Separating Josh and Chuck

Of course, I wouldn’t actually want to separate Josh and Chuck. But for comparison’s sake in this analysis, I am creating two separate files for lines of dialogue spoken by either Josh or Chuck. I am also keeping the combination of both for background information.


How emotional are Josh and Chuck?

Sentiment analysis

The first thing I want to explore is a sentiment analysis of the lines spoken by Josh and Chuck. Sentiment analysis categorizes text data into positive and negative sentiments and gives information about the emotional state or attitude of the speaker or the contents of a text.

I am using the package syuzhet for sentiment analysis.


NRC sentiments

Saif Mohammad’s NRC Emotion Lexicon is a collection of words that were manually categorized based on their association with the emotions anger, anticipation, disgust, fear, joy, sadness, surprise, and trust, and with positive and negative sentiments.


What sentiment did the podcast have?

Before I go ahead with the sentiment analysis I want to get an idea of the podcast’s words’ association with the NRC categories.

As can be seen by the words and their associated emotions/ sentiments, sentiment analysis is not perfect. Most words make a lot of intuitive sense with their category (e.g. gross, fungus and spider in disgust), but a few I find to be really strange (like, why would waffle be associated with anger?). Still, the majority of categorisations make sense, so let’s go ahead with the sentiment analysis.


Does the podcast’s sentiment change over time?

Sentiment analysis for each line of dialogue produces a matrix with one column per sentiment/ emotion and one row per line. If any of the words in a line of dialogue could be associated with a given category, this category would get a value of 1 in the matrix. If there was no association with a category, its value would be 0. The lines of dialogue are sorted according to the original input text, in this case this means that they represent the order in which they were spoken in the podcast. Because the plot would get too big with all categories, I split the data into positive and negative sentiments and emotions.

For analysing positive and negative sentiments, syuzhet implements four different methods, each of which uses a slightly different scale. But all of them assign negative values to indicate negative sentiment and positive values to indicate positive sentiments.

All of the methods rely on a precomputed lexicon of word-sentiment score associations. The emotional or sentiment valence is then computed based on the scores of the words from each line of dialogue.

The upper two plots show on the x-axis the progression of dialogue over time with each point being a line of dialogue. The sentiment score on the y-axis shows the intensity of the sentiment or emotions in the respective line of dialogue, i.e. the more words in a line were associated with the given category, the higher the line’s sentiment score.

From a first glance at these emotions and sentiments, it seems to me that the podcast is more positive than negative but we can get a better overview of positive and negative sentiments by scoring only positive and negative sentiments.

In the third graph we can see quite well that the trend goes towards positive scores, meaning the podcast is overall upbeat. While there are different peaks in both positive and negative directions in Chuck’s and Josh’s lines, there is no overall bias for one being more positive (or negative) than the other.

Finally, I am looking at the sentiment percentage values to get an idea about the percentage of positive versus negative scores along the podcast’s trajectory. Here, the podcast was divided into 20 bins and the mean sentiment valence calculated for each. This last plot shows a clear trend of increasing positivity towards the end of the podcast in Chuck’s lines. Josh on the other hand doesn’t change very much over the progression of the podcast. Interesting…


Quantitative text analysis

Building a corpus

In text analysis, a corpus refers to a collection of documents. Here, I am using the tm package to create my corpus from the character vectors of Josh’s and Chuck’s lines. SnowballC is used for word stemming.

Before I can analyse the text data meaningfully, however, I have to do some pre-processing:

  1. Removing punctuation

    Here, I am removing all punctuation marks. Before I do that, I will transform all hyphens to a space, because the text includes some words which are connected by hyphens and would otherwise be connected if I simply removed the hyphen with the removePunctuation function.

  2. Transforming to lower case

    R character string processing is case-sensitive, so everything will be converted to lower case.

  3. Stripping numbers

    Numbers are usually not very meaningful for text analysis (if we are not specifically interested in dates, for example), so they are removed as well.

  4. Removing stopwords

    Stopwords are collections of very common words which by themselves don’t tell us very much about the content of a text (e.g. and, but, the, however, etc.). The tm package includes a list of stopwords for the English language.

  5. Stripping whitespace

    I’m also removing superfluous whitespace around words.

  6. Stemming

    Finally, I’m stemming the words in the corpus. This means that words with a common root are shortened to this root. Even though stemming algorithms are not perfect, they allow us to compare conjugated words from the same origin.


Creating the Document Term Matrix

The document term matrix (DTM) lists the number of occurrences of each word per document in the corpus. Here, each document in the corpus represents one line of dialogue from the original transcript.

By restricting the DTM to words with a minimum number of letters and an occurrence in at least a minimum number of lines of dialogue (cutoff), we exclude less specific terms.


Is “Yeah” Josh and Chucks favorite word?

Most frequent words

By summing up the occurrences of each word over all documents we get the word count frequencies.

When accounting for stem words and the cutoff I set for the DTM to evaluate, Josh and Chuck spoke roughly the same number of words (Josh: 880, Chuck: 865) and had almost the same number of dialogue lines (Josh: 202, Chuck: 193). So, good job on neither one dominating the discussion. ;-)

The lefthand plot shows the number of words spoken per line of dialogue. The background barplot shows the mean number of words spoken per line, the boxplot shows all individual data points (each point represents one line of dialogue and its corresponding word count). While the total number of words and of dialogue lines were basically the same, Chuck’s lines had a stronger deviation around the median with few very long lines. Josh on the other hand seems to have spoken lines with a more consistent length.

The righthand plot shows the most common words and how often they were used overall (red) and by Josh and Chuck separately (green and blue). The most frequent words include (not surprisingly) “ear” and “earwax”, but funnily also “yeah”. To be honest, while listening to the podcast I never noticed yeah being said exceptionally often but I guess the data doesn’t lie…

Wordclouds are another way to visualize the frequency of words. The frequency is indicated both by the size of the words (bigger words are more frequent than smaller words) and their color.


Word association

Associations among words bigger than 60% were plotted in a heatmap to find words that most often co-occured in the same line of dialogue.

Among the most conspicuous associations were cotton and swab in Josh’s lines (this was probably a hyphenated word to begin with: cotton-swab) and between secret and gland in Chuck’s line (probably secretory gland).


Hierarchical clustering

Hierarchical clustering can be used to classify words by sorting them into clusters according to similarity of occurence (i.e. their frequency or count).

We already knew that the words “earwax”, “ear” and “yeah” were the most common, so they were clustered accordingly.


Knowledge for the masses

Shorter words are more frequent than longer words

Most words have 4 or 5 letters, only a handful are longer than 7 letters. We don’t have words with fewer than 3 letters because they were excluded in the beginning when obtaining the DTM.

As we can see above, there is only a very small correlation between the length of words and how often they are used. As expected from what we intuitively know, the most common words tend to be shorter while long words are used only occasionally because they are often more specific terms. And there is no real difference between Josh or Chuck when it comes to the length (and complexity?) of the words they use. In general the words they use are rather on the short site, which makes sense as a big part of what makes their podcast so great is that they convey information in a down-to-earth, understandable way.

The frequency plot of all the letters in the alphabet shows that vocals are more common than consonants.


The plot above shows how often each letter occurs at which position in a word in all the words used by Josh and Chuck. For example, the letter j occured only once and at position one (so 100% of js are at this position). The other letters are more equally distributed but according to the fact that fewer words were longer than 6 letters, there are fewer letters at positions 7 to 10.


Conclusion

This little excursion into text analysis gave an interesting different look at a podcast one would normally “evaluate” intuitively while listening. This hard cold look through the data lens can highlight aspects that probably would be overlooked otherwise; for example, I never noticed that Josh and Chuck used the word yeah that much!

It would be very interesting to broaden the analysis to more podcasts to see if the yeah-thing was just a fluke of this episode or whether it’s a recurrent thing, maybe by trying speech-to-text-conversion tools.



R code

# setting my custom theme of choice
library(ggplot2)
library(ggrepel)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "navy"),
    legend.position = "bottom",
    legend.background = element_blank(),
    panel.margin = unit(.5, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
# reading lines of transcript
raw <- readLines("sysk_earwax.transcript.txt")
head(raw)
## [1] "Josh: Hey, and welcome to the podcast. I'm Josh Clark. There is Charles W. \"Chuck\" Bryant, there is Jeri. Yeah, it's Stuff You Should Know."
## [2] ""                                                                                                                                             
## [3] "Chuck: He just shrugged."                                                                                                                     
## [4] ""                                                                                                                                             
## [5] "Josh: Yeah, like \"eh, what are we going to do? That's what we are.\""                                                                        
## [6] ""
# extracting lines beginning with Josh:/ Chuck: from transcript by looping over the names
names <- c("Josh", "Chuck")

for (name in names){
  # grep lines from Josh or Chuck
  assign(paste("lines", name, sep = "_"), as.character(raw[grep(paste0("^", name, ":"), raw)]))
}
# and removing the beginning of each line that indicates who's speaking
lines_Josh <- gsub("^Josh: ", "", lines_Josh)
lines_Chuck <- gsub("^Chuck: ", "", lines_Chuck)

# Combining Josh's and Chuck's lines
lines_bg <- c(lines_Josh, lines_Chuck)

names <- c("Josh", "Chuck", "bg")
# get NRC sentiments for 
# a) each line of dialogue and
# b) each word spoken in the podcast
library(syuzhet)

for (name in names){
  assign(paste("get_nrc_sentiment", name, sep = "_"), data.frame(line_number = 1:length(get(paste("lines", name, sep = "_"))), 
                                                                 get_nrc_sentiment(get(paste("lines", name, sep = "_")))))
  
  get_tokens <- get_tokens(get(paste("lines", name, sep = "_")))
  assign(paste("get_nrc_sentiments_tokens", name, sep = "_"), data.frame(word = get_tokens,
                                                                         get_nrc_sentiment(get_tokens)))
}
# gather word sentiments for plotting
library(tidyr)

get_nrc_sentiments_tokens_bg_gather <- get_nrc_sentiments_tokens_bg %>% 
  gather(word, sentiment, anger:positive)
colnames(get_nrc_sentiments_tokens_bg_gather)[2:3] <- c("sentiment", "value")
get_nrc_sentiments_tokens_bg_gather$name <- "bg"
get_nrc_sentiments_tokens_bg_gather <- get_nrc_sentiments_tokens_bg_gather[which(get_nrc_sentiments_tokens_bg_gather$value != 0), ]
get_nrc_sentiments_tokens_bg_gather <- get_nrc_sentiments_tokens_bg_gather[!duplicated(get_nrc_sentiments_tokens_bg_gather$word), ]
ggplot(data = get_nrc_sentiments_tokens_bg_gather, aes(x = value, y = value, color = word, label = word)) + 
  geom_line(size = 1) +
  facet_wrap(~ sentiment, ncol = 2) +
  my_theme() + 
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    legend.position = "none",
    panel.grid.major = element_blank()) +
  geom_text_repel(segment.color = "aliceblue", segment.alpha = 0) +
  labs(title = "Sentiment categories of words used in podcast")
get_nrc_sentiment_Josh_gather <- get_nrc_sentiment_Josh %>% 
  gather(line_number, sentiment, anger:positive)
colnames(get_nrc_sentiment_Josh_gather)[2:3] <- c("sentiment", "value")
get_nrc_sentiment_Josh_gather$name <- "Josh"

get_nrc_sentiment_Chuck_gather <- get_nrc_sentiment_Chuck %>% 
  gather(line_number, sentiment, anger:positive)
colnames(get_nrc_sentiment_Chuck_gather)[2:3] <- c("sentiment", "value")
get_nrc_sentiment_Chuck_gather$name <- "Chuck"

get_nrc_sentiment_gather <- rbind(get_nrc_sentiment_Josh_gather, get_nrc_sentiment_Chuck_gather)
# split into positive and negative emotions/ sentiments
get_nrc_sentiment_gather_pos <- get_nrc_sentiment_gather[which(get_nrc_sentiment_gather$sentiment %in% 
                                                                 c("anticipation", "joy", "positive", "surprise", "trust")), ]
get_nrc_sentiment_gather_neg <- get_nrc_sentiment_gather[which(get_nrc_sentiment_gather$sentiment %in% 
                                                                 c("anger", "disgust", "fear", "negative", "sadness")), ]

library(RColorBrewer)

p1 <- ggplot(data = get_nrc_sentiment_gather_pos, aes(x = line_number, y = value)) + 
  geom_line(size = 1, color = brewer.pal(3, "Set1")[3]) +
  facet_grid(name ~ sentiment) +
  my_theme() + 
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
    legend.position = "none") +
  labs(x = "Dialogue line number", y = "Sentiment valence",
       title = "Sentiments during podcast progression (positive sentiments)")

p2 <- ggplot(data = get_nrc_sentiment_gather_neg, aes(x = line_number, y = value)) + 
  geom_line(size = 1, color = brewer.pal(3, "Set1")[1]) +
  facet_grid(name ~ sentiment) +
  my_theme() + 
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
    legend.position = "none") +
  labs(x = "Dialogue line number", y = "Sentiment valence",
       title = "Sentiments during podcast progression (negative sentiments)")
# get sentiment scores
for (name in names){
  sentiment <- data.frame(line_number = 1:length(get(paste("lines", name, sep = "_"))), 
                              syuzhet = get_sentiment(get(paste("lines", name, sep = "_")), method = "syuzhet"),
                              bing = get_sentiment(get(paste("lines", name, sep = "_")), method = "bing"),
                              afinn = get_sentiment(get(paste("lines", name, sep = "_")), method = "afinn"),
                              nrc = get_sentiment(get(paste("lines", name, sep = "_")), method = "nrc"))
  assign(paste("sentiment", name, sep = "_"), sentiment)
}
# gather for plotting
sentiment_Josh_gather <- sentiment_Josh %>% 
  gather(line_number, analysis, syuzhet:nrc)
colnames(sentiment_Josh_gather)[2:3] <- c("algorithm", "value")
sentiment_Josh_gather$name <- "Josh"

sentiment_Chuck_gather <- sentiment_Chuck %>% 
  gather(line_number, analysis, syuzhet:nrc)
colnames(sentiment_Chuck_gather)[2:3] <- c("algorithm", "value")
sentiment_Chuck_gather$name <- "Chuck"

sentiment_gather <- rbind(sentiment_Josh_gather, sentiment_Chuck_gather)
p3 <- ggplot(data = sentiment_gather, aes(x = line_number, y = value, color = algorithm)) + 
  geom_hline(aes(yintercept=0), linetype="dashed", size = 1) +
  geom_line(size = 1) +
  facet_grid(name ~ algorithm) +
  my_theme() + 
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
    legend.position = "none") +
  labs(x = "Dialogue line number", y = "Sentiment valence",
       title = "Sentiment during podcast progression according to different lexicons")
# get sentiment percent values
for (name in names){
  sentiment_percent_vals <- data.frame(bin = 1:20,
                              syuzhet = get_percentage_values(get_sentiment(get(paste("lines", name, sep = "_")), method = "syuzhet"), bins = 20),
                              bing = get_percentage_values(get_sentiment(get(paste("lines", name, sep = "_")), method = "bing"), bins = 20),
                              afinn = get_percentage_values(get_sentiment(get(paste("lines", name, sep = "_")), method = "afinn"), bins = 20),
                              nrc = get_percentage_values(get_sentiment(get(paste("lines", name, sep = "_")), method = "nrc"), bins = 20))
  assign(paste("sentiment_percent_vals", name, sep = "_"), sentiment_percent_vals)
}
# gather for plotting
sentiment_percent_vals_Josh_gather <- sentiment_percent_vals_Josh %>% 
  gather(bin, analysis, syuzhet:nrc)
colnames(sentiment_percent_vals_Josh_gather)[2:3] <- c("algorithm", "value")
sentiment_percent_vals_Josh_gather$name <- "Josh"

sentiment_percent_vals_Chuck_gather <- sentiment_percent_vals_Chuck %>% 
  gather(bin, analysis, syuzhet:nrc)
colnames(sentiment_percent_vals_Chuck_gather)[2:3] <- c("algorithm", "value")
sentiment_percent_vals_Chuck_gather$name <- "Chuck"

sentiment_percent_vals_gather <- rbind(sentiment_percent_vals_Josh_gather, sentiment_percent_vals_Chuck_gather)
p4 <- ggplot(data = sentiment_percent_vals_gather, aes(x = bin, y = value, color = algorithm)) + 
  geom_hline(aes(yintercept=0), linetype="dashed", size = 1) +
  geom_line(size = 1) +
  facet_grid(name ~ algorithm) +
  my_theme() + 
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
    legend.position = "none") +
  labs(x = "Bin", y = "Sentiment percent values",
       title = "Sentiment percent values during podcast progression")

library(gridExtra)
library(grid)

grid.arrange(p1, p2, p3, p4, ncol = 1)
library(tm)
data("crude")
library(SnowballC)

transform_to_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

for (name in names){
  corpus <- Corpus(VectorSource(get(paste("lines", name, sep = "_"))))

  # Removing punctuation
  corpus <- tm_map(corpus, transform_to_space, "-")
  corpus <- tm_map(corpus, removePunctuation)
  
  # Transforming to lower case
  corpus <- tm_map(corpus, content_transformer(tolower))
    
  # Stripping numbers
  corpus <- tm_map(corpus, removeNumbers)
    
  # Removing stopwords
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  
  # Stripping whitespace
  corpus <- tm_map(corpus, stripWhitespace)
    
  # Stemming
  corpus <- tm_map(corpus, stemDocument)

  # Converting said to say because this isn't included in stemDocument
  corpus <- tm_map(corpus, gsub, pattern = "said", replacement = "say")
  
  # There are a couple of words not included among the stopwords that I would like to remove
  myStopwords <- c("one", "two", "three", "just", "your", "that", "can")
  corpus <- tm_map(corpus, removeWords, myStopwords)
  
  # Assigning to object and treat as text document
  assign(paste("corpus", name, sep = "_"), tm_map(corpus, PlainTextDocument))
}

corpus_Josh
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 227
corpus_Chuck
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 226
corpus_bg
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 453
#  but here I will keep them all to make the numbers comparable to background later
for (name in names){
  assign(paste("dtm", name, sep = "_"), DocumentTermMatrix(get(paste("corpus", name, sep = "_")), 
         control = list(wordLengths = c(3, 100), bounds = list(global = c(3, length(get(paste("lines", name, sep = "_"))))))))
}

dtm_Josh
## <<DocumentTermMatrix (documents: 227, terms: 126)>>
## Non-/sparse entries: 801/27801
## Sparsity           : 97%
## Maximal term length: 10
## Weighting          : term frequency (tf)
dtm_Chuck
## <<DocumentTermMatrix (documents: 226, terms: 122)>>
## Non-/sparse entries: 779/26793
## Sparsity           : 97%
## Maximal term length: 10
## Weighting          : term frequency (tf)
dtm_bg
## <<DocumentTermMatrix (documents: 453, terms: 260)>>
## Non-/sparse entries: 1945/115835
## Sparsity           : 98%
## Maximal term length: 10
## Weighting          : term frequency (tf)
# get word frequencies from DTM
for (name in names){
  assign(paste("freq_dtm", name, sep = "_"), colSums(as.matrix(get(paste("dtm", name, sep = "_")))))
}
# find most frequent terms and strongest associations with "ear"
for (name in names){
  print(findFreqTerms(get(paste("dtm", name, sep = "_")), lowfreq=40))
  print(findAssocs(get(paste("dtm", name, sep = "_")), "ear", 0.4))
}
## [1] "ear"    "earwax" "yeah"  
## $ear
## canal candl 
##  0.44  0.44 
## 
## [1] "ear"  "yeah"
## $ear
## canal  want clean 
##  0.54  0.44  0.42 
## 
## [1] "dont"   "ear"    "earwax" "like"   "right"  "say"    "think"  "yeah"  
## $ear
## canal candl 
##  0.49  0.42
length(freq_dtm_Josh)
## [1] 126
length(freq_dtm_Chuck)
## [1] 122
length(freq_dtm_bg)
## [1] 260
# get number of words per line
for (name in names){
  words_per_doc <- data.frame(wordcount = rowSums(as.matrix(get(paste("dtm", name, sep = "_")))))
  words_per_doc$name <- rep(name, nrow(words_per_doc))
  
  assign(paste("words_per_doc", name, sep = "_"), words_per_doc)
}

words_per_doc <- rbind(words_per_doc_Josh, words_per_doc_Chuck)
words_per_doc <- words_per_doc[which(words_per_doc$wordcount > 0), ]

word_count <- data.frame(name = c(names[-3]), 
                         number_all_words = c(sum(words_per_doc[which(words_per_doc$name == "Josh"), 1]), 
                                       sum(words_per_doc[which(words_per_doc$name == "Chuck"), 1])),
                         number_lines = c(nrow(words_per_doc[which(words_per_doc$name == "Josh"), ]), 
                                       nrow(words_per_doc[which(words_per_doc$name == "Chuck"), ])))
word_count$words_per_line <- word_count$number_all_words/word_count$number_lines
word_count
##    name number_all_words number_lines words_per_line
## 1  Josh              880          202       4.356436
## 2 Chuck              865          193       4.481865
p1 <- ggplot(words_per_doc, aes(x = name, y = wordcount, fill = name)) +
  geom_bar(data = word_count, aes(x = name, y = words_per_line, fill = name), stat = "identity", alpha = .5) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Set1", name = "") +
  my_theme() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
        legend.position = "none") +
  labs(y = "Number of words",
       title = "Number of words spoken\n per line of dialogue")
# create merged data frame for plotting
library(dplyr)

freq_df_Josh <- data.frame(words = names(freq_dtm_Josh), freq = freq_dtm_Josh)
freq_df_Chuck <- data.frame(words = names(freq_dtm_Chuck), freq = freq_dtm_Chuck)

freq_df <- full_join(freq_df_Josh, freq_df_Chuck, by = "words")

freq_df_bg  <- data.frame(words = names(freq_dtm_bg), freq = freq_dtm_bg)
freq_df <- full_join(freq_df, freq_df_bg, by = "words")

freq_df[is.na(freq_df)] <- 0
colnames(freq_df)[2:4] <- names
colnames(freq_df)[4] <- "background"
# subsetting only those words which occur at least 15 times in either Josh's, Chuck's or their combined lines
freq_df_subs <- freq_df[apply(freq_df[, -1], 1, function(x) any(x > 15)), ]

# gather for plotting
freq_df_subs_gather <- freq_df_subs %>% 
  gather(words, count)
colnames(freq_df_subs_gather)[2] <- "name"
# setting factor levels
f = as.factor(freq_df_bg[order(freq_df_bg$freq, decreasing = TRUE), "words"])
freq_df_subs_gather <- within(freq_df_subs_gather, words <- factor(words, levels = f))

p2 <- ggplot(freq_df_subs_gather, aes(x = words, y = count, fill = name)) +
  geom_bar(data = subset(freq_df_subs_gather, name == "background"), alpha = .3, stat = "identity") + 
  geom_bar(data = subset(freq_df_subs_gather, name != "background"), stat = "identity", position = position_dodge(width = .8), width = .7) + 
  scale_fill_brewer(palette = "Set1", name = "") +
  my_theme() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        legend.position = c(.9, .85)) +
  labs(y = "Frequency of use (count)",
       title = "Frequency of the most common words in Josh and Chuck's podcast")
grid.arrange(p1, p2, widths = c(0.3, 0.7), ncol = 2)
library(wordcloud)

#setting the same seed makes the look consistent
set.seed(42)

layout(matrix(c(1, 2, 3, 4), nrow = 2, byrow = TRUE), heights = c(0.1, 1))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Josh:", cex = 2)
plot.new()
text(x=0.5, y=0.5, "Chuck:", cex = 2)
wordcloud(names(freq_dtm_Josh), freq_dtm_Josh, min.freq = 8, colors = rev(brewer.pal(8, "Spectral")))
wordcloud(names(freq_dtm_Chuck), freq_dtm_Chuck, min.freq = 8, colors = rev(brewer.pal(8, "Spectral")))
library(dplyr)

for (name in names){
  
  for (i in 1:length(names(get(paste("freq_dtm", name, sep = "_"))))){
      associations <- as.data.frame(findAssocs(get(paste("dtm", name, sep = "_")), names(get(paste("freq_dtm", name, sep = "_")))[i], 0))
      associations$word <- rownames(associations)
  
  if (i == 1){
    associations_table <- associations[, c(2, 1)]
  } else {
    associations_table <- full_join(associations_table, associations, by = "word")
  }
  
  if (i == length(names(get(paste("freq_dtm", name, sep = "_"))))){
    associations_table[is.na(associations_table)] <- 0
    assign(paste("associations_table", name, sep = "_"), associations_table)
  }
  }
}
library(reshape2)

cor = 0.6

for (name in names){
  top_words <- get(paste("associations_table", name, sep = "_"))
  top_words <- top_words[apply(top_words[, -1], 1, function(x) any(x > cor)), c(TRUE, apply(top_words[, -1], 2, function(x) any(x > cor)))]
  
  associations_table_melt <- melt(top_words)

  p <- ggplot(data = associations_table_melt, aes(x = word, y = variable, fill = value)) + 
    geom_tile(width=.9, height=.9) +
    scale_fill_gradient2(low = "white", high = "red",  
                         limit = c(0,1), space = "Lab", 
                         name="") +
    my_theme() + 
    theme(
      axis.title = element_blank(),
      legend.position = "right") +
    coord_fixed() +
    labs(title = paste0("Word pair associations > ", cor, " in ", name, "'s lines"))
  
  print(p)
}
library(ggdendro)

for (name in names){
  #convert dtm to matrix
  m <- as.matrix(subset(get(paste("freq_df", name, sep = "_")), freq > 10))
  d <- dist(m, method = "maximum")
  hc <- hclust(d, method = "average")
  
  assign(paste("hc", name, sep = "_"), dendro_data(as.dendrogram(hc)))
}
ggplot() + 
  ylim(-5, 50) +
  geom_segment(data = segment(hc_bg), aes(x = x, y = y+40, xend = xend, yend = yend+40)) + 
  geom_segment(data = segment(hc_Josh), aes(x = x+5, y = y, xend = xend+5, yend = yend)) + 
  geom_segment(data = segment(hc_Chuck), aes(x = x+22, y = y, xend = xend+22, yend = yend)) + 
  geom_text(data = label(hc_bg), aes(x = x, y = y+37, label = label, color = label), angle = 45, lineheight = 0, vjust = 0, hjust = 0.6) +
  geom_text(data = label(hc_Josh), aes(x = x+5, y = y-3, label = label, color = label), angle = 45, lineheight = 0, vjust = 0, hjust = 0.6) +
  geom_text(data = label(hc_Chuck), aes(x = x+22, y = y-3, label = label, color = label), angle = 45, lineheight = 0, vjust = 0, hjust = 0.6) +
  ggplot2::annotate("text", x = 4, y = 48, label = "Background:") +
  ggplot2::annotate("text", x = 4.9, y = 26, label = "Josh:") +
  ggplot2::annotate("text", x = 21.5, y = 26, label = "Chuck:") +
  my_theme() + 
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    legend.position = "none",
    panel.grid.major = element_blank()) +
  labs(title = "Hierarchical clustering dendrograms of words with at least 10 occurences")
freq_df$characters <- nchar(freq_df$words)
  
p1 <- ggplot(data = freq_df, aes(x = characters)) +
  geom_histogram(binwidth = 1, fill = "navy") +
  geom_vline(xintercept = round(mean(freq_df$characters), digits = 0),
             color = "red", size = 2, alpha = .5) +
  labs(x = "Letters", y = "Words",
       title = "Histogram of\nword length") +
  my_theme() +
  theme(axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5))
freq_df_gather <- freq_df %>% 
  gather(words, count, Josh:background)
colnames(freq_df_gather)[3] <- "name"

p2 <- ggplot(subset(freq_df_gather, name != "background"), aes(x = characters, y = count, label = words)) + 
  geom_point(color = "deepskyblue4", alpha = 0.5) + 
  my_theme() +
  geom_smooth(method = "lm", se = TRUE, size = 1, color = "black") +
  facet_grid(~ name) +
  theme(axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5)) +
  labs(x = "Frequency of use (count)", y = "Word length",
       title = "Correlation between word length and frequency\nof use in Josh's and Chuck's lines") +
  geom_text_repel(data = subset(freq_df_gather, name != "background" & count > 25))
library(stringr)
letters <- unlist(sapply(freq_df$words, function(x) str_split(x, "")))

library(qdap)
letters_tab <- dist_tab(letters)

# setting factor levels
f = as.factor(rev(letters_tab$interval))
letters_tab <- within(letters_tab, interval <- factor(interval, levels = f))

p3 <- ggplot(letters_tab, aes(x = interval, y = percent)) +
  geom_bar(stat = "identity", fill = "navy", alpha = 0.8) +
  coord_flip() +
  my_theme() +
  theme(axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5)) +
  labs(x = "Letter", y = "Percentage",
       title = "Frequency of letters")
grid.arrange(p1, p2, p3, widths = c(0.2, 0.5, 0.3), ncol = 3)
letters_df <- as.data.frame(letters)
letters_df$position <- as.factor(unlist(regmatches(rownames(letters_df), gregexpr("[0-9]+", rownames(letters_df)))))

letters_df_table <- as.data.frame(table(letters_df))
total_number_letter <- as.data.frame(table(letters_df$letters))

letters_df_table <- left_join(letters_df_table, total_number_letter, by = c("letters" = "Var1"))
letters_df_table$percent <- letters_df_table$Freq.x*100/letters_df_table$Freq.y
letters_df_table$percent[letters_df_table$percent == 0] <- Inf

letters_df_table$Freq.x[letters_df_table$Freq.x == 0] <- Inf

letters_df_table_gather <- letters_df_table[, -4] %>% 
  gather(letters, position, Freq.x:percent)
colnames(letters_df_table_gather)[3:4] <- c("measure", "value")

letters_df_table_gather$measure <- ifelse(letters_df_table_gather$measure == "Freq.x", "Count", "Percent")

# setting factor levels
f = as.factor(rev(1:10))
letters_df_table_gather <- within(letters_df_table_gather, position <- factor(position, levels = f))
ggplot(data = subset(letters_df_table_gather, !is.infinite(value)), aes(x = letters, y = position, fill = value)) + 
  geom_tile(width=.9, height=.9) +
  facet_grid(measure ~ .) +
  scale_fill_gradient(low = "moccasin", high = "red") +
  coord_flip() +
  my_theme() + 
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0, hjust = 0.5),
    legend.position = "right") +
  coord_fixed() +
  labs(x = "", y = "Position",
       title = "How often does each letter occur\nat which position in a word?")
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] qdap_2.2.5             qdapTools_1.3.1        qdapRegex_0.6.0       
##  [4] qdapDictionaries_1.0.6 stringr_1.1.0          ggdendro_0.1-20       
##  [7] reshape2_1.4.2         wordcloud_2.5          dplyr_0.5.0           
## [10] SnowballC_0.5.1        tm_0.6-2               NLP_0.1-9             
## [13] gridExtra_2.2.1        RColorBrewer_1.1-2     tidyr_0.6.0           
## [16] syuzhet_1.0.0          ggrepel_0.6.3          ggplot2_2.1.0         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0        venneuler_1.1-0     slam_0.1-37        
##  [4] rJava_0.9-8         reports_0.1.4       colorspace_1.2-7   
##  [7] htmltools_0.3.5     yaml_2.1.13         chron_2.3-47       
## [10] XML_3.98-1.4        DBI_0.5-1           plyr_1.8.4         
## [13] munsell_0.4.3       gtable_0.2.0        evaluate_0.10      
## [16] labeling_0.3        knitr_1.14          gender_0.5.1       
## [19] parallel_3.3.2      xlsxjars_0.6.1      Rcpp_0.12.7        
## [22] scales_0.4.0        formatR_1.4         gdata_2.17.0       
## [25] plotrix_3.6-3       xlsx_0.5.7          openNLPdata_1.5.3-2
## [28] digest_0.6.10       stringi_1.1.2       tools_3.3.2        
## [31] bitops_1.0-6        magrittr_1.5        lazyeval_0.2.0     
## [34] RCurl_1.95-4.8      tibble_1.2          MASS_7.3-45        
## [37] data.table_1.9.6    assertthat_0.1      rmarkdown_1.1      
## [40] openNLP_0.2-6       R6_2.2.0            igraph_1.0.1

Exploring the human genome (Part 2) - Transcripts

2016-11-01 00:00:00 +0000

How many transcripts and proteins do genes have?

In Exploring the human genome (Part 1) - Gene Annotations I examined Ensembl, Entrez and HGNC gene annotations with AnnotationDbi via three R packages: org.Hs.eg.db, EnsDb.Hsapiens.v79 and TxDb.Hsapiens.UCSC.hg38.knownGene.

Now, I want to know how many transcripts there are for genes in these databases.


What is a transcript?

While a gene is defined as a unit of DNA information which encodes for the production of a protein, it is really more a concept than an actual physical unit. Human genes conists of exons and introns, which can often be transcribed in different combinations - a process called alternative splicing.

“While the concept of a gene has been helpful in defining the relationship of a portion of a genome to a phenotype, this traditional term may not be as useful as it once was. Currently, “gene” has come to refer principally to a genomic region producing a polyadenylated mRNA that encodes a protein. However, the recent emergence of a large collection of unannotated transcripts with apparently little protein coding capacity, collectively called transcripts of unknown function (TUFs), has begun to blur the physical boundaries and genomic organization of genic regions with noncoding transcripts often overlapping protein-coding genes on the same (sense) and opposite strand (antisense). Moreover, they are often located in intergenic regions, making the genic portions of the human genome an interleaved network of both annotated polyadenylated and nonpolyadenylated transcripts, including splice variants with novel 5′ ends extending hundreds of kilobase. This complex transcriptional organization and other recently observed features of genomes argue for the reconsideration of the term “gene” and suggests that transcripts may be used to define the operational unit of a genome.” Thomas Gingeras, Genome Res. 2007


org.Hs.eg.db

With org.Hs.eg.db I am using Entrez and Ensembl IDs to obtain Ensembl transcript IDs, which show splice variants of a gene, including protein-coding and non-coding transcripts.

library(AnnotationDbi)
library(org.Hs.eg.db)

ENTREZID_org <- keys(org.Hs.eg.db, keytype = "ENTREZID")
ENSEMBL_org <- keys(org.Hs.eg.db, keytype = "ENSEMBL")

# Summarize number of transcripts per gene Entrez ID
org_trans_entrez <- AnnotationDbi::select(org.Hs.eg.db, keys = ENTREZID_org, columns = c("ENSEMBLTRANS"), keytype = "ENTREZID")
org_transcript_num_table_entrez <- as.data.frame(table(org_trans_entrez$ENTREZID))
colnames(org_transcript_num_table_entrez) <- c("Entrez", "orgDb")

# Summarize number of transcripts per gene Ensembl ID
org_trans_ensembl <- AnnotationDbi::select(org.Hs.eg.db, keys = ENSEMBL_org, columns = c("ENSEMBLTRANS"), keytype = "ENSEMBL")
org_transcript_num_table_ensembl <- as.data.frame(table(org_trans_ensembl$ENSEMBL))
colnames(org_transcript_num_table_ensembl) <- c("Ensembl", "orgDb")

# how many NAs are in each column?
sapply(org_trans_entrez, function(x) sum(is.na(x)))
##     ENTREZID ENSEMBLTRANS 
##            0        52081
sapply(org_trans_ensembl, function(x) sum(is.na(x)))
##      ENSEMBL ENSEMBLTRANS 
##            0        17678
head(org_trans_entrez)
##   ENTREZID    ENSEMBLTRANS
## 1        1            <NA>
## 2        2            <NA>
## 3        3 ENST00000543404
## 4        3 ENST00000566278
## 5        3 ENST00000545343
## 6        3 ENST00000544183
head(org_trans_ensembl)
##           ENSEMBL    ENSEMBLTRANS
## 1 ENSG00000121410            <NA>
## 2 ENSG00000175899            <NA>
## 3 ENSG00000256069 ENST00000543404
## 4 ENSG00000256069 ENST00000566278
## 5 ENSG00000256069 ENST00000545343
## 6 ENSG00000256069 ENST00000544183

Strangely, some genes, like A1BG (Entrez ID 1) are listed with one gene ID but no Ensembl transcript ID. This is weird, especially since I happen to know for this particular gene that it has several transcripts. Also, because each gene must have at least one transcript, all NAs are counted as 1 transcript in the summary table.

Let’s check other databases…


TxDb.Hsapiens.UCSC.hg38.knownGene

TxDb.Hsapiens.UCSC.hg38.knownGene only has Entrez IDs to identify genes and UCSC transcript ID to identify transcripts.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

ENTREZID_TxDb <- keys(TxDb.Hsapiens.UCSC.hg38.knownGene, keytype = "GENEID")
TxDb_trans <- AnnotationDbi::select(TxDb.Hsapiens.UCSC.hg38.knownGene, keys = ENTREZID_TxDb, columns = c("TXID"), keytype = "GENEID")

# Summarize number of transcripts per gene Entrez ID
TxDb_transcript_num_table_entrez <- as.data.frame(table(TxDb_trans$GENEID))
colnames(TxDb_transcript_num_table_entrez) <- c("Entrez", "TxDb")

# how many NAs are in each column?
sapply(TxDb_trans, function(x) sum(is.na(x)))
## GENEID   TXID 
##      0      0
head(TxDb_trans)
##   GENEID   TXID
## 1      1 166436
## 2      1 166437
## 3      1 166438
## 4      1 166439
## 5      1 166440
## 6      1 166441

Here, we don’t have NAs in the data and in contrast to org.Hs.eg.db we find at least six transcripts for A1BG.


EnsDb.Hsapiens.v79

Like org.Hs.eg.db EnsDb.Hsapiens.v79 has Entrez and Ensembl IDs in reference to Ensembl transcript IDs.

library(EnsDb.Hsapiens.v79)

ENSEMBL_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "GENEID")
ENTREZ_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "ENTREZID")

EnsDb_trans_ensembl <- ensembldb::select(EnsDb.Hsapiens.v79, keys = ENSEMBL_EnsDb, columns = c("TXID"), keytype = "GENEID")
EnsDb_trans_entrez <- ensembldb::select(EnsDb.Hsapiens.v79, keys = ENTREZ_EnsDb, columns = c("TXID"), keytype = "ENTREZID")

# somehow there are empty fields in the Entrez ID column, replacing them with NA
EnsDb_trans_entrez[EnsDb_trans_entrez ==  ""] <- NA

# how many NAs are in each column?
sapply(EnsDb_trans_entrez, function(x) sum(is.na(x)))
## ENTREZID     TXID 
##    47828        0
# and removing NA rows
EnsDb_trans_entrez <- EnsDb_trans_entrez[!is.na(EnsDb_trans_entrez$ENTREZID), ]
# order by Entrez ID to compare with other databases
EnsDb_trans_entrez <- EnsDb_trans_entrez[order(EnsDb_trans_entrez$ENTREZID),]

head(EnsDb_trans_ensembl)
##            GENEID            TXID
## 1 ENSG00000000003 ENST00000373020
## 2 ENSG00000000003 ENST00000496771
## 3 ENSG00000000003 ENST00000494424
## 4 ENSG00000000003 ENST00000612152
## 5 ENSG00000000003 ENST00000614008
## 6 ENSG00000000005 ENST00000373031
head(EnsDb_trans_entrez)
##        ENTREZID            TXID
## 95915         1 ENST00000596924
## 95916         1 ENST00000263100
## 95917         1 ENST00000595014
## 95918         1 ENST00000598345
## 95919         1 ENST00000600966
## 135994       10 ENST00000286479

Here, we find five transcripts for A1BG.


# Summarize number of transcripts per gene Entrez ID
EnsDb_transcript_num_table_entrez <- as.data.frame(table(EnsDb_trans_entrez$ENTREZID))
colnames(EnsDb_transcript_num_table_entrez) <- c("Entrez", "EnsDb")

# Summarize number of transcripts per gene Ensembl ID
EnsDb_transcript_num_table_ensembl <- as.data.frame(table(EnsDb_trans_ensembl$GENEID))
colnames(EnsDb_transcript_num_table_ensembl) <- c("Ensembl", "EnsDb")

# In the Entrez column, there are some with multiple entries
# divide entries with multiple gene names into one row per gene/ entry
head(EnsDb_transcript_num_table_entrez[grep(";", EnsDb_transcript_num_table_entrez$Entrez), ])
##                                     Entrez EnsDb
## 13                     100033415;100033421     1
## 15                     100033417;100033419     2
## 18                     100033421;100033415     1
## 40                     100033446;100033449     1
## 42 100033448;100033803;100033817;100033810     2
## 43                     100033449;100033446     1
library(splitstackshape)
out <- as.data.frame(cSplit(EnsDb_transcript_num_table_entrez, splitCols = "Entrez", sep = ";", direction = "long"), stringsAsFactors = FALSE)
out$Entrez <- as.character(out$Entrez)

# remove duplicates and take the mean
library(plyr)
EnsDb_transcript_num_table_entrez <- ddply(out, "Entrez", numcolwise(mean))


Comparison of transcript numbers per database package

The majority of genes have only one transcript. The number of genes with more transcripts decreases with number of transcripts; this can be seen in the plots below.

Entrez IDs

# merging datasets by Entrez ID
library(dplyr)
transcript_num_table_entrez <- full_join(org_transcript_num_table_entrez, TxDb_transcript_num_table_entrez, by = "Entrez")
transcript_num_table_entrez <- full_join(transcript_num_table_entrez, EnsDb_transcript_num_table_entrez, by = "Entrez")

# gather for plotting
library(tidyr)
transcript_num_table_gather_entrez <- transcript_num_table_entrez %>% 
  gather(DB, count, orgDb:EnsDb)

# How many counts are NA?
sapply(transcript_num_table_gather_entrez, function(x) sum(is.na(x)))
## Entrez     DB  count 
##      0      0  71066
# removing rows with NA counts
transcript_num_table_gather_entrez <- transcript_num_table_gather_entrez[!is.na(transcript_num_table_gather_entrez$count), ]

# because there are only a handful of genes with many transcripts, they can't be plotted together with genes with few transcripts
# separating them in high and low
transcript_num_table_gather_entrez$group <- ifelse(transcript_num_table_gather_entrez$count > 25, "high", "low")

# setting factor levels
f = c("low", "high")
transcript_num_table_gather_entrez <- within(transcript_num_table_gather_entrez, group <- factor(group, levels = f))
# setting my custom theme of choice
library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(colour = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, colour = "navy"),
    legend.position = "bottom",
    panel.margin = unit(.05, "lines"),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}
p <- ggplot(data = transcript_num_table_gather_entrez, aes(as.numeric(count))) + 
  geom_histogram(fill = "deepskyblue4") +
  my_theme() +
  labs(title = "Histogram of number of transcripts per gene (Entrez ID)", 
       x = "Number of transcripts per gene", y = "Count") +
  facet_wrap(DB ~ group, scales = "free", ncol = 2)

ann_text_entrez <- data.frame(x = c(20, 300, 20, 100, 20, 100),
                       y = c(4000, 350, 40000, 15, 5000, 280),
                       group = rep(c("low", "high"), 3),
                       DB = rep(c("EnsDb", "orgDb", "TxDb"), each = 2),
                       labs = c(paste(length(which(transcript_num_table_gather_entrez$group ==  "low" & 
                                                     transcript_num_table_gather_entrez$DB ==  "EnsDb"))), 
                                paste(length(which(transcript_num_table_gather_entrez$group ==  "high" & 
                                                     transcript_num_table_gather_entrez$DB ==  "EnsDb"))),
                                paste(length(which(transcript_num_table_gather_entrez$group ==  "low" & 
                                                     transcript_num_table_gather_entrez$DB ==  "orgDb"))), 
                                paste(length(which(transcript_num_table_gather_entrez$group ==  "high" & 
                                                     transcript_num_table_gather_entrez$DB ==  "orgDb"))),
                                paste(length(which(transcript_num_table_gather_entrez$group ==  "low" & 
                                                     transcript_num_table_gather_entrez$DB ==  "TxDb"))), 
                                paste(length(which(transcript_num_table_gather_entrez$group ==  "high" & 
                                                     transcript_num_table_gather_entrez$DB ==  "TxDb")))))

p + geom_text(data = ann_text_entrez, aes(x, y, label = labs, group = NULL), size = 8)

This time, orgDb has the highest number of Entrez ID gene entries with corresponding transcript information, the majority of which have fewer than 25 transcripts (59984 genes); only 152 genes have more than 25 transcripts. EnsDb and TxDB have comparable numbers of gene entries, also for genes with few (23971 and 24669 genes) and many transcripts (566 and 552 genes).


Ensembl IDs

# merging datasets by Ensembl ID
transcript_num_table_ensembl <- full_join(org_transcript_num_table_ensembl, EnsDb_transcript_num_table_ensembl, by = "Ensembl")

# gather for plotting
transcript_num_table_gather_ensembl <- transcript_num_table_ensembl %>% 
  gather(DB, count, orgDb:EnsDb)

# How many counts are NA?
sapply(transcript_num_table_gather_ensembl, function(x) sum(is.na(x)))
## Ensembl      DB   count 
##       0       0   38659
# removing rows with NA counts
transcript_num_table_gather_ensembl <- transcript_num_table_gather_ensembl[!is.na(transcript_num_table_gather_ensembl$count), ]

# because there are only a handful of genes with many transcripts, they can't be plotted together with genes with few transcripts
# separating them in high and low
transcript_num_table_gather_ensembl$group <- ifelse(transcript_num_table_gather_ensembl$count > 25, "high", "low")

# setting factor levels
f = c("low", "high")
transcript_num_table_gather_ensembl <- within(transcript_num_table_gather_ensembl, group <- factor(group, levels = f))
p <- ggplot(data = transcript_num_table_gather_ensembl, aes(as.numeric(count))) + 
  geom_histogram(fill = "deepskyblue4") +
  my_theme() +
  labs(title = "Histogram of number of transcripts per gene (Ensembl ID)", 
       x = "Number of transcripts per gene", y = "Count") +
  facet_wrap(DB ~ group, scales = "free", ncol = 2)

ann_text_ensembl <- data.frame(x = c(10, 75, 10, 100),
                              y = c(30000, 100, 15000, 100),
                              group = rep(c("low", "high"), 4),
                              DB = rep(c("EnsDb", "orgDb"), each = 2),
                              labs = c(paste(length(which(transcript_num_table_gather_ensembl$group ==  "low" & 
                                                            transcript_num_table_gather_ensembl$DB ==  "EnsDb"))), 
                                       paste(length(which(transcript_num_table_gather_ensembl$group ==  "high" & 
                                                            transcript_num_table_gather_ensembl$DB ==  "EnsDb"))),
                                       paste(length(which(transcript_num_table_gather_ensembl$group ==  "low" & 
                                                            transcript_num_table_gather_ensembl$DB ==  "orgDb"))), 
                                       paste(length(which(transcript_num_table_gather_ensembl$group ==  "high" & 
                                                            transcript_num_table_gather_ensembl$DB ==  "orgDb")))))

p + geom_text(data = ann_text_ensembl, aes(x, y, label = labs, group = NULL), size = 8)

For Ensembl gene IDs EnsDb has the higher number of entries with transcript information; most of these genes (65374) have fewer than 25 transcripts. While orgDb has fewer Ensembl-transcript entries, proportionally more have a high number of transcripts (1035 vs 400), while 26980 genes have low transcript numbers.


Comparison of transcript numbers per gene

Now that we know what the distribution of transcript numbers per gene is in different database packages, I want to directly compare the numbers of transcripts that are identified for each gene.

# for comparison replacing NAs with 0
transcript_num_table_ensembl_NAtozero <- transcript_num_table_ensembl
transcript_num_table_ensembl_NAtozero[is.na(transcript_num_table_ensembl_NAtozero)] <- 0

transcript_num_table_entrez_NAtozero <- transcript_num_table_entrez
transcript_num_table_entrez_NAtozero[is.na(transcript_num_table_entrez_NAtozero)] <- 0
library(ggrepel)

p1 <- ggplot(transcript_num_table_ensembl_NAtozero, aes(x = orgDb, y = EnsDb)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +            # Add a loess smoothed fit curve with confidence region
  labs(title = "Number of transcripts per Ensembl ID") +
  geom_text_repel(data = subset(transcript_num_table_ensembl_NAtozero, orgDb > 140 | EnsDb > 150), aes(label = Ensembl))

p2 <- ggplot(transcript_num_table_entrez_NAtozero, aes(x = orgDb, y = TxDb)) + 
  geom_abline(linetype="dashed") + 
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts per Entrez ID") +
  geom_text_repel(data = subset(transcript_num_table_entrez_NAtozero, orgDb > 100 | TxDb > 200), aes(label = Entrez))

p3 <- ggplot(transcript_num_table_entrez_NAtozero, aes(x = orgDb, y = EnsDb)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts per Entrez ID") +
  geom_text_repel(data = subset(transcript_num_table_entrez_NAtozero, orgDb > 100 | EnsDb > 900), aes(label = Entrez))

p4 <- ggplot(transcript_num_table_entrez_NAtozero, aes(x = TxDb, y = EnsDb)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts per Entrez ID") +
  geom_text_repel(data = subset(transcript_num_table_entrez_NAtozero, TxDb > 150 | EnsDb > 900), aes(label = Entrez))

library(gridExtra)
library(grid)

grid.arrange(p1, p2, p3, p4, ncol = 2)

The plots above show scatterplots with fitted loess curves with confidence regions of the numbers of transcripts for each gene with either Entrez or Ensembl entries from different database packages; dashed lines show 1:1 ratio. For most extreme genes, the Ensembl/ Entrez ID is given in the plots.

While some genes do have the same number of transcripts in different database packages, the majority do not. Entrez IDs seem to give a better correlation than Ensembl IDs, except for a few extreme outliers.


Is there a correlation between number of transcripts and number of exons of a gene?

Because the number of exons a gene has exponentially increases the number of potential transcripts produced via alternative splicing, it would make sense if the number of exons correlates with the number of transcripts of a gene. However, not all combinations of exons will produce a functioning protein, so there won’t be as many transcripts as theoretically possible.

Exon IDs are only available in TxDb.Hsapiens.UCSC.hg38.knownGene and EnsDb.Hsapiens.v79.

TxDb_exons <- AnnotationDbi::select(TxDb.Hsapiens.UCSC.hg38.knownGene, keys = ENTREZID_TxDb, columns = c("EXONID"), keytype = "GENEID")

EnsDb_exons_ensembl <- ensembldb::select(EnsDb.Hsapiens.v79, keys = ENSEMBL_EnsDb, columns = c("EXONID"), keytype = "GENEID")
EnsDb_exons_entrez <- ensembldb::select(EnsDb.Hsapiens.v79, keys = ENTREZ_EnsDb, columns = c("EXONID"), keytype = "ENTREZID")

# summarize number of exons per gene
TxDb_exons_num_table_entrez <- as.data.frame(table(TxDb_exons$GENEID))

EnsDb_exons_num_table_ensembl <- as.data.frame(table(EnsDb_exons_ensembl$GENEID))
EnsDb_exons_num_table_entrez <- as.data.frame(table(EnsDb_exons_entrez$ENTREZID))

# there are over 98000 entries without corresponding Entrez ID, removing those
EnsDb_exons_num_table_entrez <- EnsDb_exons_num_table_entrez[-1, ]
EnsDb_trans_vs_exons_ensembl <- full_join(EnsDb_transcript_num_table_ensembl, EnsDb_exons_num_table_ensembl, by = c("Ensembl" = "Var1"))
colnames(EnsDb_trans_vs_exons_ensembl)[2:3] <- c("EnsDb_transcripts", "EnsDb_exons")

EnsDb_trans_vs_exons_entrez <- full_join(EnsDb_transcript_num_table_entrez, EnsDb_exons_num_table_entrez, by = c("Entrez" = "Var1"))
colnames(EnsDb_trans_vs_exons_entrez)[2:3] <- c("EnsDb_transcripts", "EnsDb_exons")

TxDb_trans_vs_exons_entrez <- full_join(TxDb_transcript_num_table_entrez, TxDb_exons_num_table_entrez, by = c("Entrez" = "Var1"))
colnames(TxDb_trans_vs_exons_entrez)[2:3] <- c("TxDb_transcripts", "TxDb_exons")
p1 <- ggplot(EnsDb_trans_vs_exons_ensembl, aes(x = EnsDb_transcripts, y = EnsDb_exons)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts vs number of exons\nper Ensembl ID in EnsDb") +
  geom_text_repel(data = subset(EnsDb_trans_vs_exons_ensembl, EnsDb_transcripts > 150 | EnsDb_exons > 300), aes(label = Ensembl))

p2 <- ggplot(EnsDb_trans_vs_exons_entrez, aes(x = EnsDb_transcripts, y = EnsDb_exons)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts vs number of exons\nper Entrez ID in EnsDb") +
  geom_text_repel(data = subset(EnsDb_trans_vs_exons_entrez, EnsDb_transcripts > 300 | EnsDb_exons > 1000), aes(label = Entrez))

p3 <- ggplot(TxDb_trans_vs_exons_entrez, aes(x = TxDb_transcripts, y = TxDb_exons)) + 
  geom_abline(linetype="dashed") +
  geom_point(colour = "deepskyblue4", alpha = 0.3) + 
  my_theme() +
  geom_smooth(size = 1, color = "black") +
  labs(title = "Number of transcripts vs number of exons\nper Entrez ID in TxDb") +
  geom_text_repel(data = subset(TxDb_trans_vs_exons_entrez, TxDb_transcripts > 200 | TxDb_exons > 400), aes(label = Entrez))

grid.arrange(p1, p2, p3, ncol = 1)

Indeed, there is a reasonably good correlation between number of transcripts and number of exons ins genes. The scatterplots above show transcript vs exon number per gene in the three database packages with fitted loess curves with confidence regions. As expected, the increase is not exponential as theoretically possible, but stronger than 1:1. For Entrez IDs, in both EnsDb and TxDbm genes with up to about 100 transcripts show a steeper correlation than genes with more than 100 transcripts. For Ensembl IDs the cutoff is around 40 transcripts.


What genes have many exons but few transcripts?

Interestingly, there are a couple of genes with many exons but only few transcripts. To find out more about these genes, I will compare their proportion of gene biotypes.

# gene biotypes are only available in EnsDb
EnsDb_trans_vs_exons_ensembl$ratio <- EnsDb_trans_vs_exons_ensembl$EnsDb_exons/EnsDb_trans_vs_exons_ensembl$EnsDb_transcripts
EnsDb_trans_vs_exons_ensembl_genes <- as.character(EnsDb_trans_vs_exons_ensembl[order(EnsDb_trans_vs_exons_ensembl$ratio, decreasing = TRUE), "Ensembl"][1:100])

EnsDb_trans_vs_exons_entrez$ratio <- EnsDb_trans_vs_exons_entrez$EnsDb_exons/EnsDb_trans_vs_exons_entrez$EnsDb_transcripts
EnsDb_trans_vs_exons_entrez_genes <- as.character(EnsDb_trans_vs_exons_entrez[order(EnsDb_trans_vs_exons_entrez$ratio, decreasing = TRUE), "Entrez"][1:100])
# get all gene biotypes for all gene IDs
biotypes_exons_trans_EnsDb_ensembl_all <- ensembldb::select(EnsDb.Hsapiens.v79, keys = EnsDb_trans_vs_exons_ensembl_genes, 
                                          columns = "GENEBIOTYPE", keytype = "GENEID")

biotypes_exons_trans_EnsDb_entrez_all <- ensembldb::select(EnsDb.Hsapiens.v79, keys = EnsDb_trans_vs_exons_entrez_genes, 
                                          columns = "GENEBIOTYPE", keytype = "ENTREZID")
# get all gene biotypes for all gene IDs
biotypes_EnsDb_ensembl_all <- ensembldb::select(EnsDb.Hsapiens.v79, keys = keys(EnsDb.Hsapiens.v79, keytype = "GENEID"), 
                                          columns = "GENEBIOTYPE", keytype = "GENEID")

biotypes_EnsDb_entrez_all <- ensembldb::select(EnsDb.Hsapiens.v79, keys = keys(EnsDb.Hsapiens.v79, keytype = "ENTREZID"), 
                                          columns = "GENEBIOTYPE", keytype = "ENTREZID")
# how many NAs are in each column?
sapply(biotypes_exons_trans_EnsDb_ensembl_all, function(x) sum(is.na(x)))
##      GENEID GENEBIOTYPE 
##           0           0
sapply(biotypes_exons_trans_EnsDb_entrez_all, function(x) sum(is.na(x)))
##    ENTREZID GENEBIOTYPE 
##           0           0
# summary table of number of genebiotypes
biotypes_exons_trans_EnsDb_ensembl_all_num_table <- as.data.frame(table(biotypes_exons_trans_EnsDb_ensembl_all$GENEBIOTYPE))
biotypes_exons_trans_EnsDb_entrez_all_num_table <- as.data.frame(table(biotypes_exons_trans_EnsDb_entrez_all$GENEBIOTYPE))
# subset to keep only those biotypes that are among the top genes
biotypes_EnsDb_ensembl_all_subs <- biotypes_EnsDb_ensembl_all[which(biotypes_EnsDb_ensembl_all$GENEBIOTYPE %in% 
                                                                      unique(biotypes_exons_trans_EnsDb_ensembl_all$GENEBIOTYPE)), ]

biotypes_EnsDb_entrez_all_subs <- biotypes_EnsDb_entrez_all[which(biotypes_EnsDb_entrez_all$GENEBIOTYPE %in% 
                                                                      unique(biotypes_exons_trans_EnsDb_entrez_all$GENEBIOTYPE)), ]

# summary table of biotypes background (all genes)
biotypes_EnsDb_ensembl_all_subs_table <- as.data.frame(table(biotypes_EnsDb_ensembl_all_subs$GENEBIOTYPE))
biotypes_EnsDb_entrez_all_subs_table <- as.data.frame(table(biotypes_EnsDb_entrez_all_subs$GENEBIOTYPE))

# add background number of biotypes to number among top genes
biotypes_EnsDb_ensembl_top_num_table <- merge(biotypes_exons_trans_EnsDb_ensembl_all_num_table, biotypes_EnsDb_ensembl_all_subs_table, by = "Var1", all.x = TRUE)
biotypes_EnsDb_entrez_top_num_table <- merge(biotypes_exons_trans_EnsDb_entrez_all_num_table, biotypes_EnsDb_entrez_all_subs_table, by = "Var1", all.x = TRUE)
# merge
biotypes_top_num_table <- merge(biotypes_EnsDb_ensembl_top_num_table, biotypes_EnsDb_entrez_top_num_table, by = "Var1", all = TRUE)
colnames(biotypes_top_num_table) <- c("GENEBIOTYPE", "Ensembl_top", "Ensembl_bg", "Entrez_top", "Entrez_bg")

biotypes_top_num_table$Ensembl <- biotypes_top_num_table$Ensembl_top/biotypes_top_num_table$Ensembl_bg
biotypes_top_num_table$Entrez <- biotypes_top_num_table$Entrez_top/biotypes_top_num_table$Entrez_bg

# gather for plotting
biotypes_top_num_table_gather <- biotypes_top_num_table[, c(1, 6, 7)] %>% 
  gather(GENEBIOTYPE, count, Ensembl:Entrez)
colnames(biotypes_top_num_table_gather)[2] <- "Gene_ID"

# replace NA with 0
biotypes_top_num_table_gather[is.na(biotypes_top_num_table_gather)] <- 0
ggplot(biotypes_top_num_table_gather, aes(x = factor(GENEBIOTYPE), y = count, fill = Gene_ID)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  my_theme() + 
  theme(legend.position = "right") + coord_flip() +
  labs(title = "Gene biotypes of genes with many exons but few transcripts", x = "", y = "Ratio of biotype count vs background number",
       fill = "Gene ID") + 
  scale_x_discrete(breaks = biotypes_top_num_table_gather$GENEBIOTYPE,
                      labels = gsub("_", " ", biotypes_top_num_table_gather$GENEBIOTYPE)) +
  scale_fill_brewer(palette = "Set1")

Compared to the background distribution of biotypes, the proportionally most strongly represented group are LRG genes. As defined by Ensembl’s glossary:

“An LRG (Locus Reference Genomic) is a fixed sequence, independent of the genome, specifically created for the diagnostic community to record DNA sequence variation on a fixed framework. Sequence variants in LSDBs (Locus Specific Databases) are reported using LRG sequences.”

Other than this, there are only protein coding, processed and pseudogenes among genes with many exons and few transcripts. So, nothing exceptional about these genes’s biotypes.


What types of genes have the most transcripts?

Is there a biological function associated with having many transcripts? In order to find out more about these genes, I am looking at

  1. their biotypes and
  2. for over-represented gene ontology (GO) categories.
# mean number of transcripts over all database packages
mean_num_transcripts_ensembl <- data.frame(mean_num_transcripts = rowMeans(transcript_num_table_ensembl_NAtozero[, -1]),
               row.names = transcript_num_table_ensembl_NAtozero[, 1])

mean_num_transcripts_entrez <- data.frame(mean_num_transcripts = rowMeans(transcript_num_table_entrez_NAtozero[, -1]),
               row.names = transcript_num_table_entrez_NAtozero[, 1])

# standard error of mean
mean_num_transcripts_ensembl$sem <- apply(transcript_num_table_ensembl_NAtozero[, -1], 1, function(x) sd(x)/sqrt(length(x)))
mean_num_transcripts_entrez$sem <- apply(transcript_num_table_entrez_NAtozero[, -1], 1, function(x) sd(x)/sqrt(length(x)))

# ranking the mean
mean_num_transcripts_ensembl$rank <- rank(mean_num_transcripts_ensembl$mean_num_transcripts)
mean_num_transcripts_ensembl <- mean_num_transcripts_ensembl[order(mean_num_transcripts_ensembl$rank, decreasing = TRUE), ]

mean_num_transcripts_entrez$rank <- rank(mean_num_transcripts_entrez$mean_num_transcripts)
mean_num_transcripts_entrez <- mean_num_transcripts_entrez[order(mean_num_transcripts_entrez$rank, decreasing = TRUE), ]


I am picking the top 1000 genes with highest mean number of transcripts.

# picking top genes
mean_num_transcripts_ensembl_top <- mean_num_transcripts_ensembl[1:1000, ]
mean_num_transcripts_entrez_top <- mean_num_transcripts_entrez[1:1000, ]

# distribution of mean number of transcripts
p1 <- ggplot(mean_num_transcripts_ensembl_top, aes(x=mean_num_transcripts)) + geom_density(fill="navy") + my_theme() +
  labs(title = "Mean number of transcripts\nper Ensembl ID for the top genes", x = "mean number of transcripts pr gene")
p2 <- ggplot(mean_num_transcripts_entrez_top, aes(x=mean_num_transcripts)) + geom_density(fill="navy") + my_theme() +
  labs(title = "Mean number of transcripts\nper Entrez ID for the top genes", x = "mean number of transcripts per gene")
grid.arrange(p1, p2, ncol = 1)

The density distribution of the mean number of transcripts per gene (mean over database packages) for the top genes with highest transcript numbers shows the stark difference between Ensembl and Entrez IDs.


Gene biotypes of top genes with highest mean number of transcripts

biotypes_EnsDb_ensembl_top <- ensembldb::select(EnsDb.Hsapiens.v79, keys = rownames(mean_num_transcripts_ensembl_top), 
                                          columns = "GENEBIOTYPE", keytype = "GENEID")

biotypes_EnsDb_entrez_top <- ensembldb::select(EnsDb.Hsapiens.v79, keys = rownames(mean_num_transcripts_entrez_top), 
                                          columns = "GENEBIOTYPE", keytype = "ENTREZID")

# how many NAs are in each column?
sapply(biotypes_EnsDb_ensembl_top, function(x) sum(is.na(x)))
##      GENEID GENEBIOTYPE 
##           0           0
sapply(biotypes_EnsDb_entrez_top, function(x) sum(is.na(x)))
##    ENTREZID GENEBIOTYPE 
##           0           0
# summary table of number of genebiotypes
biotypes_EnsDb_ensembl_top_num_table <- as.data.frame(table(biotypes_EnsDb_ensembl_top$GENEBIOTYPE))
biotypes_EnsDb_entrez_top_num_table <- as.data.frame(table(biotypes_EnsDb_entrez_top$GENEBIOTYPE))
# subset to keep only those biotypes that are among the top genes
biotypes_EnsDb_ensembl_all_subs <- biotypes_EnsDb_ensembl_all[which(biotypes_EnsDb_ensembl_all$GENEBIOTYPE %in% 
                                                                      unique(biotypes_EnsDb_ensembl_top_num_table$Var1)), ]

biotypes_EnsDb_entrez_all_subs <- biotypes_EnsDb_entrez_all[which(biotypes_EnsDb_entrez_all$GENEBIOTYPE %in% 
                                                                      unique(biotypes_EnsDb_entrez_top_num_table$Var1)), ]

# summary table of biotypes background (all genes)
biotypes_EnsDb_ensembl_all_subs_table <- as.data.frame(table(biotypes_EnsDb_ensembl_all_subs$GENEBIOTYPE))
biotypes_EnsDb_entrez_all_subs_table <- as.data.frame(table(biotypes_EnsDb_entrez_all_subs$GENEBIOTYPE))

# add background number of biotypes to number among top genes
biotypes_EnsDb_ensembl_top_num_table <- merge(biotypes_EnsDb_ensembl_top_num_table, biotypes_EnsDb_ensembl_all_subs_table, by = "Var1", all.x = TRUE)
biotypes_EnsDb_entrez_top_num_table <- merge(biotypes_EnsDb_entrez_top_num_table, biotypes_EnsDb_entrez_all_subs_table, by = "Var1", all.x = TRUE)
# merge
biotypes_top_num_table <- merge(biotypes_EnsDb_ensembl_top_num_table, biotypes_EnsDb_entrez_top_num_table, by = "Var1", all = TRUE)
colnames(biotypes_top_num_table) <- c("GENEBIOTYPE", "Ensembl_top", "Ensembl_bg", "Entrez_top", "Entrez_bg")

biotypes_top_num_table$Ensembl <- biotypes_top_num_table$Ensembl_top/biotypes_top_num_table$Ensembl_bg
biotypes_top_num_table$Entrez <- biotypes_top_num_table$Entrez_top/biotypes_top_num_table$Entrez_bg

# gather for plotting
biotypes_top_num_table_gather <- biotypes_top_num_table[, c(1, 6, 7)] %>% 
  gather(GENEBIOTYPE, count, Ensembl:Entrez)
colnames(biotypes_top_num_table_gather)[2] <- "Gene_ID"

# replace NA with 0
biotypes_top_num_table_gather[is.na(biotypes_top_num_table_gather)] <- 0
ggplot(biotypes_top_num_table_gather, aes(x = factor(GENEBIOTYPE), y = count, fill = Gene_ID)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  my_theme() + 
  theme(legend.position = "right") + coord_flip() +
  labs(title = "Gene biotypes of top genes\nwith highest transcript numbers", x = "", y = "Ratio of biotype count vs background number",
       fill = "Gene ID") + 
  scale_x_discrete(breaks = biotypes_top_num_table_gather$GENEBIOTYPE,
                      labels = gsub("_", " ", biotypes_top_num_table_gather$GENEBIOTYPE)) +
  scale_fill_brewer(palette = "Set1")

Again, compared to background, we find a high proportion of LRG, protein coding, processed and pseudogenes. But here, we also see a few other biotypes represented among genes with most transcripts. Especially interesting is that we find lincRNAs and miRNAs among them, not something I would have expected…


Enrichment Analysis of top genes with highest mean number of transcripts

Functional profiles for the top 100 genes with most transcripts were created with clusterProfiler’s compareCluster function.


Grouping GO terms

With groupGO, we can see the distribution of GO categories among input genes.


GO category enrichment

Over-representation of GO categories is calculated with enrichGO.

Especially interesting to me is that the GO enrichment analysis of genes with many transcripts finds these genes to be overproportionally involved in immune system functions like MCH class II and antigen binding. These functions are known for their high variability, which is necessary for efficient immune defenses against a wide variety of intruders (pathogens).

We also see an enrichment of genes involved in cell-cell-adhesion, specifically cadherin binding. These have also been known for their high variability via alternative splicing.



## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] clusterProfiler_3.2.0                  
##  [2] DOSE_3.0.0                             
##  [3] gridExtra_2.2.1                        
##  [4] ggrepel_0.5                            
##  [5] ggplot2_2.1.0                          
##  [6] tidyr_0.6.0                            
##  [7] dplyr_0.5.0                            
##  [8] plyr_1.8.4                             
##  [9] splitstackshape_1.4.2                  
## [10] data.table_1.9.6                       
## [11] EnsDb.Hsapiens.v79_1.1.0               
## [12] ensembldb_1.6.0                        
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
## [14] GenomicFeatures_1.26.0                 
## [15] GenomicRanges_1.26.0                   
## [16] GenomeInfoDb_1.10.0                    
## [17] org.Hs.eg.db_3.4.0                     
## [18] AnnotationDbi_1.36.0                   
## [19] IRanges_2.8.0                          
## [20] S4Vectors_0.12.0                       
## [21] Biobase_2.34.0                         
## [22] BiocGenerics_0.20.0                    
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1                    AnnotationHub_2.6.0          
##  [3] splines_3.3.1                 shiny_0.14.1                 
##  [5] assertthat_0.1                DO.db_2.9                    
##  [7] interactiveDisplayBase_1.12.0 Rsamtools_1.26.0             
##  [9] yaml_2.1.13                   RSQLite_1.0.0                
## [11] lattice_0.20-34               chron_2.3-47                 
## [13] digest_0.6.10                 RColorBrewer_1.1-2           
## [15] XVector_0.14.0                qvalue_2.6.0                 
## [17] colorspace_1.2-7              htmltools_0.3.5              
## [19] httpuv_1.3.3                  Matrix_1.2-7.1               
## [21] XML_3.98-1.4                  biomaRt_2.30.0               
## [23] zlibbioc_1.20.0               xtable_1.8-2                 
## [25] GO.db_3.4.0                   scales_0.4.0                 
## [27] BiocParallel_1.8.0            tibble_1.2                   
## [29] mgcv_1.8-15                   SummarizedExperiment_1.4.0   
## [31] lazyeval_0.2.0                magrittr_1.5                 
## [33] mime_0.5                      evaluate_0.10                
## [35] nlme_3.1-128                  BiocInstaller_1.24.0         
## [37] tools_3.3.1                   formatR_1.4                  
## [39] stringr_1.1.0                 munsell_0.4.3                
## [41] Biostrings_2.42.0             RCurl_1.95-4.8               
## [43] igraph_1.0.1                  bitops_1.0-6                 
## [45] labeling_0.3                  rmarkdown_1.1                
## [47] gtable_0.2.0                  DBI_0.5-1                    
## [49] reshape2_1.4.1                R6_2.2.0                     
## [51] GenomicAlignments_1.10.0      knitr_1.14                   
## [53] rtracklayer_1.34.0            fastmatch_1.0-4              
## [55] fgsea_1.0.0                   stringi_1.1.2                
## [57] GOSemSim_2.0.0                Rcpp_0.12.7

Exploring the human genome (Part 1) - Gene Annotations

2016-10-23 00:00:00 +0000

When working with any type of genome data, we often look for annotation information about genes, e.g. what’s the gene’s full name, what’s its abbreviated symbol, what ID it has in other databases, what functions have been described, how many and which transcripts exist, etc.

However, when looking for this information we (luckily) find a number of different databases and packages to access their information in R. But with this often comes confusion, because they all vary - sometimes only slightly, other times quite strongly - in how many genes they list and what they consider to be a “gene”.



What exactly is a gene?

The narrow traditional definition of a gene is that it is a hereditary unit of information, which meant that it is a unit of DNA which encodes for the production of a protein. The Human Genome Project has estimated that the human genome comprises 20000 to 25000 genes.

However, if we take the definition of gene more liberally, we could also consider chunks of DNA which code for other functional units, like miRNAs, pseudogenes or other regulatory elements.



AnnotationDbi

The R package AnnotationDbi provides connections to a large number of annotation resources for gene and genome information of many organisms. It can access feature data from a number of gene or genome centric packages.

Here, I want to explore some of these packages and compare the information they contain.



How many genes are there in different databases?

Because genes can have identifiers from many databases (but not necessarily from all and there is not always a one-to-one mapping between them), we compare the number of unique identifiers for Entrez and Ensembl, as well as for gene symbol.

Entrez Gene is NCBI’s database for gene-specific information. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis.” Maglott et al., Nucleic Acids Research 2005

Ensembl is a genome browser but it also includes gene annotations.

Gene symbols represent the official gene name abbreviation from HGNC.



org.Hs.eg.db

The first package I looked at was org.Hs.eg.db, which contains genome wide annotations for the human genome.

library(AnnotationDbi)
library(org.Hs.eg.db)

# We can find out which keys and/or columns can be accessed by running
# keytypes(org.Hs.eg.db) or columns(org.Hs.eg.db)

ENTREZID_org <- keys(org.Hs.eg.db, keytype = "ENTREZID")
ENSEMBL_org <- keys(org.Hs.eg.db, keytype = "ENSEMBL")
HGNC_org <- keys(org.Hs.eg.db, keytype = "SYMBOL")


EnsDb.Hsapiens.v79

EnsDb.Hsapiens.v79 contains annotation information from Ensembl. Here, we also find genes based on Ensembl, Entrez and HGNC IDs.

library(EnsDb.Hsapiens.v79)

# keytypes(EnsDb.Hsapiens.v79)

ENTREZID_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "ENTREZID")
ENSEMBL_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "GENEID")
HGNC_EnsDb <- keys(EnsDb.Hsapiens.v79, keytype = "SYMBOL")


TxDb.Hsapiens.UCSC.hg38.knownGene

TxDb.Hsapiens.UCSC.hg38.knownGene provides annotation information from UCSC, specifically on transcripts in form of TxDb objects. Here, we only find comparable information on Entrez IDs.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

# keytypes(TxDb.Hsapiens.UCSC.hg38.knownGene)

ENTREZID_TxDb <- keys(TxDb.Hsapiens.UCSC.hg38.knownGene, keytype = "GENEID")


Comparison of gene numbers from the three databases accessed via the three R packages

# setting my custom theme of choice
library(ggplot2)

my_theme <- function(base_size = 12, base_family = "sans"){
  theme_grey(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 14, angle = 90, hjust = 1),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(colour = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "cornsilk", color = "maroon", size = 2),
    strip.text = element_text(face = "bold", size = 15, colour = "maroon"),
    legend.position = "bottom"
  )
}
# preparing the gene table for plotting
gene_table <- data.frame(DB = c("Entrez_orgDb", "Ensembl_orgDb", "HGNC_orgDb", 
                                "Entrez_EnsDb", "Ensembl_EnsDb", "HGNC_EnsDb", "Entrez_TxDb"), 
                         no_genes = c(length(unique(ENTREZID_org)), length(unique(ENSEMBL_org)), 
                                      length(unique(HGNC_org)), length(unique(ENTREZID_EnsDb)),
                                      length(unique(ENSEMBL_EnsDb)), length(unique(HGNC_EnsDb)), 
                                      length(unique(ENTREZID_TxDb))))

p <- ggplot(gene_table, aes(x = factor(DB), y = no_genes, label = no_genes)) + 
  geom_bar(stat = "identity", fill = "maroon") +
  my_theme() +
  labs(title = "Number of genes in databases", y = "Count", x = "")

p + geom_text(size = 6, vjust = 1.5)

When we compare the numbers of genes we can obtain from the three databases Ensembl, Entrez and HGNC via three R packages (org.Hs.eg.db, EnsDb.Hsapiens.v79 and TxDb.Hsapiens.UCSC.hg38.knownGene), we see that there are strong differences: There seem to be either around 25000 to 28000 genes or around 60000. The only database with consistent numbers is HGNC. The reason for these divergent numbers could be due to how they classify a “gene”. Everything that is around 25000 could be because they consider protein coding genes only. Will we check if that’s indeed the case in a minute.



Overlap between databases

First, I want to know how well the database information from different packages overlaps. Venn diagrams show us the number of genes overlapping between the three databases.

library(gplots)

venn(list(orgDb = unique(ENTREZID_org), EnsDb = unique(ENTREZID_EnsDb), TxDb = unique(ENTREZID_TxDb)), 
     show.plot = TRUE, small = 0.7, showSetLogicLabel = FALSE, simplify = TRUE)
mtext("Entrez database", side=3, cex = 1.2)

venn(list(orgDb = unique(ENSEMBL_org), EnsDb = unique(ENSEMBL_EnsDb)), 
     show.plot = TRUE, small = 0.7, showSetLogicLabel = FALSE, simplify = TRUE)
mtext("Ensembl database", side=3, cex = 1.2)

While there is a reasonably large overlap between packages, there are still many genes which are either in one or the other package.

For gene names this is worst, but it might be due to changing names of genes (one gene often has several symbols which can be used, the preferred one often changing with new knowledge about the gene’s function).

For Ensembl and Entrez it seems like one of the databases simply contains more gene entries than the other(s), while the majority of genes from the smaller database are included.



Gene biotypes

Gene biotype annotation tells us the general category of a gene. The biggest category is protein coding genes. They allow us to check whether the genes in different databases and package consider different categories as “genes”.

Gene biotype information is only available for genes in EnsemblDb! Thus, the classification of genes in the other packages is biased in that only genes in EnsemblDb could be considered.

# gene dataframe for org.Hs.eg.db
gene_dataframe_orgDb <- AnnotationDbi::select(org.Hs.eg.db, keys=ENTREZID_org, columns=c("ENSEMBL", "SYMBOL"), keytype="ENTREZID")
colnames(gene_dataframe_orgDb) <- c("Entrez", "Ensembl", "HGNC")
colnames(gene_dataframe_orgDb) <- paste(colnames(gene_dataframe_orgDb), "orgDb", sep = "_")
# additional column with HGNC added for merging
gene_dataframe_orgDb$HGNC <- gene_dataframe_orgDb$HGNC_orgDb

# gene dataframe for EnsDb.Hsapiens.v79
gene_dataframe_EnsDb <- ensembldb::select(EnsDb.Hsapiens.v79, keys=ENSEMBL_EnsDb, 
                                          columns=c("ENTREZID", "SYMBOL", "GENEBIOTYPE"), keytype="GENEID")
colnames(gene_dataframe_EnsDb) <- c("Ensembl", "Entrez", "HGNC", "GENEBIOTYPE")
colnames(gene_dataframe_EnsDb) <- paste(colnames(gene_dataframe_EnsDb), "EnsDb", sep = "_")
# additional column with HGNC added for merging & keeping one copy of it in the merged dataframe
gene_dataframe_EnsDb$HGNC <- gene_dataframe_EnsDb$HGNC_EnsDb

# merging the two dataframes by HGNC
library(dplyr)
gene_dataframe <- full_join(gene_dataframe_EnsDb, gene_dataframe_orgDb, by = "HGNC")
gene_dataframe <- gene_dataframe[, -5]

# making a dataframe with additional column for merging & keeping one copy in the merged dataframe
ENTREZID_TxDb_df <- data.frame(ENTREZID = as.character(ENTREZID_TxDb), Entrez_TxDb = ENTREZID_TxDb)
gene_dataframe <- left_join(gene_dataframe, ENTREZID_TxDb_df, by = c("Entrez_orgDb" = "ENTREZID")) 
# calculating percentages of gene biotypes
for (i in 1:length(colnames(gene_dataframe))){
  key = colnames(gene_dataframe)[i]
  
  # remove duplicate and NA plots
  gene_dataframe_subs <- gene_dataframe[!duplicated(gene_dataframe[, key]), c(key, "GENEBIOTYPE_EnsDb")]
  gene_dataframe_subs <- gene_dataframe_subs[!is.na(gene_dataframe_subs[, key]), ]
  
  cat("\nKey:", key, "has", nrow(gene_dataframe_subs), "unique rows without NAs.\nOf these,", 
      sum(is.na(gene_dataframe_subs$GENEBIOTYPE_EnsDb)), "don't have a gene biotype annotation\n")
  
  if (i == 1){
    genebiotypes_table <- as.data.frame(table(gene_dataframe_subs$GENEBIOTYPE_EnsDb))
    genebiotypes_table <- rbind(genebiotypes_table,
                                      data.frame(Var1 = "NA", 
                                                 Freq = sum(is.na(gene_dataframe_subs$GENEBIOTYPE_EnsDb))))
    genebiotypes_table$percent <- round(genebiotypes_table$Freq / 
                                                sum(genebiotypes_table$Freq) * 100, digits = 4)
  } else {
      genebiotypes_table_pre <- as.data.frame(table(gene_dataframe_subs$GENEBIOTYPE_EnsDb))
      genebiotypes_table_pre <- rbind(genebiotypes_table_pre,
                                      data.frame(Var1 = "NA", 
                                                 Freq = sum(is.na(gene_dataframe_subs$GENEBIOTYPE_EnsDb))))
      genebiotypes_table_pre$percent <- round(genebiotypes_table_pre$Freq / 
                                                sum(genebiotypes_table_pre$Freq) * 100, digits = 4)
      genebiotypes_table <- full_join(genebiotypes_table, genebiotypes_table_pre, by = "Var1")
  }
}
## 
## Key: Ensembl_EnsDb has 65774 unique rows without NAs.
## Of these, 0 don't have a gene biotype annotation
## 
## Key: Entrez_EnsDb has 24263 unique rows without NAs.
## Of these, 0 don't have a gene biotype annotation
## 
## Key: HGNC_EnsDb has 59074 unique rows without NAs.
## Of these, 0 don't have a gene biotype annotation
## 
## Key: Entrez_orgDb has 60136 unique rows without NAs.
## Of these, 25740 don't have a gene biotype annotation
## 
## Key: Ensembl_orgDb has 28015 unique rows without NAs.
## Of these, 2456 don't have a gene biotype annotation
## 
## Key: HGNC_orgDb has 60083 unique rows without NAs.
## Of these, 25691 don't have a gene biotype annotation
## 
## Key: Entrez_TxDb has 25217 unique rows without NAs.
## Of these, 2087 don't have a gene biotype annotation
# create unique colnames
colnames(genebiotypes_table) <- make.unique(c("Key", rep(colnames(gene_dataframe), each = 2)))
genebiotypes_table <- genebiotypes_table[, -c(8, 9)]

# order by percentage of Ensembl EnsDb
genebiotypes_table <- genebiotypes_table[order(genebiotypes_table$Ensembl_EnsDb.1, decreasing = TRUE), ]


How many protein coding genes are in each database and annotation package?

Ensembl IDs from EnsDb.Hsapiens.v79 have the most protein coding genes, which is consistent with also having the highest number of annotations in general. The number of protein coding genes in the other databases/ packages is only slightly lower, however and fits well with the predicted 20000 to 25000.

genebiotypes_table_protein_coding <- genebiotypes_table[
  which(genebiotypes_table$Key %in% c("protein_coding", "NA")), ]

genebiotypes_table_count <- genebiotypes_table_protein_coding[1, c(1, grep(".1", colnames(genebiotypes_table_protein_coding)) - 1)]

library(tidyr)
genebiotypes_table_gather <- genebiotypes_table_count %>% 
  gather(Key, Count, Ensembl_EnsDb:Entrez_TxDb)
colnames(genebiotypes_table_gather)[2] <- "DB"

# plot
bp <- ggplot(genebiotypes_table_gather[, -1], aes(x = factor(DB), y = Count, label = Count)) + 
  geom_bar(stat = "identity", fill = "maroon") + my_theme()
bp + labs(title = "Number of protein coding genes in databases", y = "Count", x = "") +
  geom_text(size = 6, vjust = 1.5)

When looking at the percentage of protein coding genes among all gene annotations we can confirm that the database/ package combinations with lower gene numbers (Ensembl org.Db, Entrez EnsDb and Entrez TxDb) are mainly comprised of protein coding genes while databases/ packages with around 60000 gene entries consider other functional units as well.

genebiotypes_table_percent <- genebiotypes_table_protein_coding[, c(1, grep(".1", colnames(genebiotypes_table_protein_coding)))]

library(tidyr)
genebiotypes_table_gather <- genebiotypes_table_percent %>% 
  gather(Key, Percent, Ensembl_EnsDb.1:Entrez_TxDb.1)
colnames(genebiotypes_table_gather)[2] <- "DB"
genebiotypes_table_gather$DB <- gsub(".1", "", genebiotypes_table_gather$DB)

# plot
bp <- ggplot(genebiotypes_table_gather, aes(x = "", y = Percent, fill = Key)) + 
  geom_bar(width = 1, stat = "identity") + theme_minimal()

pie <- bp + coord_polar("y", start = 0) +
  ggtitle("Percentage of protein coding genes in gene databases") +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.ticks = element_blank(),
  plot.title = element_text(size = 14, face = "bold"),
  legend.title = element_blank(),
  legend.position = "bottom",
  legend.text = element_text(size = 8)
  )
pie + facet_wrap(~DB, ncol = 4)


Percentages of gene biotypes

What are these other biotypes that most databases/ packages consider as genes? The second largest group following protein coding genes are processed pseudogenes, then come lincRNAs and other small RNAs.

genebiotypes_table_percent <- genebiotypes_table[, c(1, grep(".1", colnames(genebiotypes_table)))]
genebiotypes_table_percent$Key <- factor(genebiotypes_table_percent$Key, 
                                         levels = paste0(genebiotypes_table_percent$Key))

# gather dataframe for plotting
library(tidyr)
genebiotypes_table_gather <- genebiotypes_table_percent %>% 
  gather(Key, Percent, Ensembl_EnsDb.1:Entrez_TxDb.1)
colnames(genebiotypes_table_gather)[2] <- "DB"
genebiotypes_table_gather$DB <- gsub(".1", "", genebiotypes_table_gather$DB)

# plot
bp <- ggplot(genebiotypes_table_gather, aes(x = "", y = Percent, fill = Key)) + 
  geom_bar(width = 1, stat = "identity") + theme_minimal()

pie <- bp + coord_polar("y", start = 0) +
  ggtitle("Percentage of gene biotypes in gene databases") +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  axis.ticks = element_blank(),
  plot.title = element_text(size = 14, face = "bold"),
  legend.title = element_blank(),
  legend.position = "bottom",
  legend.text = element_text(size = 8)
  )
pie + facet_wrap(~DB, ncol = 4)



What does this mean for genetics and genomics research?

Overall, Ensembl gene annotations from EnsDb.Hsapiens.v79 have the largest number of gene annotations.

However, this comparison shows that it can be critical for identifying genes and annotating them correctly in your analyses to consider the strengths and weaknesses of the different databases and packages you use to access the information. I believe it is well worthwhile the time and effort to consider what information a database can or cannot give you before using it on your data.


In a part 2 I am looking at transcripts.



sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tidyr_0.6.0                            
##  [2] dplyr_0.5.0                            
##  [3] gplots_3.0.1                           
##  [4] ggplot2_2.1.0                          
##  [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
##  [6] EnsDb.Hsapiens.v79_1.1.0               
##  [7] ensembldb_1.6.0                        
##  [8] GenomicFeatures_1.26.0                 
##  [9] GenomicRanges_1.26.0                   
## [10] GenomeInfoDb_1.10.0                    
## [11] org.Hs.eg.db_3.4.0                     
## [12] AnnotationDbi_1.36.0                   
## [13] IRanges_2.8.0                          
## [14] S4Vectors_0.12.0                       
## [15] Biobase_2.34.0                         
## [16] BiocGenerics_0.20.0                    
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.4.0    gtools_3.5.0                 
##  [3] lattice_0.20-34               colorspace_1.2-7             
##  [5] htmltools_0.3.5               rtracklayer_1.34.0           
##  [7] yaml_2.1.13                   interactiveDisplayBase_1.12.0
##  [9] XML_3.98-1.4                  DBI_0.5-1                    
## [11] BiocParallel_1.8.0            plyr_1.8.4                   
## [13] stringr_1.1.0                 zlibbioc_1.20.0              
## [15] Biostrings_2.42.0             munsell_0.4.3                
## [17] gtable_0.2.0                  caTools_1.17.1               
## [19] evaluate_0.10                 labeling_0.3                 
## [21] knitr_1.14                    biomaRt_2.30.0               
## [23] httpuv_1.3.3                  BiocInstaller_1.24.0         
## [25] Rcpp_0.12.7                   KernSmooth_2.23-15           
## [27] xtable_1.8-2                  scales_0.4.0                 
## [29] formatR_1.4                   gdata_2.17.0                 
## [31] XVector_0.14.0                mime_0.5                     
## [33] Rsamtools_1.26.0              AnnotationHub_2.6.0          
## [35] digest_0.6.10                 stringi_1.1.2                
## [37] shiny_0.14.1                  grid_3.3.1                   
## [39] tools_3.3.1                   bitops_1.0-6                 
## [41] magrittr_1.5                  lazyeval_0.2.0               
## [43] RCurl_1.95-4.8                tibble_1.2                   
## [45] RSQLite_1.0.0                 Matrix_1.2-7.1               
## [47] assertthat_0.1                rmarkdown_1.1                
## [49] httr_1.2.1                    R6_2.2.0                     
## [51] GenomicAlignments_1.10.0

USA/ Canada Roadtrip 2016

2016-10-16 00:00:00 +0000

Mapping GPS data from our USA/ Canada Roadtrip

This September we went on a roadtrip to the US and Canada. Of course, we had our trusty GPS to guide us along the way - the data from which I downloaded and used to play around with.

(If you want to see a few photos from the trip and don’t care about the rest, skip to the bottom…)


Loading the data

I used the XML package to load the gpx files following this blog post.

library(XML)

myfiles <- list.files(path = "../gpx/gpx_files", full.names = TRUE)

for (i in 1:length(myfiles)){

  # One of the files seems to be broken, but since the file is one with earlier data
  # not from the trip, I will simply skip it
  tryCatch({
    pfile <- htmlTreeParse(myfiles[i], useInternalNodes = T)

  }, error=function(e){cat("ERROR\n")})

  if (exists("pfile")){
    # Get all elevations, times and coordinates via the respective xpath
    elevations <- as.numeric(as.character(xpathSApply(pfile, path = "//trkpt/ele", xmlValue)))
    times <- xpathSApply(pfile, path = "//trkpt/time", xmlValue)
    coords <- xpathSApply(pfile, path = "//trkpt", xmlAttrs)
    # Extract latitude and longitude from the coordinates
    lats <- as.numeric(as.character(coords["lat",]))
    lons <- as.numeric(as.character(coords["lon",]))
    # Put everything in a dataframe and get rid of old variables
    geodf <- data.frame(lat = lats, lon = lons, ele = elevations, time = times)

    if (i == 1){
      geodata <- geodf
      } else {
      geodata <- rbind(geodata, geodf)
    }
  }
  rm(pfile)
}
geodata_all <- geodata


I then did some cleaning of the date and time column and removed all the old track that were not part of our trip.

# Transforming the time column
geodata_all$time <- strptime(geodata_all$time, format = "%Y-%m-%dT%H:%M:%SZ")

# ordering by date
library(plyr)
geodata_all <- arrange(geodata_all, time)

# removing all tracks from Germany
geodata <- geodata_all[which(as.numeric(as.character(geodata_all$lon)) < 0),]

# adding a column with day for separating the plotting by day lateron
geodata$day <- as.factor(format(as.Date(geodata$time,format="%Y-%m-%d"), "%d"))


We have 119225 data points with a range from 2016-09-09 19:23:53 to 2016-09-23 21:36:06.

summary(geodata)
##       lat             lon              ele        
##  Min.   :36.99   Min.   :-88.68   Min.   :-26.78  
##  1st Qu.:40.09   1st Qu.:-84.86   1st Qu.:181.83  
##  Median :41.90   Median :-82.99   Median :198.65  
##  Mean   :41.27   Mean   :-82.76   Mean   :216.58  
##  3rd Qu.:42.33   3rd Qu.:-80.44   3rd Qu.:255.85  
##  Max.   :43.65   Max.   :-78.85   Max.   :438.02  
##                                                   
##       time                          day       
##  Min.   :2016-09-09 19:23:53   19     :17203  
##  1st Qu.:2016-09-12 12:14:47   12     :14455  
##  Median :2016-09-15 21:52:37   11     :14020  
##  Mean   :2016-09-15 22:35:18   10     :12061  
##  3rd Qu.:2016-09-19 15:59:09   16     :10856  
##  Max.   :2016-09-23 21:36:06   13     :10787  
##                                (Other):39843
str(geodata)
## 'data.frame':    119225 obs. of  5 variables:
##  $ lat : num  42 42 42 42 42 ...
##  $ lon : num  -87.9 -87.9 -87.9 -87.9 -87.9 ...
##  $ ele : num  122 122 122 122 122 ...
##  $ time: POSIXlt, format: "2016-09-09 19:23:53" "2016-09-09 19:23:53" ...
##  $ day : Factor w/ 15 levels "09","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...


Getting additional data from Google Maps

Somehow the GPS didn’t record the last two days of our trip back to Chicago, so I added a few way points with the Google Maps Geocoding and Directions APIs (accessed via the googleway package).

library(googleway)

# on th 26th we had lunch at cracker barrel
cracker_barrel <- google_geocode(address = "2101 N Kenyon Rd, Urbana, IL 61802, USA",
                                 key = "your_key",
                                 language = "en",
                                 simplify = TRUE)

cracker_barrel_data <- data.frame(lat = cracker_barrel$results$geometry$location$lat,
                                  lon = cracker_barrel$results$geometry$location$lng,
                                  ele = "NA", time = as.POSIXlt("2016-09-26 19:00:00"))

# then we stayed the night in Aurora/ Naperville
super8 <- google_geocode(address = "4228 Longmeadow Dr, Aurora, IL 60504, USA",
                         key = "your_key",
                         language = "en",
                         simplify = TRUE)

super8_data <- data.frame(lat = super8$results$geometry$location$lat,
                          lon = super8$results$geometry$location$lng,
                          ele = "NA", time = as.POSIXlt("2016-09-26 23:30:00"))

# on the 27th we went back to the airport to fly home
ohare <- google_geocode(address = "10000 W O'Hare Ave, Chicago, IL 60666, USA",
                        key = "your_key",
                        language = "en",
                        simplify = TRUE)

ohare_data <- data.frame(lat = ohare$results$geometry$location$lat,
                         lon = ohare$results$geometry$location$lng,
                         ele = "NA", time = as.POSIXlt("2016-09-27 18:00:00"))


Using only these three waypoints won’t plot nicely on a map because it will draw straight lines between the points. So, I also want to get travel information from Google Maps between these points to get a few more waypoints along the actual route we drove.

## Route from Paducah to Cracker Barrel

route_pad_cracker <- google_directions(origin = c(geodata$lat[nrow(geodata)], geodata$lon[nrow(geodata)]),
                  destination = c(cracker_barrel_data$lat, cracker_barrel_data$lon),
                  mode = "driving",
                  key = "your_key",
                  language = "en",
                  simplify = TRUE)

route_pad_cracker_legs <- as.data.frame(route_pad_cracker$routes$legs)
route_pad_cracker_legs_steps <- as.data.frame(route_pad_cracker_legs$steps)

## Route from Cracker Barrel to Aurora

route_cracker_aurora <- google_directions(origin = c(cracker_barrel_data$lat, cracker_barrel_data$lon),
                                          destination = c(super8_data$lat, super8_data$lon),
                                          mode = "driving",
                                          key = "your_key",
                                          language = "en",
                                          simplify = TRUE)

route_cracker_aurora_legs <- as.data.frame(route_cracker_aurora$routes$legs)
route_cracker_aurora_legs_steps <- as.data.frame(route_cracker_aurora_legs$steps)

## Route from Cracker Aurora to O'Hare

route_aurora_ohare <- google_directions(origin = c(super8_data$lat, super8_data$lon),
                                        destination = c(ohare_data$lat, ohare_data$lon),
                                        mode = "driving",
                                        key = "your_key",
                                        language = "en",
                                        simplify = TRUE)

route_aurora_ohare_legs <- as.data.frame(route_aurora_ohare$routes$legs)
route_aurora_ohare_legs_steps <- as.data.frame(route_aurora_ohare_legs$steps)

# Combining the data sets
geodata_2 <- data.frame(lat = c(geodata$lat,
                                route_pad_cracker_legs$start_location$lat, route_pad_cracker_legs_steps$start_location$lat,
                                route_pad_cracker_legs$end_location$lat, cracker_barrel_data$lat,
                                route_cracker_aurora_legs$start_location$lat, route_cracker_aurora_legs_steps$start_location$lat,
                                route_cracker_aurora_legs$end_location$lat, super8_data$lat,
                                route_aurora_ohare_legs$start_location$lat, route_aurora_ohare_legs_steps$start_location$lat,
                                route_aurora_ohare_legs$end_location$lat, ohare_data$lat),
                      lon = c(geodata$lon, route_pad_cracker_legs$start_location$lng, route_pad_cracker_legs_steps$start_location$lng,
                              route_pad_cracker_legs$end_location$lng, cracker_barrel_data$lon,
                              route_cracker_aurora_legs$start_location$lng, route_cracker_aurora_legs_steps$start_location$lng,
                              route_cracker_aurora_legs$end_location$lng, super8_data$lon,
                              route_aurora_ohare_legs$start_location$lng, route_aurora_ohare_legs_steps$start_location$lng,
                              route_aurora_ohare_legs$end_location$lng, ohare_data$lon),
                      day = c(as.character(geodata$day), rep("26",
                                               length(c(route_pad_cracker_legs$start_location$lat,
                                                        route_pad_cracker_legs_steps$start_location$lat,
                                                        route_pad_cracker_legs$end_location$lat, cracker_barrel_data$lat,
                                                        route_cracker_aurora_legs$start_location$lat,
                                                        route_cracker_aurora_legs_steps$start_location$lat,
                                                        route_cracker_aurora_legs$end_location$lat, super8_data$lat))),
                              rep("27",
                                  length(c(route_aurora_ohare_legs$start_location$lat, route_aurora_ohare_legs_steps$start_location$lat,
                                           route_aurora_ohare_legs$end_location$lat, ohare_data$lat)))))


Plotting

I tried several packages for plotting but ended up with ggmap as the nicest looking solution.


Overview of the trip

We flew to Chicago, went west across Michigan to Ann Arbor, Detroit, Toronto, then south again to Pittsburgh, from where we followed the Ohio river and headed to Louisville. From there, we went to Paducah, KY and back north again to Chicago. Each color shows a different day.

library(ggmap)

bc_bbox <- make_bbox(lat = lat, lon = lon, data = geodata_2, f = .10)
map <- get_map(location = bc_bbox, maptype = "roadmap")

ggmap(map, darken = c(0.3, "white")) + geom_point(aes(x = lon, y = lat, col = factor(day), group = factor(day)), data = geodata_2) +
  geom_path(aes(x = lon, y = lat, col = factor(day)), data = geodata_2[which(geodata_2$day %in% c("26", "27")), ]) + 
  theme(legend.position="none") + labs(x="Longitude", y="Latitude", title="Roadtrip 2016")


Road trip with stops

First, I need to obtain the geocodes from the stops (again via Google Maps API).

chicago <- google_geocode(address = "Chicago",
                          key = "your_key",
                          language = "en",
                          simplify = TRUE)

stjospeh <- google_geocode(address = "St. Joseph, MI",
                          key = "your_key",
                          language = "en",
                          simplify = TRUE)

annarbor <- google_geocode(address = "Ann Arbor, MI",
                           key = "your_key",
                           language = "en",
                           simplify = TRUE)

detroit <- google_geocode(address = "Detroit, MI",
                           key = "your_key",
                           language = "en",
                           simplify = TRUE)

burlington <- google_geocode(address = "Burlington, ON",
                          key = "your_key",
                          language = "en",
                          simplify = TRUE)

ashtabula <- google_geocode(address = "Ashtabula",
                             key = "your_key",
                             language = "en",
                             simplify = TRUE)

pittsburgh <- google_geocode(address = "Pittsburgh",
                            key = "your_key",
                            language = "en",
                            simplify = TRUE)

marietta <- google_geocode(address = "Marietta, OH",
                             key = "your_key",
                             language = "en",
                             simplify = TRUE)

maysville <- google_geocode(address = "Maysville, KY",
                             key = "your_key",
                             language = "en",
                             simplify = TRUE)

louisville <- google_geocode(address = "Louisville, KY",
                             key = "your_key",
                             language = "en",
                             simplify = TRUE)

paducah <- google_geocode(address = "Paducah, KY",
                             key = "your_key",
                             language = "en",
                             simplify = TRUE)

ann_text <- data.frame(x = c(chicago$results$geometry$location$lng,
                             stjospeh$results$geometry$location$lng,
                             annarbor$results$geometry$location$lng,
                             detroit$results$geometry$location$lng,
                             burlington$results$geometry$location$lng,
                             ashtabula$results$geometry$location$lng,
                             pittsburgh$results$geometry$location$lng,
                             marietta$results$geometry$location$lng,
                             maysville$results$geometry$location$lng,
                             louisville$results$geometry$location$lng[1], #Maps found a second entry for the location
                             paducah$results$geometry$location$lng),
                       y = c(chicago$results$geometry$location$lat,
                             stjospeh$results$geometry$location$lat,
                             annarbor$results$geometry$location$lat,
                             detroit$results$geometry$location$lat+0.1, #adjusting Detroit label to avoid overlapping with Ann Arbor
                             burlington$results$geometry$location$lat,
                             ashtabula$results$geometry$location$lat,
                             pittsburgh$results$geometry$location$lat,
                             marietta$results$geometry$location$lat,
                             maysville$results$geometry$location$lat,
                             louisville$results$geometry$location$lat[1],
                             paducah$results$geometry$location$lat),
                       labs = c(paste0(chicago$results$formatted_address),
                                paste0(stjospeh$results$formatted_address),
                                paste0(annarbor$results$formatted_address),
                                paste0(detroit$results$formatted_address),
                                paste0(burlington$results$formatted_address),
                                paste0(ashtabula$results$formatted_address),
                                paste0(pittsburgh$results$formatted_address),
                                paste0(marietta$results$formatted_address),
                                paste0(maysville$results$formatted_address),
                                paste0(louisville$results$formatted_address[1]),
                                paste0(paducah$results$formatted_address)
                                ))
map <- get_map(location = bc_bbox, maptype = "watercolor")

roadtrip_map <- ggmap(map, darken = c(0.3, "white")) + geom_point(aes(x = lon, y = lat, col = factor(day), group = factor(day)), data = geodata_2, alpha = 0.3) +
  geom_path(aes(x = lon, y = lat, col = factor(day)), data = geodata_2[which(geodata_2$day %in% c("26", "27")), ], alpha = 0.5) +
  theme(legend.position="none") + labs(x="Longitude", y="Latitude", title="Roadtrip 2016")
roadtrip_map + geom_text(data = ann_text, aes(x, y, label = labs), size = 4)


Road trip statistics

How many kilometers did we drive?

# removing duplicate entries
geodata_2 <- geodata_2[!duplicated(geodata_2), ]

# Shifting vectors for latitude and longitude to include end position
shift.vec <- function (vec, shift){
  if(length(vec) <= abs(shift)){
    rep(NA ,length(vec))
  } else {
    if (shift >= 0) {
      c(rep(NA, shift), vec[1:(length(vec)-shift)]) }
    else {
      c(vec[(abs(shift)+1):length(vec)], rep(NA, abs(shift)))
    }
  }
}

geodata_2$lat.p1 <- shift.vec(geodata_2$lat, -1)
geodata_2$lon.p1 <- shift.vec(geodata_2$lon, -1)

# Calculating distances between points (in metres) with the function pointDistance from the 'raster' package.

library(raster)
geodata_2$dist.to.prev <- apply(geodata_2, 1, FUN = function (row) {
  pointDistance(c(as.numeric(as.character(row["lat.p1"])),
                  as.numeric(as.character(row["lon.p1"]))),
                c(as.numeric(as.character(row["lat"])), as.numeric(as.character(row["lon"]))),
                lonlat = T) # Parameter 'lonlat' has to be TRUE!
})

We drove approximately 3236.95 km in 18 days. Not counting the ~6 days where we stayed in one place and didn’t drive much, that’s around 269.75 km/ day.



Identifying local time

The time stamps are in CEST - Central European Summer Time. We arrived in the Central Time Zone (CDT) but spent the majority of our trip in Estern Time Zone (EDT).


Converting to CDT and EDT

library(lubridate)

geodata$time_CDT <- with_tz(geodata$time, tz="America/Chicago")
geodata$time_EDT <- with_tz(geodata$time, tz="America/New_York")

# extracting day for EDT
geodata$day_EDT <- as.factor(format(as.Date(geodata$time_EDT,format="%Y-%m-%d"), "%d"))


Identifying the local time based on geocode

# the days where we crossed time zones were the 10th and 23rd

library(googleway)

geodata_timezone <- geodata
geodata_timezone$local_time <- "NA"

for (i in 1:nrow(geodata)){
  if (is.na(nchar(strsplit(as.character(i/100), "\\.")[[1]][2]))){
    print(i)
  }

  if (geodata[i, "day_EDT"] == "10"){
    data_row <- geodata[i, ]
    google_timezone_api <- google_timezone(location = c(data_row$lat, data_row$lon),
                                           timestamp = as.POSIXct(data_row$time_EDT), # because the majority of our trip was in EDT
                                           key = "AIzaSyBkqrk8mBqqvao-W-jKkL2VhHCGLWqZVCY",
                                           simplify = TRUE,
                                           language = "en")

    time_zone <- google_timezone_api$timeZoneId

    local_time <- as.POSIXlt(with_tz(data_row$time, tz=paste(time_zone)))

    geodata_timezone$local_time[i] <- as.character(local_time)

    rm(data_row, google_timezone_api, time_zone)

  } else {

    if (geodata[i, "day_EDT"] == "09"){
      geodata_timezone$local_time[i] <- as.character(geodata_timezone$time_CDT[i])
    } else {
      geodata_timezone$local_time[i] <- as.character(geodata_timezone$time_EDT[i])
    }
  }
}


Plotting elevation

theme_set(theme_bw()) # Change the theme to my preference

labels <- c("9" = "9th, Chicago", "10" = "10th, St. Joseph", "11" = "11th, Ann Arbor", "12" = "12th, Detroit", "13" = "13th, Burlington",
            "14" = "14th, Burlington", "15" = "15th, Toronto", "16" = "16th, Niagara & Ashtabula", "17" = "17th, Pittsburgh",
            "18" = "18th, Pittsburgh", "19" = "19th, Marietta", "20" = "20th, Maysville", "21" = "21st, Louisville",
            "22" = "22nd, Louisville", "23" = "23rd, Paducah")

ggplot(aes(x = as.POSIXlt(local_time), y = ele), data = geodata_timezone) +
  geom_point(size = 0.5) + facet_wrap( ~ day_EDT, ncol=3, scales = "free_x", labeller=labeller(day_EDT = labels)) +
  labs(x="Local Time", y="Elevation", title="Elevation per day and time")

day_df <- data.frame(row.names = unique(geodata_timezone$day_EDT),
                     group = unique(geodata_timezone$day_EDT),
                     xmin = rep(NA, length(unique(geodata_timezone$day_EDT))),
                     xmax = rep(NA, length(unique(geodata_timezone$day_EDT))),
                     ymin = rep(-Inf, length(unique(geodata_timezone$day_EDT))),
                     ymax = rep(Inf, length(unique(geodata_timezone$day_EDT))))

for (day in unique(geodata_timezone$day_EDT)){
  if (!day == "23"){
    day_df[paste(day), 2] <- which(geodata_timezone$day_EDT == day)[1]
    day_df[paste(day), 3] <- which(geodata_timezone$day_EDT == as.numeric(as.character(day))+1)[1]-1
  } else {
    day_df[paste(day), 2] <- which(geodata_timezone$day_EDT == day)[1]
    day_df[paste(day), 3] <- nrow(geodata_timezone)
  }
}

ggplot(aes(x = seq_along(ele), y = ele), data = geodata_timezone) +
  geom_rect(data=day_df, inherit.aes=FALSE,
           aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill = factor(group)), col = "grey", alpha=0.5) +
  #annotate("rect", xmin=0, xmax=0, ymin=-Inf, ymax=0, fill="red", alpha=0.5) +
  geom_path() +
  labs(x="Index", y="Elevation (line) and day (rectangles)", title="Elevation per day") +
  scale_y_continuous(limits = c(min(c(geodata_timezone$ele, as.numeric(as.character(geodata_timezone$day_EDT)))), 
                                max(c(geodata_timezone$ele, as.numeric(as.character(geodata_timezone$day_EDT)))))) +
  scale_fill_discrete(name="Days", labels = labels)



Extracting speed data from the gpx files

Looking at the gpx file, there is also speed and course data. It is saved into the extensions tag. However, not all data points seem to have a speed tag.

for (i in 1:length(myfiles)){

  tryCatch({
    singleString <- paste(readLines(myfiles[i]), collapse=" ")
    pfile2 <- gsub("</time><extensions><gpxtpx:TrackPointExtension><gpxtpx:course>",
                   "</time><extensions><gpxtpx:TrackPointExtension><gpxtpx:speed>0.00</gpxtpx:speed><gpxtpx:course>", singleString)

    pfile <- htmlTreeParse(pfile2, useInternalNodes = T)

    # Get all elevations, times and coordinates via the respective xpath
    times <- xpathSApply(pfile, path = "//trkpt/time", xmlValue)
    coords <- xpathSApply(pfile, path = "//trkpt", xmlAttrs)
    # Extract latitude and longitude from the coordinates
    lats <- as.numeric(as.character(coords["lat",]))
    lons <- as.numeric(as.character(coords["lon",]))

    speed <- xpathSApply(pfile, path = "//trkpt/extensions/*/speed", xmlValue)# in m/s
    
    # course gives the degrees where the car was headed at that point
    course <- xpathSApply(pfile, path = "//trkpt/extensions/*/course", xmlValue)
    # Put everything in a dataframe and get rid of old variables
    geodf <- data.frame(lat = lats, lon = lons, time = times, speed = speed, course = course)

    if (i == 1){
      geodata_new <- geodf
    } else {
      geodata_new <- rbind(geodata_new, geodf)
    }

    rm(pfile, pfile2, singleString)
  }, error=function(e){cat("ERROR\n")})

}
# Transforming the time column
geodata_new$time <- strptime(geodata_new$time, format = "%Y-%m-%dT%H:%M:%SZ")

# ordering by date
geodata_new <- arrange(geodata_new, time)

# removing all tracks from Germany
geodata_new_trip <- geodata_new[which(as.numeric(as.character(geodata_new$lon)) < 0),]

# adding a column with day for separating the plotting by day lateron
geodata_new_trip$day <- as.factor(format(as.Date(geodata_new_trip$time,format="%Y-%m-%d"), "%d"))

head(geodata_new_trip)
##            lat      lon                time speed course day
## 35664 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09
## 35665 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09
## 35666 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09
## 35667 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09
## 35668 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09
## 35669 41.98135 -87.8818 2016-09-09 19:23:53  0.00   0.00  09


Garmin shows speed as m/s, so I am converting it to km/h.

library(yarrr)

geodata_new_trip$speed_kmh <- as.numeric(as.character(geodata_new_trip$speed)) *3.6

speed_data <- geodata_new_trip[which(as.numeric(as.character(geodata_new_trip$speed_kmh)) > 0),]
speed_data <- speed_data[!duplicated(speed_data), ]

summary(as.numeric(as.character(speed_data$speed_kmh)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.932  24.700  44.460  49.210  69.190 128.500
pirateplot(formula = as.numeric(as.character(speed_kmh)) ~ as.factor(day),
           point.o = .1,
           data = speed_data,
           ylab = "Speed (km/h)",
           xlab = "Day in September",
           main = "Speed per day")



Photos

And finally some photos from our trip. Their locations are marked on the map below:

myphotos <- list.files(path = ".", full.names = TRUE)
myphotos <- myphotos[grep("jpg", myphotos)] 

library(exif)

photo_geotags <- data.frame(row.names = gsub("./", "", myphotos), lat = rep(NA, length(myphotos)), lon = NA)

for (file in gsub("./", "", myphotos)){
  photodata <- read_exif(file)
  lat <- photodata$latitude
  lon <- photodata$longitude
  photo_geotags[file, ]$lat <- lat
  photo_geotags[file, ]$lon <- lon
}

photo_geotags$text <- seq(1:nrow(photo_geotags))

library(ggmap)
library(ggrepel)

bc_bbox <- make_bbox(lat = lat, lon = lon, data = photo_geotags, f = .10)
map <- get_map(location = bc_bbox, maptype = "roadmap") 

ggmap(map, darken = c(0.3, "white")) + 
  geom_point(aes(x = lon, y = lat), data = photo_geotags) +
  theme(legend.position="none") + labs(x="Longitude", y="Latitude", title="Roadtrip 2016") + 
  geom_text_repel(aes(label = text), data = photo_geotags)

1 Photo number 1

2 Photo number 2

3 Photo number 3

4 Photo number 4

5 Photo number 5

6 Photo number 6

7 Photo number 7

8 Photo number 8

9 Photo number 9

10 Photo number 10

11 Photo number 11

12 Photo number 12

13 Photo number 13

14 Photo number 14

15 Photo number 15

16 Photo number 16

17 Photo number 17

18 Photo number 18

19 Photo number 19

20 Photo number 20

21 Photo number 21

22 Photo number 22

23 Photo number 23

24 Photo number 24

25 Photo number 25

26 Photo number 26

27 Photo number 27

28 Photo number 28

29 Photo number 29

30 Photo number 30

31 Photo number 31

32 Photo number 32

33 Photo number 33

34 Photo number 34

35 Photo number 35

DESeq2 Course Work

2016-09-29 00:00:00 +0000



The following workflow has been designed as teaching instructions for an introductory course to RNA-seq data analysis with DESeq2.

The course is designed for PhD students and will be given at the University of Münster from 10th to 21st of October 2016.

For questions or other comments, please contact me.



Go to exprAnalysis or this post for installation instructions.

For all functions, use the help pages to find out more about parameters and usage.

See Vignette for additional information.

library(exprAnalysis)


Input data

“As input, the DESeq2 package expects count data as obtained, e. g., from RNAseq or another high-throughput sequencing experiment, in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j.” Love et al., DESeq2 vignette

data("countmatrix")


Count data analysis with DESeq2

See DESeq2 Vignette for details.

  • read in saved count matrix
  • define experimental design
  • convert to DESeq data set

Count matrix input

design <- gsub("(.*)(_[0-9])", "\\1", colnames(countmatrix))
ExpDesign <- data.frame(row.names=colnames(countmatrix), treatment = design)

data <- DESeq2::DESeqDataSetFromMatrix(countData = countmatrix, colData = ExpDesign, design = ~treatment)
## converting counts to integer mode


DESeq2

  • optional, but recommended: remove genes with zero counts over all samples
  • run DESeq
  • Extracting transformed values

“While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are no reads or nearly no reads, we reduce the memory size of the dds data object and we increase the speed of the transformation and testing functions within DESeq2.” Love et al., DESeq2 vignette

Note: the rlog transformation is provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.

For a quick first glance at the data, we can use pcaExplorer.

data <- data[rowSums(DESeq2::counts(data)) > 1, ]

data_DESeq <- DESeq2::DESeq(data)
## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing
expmatrix_DESeq <- DESeq2::rlog(data_DESeq, fitType="local")
expmatrix <- SummarizedExperiment::assay(expmatrix_DESeq)
library("pcaExplorer")
pcaExplorer(data_DESeq, expmatrix_DESeq)


Dispersion plot

DESeq2::plotDispEsts(data_DESeq, main="Dispersion Estimates")



Exploratory analysis of all genes

Variance vs mean gene expression across samples

Plots variance against mean gene expression across samples and calculates the correlation of a linear regression model.

var_vs_mean() uses the R package matrixStats.

var_vs_mean(countmatrix)

## 
##  Pearson's product-moment correlation
## 
## data:  log2(dispersion[, 1] + 1) and log2(dispersion[, 2] + 1)
## t = 281.2, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9399669 0.9443684
## sample estimates:
##       cor 
## 0.9422083
var_vs_mean(expmatrix)

## 
##  Pearson's product-moment correlation
## 
## data:  log2(dispersion[, 1] + 1) and log2(dispersion[, 2] + 1)
## t = 12.133, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1010951 0.1397267
## sample estimates:
##       cor 
## 0.1204565


Intersample variances

library(corrgram)

Ctrl_cor <- expmatrix[,grep("Ctrl", colnames(expmatrix))]

corrgram::corrgram(Ctrl_cor, order=TRUE, lower.panel=corrgram::panel.pie,
         upper.panel=corrgram::panel.pts, text.panel=corrgram::panel.txt,
         main="Correlogram of controls")

Repeat for other treatment groups



Principle Component Analysis

Uses functions from the R package pcaGoPromoter.

You can only plot the principle components using:

groups <- as.factor(c(rep("Ctrl",4), rep("TolLPS",4), rep("TolS100A8",4), rep("ActLPS",4)))
pca_plot(expmatrix, groups)


Or you can plot the principle components and calculate TF and GO term enrichments of genes (defaults to top 2.5%) with highest and lowest loadings. With this function, the ouput files are directly saved to .pdf and .txt (by default to working directory).

pca_plot_enrich(expmatrix, groups)


Heatmaps

heatmaps() uses the R package gplots.

Here, of the 30 most highly expressed genes.

select <- order(rowMeans(expmatrix),decreasing=TRUE)[1:30]
heatmaps(expmatrix[select,], samplecols = rep(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3"), each=4))

Heatmap function from DESeq2, using pheatmap:

library(pheatmap)

sampleDists <- dist(t(expmatrix))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(expmatrix_DESeq$treatment)
colnames(sampleDistMatrix) <- NULL
colors <- grDevices::colorRampPalette( rev(RColorBrewer::brewer.pal(9, "Blues")) )(255)
pheatmap::pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

df <- data.frame(treatment = SummarizedExperiment::colData(data_DESeq)[,c("treatment")], row.names = rownames(SummarizedExperiment::colData(data_DESeq)))
pheatmap::pheatmap(expmatrix[select,], cluster_rows=TRUE, show_rownames=TRUE, cluster_cols=TRUE, annotation_col=df)



Hierarchical Clustering and outlier detection

Uses adjacency matrix function from the R package WGCNA and hierarchical clustering from the R package flashClust.

datTraits <- data.frame(Ctrl = c(rep(1, 4), rep(0,12)), TolPS = c(rep(0, 4), rep(1, 4),rep(0, 8)), 
                        TolS100A8 = c(rep(0, 8), rep(1, 4), rep(0, 4)), ActLPS = c(rep(0, 12),rep(1, 4)), 
                        Tol = c(rep(0, 4), rep(1, 8), rep(0, 4)), 
                        ExPhenotype = c(stats::rnorm(4, 10, 1),stats::rnorm(8, 25, 1),stats::rnorm(4, 50, 1)), 
                        row.names = colnames(expmatrix))

datExpr <- wgcna_sample_dendrogram(expmatrix, datTraits)
## 

##  Flagging genes and samples with too many missing values...
##   ..step 1
## All genes are okay!
## All samples are okay!
# Optional: Remove outlier samples and repeats: All genes flagged for removal are saved to the object "remove_genes"
#head(remove_genes)


Differential expression analysis using DESeq2

For raw read count data.

contrast DE groups:

  • lfc = treatment > Ctrl, - lfc = treatment < Ctrl p-value & p.adjust values of NA indicate outliers detected by Cook’s distance NA only for p.adjust means the gene is filtered by automatic independent filtering for having a low mean normalized count

Information about which variables and tests were used can be found by calling the function mcols, on the results object.

library(DESeq2)
library(ggplot2)
library(ggrepel)

# find possible contrasts with
DESeq2::resultsNames(data_DESeq)
## [1] "Intercept"          "treatmentActLPS"    "treatmentCtrl"     
## [4] "treatmentTolLPS"    "treatmentTolS100A8"
res <- DESeq2::results(data_DESeq, contrast=list("treatmentActLPS", "treatmentCtrl"), cooksCutoff = 0.99, independentFiltering = TRUE, alpha = 0.05, pAdjustMethod = "BH")
summary(res)
## 
## out of 10000 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)     : 2608, 26% 
## LFC < 0 (down)   : 2407, 24% 
## outliers [1]     : 175, 1.8% 
## low counts [2]   : 0, 0% 
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
mcols(res)$description
## [1] "mean of normalized counts for all samples"               
## [2] "log2 fold change (MAP): treatmentActLPS vs treatmentCtrl"
## [3] "standard error: treatmentActLPS vs treatmentCtrl"        
## [4] "Wald statistic: treatmentActLPS vs treatmentCtrl"        
## [5] "Wald test p-value: treatmentActLPS vs treatmentCtrl"     
## [6] "BH adjusted p-values"
# order results table by the smallest adjusted p value:
res <- res[order(res$padj),]

results = as.data.frame(dplyr::mutate(as.data.frame(res), sig=ifelse(res$padj<0.05, "FDR<0.05", "Not Sig")), row.names=rownames(res))
head(results)
##               baseMean log2FoldChange      lfcSE     stat        pvalue
## CATSPER3     2469.0350       4.094351 0.08958149 45.70532  0.000000e+00
## GDA          5578.3392       4.200987 0.13525887 31.05886 8.660849e-212
## CCDC64B       544.6121       5.878983 0.19906885 29.53241 1.104831e-191
## GRB2         1751.2594       2.506512 0.09224234 27.17312 1.350120e-162
## LOC100507387  716.7431       9.537221 0.36844495 25.88506 9.810056e-148
## TMEM102      1346.0139       4.244962 0.16534946 25.67267 2.360945e-145
##                       padj      sig
## CATSPER3      0.000000e+00 FDR<0.05
## GDA          4.254642e-208 FDR<0.05
## CCDC64B      3.618323e-188 FDR<0.05
## GRB2         3.316233e-159 FDR<0.05
## LOC100507387 1.927676e-144 FDR<0.05
## TMEM102      3.866048e-142 FDR<0.05
DEgenes_DESeq <- results[which(abs(results$log2FoldChange) > log2(1.5) & results$padj < 0.05),]

p = ggplot2::ggplot(results, ggplot2::aes(log2FoldChange, -log10(pvalue))) +
  ggplot2::geom_point(ggplot2::aes(col = sig)) +
  ggplot2::scale_color_manual(values = c("red", "black")) +
  ggplot2::ggtitle("Volcano Plot of DESeq2 analysis")

p + ggrepel::geom_text_repel(data=results[1:10, ], ggplot2::aes(label=rownames(results[1:10, ])))
## Warning: Removed 175 rows containing missing values (geom_point).

# If there aren't too many DE genes:
#p + geom_text_repel(data = dplyr::filter(results, padj<0.05), aes(label = rownames(results[1:10, ])))


MA-plot

“These plots show the log2 fold changes from the treatment over the mean of normalized counts, i.e. the average of counts normalized by size factors. The left plot shows the “unshrunken” log2 fold changes, while the right plot, produced by the code above, shows the shrinkage of log2 fold changes resulting from the incorporation of zero-centered normal prior. The shrinkage is greater for the log2 fold change estimates from genes with low counts and high dispersion, as can be seen by the narrowing of spread of leftmost points in the right plot.” Love et al., DESeq2 vignette

DESeq2::plotMA(res, main="MA Plot", ylim=c(-2,2))


plotCounts

“It can also be useful to examine the counts of reads for a single gene across the groups. A simple function for making this plot is plotCounts, which normalizes counts by sequencing depth and adds a pseudocount of 1/2 to allow for log scale plotting. The counts are grouped by the variables in intgroup, where more than one variable can be specified.” Love et al., DESeq2 vignette

par(mfrow=c(1,3))

for (i in 1:3){
  gene <- rownames(res)[i]
  main = gene
  DESeq2::plotCounts(data_DESeq, gene=gene, intgroup="treatment", main = main)
}



Gene annotations

Can be used to add e.g. ENTREZ ID, ENSEMBL ID, etc. to gene name.

results_anno <- geneAnnotations(input=results, keys=row.names(results), column=c("ENTREZID", "ENSEMBL"), keytype="SYMBOL", organism = "human")
## 

## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
head(results_anno)
##               baseMean log2FoldChange      lfcSE     stat        pvalue
## CATSPER3     2469.0350       4.094351 0.08958149 45.70532  0.000000e+00
## GDA          5578.3392       4.200987 0.13525887 31.05886 8.660849e-212
## CCDC64B       544.6121       5.878983 0.19906885 29.53241 1.104831e-191
## GRB2         1751.2594       2.506512 0.09224234 27.17312 1.350120e-162
## LOC100507387  716.7431       9.537221 0.36844495 25.88506 9.810056e-148
## TMEM102      1346.0139       4.244962 0.16534946 25.67267 2.360945e-145
##                       padj      sig  ENTREZID         ENSEMBL
## CATSPER3      0.000000e+00 FDR<0.05    347732 ENSG00000152705
## GDA          4.254642e-208 FDR<0.05      9615 ENSG00000119125
## CCDC64B      3.618323e-188 FDR<0.05    146439 ENSG00000162069
## GRB2         3.316233e-159 FDR<0.05      2885 ENSG00000177885
## LOC100507387 1.927676e-144 FDR<0.05 100507387 ENSG00000182230
## TMEM102      3.866048e-142 FDR<0.05    284114 ENSG00000181284


Enrichment Analysis using clusterPofiler

See clusterProfiler instructions for details.

library(clusterProfiler)
## Loading required package: DOSE
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
OrgDb <- org.Hs.eg.db # can also be other organisms

geneList <- as.vector(results_anno$log2FoldChange)
names(geneList) <- results_anno$ENTREZID
gene <- na.omit(results_anno$ENTREZID)


# Group GO
ggo <- clusterProfiler::groupGO(gene     = gene,
                                OrgDb    = OrgDb,
                                ont      = "BP",
                                level    = 3,
                                readable = TRUE)
head(summary(ggo)[,-5])
##                    ID                                    Description Count
## GO:0003006 GO:0003006 developmental process involved in reproduction   261
## GO:0019953 GO:0019953                            sexual reproduction   303
## GO:0019954 GO:0019954                           asexual reproduction     0
## GO:0022414 GO:0022414                           reproductive process   547
## GO:0032504 GO:0032504            multicellular organism reproduction   317
## GO:0032505 GO:0032505       reproduction of a single-celled organism     0
##            GeneRatio
## GO:0003006  261/9735
## GO:0019953  303/9735
## GO:0019954    0/9735
## GO:0022414  547/9735
## GO:0032504  317/9735
## GO:0032505    0/9735
barplot(ggo, drop=TRUE, showCategory=12)

# GO over-representation test
ego <- clusterProfiler::enrichGO(gene          = gene,
                                 OrgDb         = OrgDb,
                                 ont           = "BP",
                                 pAdjustMethod = "BH",
                                 pvalueCutoff  = 0.05,
                                 qvalueCutoff  = 0.05, 
                                 readable      = TRUE)
head(summary(ego)[,-8])
##                    ID                                        Description
## GO:0072657 GO:0072657                   protein localization to membrane
## GO:0006893 GO:0006893                 Golgi to plasma membrane transport
## GO:0007051 GO:0007051                               spindle organization
## GO:0044772 GO:0044772                mitotic cell cycle phase transition
## GO:0060999 GO:0060999 positive regulation of dendritic spine development
## GO:0048193 GO:0048193                            Golgi vesicle transport
##            GeneRatio   BgRatio       pvalue   p.adjust     qvalue Count
## GO:0072657  260/7259 487/16655 6.475400e-06 0.02201028 0.02035249   260
## GO:0006893   36/7259  48/16655 9.648441e-06 0.02201028 0.02035249    36
## GO:0007051   81/7259 130/16655 1.246381e-05 0.02201028 0.02035249    81
## GO:0044772  253/7259 477/16655 1.597263e-05 0.02201028 0.02035249   253
## GO:0060999   27/7259  34/16655 2.158731e-05 0.02379785 0.02200542    27
## GO:0048193  179/7259 329/16655 4.371899e-05 0.03540965 0.03274263   179
barplot(ego, showCategory=25)

clusterProfiler::dotplot(ego, showCategory=25)

#clusterProfiler::plotGOgraph(ego)
## KEGG over-representation test
kk <- clusterProfiler::enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.05,
                 qvalueCutoff  = 0.05)
head(summary(kk)[,-8])
##                ID                  Description GeneRatio  BgRatio
## hsa04144 hsa04144                  Endocytosis  142/3071 260/7119
## hsa04210 hsa04210                    Apoptosis   80/3071 140/7119
## hsa05169 hsa05169 Epstein-Barr virus infection  111/3071 204/7119
## hsa05134 hsa05134                Legionellosis   36/3071  55/7119
##                pvalue   p.adjust     qvalue Count
## hsa04144 9.858484e-05 0.02858960 0.02345281   142
## hsa04210 5.292257e-04 0.04885909 0.04008042    80
## hsa05169 6.639562e-04 0.04885909 0.04008042   111
## hsa05134 6.739185e-04 0.04885909 0.04008042    36
cnetplot(kk, categorySize="geneNum", foldChange=geneList)

exprAnalysis package

2016-09-28 00:00:00 +0000

I created the R package exprAnalysis designed to streamline my RNA-seq data analysis pipeline. Below you find the vignette for installation and usage of the package.



This package combines functions from various packages used to analyze and visualize expression data from NGS or expression chips.

  • It supports normalized input as e.g. from Cufflinks or expression chip arrays and raw count data from bam file input.
  • It supports mRNA, miRNA, protein or other expression data.

So far, it is implemented for human and mouse data only!

It uses functions from the following packages:

  • AnnotationDbi for annotating gene information
  • beadarray for importing Illumina expression chip files from GenomeStudio
  • clusterProfiler and DOSE for functional enrichment analysis
  • DESeq2 for differential expression analysis of raw count data
  • GenomicAlignments, GenomicFeatures, Rsamtools for reading bam files
  • pcaGoPromoter for principle component analysis
  • limma for differential expression analysis of normalised expression data
  • pathview for mapping KEGG pathways
  • gplots for heatmaps
  • sva for batch correction
  • WGCNA for coregulatory network determination


Prepare session

library(exprAnalysis)
# update all packages
source.bioc="https://bioconductor.org/biocLite.R"
source(source.bioc)
biocLite("BiocUpgrade")
biocLite(ask=FALSE) 

update.packages()

# make sure the workspace is in pristine condition
rm(list=ls(all=TRUE))

orig_par <- par(no.readonly=T)

options(stringsAsFactors = FALSE)


The package can be installed via Github:

Beware that the vignette is rather large and thus takes a minute to compile. You can also just use this page (it is identical to the vignette).

# install package from github
install.packages("devtools")
library(devtools)

# either the latest stable release that passed TRAVIS CI check
devtools::install_github("ShirinG/exprAnalysis", build_vignettes=TRUE, ref = "stable.version0.1.0")

# or the development version
devtools::install_github("ShirinG/exprAnalysis", build_vignettes=TRUE, ref = "master")

There might be problems with installation of some dependency packages (especially Bioconductor packages and WGCNA and its dependencies from CRAN). In order to install them manually:

list.of.packages_bioconductor <- c("arrayQualityMetrics", "beadarray", "pcaGoPromoter", "limma", "pathview", "sva", "GO.db", "impute")
list.of.packages_cran <- c("WGCNA", "roxygen2", "testthat", "gplots")

new.packages_bioconductor <- list.of.packages_bioconductor[!(list.of.packages_bioconductor %in% installed.packages()[,"Package"])]
new.packages_cran <- list.of.packages_cran[!(list.of.packages_cran %in% installed.packages()[,"Package"])]

# CRAN
if(length(new.packages_cran)>0) install.packages(new.packages_cran)

# Bioconductor
if(length(new.packages_bioconductor)>0) {
  source("https://bioconductor.org/biocLite.R")
  biocLite(new.packages_bioconductor)
}

Load the library:

library("exprAnalysis")

# To view the vignette if you built it with the package:

vignette("exprAnalysis", package="exprAnalysis")
vignette("CummeRbund", package="exprAnalysis")

browseVignettes("exprAnalysis")

enableWGCNAThreads()


Functions

See package help (?functionname, ?data) for detailed descriptions of functions and example datasets.



Input data

Takes an expression data matrix containing numeric values as expression measures (e.g. read count data, FPKM values, Illumina expression data).

  • Rownames should be gene or isoform identifiers (e.g. gene names)
  • Colnames should be sample IDs (sample names)

This package contains two example matrices of randomly generated expression data as raw read counts (called “countmatrix”) and as normalized read counts (called “expmatrix”).

data("countmatrix")
data("expmatrix")


Starting from FPKM data

If you have FPKM data (e.g. quantile normalized) from Cufflinks, treat the data such as the expmatrix example.


Cuffdiff

See CummeRbund vignette included among the vignettes of this package CummeRbundVignette.html.



Starting from count data

If you want to do count data analysis, you can either produce a count matrix (e.g. with HTSeq) and proceed to DESeq2 analysis or you can produce the count table directly from the bam files and then proceed to DESeq2.

read_bam_to_countmatrix() returns a DESeq data set that can directly be used with the DEseq2 pipeline. Or you can manually load the count matrix that was saved by read_bam_to_countmatrix() and then go to DESeqDataFrameFromMatrix().

In read_bam_to_countmatrix() fragments will be counted only once to each gene, even if they overlap multiple exons of a gene which may themselves be overlapping.

read_bam_to_countmatrix() uses the packages Rsamtools, GenomicAlignments, GenomicFeatures, Biobase and SummarizedExperiment.

# Locate alignment files
dir <- getwd()
filenames <- fileslist[grep(".*sorted.bam$", list.files(dir))]

# Create a sample table

samplename <- sub("_accepted_hits.sorted.bam", "", filenames)
design <- c("Treatment", "Control")
sampleid <- sub("Sample", "", samplename)

sampleTable <- data.frame(row.names = samplename, sampleid = sampleid, filenames = filenames, colData = design, stringsAsFactors = F)

#         sampleid                         filenames    colData
#Sample1         1  Sample1_accepted_hits.sorted.bam  Treatment
#Sample2         2  Sample2_accepted_hits.sorted.bam    Control

data <- read_bam_to_countmatrix(sampleTable, gtffile = "Homo_sapiens.GRCh38.83.gtf", projectfolder = getwd(), outPrefix="Test")

countmatrix <- SummarizedExperiment::assay(data)


Continue count data analysis with DESeq2

See DESeq2 Vignette for additional details.

If you do not directly proceed from read_bam_to_countmatrix():

  • read in saved count matrix
  • define experimental design
  • convert to DESeq data set
countmatrix <-   read.table(file.path(projectfolder, "Countdata", outPrefix, paste0(outPrefix, "_Summarize_overlaps_count_matrix.txt")), header=T, sep = "\t")
design <- gsub("(.*)(_[0-9])", "\\1", colnames(countmatrix))
ExpDesign <- data.frame(row.names=colnames(countmatrix), treatment = design)

data <- DESeq2::DESeqDataSetFromMatrix(countData = countmatrix, colData=ExpDesign, design=~treatment)


DESeq2

  • optional, but recommended: remove genes with zero counts over all samples
  • run DESeq
  • Extracting transformed values

Note: the rlog transformation is provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.

data <- data[rowSums(DESeq2::counts(data)) > 1, ]

data_DESeq <- DESeq2::DESeq(data)

expmatrix_DESeq <- DESeq2::rlog(data_DESeq, fitType="local")
expmatrix <- SummarizedExperiment::assay(expmatrix_DESeq)


Dispersion plot

DESeq2::plotDispEsts(data_DESeq, main="Dispersion Estimates")


Starting from Affymetrix expression chips

If you have CEL files, start with the following code to produce the expression matrix and then treat it like the expmatrix example:

  • Read in the data and create an expression, using RMA for example

Currently the rma function implements RMA in the following manner 1. Probe specific correction of the PM probes using a model based on observed intensity being the sum of signal and noise 2. Normalization of corrected PM probes using quantile normalization (Bolstad et al., 2003) 3. Calculation of Expression measure using median polish.

library(affy)

# Choose directory containing CEL files
dir <- getwd()

# check, that you have the correct directory
celfiles <- file.path(dir, list.files(dir, recursive = TRUE)[grep(".*CEL$", list.files(dir, recursive = TRUE), ignore.case = TRUE)])

affydata <- ReadAffy(filenames=celfiles)
affy::MAplot(affydata,plot.method="smoothScatter")
eset <- affy::rma(affydata)

# If affy doesn' recognize the chip type, try the oligo package:
library(oligo)
rawData <- read.celfiles(celfiles)
oligo::MAplot(rawData, plotFun=smoothScatter)
eset <- backgroundCorrect(rawData, method="rma")

quality_control_plots(eset, groupColumn=c())

# Optional: Remove bad quality probes/ genes and/ or samples

write.exprs(eset, file="eset.txt")

library(made4)
overview(eset)

# RNA degradation plots
deg <- AffyRNAdeg(affydata)
plotAffyRNAdeg(deg)


expmatrix <- exprs(eset)


Starting from Illumina expression chips

If you have Illumina intensity data, load it into GenomeStudio first:

  • Open New Project -> Gene Expression
  • Choose Assay Type -> Next
  • Choose Project Repo and name the project -> Next
  • On the left side, mark all folders containing data, select “All” and import them into the right window -> Next
  • Name Group Set, on the left side, mark all folders containing data, select “All” and import them into the right window (optionally, you can directly create groups here) -> Next
  • For now, don’t use normalisation and don’ substract background
  • Choose appropriate Content Descriptor (e.g. HumanHT-12_V4_0_R1_15002873_B.bgx) -> Finish
  • Export SampleProbeProfile.txt: Choose columns AVG_Signal, Detection Pval, BEAD_STDERR and Avg_NBEADS
  • Export ControlProbeProfile.txt: Choose columns AVG_Signal and Detection Pval
  • You also need the Samplesheet.csv

read_Illumina() and normalise_eset() uses the R package beadarray.

dataFile      <- file.path("SampleProbeProfile.txt")
qcFile        <- file.path("ControlProbeProfile.txt")
sampleSheet   <- file.path("samplesheet.csv")

# define Illumina Expression Array: "HumanHT-12 v?", "MouseWG-6 v?", "MouseRef-8 v?"
expressionchipType <- "HumanHT-12 v4"

eset <- read_Illumina(dataFile, qcFile, sampleSheet, expressionchipType, ProbeID = "PROBE_ID", skip = 0, controlID="ProbeID", qc.skip = 0, method_norm ="none", transform="log2")

quality_control_plots(eset)

# Optional: Remove bad quality probes/ genes and/ or samples

eset <- normalise_eset(eset)

expmatrix <- exprs(eset)


Batch correction

batch_removal() uses the R package sva.

pheno <- data.frame(sample=c(1:16), treatment=sub("(.*)(_[0-9])", "\\1", colnames(expmatrix)), batch=ifelse(grepl("Ctrl", colnames(expmatrix)) == TRUE, "1", ifelse(grepl("ActLPS", colnames(expmatrix)) == TRUE, "1", "2")), row.names = colnames(expmatrix))

expmatrix <- batch_removal(expmatrix, pheno)


Exploratory analysis of all genes

Variance vs mean gene expression across samples

Plots variance against mean gene expression across samples and calculates the correlation of a linear regression model.

var_vs_mean() uses the R package matrixStats.

var_vs_mean(countmatrix)

## 
##  Pearson's product-moment correlation
## 
## data:  log2(dispersion[, 1] + 1) and log2(dispersion[, 2] + 1)
## t = 281.2, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9399669 0.9443684
## sample estimates:
##       cor 
## 0.9422083
# var_vs_mean(expmatrix)


Intersample variances

library(corrgram)

Ctrl_cor <- expmatrix[,grep("Ctrl", colnames(expmatrix))]

corrgram::corrgram(Ctrl_cor, order=TRUE, lower.panel=corrgram::panel.pie,
         upper.panel=corrgram::panel.pts, text.panel=corrgram::panel.txt,
         main="Correlogram of controls")


Principle Component Analysis

Uses functions from the R package pcaGoPromoter.

You can only plot the principle components using:

groups <- as.factor(c(rep("Ctrl",4), rep("TolLPS",4), rep("TolS100A8",4), rep("ActLPS",4)))

pca_plot(expmatrix, groups)


Or you can plot the principle components and calculate TF and GO term enrichments of genes (defaults to top 2.5%) with highest and lowest loadings. With this function, the ouput files are directly saved to .pdf and .txt (by default to working directory).

pca_plot_enrich(expmatrix, groups)


Heatmaps

heatmaps() uses the R package gplots.

Here, of the 30 most highly expressed genes.

select <- order(rowMeans(expmatrix),decreasing=TRUE)[1:30]

heatmaps(expmatrix[select,], samplecols = rep(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3"), each=4))

# Heatmap function from DESeq2, using pheatmap:

library(pheatmap)

sampleDists <- dist(t(expmatrix))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(expmatrix_DESeq$treatment)
colnames(sampleDistMatrix) <- NULL
colors <- grDevices::colorRampPalette( rev(RColorBrewer::brewer.pal(9, "Blues")) )(255)
pheatmap::pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

df <- data.frame(treatment = SummarizedExperiment::colData(data_DESeq)[,c("treatment")], row.names = rownames(SummarizedExperiment::colData(data_DESeq)))
pheatmap::pheatmap(expmatrix[select,], cluster_rows=TRUE, show_rownames=TRUE, cluster_cols=TRUE, annotation_col=df)
# Interactive heatmap

#devtools::install_github('talgalili/heatmaply')
library(heatmaply)

# sample correlation
hm <- heatmapr(cor(expmatrix[select,]), k_row = 4, k_col = 4)

# gene correlation
hm <- heatmapr(cor(t(expmatrix[select,])), k_row = NULL, k_col = NULL)

heatmaply(hm)

#Hover over heatmap to see individual values and sample/ gene IDs


WGCNA

Check WGCNA page for detailed description of the WGCNA package.

Hierarchical Clustering and outlier detection

Uses adjacency matrix function from the R package WGCNA and hierarchical clustering from the R package flashClust.

datTraits <- data.frame(Ctrl = c(rep(1, 4), rep(0,12)), TolPS = c(rep(0, 4), rep(1, 4),rep(0, 8)), TolS100A8 = c(rep(0, 8), rep(1, 4), rep(0, 4)), ActLPS = c(rep(0, 12),rep(1, 4)), Tol = c(rep(0, 4), rep(1, 8), rep(0, 4)), ExPhenotype = c(stats::rnorm(4, 10, 1),stats::rnorm(8, 25, 1),stats::rnorm(4, 50, 1)), row.names = colnames(expmatrix))

datExpr <- wgcna_sample_dendrogram(expmatrix, datTraits)
## 

##  Flagging genes and samples with too many missing values...
##   ..step 1
## All genes are okay!
## All samples are okay!
# Optional: Remove outlier samples and repeats: All genes flagged for removal are saved to the object "remove_genes"
#head(remove_genes)


Choosing a Soft Threshold

Ideally, pick a SFT with R^2 > 0.9. In this example, this threshold was not reached, so I pick the highest SFT: 30.

# Choose a set of soft thresholding powers
powers=c(1:30)

# choose power based on SFT criterion
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5, networkType ="signed", corFnc= "bicor", corOptions=list(maxPOutliers=0.1))

# Plot the results:
tiff(filename= "SFT.tiff", width = 8000 , height = 4000, res=600, compression = "lzw")
par(mfrow=c(1,2))
# SFT index as a function of different powers
plot(sft$fitIndices[,1],-sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="SFT, signed R^2",type="n",main=paste("Scale independence"))
text(sft$fitIndices[,1],-sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,col="red")
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of different powers
plot(sft$fitIndices[,1],sft$fitIndices[,5],type="n",
     xlab="Soft Threshold (power)",ylab="Mean Connectivity",main=paste("Mean connectivity"))
text(sft$fitIndices[,1],sft$fitIndices[,5],labels=powers,col="red")
dev.off()


Automatic module detection via dynamic tree cutting

softpower = 30
mergingThresh = 0.25
net = blockwiseModules(datExpr, checkMissingData = TRUE, corType = "bicor", # or pearson
                       maxBlockSize = 10000, networkType = "signed", power = softpower,
                       minModuleSize = 30, maxPOutliers = 0.1, mergeCutHeight = mergingThresh,
                       numericLabels = TRUE, saveTOMs = TRUE, randomSeed = 12345,
                       pamRespectsDendro = FALSE, saveTOMFileBase = "TESTexprsTOM")
str(net)
moduleColors <- wgcna_plotDendroAndColors(datExpr, datTraits, net)


# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors, softPower = softpower)$eigengenes
# Reorder given (eigen-)vectors such that similar ones (as measured by correlation) are next to each other:
MEs = orderMEs(MEs0)
rownames(MEs) <- rownames(datExpr)
write.table(MEs, "Automatic_ModuleEigengenes_numberLabel.txt", row.names=T, quote=F, sep="\t")

# calculate the module membership values (aka. module eigengene based connectivity kME):
  datKME <- WGCNA::signedKME(datExpr, MEs)  # equals geneModuleMembership
  colnames(datKME) <- sub("kME", "MM.", colnames(datKME))

wgcna_modulememberships(datExpr, MEs, datKME, moduleColors)

wgcna_heatmaps(MEs, datExpr, datTraits)

Module trait correlation heatmap


Output file for gene ontology analysis

geneAnnotations() uses the R package AnnotationDbi.

# Annotate
genes = colnames(datExpr)
datExpr_anno <- geneAnnotations(input=genes, keys=genes, column=c("ENTREZID", "ENSEMBL"), keytype="SYMBOL", organism = "human")

# Correlations of genes and traits
traitsCor=cor(datExpr,datTraits,use="p")
colnames(traitsCor)=paste("cor",colnames(traitsCor),sep=".")

dataOutput=data.frame(datExpr_anno ,moduleColors, datKME, traitsCor)

write.table(dataOutput,"AllGenesResults.txt",row.names=F,sep="\t")


Module eigengene expression plots

data(MEs)
layout(matrix(c(1:length(colnames(MEs)),12,rep(13,3)), ncol=3, byrow=TRUE), heights=c(rep(3,4), 0.5))

for (color in gsub("ME", "", colnames(MEs))){
  module_eigengene_plot(groups, MEs, color)
}
plot.new()

par(mai=c(0,0,0,0))
plot.new()
legend("bottom", c("mean module eigengenes", "replicate module eigengene values"), xpd = TRUE, horiz = TRUE, inset = c(0,0), bty = "n", col = c("grey", "black"), pch = 16, cex = 2)



Differential expression analysis

DE analyis using the R package limma

For normalized expression data (FPKM from Cufflinks or intensities from expression chip arrays).

design <- as.matrix(data.frame(Ctrl = c(rep(1, 4), rep(0,12)), TolLPS = c(rep(0, 4), rep(1, 4),
              rep(0, 8)), TollMRP8 = c(rep(0, 8), rep(1, 4), rep(0, 4)), ActLPS = c(rep(0, 12),
              rep(1, 4)), row.names = colnames(expmatrix)))
groupcomparisons=c("TolLPS-Ctrl", "TollMRP8-Ctrl", "ActLPS-Ctrl")
DEgenes_all <- diff_limma_all(expmatrix, design, groupcomparisons)

comparison="TolLPS-Ctrl"
Allgenes_limma_pw <- diff_limma_pw_unfiltered(expmatrix, design, comparison)

DEgenes_pw <- Allgenes_limma_pw[which(abs(Allgenes_limma_pw$logFC) > log2(1.5) & Allgenes_limma_pw$adj.P.Val < 0.05),]

Example output files from diff_limma functions have been saved to the package data and can be accessed via:

data("DEgenes_pw")
data("Allgenes_limma_pw")


# Export tables to Latex
library(xtable)

italic <- function(x){
paste0('{\\emph{ ', x, '}}')
}

print(xtable(DEgenes_pw[1:10,], 
       digits = 2,
       caption = c("\\tt Long caption", "Short caption"),
       label = "tab:testtable",
       auto = TRUE),
      caption.placement = "top",
      sanitize.rownames.function = italic,
      booktabs = TRUE,
      floating = TRUE, 
      latex.environments = "center",
      include.rownames = TRUE,
      scalebox = 0.8,
      tabular.environment = "tabularx", width = "\\textwidth")

# or to html
print(xtable(DEgenes_pw[1:10,], 
       digits = 2,
       caption = c("Long caption", "Short caption"),
       label = "tab:testtable",
       auto = TRUE), type = "html", caption.placement = "top")
Long caption
logFC CI.L CI.R AveExpr t P.Value adj.P.Val B ENTREZID ENSEMBL
HCAR2 -1.60 -1.78 -1.41 5.17 -18.38 0.00 0.00 16.91 338442 ENSG00000182782
LOC100996342 2.54 2.18 2.90 6.89 15.03 0.00 0.00 14.33 100996342 ENSG00000235478
INPP4B -1.54 -1.77 -1.31 4.80 -14.19 0.00 0.00 13.58 8821 ENSG00000109452
LOC100505549 -1.26 -1.45 -1.06 3.11 -13.91 0.00 0.00 13.32 100505549
FCGBP -2.06 -2.38 -1.75 3.39 -13.87 0.00 0.00 13.28 8857 ENSG00000275395
ZSCAN31 -1.40 -1.63 -1.18 3.44 -13.33 0.00 0.00 12.76 64288 ENSG00000235109
MIR3665 -1.32 -1.53 -1.11 2.69 -13.27 0.00 0.00 12.70 100500861 ENSG00000266325
PTH2R -1.62 -1.90 -1.35 3.69 -12.56 0.00 0.00 11.98 5746 ENSG00000144407
GCLC -2.42 -2.83 -2.00 3.13 -12.27 0.00 0.00 11.66 2729 ENSG00000001084
SOX30 2.54 2.10 2.98 8.57 12.19 0.00 0.00 11.59 11063 ENSG00000039600


DE analyis using DESeq2

For raw read count data.

contrast DE groups:
  • lfc = treatment > Ctrl, - lfc = treatment < Ctrl p-value & p.adjust values of NA indicate outliers detected by Cook’s distance NA only for p.adjust means the gene is filtered by automatic independent filtering for having a low mean normalized count


library(DESeq2)
library(ggplot2)
library(ggrepel)

# find possible contrasts with
DESeq2::resultsNames(data_DESeq)

res <- DESeq2::results(data_DESeq, contrast=list("treatmentActLPS", "treatmentCtrl"), cooksCutoff = 0.99, independentFiltering = TRUE, alpha = 0.05, pAdjustMethod = "BH")

# order results table by the smallest adjusted p value:
res <- res[order(res$padj),]

results = as.data.frame(dplyr::mutate(as.data.frame(res), sig=ifelse(res$padj<0.05, "FDR<0.05", "Not Sig")), row.names=rownames(res))

DEgenes_DESeq <- results[which(abs(results$log2FoldChange) > log2(1.5) & results$padj < 0.05),]

p = ggplot2::ggplot(results, ggplot2::aes(log2FoldChange, -log10(pvalue))) +
  ggplot2::geom_point(ggplot2::aes(col=sig)) +
  ggplot2::scale_color_manual(values=c("red", "black")) +
  ggplot2::ggtitle("Volcano Plot of DESeq2 analysis")

p+ggrepel::geom_text_repel(data=results[1:10, ], ggplot2::aes(label=rownames(results[1:10, ])))

# If there aren't too many DE genes:
#p+geom_text_repel(data=dplyr::filter(results, padj<0.05), aes(label=rownames(results[1:10, ])))


MA-plot

DESeq2::plotMA(res, main="MA Plot", ylim=c(-2,2))


plotCounts, which normalizes counts by sequencing depth and adds a pseudocount of 1/2 to allow for log scale plotting

par(mfrow=c(1,3))

for (i in 1:3){
  gene <- rownames(res)[i]
  main = gene
  DESeq2::plotCounts(data_DESeq, gene=gene, intgroup="treatment", main = main)
}


Gene annotations

Can be used to add e.g. ENTREZ ID, ENSEMBL ID, etc. to gene name.

DEgenes_pw <- geneAnnotations(input=DEgenes_pw, keys=row.names(DEgenes_pw), column=c("ENTREZID", "ENSEMBL"), keytype="SYMBOL", organism = "human")
DEgenes_DESeq <- geneAnnotations(input=DEgenes_DESeq, keys=row.names(DEgenes_DESeq), column=c("ENTREZID", "ENSEMBL"), keytype="SYMBOL")


Enrichment Analysis using clusterPofiler

library(clusterProfiler)
library(DOSE)

geneList <- as.vector(Allgenes_limma_pw$logFC)
names(geneList) <- rownames(Allgenes_limma_pw)
gene <- na.omit(DEgenes_pw$ENTREZID)

OrgDb <- org.Hs.eg.db::org.Hs.eg.db # can also be org.Mm.eg.db::org.Mm.eg.db

# Group GO
ggo <- clusterProfiler::groupGO(gene     = gene,
                                OrgDb    = OrgDb,
                                ont      = "BP",
                                level    = 3,
                                readable = TRUE)
head(summary(ggo))
barplot(ggo, drop=TRUE, showCategory=12)

# GO over-representation test
ego <- clusterProfiler::enrichGO(gene          = gene,
                                 OrgDb         = OrgDb,
                                 ont           = "BP",
                                 pAdjustMethod = "BH",
                                 pvalueCutoff  = 0.05,
                                 qvalueCutoff  = 0.05, 
                                 readable      = TRUE)
head(summary(ego))
barplot(ego, showCategory=25)
clusterProfiler::dotplot(ego, showCategory=25)
#enrichMap(ego)
cnetplot(ego, categorySize="pvalue", foldChange=geneList)
clusterProfiler::plotGOgraph(ego)

# enrichGO test the whole GO corpus and enriched result may contains very general terms. With dropGO function, user can remove specific GO terms or GO level from results obtained from both enrichGO and compareCluster.


## KEGG over-representation test
kk <- clusterProfiler::enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)
head(summary(kk))
barplot(kk, showCategory=8)
clusterProfiler::dotplot(kk)
cnetplot(kk, categorySize="geneNum", foldChange=geneList)


Pathview

uses the R package pathview.

For Toll-like-receptor signaling:

pathview_func(DEgenes_pw, logFCcolumn="logFC", pathway.id = "04620", out.suffix = "DE_TolLPS")


Venn diagrams and Biological theme comparison

library(gplots)

venn_list <- list(genelist1 = na.omit(rownames(DEgenes_pw)[1:50]), genelist2 = na.omit(rownames(DEgenes_pw)[30:80]), genelist3 = na.omit(rownames(DEgenes_pw)[48:100]))

gplots::venn(venn_list, show.plot=TRUE, small=0.7, showSetLogicLabel=FALSE, simplify = TRUE)

mtext("If you want a header, put it here", side=3, cex = 0.8)
# Biological theme comparison
compGO <- clusterProfiler::compareCluster(geneCluster   = venn_list,
                                          fun           = "enrichGO",
                                          ont           = "BP",
                                          OrgDb         = OrgDb,
                                          qvalueCutoff  = 0.001,
                                          pAdjustMethod = "BH",
                                          readable      = TRUE)


compKEGG <- clusterProfiler::compareCluster(geneCluster   = venn_list,
                                            fun           = "enrichKEGG",
                                            qvalueCutoff  = 0.05,
                                            pAdjustMethod = "BH")
compKEGG <- clusterProfiler::setReadable(compKEGG, OrgDb = OrgDb)


compMKEGG <- clusterProfiler::compareCluster(geneCluster   = venn_list,
                                             fun           = "enrichMKEGG",
                                             organism='hsa', minGSSize=1)

compDO <- clusterProfiler::compareCluster(geneCluster   = venn_list,
                                          fun           = "enrichDO",
                                          qvalueCutoff  = 0.05,
                                          pAdjustMethod = "BH",
                                          readable      = TRUE)

# For visualisation see Enrichment Analysis using clusterProfiler


Networks

TF networks

Organism can be human or mouse. For human, the longer published list of TFs is used; for mouse the shorter list provided by Bonn (for which I don’t have any more info on where it comes from).

This function produced two output files:

  • The .expression matrix of TF expression data.

It will have to be opened with Biolayout3D: Set Minimum Correlation and Correlation metric (by default 0.7 and Pearson). Then choose a suitable correlation coefficient (Graph Degree Distribution should be close to linear). Save the resulting network as a TGF file.

Open the TGF file in Cytoscape: Open network from file. Then got to Advanced Options: untick “Use first column names” and add “ to “Other:”. Now set Column 2 as Source and Column 5 as Target.

  • Open the node annotation table in .txt format.

In cytoscape, open table from file, import data as Node Table Columns. You can then customize the look of your network.

For example, go to Style and

  • go to Border Paint, Column: DE, Mapping type: Discrete and choose a color for 1 and 0

  • go to Fill Color, Column logFC, Mapping Type: Continuous and choose colors.

TF_networks(expmatrix, nodeAnno=Allgenes_limma_pw)

TF network Pearson 0.7



BiNGO (Cytoscape)

  • Go to Cytoscape

  • Apps -> BiNGO

  • The BiNGO Settings panel pops up. Start by filling in a name for your cluster. This name will be used for creating the output file and the visualization of the results in Cytoscape.

  • Check Paste Genes from Text and paste gene names into field below

  • We want to assess overrepresentation of GO categories, and we want to visualize the results in Cytoscape. The corresponding boxes are checked accordingly by default. Then select a statistical test (the Hypergeometric Test is exact and equivalent to an exact Fisher test, the Binomial Test is less accurate but quicker) and a multiple testing correction (we recommend Benjamini & Hochberg’s FDR correction, the Bonferroni correction will be too conservative in most cases), and choose a significance level, e.g. 0.05. Since we only want to visualize those GO categories that are overrepresented after multiple testing correction, and their parents in the GO hierarchy, select the corresponding visualization option. We’re interested in assessing the overrepresentation of functional categories in our cluster with respect to the whole genome, which is why we choose the Complete Annotation as the reference set.

  • Select GO _Biological _Process from the ontology list, and Human from the organism list. We want to consider all evidence codes, so don’t fill in anything in the evidence code box.

  • Finally, select a directory to save the output file in (the file will be named test.bgo if you filled in test as a cluster name), and press Start BiNGO…

  • The program will inform you of its progress while parsing the annotations and calculating the tests, corrections and layout. Finally, a visualization of the overrepresented GO categories is created in Cytoscape. Uncolored nodes are not overrepresented, but they are the parents of overrepresented categories further down. Yellow nodes represent GO categories that are overrepresented at the significance level . For more significant p-values, the node color gets increasingly more orange (see also the Color Legend panel). If you’d like another layout, e.g. hierarchical, select the corresponding option from the Cytoscape Visualization menu. Regardless of the layout you choose, you’ll probably have to tweak the nodes a little in order to avoid overlapping node labels. The list of significantly overrepresented functional categories is shown in the BiNGO output window (more information on the cluster and options you selected, and on which genes did not produce any annotation, is stored in the test.bgo file).



Other functions

miRNA target identification

# The current version of multiMiR is 1.0.1 and can be downloaded from
http://multimir.ucdenver.edu/multiMiR_1.0.1.tar.gz.

install.packages("XML")
install.packages("RCurl")
install.packages("/pathname/multiMiR_1.0.1.tar.gz", repos=NULL, type="source")

library(multiMiR)

miRNA_list <- as.character(c("hsa-miR-146b-3p", "hsa-miR-155-5p", "hsa-miR-4521"))

miRNA_allTargets = get.multimir(org="hsa", mirna=miRNA_list, target=NULL, table="all", summary=TRUE, predicted.cutoff.type="p", predicted.cutoff=NULL)
str(miRNA_allTargets)
    
write.table(miRNA_allTargets$predicted, "miRNAs_allTargets_predictedTargets.txt", row.names = F, col.names = T, sep="\t")
write.table(miRNA_allTargets$validated, "miRNAs_allTargets_validatedTargets.txt", row.names = F, col.names = T, sep="\t")
write.table(miRNA_allTargets$disease.drug, "miRNAs_allTargets_diseaseDrugTargets.txt", row.names = F, col.names = T, sep="\t")
write.table(miRNA_allTargets$summary, "miRNAs_allTargets_summaryTargets.txt", row.names = F, col.names = T, sep="\t")


Hypergeometric test for enrichment

Analogous to Fisher’s Exact Test.

phyper(SampleSuccesses -1, PopulationSuccesses, PopulationSize - PopulationSuccesses, sampleSize, lower.tail = FALSE)


Citations

Martin Morgan, Herve Pages, Valerie Obenchain and Nathaniel Hayden (2016). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.24.0. http://bioconductor.org/packages/release/bioc/html/Rsamtools.html

Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. <doi:10.1371/journal.pcbi.1003118>

Martin Morgan, Valerie Obenchain, Michel Lang and Ryan Thompson (2016). BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.6.2. https://github.com/Bioconductor/BiocParallel

Martin Morgan, Valerie Obenchain, Jim Hester and Herve Pages (2016). SummarizedExperiment: SummarizedExperiment container. R package version 1.2.2.

Michael I Love, Wolfgang Huber and Simon Anders (2014): Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology

Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. 2004. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 3 (Feb. 2004), 307-315.

Carvalho B. S., and Irizarry, R. A. 2010. A Framework for Oligonucleotide Microarray Preprocessing. Bioinformatics.

Kauffmann, A., Gentleman, R.,, Huber, W. (2009) arrayQualityMetrics–a bioconductor package for quality assessment of microarray data. Bioinformatics, 25(3):415-6.

Dunning, M.J., Smith, M.L., Ritchie, M.E., Tavare, S. beadarray: R classes and methods for Illumina bead-based data, Bioinformatics. 2007, 23(16):2183-4.

Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Elana J. Fertig, Andrew E. Jaffe and John D. Storey (2016). sva: Surrogate Variable Analysis. R package version 3.20.0.

Henrik Bengtsson (2016). matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors). R package version 0.50.2. https://CRAN.R-project.org/package=matrixStats

Kevin Wright (2015). corrgram: Plot a Correlogram. R package version 1.8. https://CRAN.R-project.org/package=corrgram

Hansen M, Gerds TA, Nielsen OH, Seidelin JB, Troelsen JT, et al. (2012) pcaGoPromoter - An R Package for Biological and Regulatory Interpretation of Principal Components in Genome-Wide Gene Expression Data. PLoS ONE 7(2): e32394. <doi:10.1371/journal.pone.0032394>

Raivo Kolde (2015). pheatmap: Pretty Heatmaps. R package version 1.0.8. https://CRAN.R-project.org/package=pheatmap

Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2016). gplots: Various R Programming Tools for Plotting Data. R package version 3.0.1. https://CRAN.R-project.org/package=gplots

Tal Galili (2016). heatmaply: Interactive Heat Maps Using ‘plotly’. R package version 0.3.2. https://CRAN.R-project.org/package=heatmaply

Langfelder P and Horvath S, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008, 9:559 <doi:10.1186/1471-2105-9-559>

Peter Langfelder, Steve Horvath (2012). Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. URL http://www.jstatsoft.org/v46/i11/.

Herve Pages, Marc Carlson, Seth Falcon and Nianhua Li (2016). AnnotationDbi: Annotation Database Interface. R package version 1.34.3.

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.

David B. Dahl (2016). xtable: Export Tables to LaTeX or HTML. R package version 1.8-2. https://CRAN.R-project.org/package=xtable

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.

Kamil Slowikowski (2016). ggrepel: Repulsive Text and Label Geoms for ‘ggplot2’. R package version 0.5. https://CRAN.R-project.org/package=ggrepel

Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287

Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015 31(4):608-609

Marc Carlson (2016). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.3.0.

Marc Carlson (2016). org.Mm.eg.db: Genome wide annotation for Mouse. R package version 3.3.0.

Luo, W. and Brouwer C., Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, 2013, 29(14): 1830-1831, doi: 10.1093/bioinformatics/btt285

Yuanbin Ru and Katerina Kechris (2014). multiMiR: Integration of multiple microRNA-target databases with their disease and drug associations. R package version 1.0.1.

Session Info

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] exprAnalysis_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] fastcluster_1.1.20    gtools_3.5.0          splines_3.3.1        
##  [4] lattice_0.20-34       colorspace_1.2-6      htmltools_0.3.5      
##  [7] stats4_3.3.1          yaml_2.1.13           chron_2.3-47         
## [10] survival_2.39-5       pcaGoPromoter_1.16.0  foreign_0.8-66       
## [13] DBI_0.5               BiocGenerics_0.18.0   RColorBrewer_1.1-2   
## [16] matrixStats_0.50.2    foreach_1.4.3         plyr_1.8.4           
## [19] stringr_1.1.0         zlibbioc_1.18.0       Biostrings_2.40.2    
## [22] munsell_0.4.3         gtable_0.2.0          caTools_1.17.1       
## [25] codetools_0.2-14      evaluate_0.9          latticeExtra_0.6-28  
## [28] Biobase_2.32.0        knitr_1.14            IRanges_2.6.1        
## [31] doParallel_1.0.10     parallel_3.3.1        AnnotationDbi_1.34.4 
## [34] preprocessCore_1.34.0 Rcpp_0.12.7           acepack_1.3-3.3      
## [37] KernSmooth_2.23-15    flashClust_1.01-2     scales_0.4.0         
## [40] formatR_1.4           gdata_2.17.0          org.Hs.eg.db_3.3.0   
## [43] S4Vectors_0.10.3      Hmisc_3.17-4          WGCNA_1.51           
## [46] XVector_0.12.1        gplots_3.0.1          gridExtra_2.2.1      
## [49] impute_1.46.0         ellipse_0.3-8         ggplot2_2.1.0        
## [52] digest_0.6.10         stringi_1.1.1         grid_3.3.1           
## [55] tools_3.3.1           bitops_1.0-6          magrittr_1.5         
## [58] RSQLite_1.0.0         dynamicTreeCut_1.63-1 Formula_1.2-1        
## [61] cluster_2.0.4         GO.db_3.3.0           Matrix_1.2-7.1       
## [64] data.table_1.9.6      rmarkdown_1.0         iterators_1.0.8      
## [67] rpart_4.1-10          nnet_7.3-12