New R Users group in Münster!

2017-05-20 00:00:00 +0000

This is to announce that Münster now has its very own R users group!


If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.


Code for the logo, which of course has been created in R:

library(ggplot2)
library(ggmap)
library(ggthemes)

munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor")
ggmap(munster) +
  scale_y_continuous(limits=c(51.94, 51.97)) +
  annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) +
  annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) +
  theme_map()

Network analysis of Game of Thrones family ties

2017-05-15 00:00:00 +0000

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.


What is a network?

A network in this context is a graph of interconnected nodes/vertices. Nodes can e.g. be people in a social network, genes in a co-expression network, etc. Nodes are connected via ties/edges.


What can network analysis tell us?

Network analysis can e.g. be used to explore relationships in social or professional networks. In such cases, we would typically ask questions like:

  • How many connections does each person have?
  • Who is the most connected (i.e. influential or “important”) person?
  • Are there clusters of tightly connected people?
  • Are there a few key players that connect clusters of people?
  • etc.

These answers can give us a lot of information about the patterns of how people interact.


The Game of Thrones character network

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

library(tidyverse)
library(igraph)
library(statnet)
load("union_edges.RData")
load("union_characters.RData")


I am using igraph to plot the initial network. To do so, I first create the graph from the edge- and node-table. An edge-table contains source and target nodes in the first two columns and optionally additional columns with edge attributes. Here, I have the type of interaction (mother, father or spouse), the color and line-type I want to assign to each edge.

Because the books and the TV series differ slightly, I have introduced edges that are only supported or hinted at by the TV series and are not part of the original narrative in the books. These edges are marked by being dotted instead of solid. An additional color for edges with unspecified parental origin are introduced as well. Originally, these served for interactions that were extracted from character names (i.e. characters that ended with “… son/daughter of …”) and could either mean mother or father. Now, they show unclear parentage or cases where there are a biological and a de facto father, as in the case of Jon Snow.

head(union_edges)
##               source            target   type   color   lty
## 1         Lysa Arryn      Robert Arryn mother #7570B3 solid
## 2       Jasper Arryn        Alys Arryn father #1B9E77 solid
## 3       Jasper Arryn         Jon Arryn father #1B9E77 solid
## 4          Jon Arryn      Robert Arryn father #1B9E77 solid
## 110 Cersei Lannister  Tommen Baratheon mother #7570B3 solid
## 210 Cersei Lannister Joffrey Baratheon mother #7570B3 solid

The nodetable contains one row for each character that is either a source or a target in the edgetable. We can give any number and type of node attributes. Here, I chose the followin columns from the original Kaggle dataset: gender/male (male = 1, female = 0), house (as the house each character was born into) and popularity. House2 was meant to assign a color to only the major houses. Shape represents the gender.

head(union_characters)
##            name male culture          house popularity      house2   color  shape
## 1    Alys Arryn    0    <NA>    House Arryn 0.08026756        <NA>    <NA> circle
## 2 Elys Waynwood    0    <NA> House Waynwood 0.07023411        <NA>    <NA> circle
## 3  Jasper Arryn    1    <NA>    House Arryn 0.04347826        <NA>    <NA> square
## 4   Jeyne Royce    0    <NA>    House Royce 0.00000000        <NA>    <NA> circle
## 5     Jon Arryn    1 Valemen    House Arryn 0.83612040        <NA>    <NA> square
## 6    Lysa Arryn    0    <NA>    House Tully 0.00000000 House Tully #F781BF circle

By default, we have a directed graph.

union_graph <- graph_from_data_frame(union_edges, directed = TRUE, vertices = union_characters)

For plotting the legend, I am summarizing the edge and node colors.

color_vertices <- union_characters %>%
  group_by(house, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

colors_edges <- union_edges %>%
  group_by(type, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

Now, we can plot the graph object (here with Fruchterman-Reingold layout):

layout <- layout_with_fr(union_graph)

Click on the image to get to the high resolution pdf:

plot(union_graph,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph)$name),
     vertex.shape = V(union_graph)$shape,
     vertex.color = V(union_graph)$color, 
     vertex.size = (V(union_graph)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph)$color,
     edge.lty = E(union_graph)$lty)
legend("topleft", legend = c(NA, "Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1,
       title = "") 
legend("topleft", legend = "", cex = 4, bty = "n", ncol = 1,
       title = "Game of Thrones Family Ties")

family_ties_1

Node color shows the major houses, node size the character’s popularity and node shape their gender (square for male, circle for female). Edge color shows interaction type.

As we can see, even with only a subset of characters from the Game of Thrones world, the network is already quite big. You can click on the image to open the pdf and zoom into specific parts of the plot and read the node labels/character names.

What we can see right away is that there are only limited connections between houses and that the Greyjoys are the only house that has no ties to any of the others.


Network analysis

How do we find out who the most important characters are in this network?

We consider a character “important” if he has connections to many other characters. There are a few network properties, that tell us more about this. For this, I am considering the network as undirected to account for parent/child relationships as being mutual.

union_graph_undir <- as.undirected(union_graph, mode = "collapse")


Centrality

Centrality describes the number of edges that are in- or outgoing to/from nodes. High centrality networks have few nodes with many connections, low centrality networks have many nodes with similar numbers of edges.

“Centralization is a method for creating a graph level centralization measure from the centrality scores of the vertices.” centralize() help

For the whole network, we can calculate centrality by degree (centr_degree()), closeness (centr_clo()) or eigenvector centrality (centr_eigen()) of vertices.

centr_degree(union_graph_undir, mode = "total")$centralization
## [1] 0.04282795
centr_clo(union_graph_undir, mode = "total")$centralization
## [1] 0.01414082
centr_eigen(union_graph_undir, directed = FALSE)$centralization
## [1] 0.8787532


Node degree

Node degree or degree centrality describes how central a node is in the network (i.e. how many in- and outgoing edges it has or to how many other nodes it is directly connected via one edge).

“The degree of a vertex is its most basic structural property, the number of its adjacent edges.” From the help pages of degree()

We can calculate the number of out- or ingoing edges of each node, or - as I am doing here - the sum of both.

union_graph_undir_degree <- igraph::degree(union_graph_undir, mode = "total")

#standardized by number of nodes
union_graph_undir_degree_std <- union_graph_undir_degree / (vcount(union_graph_undir) - 1)
node_degree <- data.frame(degree = union_graph_undir_degree,
                          degree_std = union_graph_undir_degree_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_degree, by = c("name" = "rowname"))

node_degree %>%
  arrange(-degree) %>%
  .[1:10, ]
##            rowname degree degree_std
## 1  Quellon Greyjoy     12 0.05797101
## 2      Walder Frey     10 0.04830918
## 3   Oberyn Martell     10 0.04830918
## 4     Eddard Stark      9 0.04347826
## 5    Catelyn Stark      8 0.03864734
## 6       Emmon Frey      7 0.03381643
## 7  Genna Lannister      7 0.03381643
## 8     Merrett Frey      7 0.03381643
## 9    Balon Greyjoy      7 0.03381643
## 10 Jason Lannister      7 0.03381643

In this case, the node degree reflects how many offspring and spouses a character had. With 3 wifes and several children, Quellon Greyjoy, the grandfather of Theon and Asha/Yara comes out on top (of course, had I included all offspring and wifes of Walder Frey’s, he would easily be on top but the network would have gotten infinitely more confusing).


Closeness

The closeness of a node describes its distance to all other nodes. A node with highest closeness is more central and can spread information to many other nodes.

closeness <- igraph::closeness(union_graph_undir, mode = "total")

#standardized by number of nodes
closeness_std <- closeness / (vcount(union_graph_undir) - 1)
node_closeness <- data.frame(closeness = closeness,
                          closeness_std = closeness_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_closeness, by = c("name" = "rowname"))

node_closeness %>%
  arrange(-closeness) %>%
  .[1:10, ]
##             rowname    closeness closeness_std
## 1       Sansa Stark 0.0002013288  9.726028e-07
## 2  Tyrion Lannister 0.0002012882  9.724070e-07
## 3   Tywin Lannister 0.0002011668  9.718201e-07
## 4  Joanna Lannister 0.0002005616  9.688965e-07
## 5      Eddard Stark 0.0002002804  9.675381e-07
## 6     Catelyn Stark 0.0001986492  9.596579e-07
## 7  Cersei Lannister 0.0001984915  9.588960e-07
## 8   Jaime Lannister 0.0001975894  9.545382e-07
## 9    Jeyne Marbrand 0.0001966568  9.500330e-07
## 10  Tytos Lannister 0.0001966568  9.500330e-07

The characters with highest closeness all surround central characters that connect various storylines and houses in Game of Thrones.


Betweenness centrality

Betweenness describes the number of shortest paths between nodes. Nodes with high betweenness centrality are on the path between many other nodes, i.e. they are people who are key connections or bridges between different groups of nodes. In a social network, these nodes would be very important because they are likely to pass on information to a wide reach of people.

The igraph function betweenness() calculates vertex betweenness, edge_betweenness() calculates edge betweenness:

“The vertex and edge betweenness are (roughly) defined by the number of geodesics (shortest paths) going through a vertex or an edge.” igraph help for estimate_betweenness()

betweenness <- igraph::betweenness(union_graph_undir, directed = FALSE)

# standardize by number of node pairs
betweenness_std <- betweenness / ((vcount(union_graph_undir) - 1) * (vcount(union_graph_undir) - 2) / 2)

node_betweenness <- data.frame(betweenness = betweenness,
                               betweenness_std = betweenness_std) %>%
  tibble::rownames_to_column() 

union_characters <- left_join(union_characters, node_betweenness, by = c("name" = "rowname"))

node_betweenness %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##              rowname betweenness betweenness_std
## 1       Eddard Stark    6926.864       0.3248846
## 2        Sansa Stark    6165.667       0.2891828
## 3   Tyrion Lannister    5617.482       0.2634718
## 4    Tywin Lannister    5070.395       0.2378123
## 5   Joanna Lannister    4737.524       0.2221999
## 6  Rhaegar Targaryen    4301.583       0.2017533
## 7    Margaery Tyrell    4016.417       0.1883784
## 8           Jon Snow    3558.884       0.1669192
## 9        Mace Tyrell    3392.500       0.1591154
## 10   Jason Lannister    3068.500       0.1439191
edge_betweenness <- igraph::edge_betweenness(union_graph_undir, directed = FALSE)

data.frame(edge = attr(E(union_graph_undir), "vnames"),
           betweenness = edge_betweenness) %>%
  tibble::rownames_to_column() %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##    rowname                              edge betweenness
## 1      160      Sansa Stark|Tyrion Lannister    5604.149
## 2      207          Sansa Stark|Eddard Stark    4709.852
## 3      212        Rhaegar Targaryen|Jon Snow    3560.083
## 4      296       Margaery Tyrell|Mace Tyrell    3465.000
## 5      213             Eddard Stark|Jon Snow    3163.048
## 6      131  Jason Lannister|Joanna Lannister    3089.500
## 7      159 Joanna Lannister|Tyrion Lannister    2983.591
## 8      171  Tyrion Lannister|Tywin Lannister    2647.224
## 9      192    Elia Martell|Rhaegar Targaryen    2580.000
## 10     300         Luthor Tyrell|Mace Tyrell    2565.000

This, we can now plot by feeding the node betweenness as vertex.size and edge betweenness as edge.width to our plot function:

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = V(union_graph_undir)$color, 
     vertex.size = betweenness * 0.001, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = edge_betweenness * 0.01,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1

Ned Stark is the character with highest betweenness. This makes sense, as he and his children (specifically Sansa and her arranged marriage to Tyrion) connect to other houses and are the central points from which the story unfolds. However, we have to keep in mind here, that my choice of who is important enough to include in the network (e.g. the Stark ancestors) and who not (e.g. the whole complicated mess that is the Targaryen and Frey family tree) makes this result somewhat biased.


Diameter

In contrast to the shortest path between two nodes, we can also calculate the longest path, or diameter:

diameter(union_graph_undir, directed = FALSE)
## [1] 21

In our network, the longest path connects 21 nodes.

“get_diameter returns a path with the actual diameter. If there are many shortest paths of the length of the diameter, then it returns the first one found.” diameter() help

This, we can also plot:

union_graph_undir_diameter <- union_graph_undir
node_diameter <- get.diameter(union_graph_undir_diameter,  directed = FALSE)

V(union_graph_undir_diameter)$color <- scales::alpha(V(union_graph_undir_diameter)$color, alpha = 0.5)
V(union_graph_undir_diameter)$size <- 2

V(union_graph_undir_diameter)[node_diameter]$color <- "red"
V(union_graph_undir_diameter)[node_diameter]$size <- 5

E(union_graph_undir_diameter)$color <- "grey"
E(union_graph_undir_diameter)$width <- 1

E(union_graph_undir_diameter, path = node_diameter)$color <- "red"
E(union_graph_undir_diameter, path = node_diameter)$width <- 5

plot(union_graph_undir_diameter,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir_diameter)$name),
     vertex.shape = V(union_graph_undir_diameter)$shape,
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.lty = E(union_graph_undir_diameter)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1


Transitivity

“Transitivity measures the probability that the adjacent vertices of a vertex are connected. This is sometimes also called the clustering coefficient.” transitivity() help

We can calculate the transitivity or ratio of triangles to connected triples for the whole network:

transitivity(union_graph_undir, type = "global")
## [1] 0.2850679

Or for each node:

transitivity <- data.frame(name = V(union_graph_undir)$name,
      transitivity = transitivity(union_graph_undir, type = "local")) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, transitivity, by = "name")

transitivity %>%
  arrange(-transitivity) %>%
  .[1:10, ]
##                 name transitivity
## 1       Robert Arryn            1
## 2   Ormund Baratheon            1
## 3     Selyse Florent            1
## 4  Shireen Baratheon            1
## 5   Amarei Crakehall            1
## 6       Marissa Frey            1
## 7        Olyvar Frey            1
## 8        Perra Royce            1
## 9        Perwyn Frey            1
## 10         Tion Frey            1

Because ours is a family network, characters with a transitivity of one form triangles with their parents or offspring.


PageRank centrality

PageRank (originally used by Google to rank the importance of search results) is similar to eigenvector centrality. Eigenvector centrality scores nodes in a network according to the number of connections to high-degree nodes they have. It is therefore a measure of node importance. PageRank similarly considers nodes as more important if they have many incoming edges (or links).

page_rank <- page.rank(union_graph_undir, directed = FALSE)

page_rank_centrality <- data.frame(name = names(page_rank$vector),
      page_rank = page_rank$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, page_rank_centrality, by = "name")

page_rank_centrality %>%
  arrange(-page_rank) %>%
  .[1:10, ]
##                 name   page_rank
## 1     Oberyn Martell 0.018402407
## 2    Quellon Greyjoy 0.016128129
## 3        Walder Frey 0.012956029
## 4       Eddard Stark 0.011725019
## 5       Cregan Stark 0.010983561
## 6      Catelyn Stark 0.010555473
## 7       Lyarra Stark 0.009876629
## 8  Aegon V Targaryen 0.009688458
## 9      Balon Greyjoy 0.009647049
## 10         Jon Arryn 0.009623742

Oberyn Martell, Quellon Greyjoy and Walder Frey all have the highest number of spouses, children and grandchildren are are therefore scored highest for PageRank.


Matrix representation of a network

Connections between nodes can also be represented as an adjacency matrix. We can convert our graph object to an adjacency matrix with igraph’s as_adjacency_matrix() function. Whenever there is an edge between two nodes, this field in the matrix will get assigned a 1, otherwise it is 0.

adjacency <- as.matrix(as_adjacency_matrix(union_graph_undir))


Eigenvector centrality

We can now calculate the eigenvalues and eigenvectors of the adjacency matrix.

#degree diagonal matrix
degree_diag <- diag(1 / igraph::degree(union_graph_undir))

# PageRank matrix
pagerank <- adjacency %*% degree_diag

eigenvalues <- eigen(pagerank)

The eigenvector with the highest eigenvalue scores those vertices highly, that have many eges or that are connected to vertices with many edges.

eigenvector <- data.frame(name = rownames(pagerank),
           eigenvector = as.numeric(eigenvalues$vectors[, which.max(eigenvalues$values)]))

union_characters <- left_join(union_characters, eigenvector, by = "name")

eigenvector %>%
  arrange(eigenvector) %>%
  .[1:10, ]
##                       name eigenvector
## 1          Quellon Greyjoy  -0.6625628
## 2            Balon Greyjoy  -0.3864950
## 3   Lady of House Sunderly  -0.3312814
## 4           Alannys Harlaw  -0.2760678
## 5  Lady of House Stonetree  -0.2208543
## 6      Asha (Yara) Greyjoy  -0.1656407
## 7            Robin Greyjoy  -0.1104271
## 8            Euron Greyjoy  -0.1104271
## 9          Urrigon Greyjoy  -0.1104271
## 10       Victarion Greyjoy  -0.1104271

Because of their highly connected family ties (i.e. there are only a handful of connections but they are almost all triangles), the Greyjoys have been scored with the highest eigenvalues.

We can find the eigenvector centrality scores with:

eigen_centrality <- igraph::eigen_centrality(union_graph_undir, directed = FALSE)

eigen_centrality <- data.frame(name = names(eigen_centrality$vector),
           eigen_centrality = eigen_centrality$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, eigen_centrality, eigenvector, by = "name")

eigen_centrality %>%
  arrange(-eigen_centrality) %>%
  .[1:10, ]
##                name eigen_centrality
## 1   Tywin Lannister        1.0000000
## 2  Cersei Lannister        0.9168980
## 3  Joanna Lannister        0.8358122
## 4    Jeyne Marbrand        0.8190076
## 5   Tytos Lannister        0.8190076
## 6   Genna Lannister        0.7788376
## 7   Jaime Lannister        0.7642870
## 8  Robert Baratheon        0.7087042
## 9        Emmon Frey        0.6538709
## 10      Walder Frey        0.6516021

When we consider eigenvector centrality, Tywin and the core Lannister family score highest.


Who are the most important characters?

We can now compare all the node-level information to decide which characters are the most important in Game of Thrones. Such node level characteristics could also be used as input for machine learning algorithms.

Let’s look at all characters from the major houses:

union_characters %>%
  filter(!is.na(house2)) %>%
  dplyr::select(-contains("_std")) %>%
  gather(x, y, degree:eigen_centrality) %>%
  ggplot(aes(x = name, y = y, color = house2)) +
    geom_point(size = 3) +
    facet_grid(x ~ house2, scales = "free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

family_ties_1

Taken together, we could say that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones.


Groups of nodes

We can also analyze dyads (pairs of two nodes), triads (groups of three nodes) and bigger cliques in our network. For dyads, we can use the function dyad_census() from igraph or dyad.census() from sna. Both are identical and calculate a Holland and Leinhardt dyad census with

  • mut: The number of pairs with mutual connections (in our case, spouses).
  • asym: The number of pairs with non-mutual connections (in the original network: mother-child and father-child relationships; but in the undirected network, there are none).
  • null: The number of pairs with no connection between them.
#igraph::dyad_census(union_graph_undir)
sna::dyad.census(adjacency)
##      Mut Asym  Null
## [1,] 326    0 21202

The same can be calculated for triads (see ?triad_census for details on what each output means).

#igraph::triad_census(union_graph_undir)
sna::triad.census(adjacency)
##          003 012   102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300
## [1,] 1412100   0 65261    0    0    0    0    0    0    0 790    0    0    0   0 105
triad.classify(adjacency, mode = "graph")
## [1] 2

We can also calculate the number of paths and cycles of any length we specify, here e.g. of length <= 5. For edges, we obtain the sum of counts for all paths or cycles up to the given maximum length. For vertices/nodes, we obtain the number of paths or cycles to which each node belongs.

node_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = TRUE, dyadic.tabulation = "sum")
edge_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = FALSE)
edge_kpath
## $path.count
##     1     2     3     4     5 
##   326  1105  2973  7183 17026

This, we could plot with (but here, it does not give much additional information):

gplot(node_kpath$paths.bydyad,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")
node_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = TRUE, cycle.comembership = "sum")
edge_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = FALSE)
edge_kcycle
## $cycle.count
##   2   3   4   5   6   7   8 
##   0 105 136  27  57  58  86
node_kcycle_reduced <- node_kcycle$cycle.comemb
node_kcycle_reduced <- node_kcycle_reduced[which(rowSums(node_kcycle_reduced) > 0), which(colSums(node_kcycle_reduced) > 0)]

gplot(node_kcycle_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

“A (maximal) clique is a maximal set of mutually adjacency vertices.” clique.census() help

node_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = TRUE, clique.comembership = "sum")
edge_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = FALSE, clique.comembership = "sum")
edge_clique$clique.count
##   1   2   3 
##   0  74 105
node_clique_reduced <- node_clique$clique.comemb
node_clique_reduced <- node_clique_reduced[which(rowSums(node_clique_reduced) > 0), which(colSums(node_clique_reduced) > 0)]

gplot(node_clique_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

The largest group of nodes ín this network is three, i.e. all parent/child relationships. Therefore, it does not really make sense to plot them all, but we could plot and color them with:

vcol <- rep("grey80", vcount(union_graph_undir))

# highlight first of largest cliques
vcol[unlist(largest_cliques(union_graph_undir)[[1]])] <- "red"

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = vcol, 
     vertex.size = 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = 2,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)


Clustering

We can also look for groups within our network by clustering node groups according to their edge betweenness:

ceb <- cluster_edge_betweenness(union_graph_undir)
modularity(ceb)
## [1] 0.8359884
plot(ceb,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1

Or based on propagating labels:

clp <- cluster_label_prop(union_graph_undir)

plot(clp,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1


Network properties

We can also feed our adjacency matrix to other functions, like GenInd() from the NetIndices packages. This function calculates a number of network properties, like number of compartments (N), total system throughput (T..), total system throughflow (TST), number of internal links (Lint), total number of links (Ltot), like density (LD), connectance (C), average link weight (Tijbar), average compartment throughflow (TSTbar) and compartmentalization or degree of connectedness of subsystems in the network (Cbar).

library(NetIndices)
graph.properties <- GenInd(adjacency)
graph.properties
## $N
## [1] 208
## 
## $T..
## [1] 652
## 
## $TST
## [1] 652
## 
## $Lint
## [1] 652
## 
## $Ltot
## [1] 652
## 
## $LD
## [1] 3.134615
## 
## $C
## [1] 0.01514307
## 
## $Tijbar
## [1] 1
## 
## $TSTbar
## [1] 3.134615
## 
## $Cbar
## [1] 0.01086163


Alternatively, the network package provides additional functions to obtain network properties. Here, we can again feed in the adjacency matrix of our network and convert it to a network object.

library(network)
adj_network <- network(adjacency, directed = TRUE)
adj_network
##  Network attributes:
##   vertices = 208 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 652 
##     missing edges= 0 
##     non-missing edges= 652 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

From this network object, we can e.g. get the number of dyads and edges within a network and the network size.

network.dyadcount(adj_network)
## [1] 43056
network.edgecount(adj_network)
## [1] 652
network.size(adj_network)
## [1] 208

“equiv.clust uses a definition of approximate equivalence (equiv.fun) to form a hierarchical clustering of network positions. Where dat consists of multiple relations, all specified relations are considered jointly in forming the equivalence clustering.” equiv.clust() help

ec <- equiv.clust(adj_network, mode = "graph", cluster.method = "average", plabels = network.vertex.names(adj_network))
ec
## Position Clustering:
## 
##  Equivalence function: sedist 
##  Equivalence metric: hamming 
##  Cluster method: average 
##  Graph order: 208
ec$cluster$labels <- ec$plabels
plot(ec)

family_ties_1

From the sna package, we can e.g. use functions that tell us the graph density and the dyadic reciprocity of the vertices or edges

gden(adjacency)
## [1] 0.01514307
grecip(adjacency)
## Mut 
##   1
grecip(adjacency, measure = "edgewise")
## Mut 
##   1


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NetIndices_1.4.4     MASS_7.3-47          statnet_2016.9       sna_2.4              ergm.count_3.2.2     tergm_3.4.0          networkDynamic_0.9.0 ergm_3.7.1           network_1.13.0       statnet.common_3.3.0 igraph_1.0.1         dplyr_0.5.0          purrr_0.2.2          readr_1.1.0          tidyr_0.6.2          tibble_1.3.0         ggplot2_2.2.1        tidyverse_1.1.1     
## 
## loaded via a namespace (and not attached):
##  [1] lpSolve_5.6.13    reshape2_1.4.2    haven_1.0.0       lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14       foreign_0.8-68    DBI_0.6-1         modelr_0.1.0      readxl_1.0.0      trust_0.1-7       plyr_1.8.4        robustbase_0.92-7 stringr_1.2.0     munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2       coda_0.19-1       psych_1.7.5       evaluate_0.10     labeling_0.3      knitr_1.15.1      forcats_0.2.0     parallel_3.4.0    DEoptimR_1.0-8    broom_0.4.2       Rcpp_0.12.10      scales_0.4.1      backports_1.0.5   jsonlite_1.4      mnormt_1.5-5      hms_0.3           digest_0.6.12     stringi_1.1.5     grid_3.4.0        rprojroot_1.2     tools_3.4.0       magrittr_1.5      lazyeval_0.2.0    Matrix_1.2-10     xml2_1.1.1        lubridate_1.6.0   assertthat_0.2.0  rmarkdown_1.5     httr_1.2.1        R6_2.2.0          nlme_3.1-131      compiler_3.4.0

Update to autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-02 00:00:00 +0000

This is a reply to Wojciech Indyk’s comment on yesterday’s post on autoencoders and anomaly detection with machine learning in fraud analytics:

“I think you can improve the detection of anomalies if you change the training set to the deep-autoencoder. As I understand the train_unsupervised contains both class 0 and class 1. If you put only class 0 as the input of the autoencoder, the network should learn the “normal” pattern. Then the MSE should be higher for 1 class in the test set (of course if anomaly==fraud).”

To test this, I follow the same workflow as in yesterday’s post but this time, I am moving all fraud instances from the first training set for unsupervised learning to the second training test for supervised learning. Now, the autoencoder learns a pattern solely on non-fraud cases.

library(tidyverse)
library(h2o)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 minutes 11 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    17 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_erp741 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.54 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)
# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)
splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

# move class 1 instances to second training set...
train_supervised <- rbind(as.data.frame(train_supervised), as.data.frame(train_unsupervised[train_unsupervised$Class == "1", ])) %>%
  as.h2o()

# ... and remove from first training set
train_unsupervised <- train_unsupervised[train_unsupervised$Class == "0", ]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)
model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")
anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))
ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

Compared to the results from yesterday’s post, this model seems to have learned a pattern that found two major cases. The mean reconstruction MSE was slightly higher for class 0 and definitely higher for class 1.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.04, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56608 0.9995762113
## 2     0    outlier    24 0.0004237887
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

Anomaly detection with a higher threshold based on the plot above did not improve the results compared to yesterday’s post.

With a lower threshold of 0.2 (not shown here), the test set performed much better for detecting fraud cases as outliers (65 vs 27, compared to 32 vs 60 in yesterday’s post). However, this model also categorized many more non-fraud cases as outliers (2803 vs 53829, compared to only 30 vs 56602).


Now, I am again using the autoencoder model as pre-training input for supervised learning.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))
pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 56347 0.99496751
## 2      0       1   285 0.00503249
## 3      1       0     9 0.09782609
## 4      1       1    83 0.90217391

This model is now much better at identifying fraud cases than in yesterday’s post (90%, compared to 83% - even though we can’t directly compare the two models as they were trained on different training sets) but it is also slightly less accurate at predicting non-fraud cases (99.5%, compared to 99.8%).


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] h2o_3.10.4.4    dplyr_0.5.0     purrr_0.2.2     readr_1.1.0    
## [5] tidyr_0.6.1     tibble_1.3.0    ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10       RColorBrewer_1.1-2 cellranger_1.1.0  
##  [4] compiler_3.4.0     plyr_1.8.4         bitops_1.0-6      
##  [7] forcats_0.2.0      tools_3.4.0        digest_0.6.12     
## [10] lubridate_1.6.0    jsonlite_1.4       evaluate_0.10     
## [13] nlme_3.1-131       gtable_0.2.0       lattice_0.20-35   
## [16] psych_1.7.3.21     DBI_0.6-1          yaml_2.1.14       
## [19] parallel_3.4.0     haven_1.0.0        xml2_1.1.1        
## [22] stringr_1.2.0      httr_1.2.1         knitr_1.15.1      
## [25] hms_0.3            rprojroot_1.2      grid_3.4.0        
## [28] R6_2.2.0           readxl_1.0.0       foreign_0.8-68    
## [31] rmarkdown_1.5      modelr_0.1.0       reshape2_1.4.2    
## [34] magrittr_1.5       backports_1.0.5    scales_0.4.1      
## [37] htmltools_0.3.6    rvest_0.3.2        assertthat_0.2.0  
## [40] mnormt_1.5-5       colorspace_1.3-2   labeling_0.3      
## [43] stringi_1.1.5      RCurl_1.95-4.8     lazyeval_0.2.0    
## [46] munsell_0.4.3      broom_0.4.2

Autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-01 00:00:00 +0000

All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction.

For this task, I am using Kaggle’s credit card fraud dataset from the following study:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

The dataset gives > 280,000 instances of credit card use and for each transaction, we know whether it was fraudulent or not.

Datasets like this needs special treatment when performing machine learning because they are severely unbalanced: in this case, only 0.17% of all transactions are fraudulent.

While we could try to work with classifiers, like random forests or support vector machines, by applying over- or under-sampling techniques, we can alternatively try to find anomalies in the data (assuming we expect our fraud cases to be anomalies within the whole dataset).

When dealing with such a severe unbalance of response labels, we also need to be careful when measuring model performance. Because there are only a handful of fraudulent instances, a model that predicts everything as non-fraud will already achieve a > 99% accuracy. But despite its high accuracy, such a model won’t necessarily help us find fraudulent cases - the proverbial “needle-in-a-haystack” - that we actually want to find!

Below, I will show how you can use autoencoders and anomaly detection, how you can use autoencoders to pre-train a classification model and how you can measure model performance on unbalanced data.


Exploring the data

library(tidyverse)
# download from https://www.kaggle.com/dalpozz/creditcardfraud
creditcard <- read.csv("creditcard.csv")

The dataset contains numerical input variables V1 to V28, which are the result of a PCA transformation of the original features (which could not be provided due to confidentiality issues).

The response variable Class tell us whether a transaction was fraudulent (value = 1) or not (value = 0).

creditcard %>%
  ggplot(aes(x = Class)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

There are two additional features, Time (time in seconds between each transaction and the first transaction) and Amount (how much money was transferred in this transaction).

Because Time only tells us the order in which transactions have been done, it doesn’t actually tell us anything about the actual times (i.e. time of day) of the transaction. Therefore, I am normalizing them by day and bin them into four groups according to time of day.

summary(creditcard$Time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   54202   84692   94814  139320  172792
# how many seconds are 24 hours
# 1 hr = 60 mins = 60 x 60 s = 3600 s
3600 * 24
## [1] 86400
# separate transactions by day
creditcard$day <- ifelse(creditcard$Time > 3600 * 24, "day2", "day1")

# make transaction relative to day
creditcard$Time_day <- ifelse(creditcard$day == "day2", creditcard$Time - 86400, creditcard$Time)

summary(creditcard[creditcard$day == "day1", ]$Time_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   38432   54689   52948   70977   86400
summary(creditcard[creditcard$day == "day2", ]$Time_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   37843   53425   51705   68182   86392
# bin transactions according to time of day
creditcard$Time <- as.factor(ifelse(creditcard$Time_day <= 38138, "gr1", # mean 1st Qu.
                          ifelse(creditcard$Time_day <= 52327, "gr2", # mean mean
                                 ifelse(creditcard$Time_day <= 69580, "gr3", # mean 3rd Qu
                                        "gr4"))))
creditcard %>%
  ggplot(aes(x = day)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

We can see now that the transactions in this dataset have all been recorded on two consecutive days and there are roughly the same number of transactions on these two days.

Now, I remove the columns I used to create the Time bins.

creditcard <- select(creditcard, -Time_day, -day)

# convert class variable to factor
creditcard$Class <- factor(creditcard$Class)
creditcard %>%
  ggplot(aes(x = Time)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

The distribution of transactions over the four Time bins shows, that the majority of fraud cases have happened in group 1 (although, I can’t say when exactly because the original Time feature did not tell us when the first transaction occurred).

I also want to look at the distribution of the amounts of money that were transferred:

summary(creditcard[creditcard$Class == "0", ]$Amount)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.65    22.00    88.29    77.05 25691.16
summary(creditcard[creditcard$Class == "1", ]$Amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    9.25  122.21  105.89 2125.87
creditcard %>%
  ggplot(aes(x = Amount)) +
    geom_histogram(color = "grey", fill = "lightgrey", bins = 50) +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

Interestingly, fraudulent credit card transactions had a higher mean amount of money that was transferred, but the maximum amount was much lower compared to regular transactions.


Modeling

For modeling, I am using R’s H2O implementation with the h2o package. For more details and other examples, see my posts on my machine learning webinar, on building neural nets with h2o and on performing grid search for hyperparameter tuning.

library(h2o)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 minutes 6 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    16 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_nly512 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.60 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)
# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)


Then, I am splitting the dataset into training and test sets. Because I want to check how a pre-trained model with perform, I am splitting my data into two separate training sets and one independent test set for final model comparison.

splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)


Autoencoders

First, I am training the unsupervised neural network model using deep learning autoencoders. With h2o, we can simply set autoencoder = TRUE.

Here, I am applying a technique called “bottleneck” training, where the hidden layer in the middle is very small. This means that my model will have to reduce the dimensionality of the input data (in this case, down to 2 nodes/dimensions).

The autoencoder model will then learn the patterns of the input data irrespective of given class labels. Here, it will learn, which credit card transactions are similar and which transactions are outliers or anomalies. We need to keep in mind though, that autoencoder models will be sensitive to outliers in our data, which might throw off otherwise typical patterns.

model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")

Because training can take a while, I am saving the model:

h2o.saveModel(model_nn, path="model_nn", force = TRUE)
model_nn <- h2o.loadModel("model_nn")
model_nn
## Model Details:
## ==============
## 
## H2OAutoEncoderModel: deeplearning
## Model ID:  model_nn 
## Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 776 weights/biases, 16.0 KB, 2,622,851 training samples, mini-batch size 1
##   layer units  type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    34 Input  0.00 %                                              
## 2     2    10  Tanh  0.00 % 0.000000 0.000000  0.709865 0.320108 0.000000
## 3     3     2  Tanh  0.00 % 0.000000 0.000000  0.048458 0.109033 0.000000
## 4     4    10  Tanh  0.00 % 0.000000 0.000000  0.164717 0.192053 0.000000
## 5     5    34  Tanh         0.000000 0.000000  0.369681 0.425672 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1                                          
## 2   -0.039307   0.691302  0.008052 0.965178
## 3   -0.097383   0.314106  0.226376 0.067917
## 4    0.227664   1.089589  0.112032 0.672444
## 5    0.011072   0.605586  0.091124 0.722602
## 
## 
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.001472071
## RMSE: (Extract with `h2o.rmse`) 0.03836757
#Convert to autoencoded representation
test_autoenc <- h2o.predict(model_nn, test)


Dimensionality reduction with hidden layers

Because I had used a bottleneck model with two nodes in the hidden layer in the middle, we can use this dimensionality reduction to explore our feature space (similar to what to we could do with a principal component analysis). We can extract this hidden feature with the h2o.deepfeatures() function and plot it to show the reduced representation of the input data.

train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 2) %>%
  as.data.frame() %>%
  mutate(Class = as.vector(train_unsupervised[, 31]))
ggplot(train_features, aes(x = DF.L2.C1, y = DF.L2.C2, color = Class)) +
  geom_point(alpha = 0.1)

Here, we do not see a cluster of fraudulent transactions that is distinct from non-fraud instances, so dimensionality reduction with our autoencoder model alone is not sufficient to identify fraud in this dataset.

But we could use the reduced dimensionality representation of one of the hidden layers as features for model training. An example would be to use the 10 features from the first or third hidden layer:

# let's take the third hidden layer
train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 3) %>%
  as.data.frame() %>%
  mutate(Class = as.factor(as.vector(train_unsupervised[, 31]))) %>%
  as.h2o()
features_dim <- setdiff(colnames(train_features), response)
model_nn_dim <- h2o.deeplearning(y = response,
                               x = features_dim,
                               training_frame = train_features,
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
h2o.saveModel(model_nn_dim, path="model_nn_dim", force = TRUE)
model_nn_dim <- h2o.loadModel("model_nn_dim/DeepLearning_model_R_1493574057843_49")
model_nn_dim
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_49 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 184 weights/biases, 7.1 KB, 4,098,024 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    10   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.299549 0.363303
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.009907 0.012193
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.248854 0.126722
## 5     5     2 Softmax         0.000000 0.000000  0.160990 0.078139
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.539949   4.601841 -0.060170 2.765655
## 3 0.000000   -0.384247   1.763649 -1.429832 0.570700
## 4 0.000000   -0.004865   0.774441 -0.320202 0.668747
## 5 0.000000    0.563951   1.365125  0.000000 0.397460
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9974 samples **
## 
## MSE:  0.4905394
## RMSE:  0.7003852
## LogLoss:  2.763673
## Mean Per-Class Error:  0.3856454
## AUC:  0.6656933
## Gini:  0.3313867
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1438 3594 0.714229  =3594/5032
## 1       282 4660 0.057062   =282/4942
## Totals 1720 8254 0.388610  =3876/9974
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.002173 0.706275 300
## 2                       max f2  0.001540 0.842820 335
## 3                 max f0point5  0.002531 0.619024 280
## 4                 max accuracy  0.002531 0.619711 280
## 5                max precision  0.030005 1.000000   0
## 6                   max recall  0.000496 1.000000 388
## 7              max specificity  0.030005 1.000000   0
## 8             max absolute_mcc  0.002173 0.302697 300
## 9   max min_per_class_accuracy  0.002590 0.554634 277
## 10 max mean_per_class_accuracy  0.002531 0.622276 280
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`


For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the trainings data:

test_dim <- h2o.deepfeatures(model_nn, test, layer = 3)
h2o.predict(model_nn_dim, test_dim) %>%
  as.data.frame() %>%
  mutate(actual = as.vector(test[, 31])) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 16710 0.29506286
## 2      0       1 39922 0.70493714
## 3      1       0     7 0.07608696
## 4      1       1    85 0.92391304

Now, this actually looks quite good in terms of identifying fraud cases: 92% of fraud cases were identified! However, many non-fraud cases were also classified as fraud. For real-life application, this wouldn’t be a good model. Let’s try some other techniques…


Anomaly detection

We can also ask which instances were considered outliers or anomalies within our test data, using the h2o.anomaly() function. Based on the autoencoder model that was trained before, the input data will be reconstructed and for each instance, the mean squared error (MSE) between actual value and reconstruction is calculated.

I am also calculating the mean MSE for both class labels.

anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))

This, we can now plot:

ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

As we can see in the plot, there is no perfect classification into fraud and non-fraud cases but the mean MSE is definitely higher for fraudulent transactions than for regular ones.

We can now identify outlier instances by applying an MSE threshold for what we consider outliers. We could e.g. say that we consider every instance with an MSE > 0.02 (chosen according to the plot above) to be an anomaly/outlier.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.02, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56602 0.9994702642
## 2     0    outlier    30 0.0005297358
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

As we can see, outlier detection is not sufficient to correctly classify fraudulent credit card transactions either (at least not with this dataset).


Pre-trained supervised model

We can now try using the autoencoder model as a pre-training input for a supervised model. Here, I am again using a neural network. This model will now use the weights from the autoencoder for model fitting.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
h2o.saveModel(model_nn_2, path="model_nn_2", force = TRUE)
model_nn_2 <- h2o.loadModel("model_nn_2/DeepLearning_model_R_1493574057843_9")
model_nn_2
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_9 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 424 weights/biases, 11.7 KB, 3,643,248 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    34   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.274110 0.291925
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.004480 0.002651
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.210337 0.326239
## 5     5     2 Softmax         0.000000 0.000000  0.137694 0.123830
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.006945   1.058583 -0.501068 1.576798
## 3 0.000000    0.021012   0.309286  0.360286 0.105522
## 4 0.000000    0.934361   1.759343  0.304729 0.532527
## 5 0.000000   -0.133149   2.641065  2.392999 3.137845
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9973 samples **
## 
## MSE:  0.01500211
## RMSE:  0.1224831
## LogLoss:  0.04408663
## Mean Per-Class Error:  0.00169424
## AUC:  0.9995075
## Gini:  0.9990149
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error      Rate
## 0      5000   17 0.003388  =17/5017
## 1         0 4956 0.000000   =0/4956
## Totals 5000 4973 0.001705  =17/9973
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.134320 0.998288  96
## 2                       max f2  0.134320 0.999314  96
## 3                 max f0point5  0.134320 0.997263  96
## 4                 max accuracy  0.134320 0.998295  96
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.134320 1.000000  96
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.134320 0.996597  96
## 9   max min_per_class_accuracy  0.134320 0.996612  96
## 10 max mean_per_class_accuracy  0.134320 0.998306  96
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))
pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>  <fctr> <int>       <dbl>
## 1      0       0 56512 0.997881057
## 2      0       1   120 0.002118943
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957
pred %>%
  ggplot(aes(x = actual, fill = predict)) +
    geom_bar() +
    theme_bw() +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap( ~ actual, scales = "free", ncol = 2)

Now, this looks much better! We did miss 17% of the fraud cases but we also did not mis-classify too many of the non-fraud cases.

In real-life, we would now spend some more time trying to improve the model by e.g. performing grid search for hyperparameter tuning, going back to the original features (which we did not have here) and trying different engineered features and/or trying different algorithms. But here, I will leave it at that.


Measuring model performance on highly unbalanced data

Because of the severe bias towards non-fraud cases, we can not use performance measures like accuracy or area under the curve (AUC), as they would give overly optimistic results based on the high percentage of correct classifications of the majority class.

An alternative to AUC is to use the precision-recall curve or the sensitivity (recall)-specificity curve. To calculate and plot these metrics, we can use the ROCR package. There are different ways to calculate the area under a curve (see the PRROC package for details) but I am going to use a simple function that calculates the area between every consecutive points-pair of x (i.e. x1 - x0, x2 - x1, etc.) under the corresponding values of y.

library(ROCR)

# http://stackoverflow.com/questions/24563061/computing-integral-of-a-line-plot-in-r
line_integral <- function(x, y) {
  dx <- diff(x)
  end <- length(y)
  my <- (y[1:(end - 1)] + y[2:end]) / 2
  sum(dx * my)
} 

prediction_obj <- prediction(pred$p1, pred$actual)
par(mfrow = c(1, 2))
par(mar = c(5.1,4.1,4.1,2.1))

# precision-recall curve
perf1 <- performance(prediction_obj, measure = "prec", x.measure = "rec") 

x <- perf1@x.values[[1]]
y <- perf1@y.values[[1]]
y[1] <- 0

plot(perf1, main = paste("Area Under the\nPrecision-Recall Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

# sensitivity-specificity curve
perf2 <- performance(prediction_obj, measure = "sens", x.measure = "spec") 

x <- perf2@x.values[[1]]
y <- perf2@y.values[[1]]
y[1] <- 0

plot(perf2, main = paste("Area Under the\nSensitivity-Specificity Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

Precision is the proportion of test cases predicted to be fraud that were indeed fraudulent (i.e. the true positive predictions), while recall or sensitivity is the proportion of fraud cases that were identified as fraud. And specificity is the proportion of non-fraud cases that are identified as non-fraud.

The precision-recall curve tells us the relationship between correct fraud predictions and the proportion of fraud cases that were detected (e.g. if all or most fraud cases were identified, we also have many non-fraud cases predicted as fraud and vice versa). The sensitivity-specificity curve thus tell us the relationship between correctly identified classes of both labels (e.g. if we have 100% correctly classified fraud cases, we will have no correctly classified non-fraud cases and vice versa).

We can also look at this a little bit differently, by manually going through different prediction thresholds and calculating how many cases were correctly classified in the two classes:

thresholds <- seq(from = 0, to = 1, by = 0.1)
pred_thresholds <- data.frame(actual = pred$actual)

for (threshold in thresholds) {
  
  prediction <- ifelse(pred$p1 > threshold, 1, 0)
  prediction_true <- ifelse(pred_thresholds$actual == prediction, TRUE, FALSE)
  pred_thresholds <- cbind(pred_thresholds, prediction_true)

}

colnames(pred_thresholds)[-1] <- thresholds
pred_thresholds %>%
  gather(x, y, 2:ncol(pred_thresholds)) %>%
  group_by(actual, x, y) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = as.numeric(x), y = n, color = actual)) +
    geom_vline(xintercept = 0.6, alpha = 0.5) +
    geom_line() +
    geom_point(alpha = 0.5) +
    theme_bw() +
    facet_wrap(actual ~ y, scales = "free", ncol = 2) +
    labs(x = "prediction threshold",
         y = "number of instances")

This plot tells us that we can increase the number of correctly classified non-fraud cases without loosing correctly classified fraud cases when we increase the prediction threshold from the default 0.5 to 0.6:

pred %>%
  mutate(predict = ifelse(pred$p1 > 0.6, 1, 0)) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>   <dbl> <int>       <dbl>
## 1      0       0 56558 0.998693318
## 2      0       1    74 0.001306682
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957

Our final model now correctly identified 83% of fraud cases and almost 100% of non-fraud cases.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ROCR_1.0-7      gplots_3.0.1    h2o_3.10.4.4    dplyr_0.5.0    
##  [5] purrr_0.2.2     readr_1.1.0     tidyr_0.6.1     tibble_1.3.0   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       reshape2_1.4.2     haven_1.0.0       
##  [4] lattice_0.20-35    colorspace_1.3-2   htmltools_0.3.6   
##  [7] yaml_2.1.14        foreign_0.8-68     DBI_0.6-1         
## [10] RColorBrewer_1.1-2 modelr_0.1.0       readxl_1.0.0      
## [13] plyr_1.8.4         stringr_1.2.0      munsell_0.4.3     
## [16] gtable_0.2.0       cellranger_1.1.0   rvest_0.3.2       
## [19] caTools_1.17.1     psych_1.7.3.21     evaluate_0.10     
## [22] labeling_0.3       knitr_1.15.1       forcats_0.2.0     
## [25] parallel_3.4.0     broom_0.4.2        Rcpp_0.12.10      
## [28] KernSmooth_2.23-15 scales_0.4.1       backports_1.0.5   
## [31] gdata_2.17.0       jsonlite_1.4       mnormt_1.5-5      
## [34] hms_0.3            digest_0.6.12      stringi_1.1.5     
## [37] grid_3.4.0         rprojroot_1.2      tools_3.4.0       
## [40] bitops_1.0-6       magrittr_1.5       lazyeval_0.2.0    
## [43] RCurl_1.95-4.8     xml2_1.1.1         lubridate_1.6.0   
## [46] assertthat_0.2.0   rmarkdown_1.5      httr_1.2.1        
## [49] R6_2.2.0           nlme_3.1-131       compiler_3.4.0

Does money buy happiness after all? Machine Learning with One Rule

2017-04-23 00:00:00 +0000

This week, I am exploring Holger K. von Jouanne-Diedrich’s OneR package for machine learning. I am running an example analysis on world happiness data and compare the results with other machine learning models (decision trees, random forest, gradient boosting trees and neural nets).


Conclusions

All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.

If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.

When looking at feature importance across models, the feature OneR chose - Economy/GDP per capita - was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.



OneR

OneR has been developed for the purpose of creating machine learning models that are easy to interpret and understand, while still being as accurate as possible. It is based on the one rule classification algorithm from Holte (1993), which is basically a decision tree cut at the first level.

R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.

While the original algorithm has difficulties in handling missing values and numeric data, the package provides enhanced functionality to handle those cases better, e.g. introducing a separate class for NA values and the optbin() function to find optimal splitting points for each feature. The main function of the package is OneR, which finds an optimal split for each feature and only use the most important feature with highest training accuracy for classification.


I installed the latest stable version of the OneR package from CRAN.

library(OneR)


The dataset

I am using the World Happiness Report 2016 from Kaggle.

library(tidyverse)

data_16 <- read.table("world-happiness/2016.csv", sep = ",", header = TRUE)
data_15 <- read.table("world-happiness/2015.csv", sep = ",", header = TRUE)

In the 2016 data there are upper and lower CI for the happiness score given, while in the 2015 data we have standard errors. Because I want to combine data from the two years, I am using only columns that are in both datasets.

common_feats <- colnames(data_16)[which(colnames(data_16) %in% colnames(data_15))]

# features and response variable for modeling
feats <- setdiff(common_feats, c("Country", "Happiness.Rank", "Happiness.Score"))
response <- "Happiness.Score"

# combine data from 2015 and 2016
data_15_16 <- rbind(select(data_15, one_of(c(feats, response))),
              select(data_16, one_of(c(feats, response))))

The response variable happiness score is on a numeric scale. OneR could also perform regression but here, I want to compare classification tasks. For classifying happiness, I create three bins for low, medium and high values of the happiness score. In order to not having to deal with unbalanced data, I am using the bin() function from OneR with method = "content". For plotting the cut-points, I am extracting the numbers from the default level names.

data_15_16$Happiness.Score.l <- bin(data_15_16$Happiness.Score, nbins = 3, method = "content")

intervals <- paste(levels(data_15_16$Happiness.Score.l), collapse = " ")
intervals <- gsub("\\(|]", "", intervals)
intervals <- gsub(",", " ", intervals)
intervals <- as.numeric(unique(strsplit(intervals, " ")[[1]]))
data_15_16 %>%
  ggplot() +
    geom_density(aes(x = Happiness.Score), color = "blue", fill = "blue", alpha = 0.4) +
    geom_vline(xintercept = intervals[2]) +
    geom_vline(xintercept = intervals[3])

Now I am removing the original happiness score column from the data for modeling and rename the factor levels of the response variable.

data_15_16 <- select(data_15_16, -Happiness.Score) %>%
  mutate(Happiness.Score.l = plyr::revalue(Happiness.Score.l, c("(2.83,4.79]" = "low", "(4.79,5.89]" = "medium", "(5.89,7.59]" = "high")))

Because there are only 9 features in this small dataset, I want to explore them all individually before modeling. First, I am plotting the only categorical variable: Region.

This plots shows that there are a few regions with very strong biases in happiness: People in Western Europe, Australia, New Zealand, North America, Latin American and the Caribbean tend to me in the high happiness group, while people in sub-saharan Africa and Southern Asia tend to be the least happiest.

data_15_16 %>%
  ggplot(aes(x = Region, fill = Happiness.Score.l)) +
    geom_bar(position = "dodge", alpha = 0.7) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.margin = unit(c(0, 0, 0, 1.5), "cm")) +
    scale_fill_brewer(palette = "Set1")

The remaining quantitative variables show happiness biases to varying degrees: e.g. low health and life expectancy is strongly biased towards low happiness, economic factors, family and freedom show a bias in the same direction, albeit not as strong.

data_15_16 %>%
  gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
  ggplot(aes(x = y, fill = Happiness.Score.l)) +
    geom_histogram(alpha = 0.7) +
    facet_wrap(~ x, scales = "free", ncol = 4) +
    scale_fill_brewer(palette = "Set1")

While OneR could also handle categorical data, in this example, I only want to consider the quantitative features to show the differences between OneR and other machine learning algorithms.

data_15_16 <- select(data_15_16, -Region)


Modeling

The algorithms I will compare to OneR will be run via the caret package.

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)

I will also use caret’s createDataPartition() function to partition the data into training (70%) and test sets (30%).

set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data  <- data_15_16[-index, ]


OneR

OneR only accepts categorical features. Because we have numerical features, we need to convert them to factors by splitting them into appropriate bins. While the original OneR algorithm splits the values into ever smaller factors, this has been changed in this R-implementation with the argument of preventing overfitting. We can either split the data into pre-defined numbers of buckets (by length, content or cluster) or we can use the optbin() function to obtain the optimal number of factors from pairwise logistic regression or information gain.

# default method length
data_1 <- bin(train_data, nbins = 5, method = "length")

# method content
data_2 <- bin(train_data, nbins = 5, method = "content")

# method cluster
data_3 <- bin(train_data, nbins = 3, method = "cluster")

# optimal bin number logistic regression
data_4 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "logreg")

# optimal bin number information gain
data_5 <- optbin(formula = Happiness.Score.l ~., data = train_data, method = "infogain")


This is how the data looks like following discretization:

  • Default method


  • 5 bins with method = "content


  • 3 bins with method = "cluster


  • optimal bin number according to logistic regression


  • optimal bin number according to information gain


Model building

Now I am running the OneR models. During model building, the chosen attribute/feature with highest accuracy along with the top 7 features decision rules and accuracies are printed. Unfortunately, this information is not saved in the model object; this would have been nice in order to compare the importance of features across models later on.

Here, all five models achieved highest prediction accuracy with the feature Economy GDP per capita.

for (i in 1:5) {
  data <- get(paste0("data_", i))
  print(model <- OneR(formula = Happiness.Score.l ~., data = data, verbose = TRUE))
  assign(paste0("model_", i), model)
}
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.96%  
## 2   Health..Life.Expectancy.      59.91%  
## 3   Family                        57.21%  
## 4   Dystopia.Residual             51.8%   
## 5   Freedom                       49.55%  
## 6   Trust..Government.Corruption. 45.5%   
## 7   Generosity                    41.89%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.365] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.365,0.73]     then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.73,1.09]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.09,1.46]      then Happiness.Score.l = high
## If Economy..GDP.per.Capita. = (1.46,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      64.41%  
## 2   Health..Life.Expectancy.      60.81%  
## 3   Family                        59.91%  
## 4   Trust..Government.Corruption. 55.41%  
## 5   Freedom                       53.15%  
## 5   Dystopia.Residual             53.15%  
## 7   Generosity                    41.44%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.548] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.548,0.877]    then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.877,1.06]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.06,1.28]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.28,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 143 of 222 instances classified correctly (64.41%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.51%  
## 2   Health..Life.Expectancy.      62.16%  
## 3   Family                        54.5%   
## 4   Freedom                       50.45%  
## 4   Dystopia.Residual             50.45%  
## 6   Trust..Government.Corruption. 43.24%  
## 7   Generosity                    36.49%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.602] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.602,1.1]      then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.1,1.83]       then Happiness.Score.l = high
## 
## Accuracy:
## 141 of 222 instances classified correctly (63.51%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      63.96%  
## 2   Health..Life.Expectancy.      62.16%  
## 3   Family                        58.56%  
## 4   Freedom                       51.35%  
## 5   Dystopia.Residual             50.9%   
## 6   Trust..Government.Corruption. 46.4%   
## 7   Generosity                    40.09%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.754] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.754,1.12]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.12,1.83]      then Happiness.Score.l = high
## 
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
## 
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      67.12%  
## 2   Health..Life.Expectancy.      65.77%  
## 3   Family                        61.71%  
## 4   Trust..Government.Corruption. 56.31%  
## 5   Dystopia.Residual             55.41%  
## 6   Freedom                       50.9%   
## 7   Generosity                    43.69%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## 
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.68] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.68,1.24]     then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.24,1.83]     then Happiness.Score.l = high
## 
## Accuracy:
## 149 of 222 instances classified correctly (67.12%)


Model evaluation

The function eval_model() prints confusion matrices for absolute and relative predictions, as well as accuracy, error and error rate reduction. For comparison with other models, it would have been convenient to be able to extract these performance metrics directly from the eval_model object, instead of only the confusion matrix and values of correct/all instances and having to re-calculate performance metrics again manually.

for (i in 1:5) {
  model <- get(paste0("model_", i))
  eval_model(predict(model, test_data), test_data$Happiness.Score.l)
}
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       1  26     10  37
##     medium    7   5     10  22
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.01 0.28   0.11 0.40
##     medium 0.08 0.05   0.11 0.24
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6344 (59/93)
## 
## Error rate:
## 0.3656 (34/93)
## 
## Error rate reduction (vs. base rate):
## 0.4516 (p-value = 2.855e-09)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     19   0      1  20
##     low       3  28     14  45
##     medium    9   3     16  28
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.20 0.00   0.01 0.22
##     low    0.03 0.30   0.15 0.48
##     medium 0.10 0.03   0.17 0.30
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6774 (63/93)
## 
## Error rate:
## 0.3226 (30/93)
## 
## Error rate reduction (vs. base rate):
## 0.5161 (p-value = 1.303e-11)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       0  25      7  32
##     medium    8   6     13  27
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.00 0.27   0.08 0.34
##     medium 0.09 0.06   0.14 0.29
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6559 (61/93)
## 
## Error rate:
## 0.3441 (32/93)
## 
## Error rate reduction (vs. base rate):
## 0.4839 (p-value = 2.116e-10)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     23   0     11  34
##     low       2  26     11  39
##     medium    6   5      9  20
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.25 0.00   0.12 0.37
##     low    0.02 0.28   0.12 0.42
##     medium 0.06 0.05   0.10 0.22
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.6237 (58/93)
## 
## Error rate:
## 0.3763 (35/93)
## 
## Error rate reduction (vs. base rate):
## 0.4355 (p-value = 9.799e-09)
## 
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low medium Sum
##     high     21   0      3  24
##     low       0  26      8  34
##     medium   10   5     20  35
##     Sum      31  31     31  93
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low medium  Sum
##     high   0.23 0.00   0.03 0.26
##     low    0.00 0.28   0.09 0.37
##     medium 0.11 0.05   0.22 0.38
##     Sum    0.33 0.33   0.33 1.00
## 
## Accuracy:
## 0.7204 (67/93)
## 
## Error rate:
## 0.2796 (26/93)
## 
## Error rate reduction (vs. base rate):
## 0.5806 (p-value = 2.761e-14)

Because I want to calculate performance measures for the different classes separately and like to have a more detailed look at the prediction probabilities I get from the models, I prefer to obtain predictions with type = "prob. While I am not looking at it here, this would also allow me to test different prediction thresholds.

for (i in 1:5) {
  model <- get(paste0("model_", i))
  pred <- data.frame(model = paste0("model_", i),
                     sample_id = 1:nrow(test_data),
                     predict(model, test_data, type = "prob"),
                     actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
  
  if (i == 1) {
    pred_df <- pred
  } else {
    pred_df <- rbind(pred_df, pred)
  }
}


Comparing other algorithms

Decision trees

First, I am building a decision tree with the rpart package and rpart() function. This, we can plot with rpart.plot().

Economy GDP per capita is the second highest node here, the best predictor here would be health and life expectancy.

library(rpart)
library(rpart.plot)

set.seed(42)
fit <- rpart(Happiness.Score.l ~ .,
            data = train_data,
            method = "class",
            control = rpart.control(xval = 10), 
            parms = list(split = "information"))

rpart.plot(fit, extra = 100)

In order to compare the models, I am producing the same output table for predictions from this model and combine it with the table from the OneR models.

pred <- data.frame(model = "rpart",
                   sample_id = 1:nrow(test_data),
                   predict(fit, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
pred_df_final <- rbind(pred_df,
                       pred)


Random Forest

Next, I am training a Random Forest model. For more details on Random Forest, check out my post “Can we predict flu deaths with Machine Learning and R?”.

set.seed(42)
model_rf <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "rf",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

The varImp() function from caret shows us which feature was of highest importance for the model and its predictions.

Here, we again find Economy GDP per captia on top.

varImp(model_rf)
## rf variable importance
## 
##                               Overall
## Economy..GDP.per.Capita.       100.00
## Dystopia.Residual               97.89
## Health..Life.Expectancy.        77.10
## Family                          47.17
## Trust..Government.Corruption.   29.89
## Freedom                         19.29
## Generosity                       0.00
pred <- data.frame(model = "rf",
                   sample_id = 1:nrow(test_data),
                   predict(model_rf, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
pred_df_final <- rbind(pred_df_final,
                       pred)


Extreme gradient boosting trees

Gradient boosting is another decision tree-based algorithm, explained in more detail in my post “Extreme Gradient Boosting and Preprocessing in Machine Learning”.

set.seed(42)
model_xgb <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "xgbTree",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

As before, we again find Economy GDP per capita as most important feature.

varImp(model_xgb)
## xgbTree variable importance
## 
##                          Overall
## Economy..GDP.per.Capita.  100.00
## Health..Life.Expectancy.   67.43
## Family                     46.59
## Freedom                     0.00
pred <- data.frame(model = "xgb",
                   sample_id = 1:nrow(test_data),
                   predict(model_xgb, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
pred_df_final <- rbind(pred_df_final,
                       pred)


Neural network

Finally, I am comparing a neural network model. Here as well, I have a post where I talk about what they are in more detail: “Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart”.

set.seed(42)
model_nn <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "mlp",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))

And Economy GDP per capita is again the most important feature!

varImp(model_nn)
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                                  low medium    high
## Economy..GDP.per.Capita.      100.00 66.548 100.000
## Health..Life.Expectancy.       95.22 63.632  95.222
## Family                         84.48 45.211  84.483
## Dystopia.Residual              71.23 43.450  71.232
## Freedom                        64.58 41.964  64.581
## Trust..Government.Corruption.  26.13 40.573  40.573
## Generosity                      0.00  3.462   3.462
pred <- data.frame(model = "nn",
                   sample_id = 1:nrow(test_data),
                   predict(model_nn, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
  pred$pred_prob <- NA
  
  for (j in 1:nrow(pred)) {
    pred[j, "pred_prob"] <- max(pred[j, 3:5])
  }
pred_df_final <- rbind(pred_df_final,
                       pred)


Model comparisons

Now to the final verdict: How do the different models compare?

The first plot below shows the prediction probabilites for the three happiness levels low, medium and high for each test data instance. For each instance, only the prediction probability of the predicted class (i.e. with the highest value) is shown. The upper row shows correct predictions, the lower row shows wrong predictions.

Sometimes, it is obvious from such a plot if a more stringent prediction threshold could improve things (when wrong predictions tend to be close to the threshold). With three classes to predict, this is obviously not as trivial as if we only had two but the same principle holds true: the smaller the prediction probability, the more uncertain it tends to be.

pred_df_final %>%
  ggplot(aes(x = actual, y = pred_prob, fill = prediction, color = prediction)) +
    geom_boxplot(alpha = 0.7) +
    facet_grid(correct ~ model) +
    scale_color_brewer(palette = "Set1") +
    scale_fill_brewer(palette = "Set1")

Probably the most straight-forwards performance measure is accuracy: i.e. the proportion of correct predictions vs the total number of instances to predict. The closer to 1, the better the accuracy.

Not surprisingly, the more complex models tend to be more accurate - albeit only slightly.

pred_df_final %>%
  group_by(model) %>%
  dplyr::summarise(correct = sum(correct == "correct")) %>%
  mutate(accuracy = correct / nrow(test_data)) %>%
  ggplot(aes(x = model, y = accuracy, fill = model)) +
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Set1")

When we look at the three classes individually, it looks a bit more complicated but most models achieved highest accuracy for class “high”.

pred_df_final %>%
  group_by(model, prediction) %>%
  dplyr::summarise(correct = sum(correct == "correct"),
            n = n()) %>%
  mutate(accuracy = correct / n) %>%
  ggplot(aes(x = model, y = accuracy, fill = prediction)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_fill_brewer(palette = "Set1")


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RSNNS_0.4-9         Rcpp_0.12.10        plyr_1.8.4         
##  [4] xgboost_0.6-4       randomForest_4.6-12 rpart.plot_2.1.1   
##  [7] rpart_4.1-10        caret_6.0-73        lattice_0.20-35    
## [10] doParallel_1.0.10   iterators_1.0.8     foreach_1.4.3      
## [13] dplyr_0.5.0         purrr_0.2.2         readr_1.1.0        
## [16] tidyr_0.6.1         tibble_1.3.0        ggplot2_2.2.1      
## [19] tidyverse_1.1.1     OneR_2.1           
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.6.0    assertthat_0.2.0   rprojroot_1.2     
##  [4] digest_0.6.12      psych_1.7.3.21     R6_2.2.0          
##  [7] backports_1.0.5    MatrixModels_0.4-1 stats4_3.3.3      
## [10] evaluate_0.10      httr_1.2.1         lazyeval_0.2.0    
## [13] readxl_0.1.1       data.table_1.10.4  minqa_1.2.4       
## [16] SparseM_1.76       car_2.1-4          nloptr_1.0.4      
## [19] Matrix_1.2-8       rmarkdown_1.4      labeling_0.3      
## [22] splines_3.3.3      lme4_1.1-12        stringr_1.2.0     
## [25] foreign_0.8-67     munsell_0.4.3      broom_0.4.2       
## [28] modelr_0.1.0       mnormt_1.5-5       mgcv_1.8-17       
## [31] htmltools_0.3.5    nnet_7.3-12        codetools_0.2-15  
## [34] MASS_7.3-45        ModelMetrics_1.1.0 grid_3.3.3        
## [37] nlme_3.1-131       jsonlite_1.4       gtable_0.2.0      
## [40] DBI_0.6-1          magrittr_1.5       scales_0.4.1      
## [43] stringi_1.1.5      reshape2_1.4.2     xml2_1.1.1        
## [46] RColorBrewer_1.1-2 tools_3.3.3        forcats_0.2.0     
## [49] hms_0.3            pbkrtest_0.4-7     yaml_2.1.14       
## [52] colorspace_1.3-2   rvest_0.3.2        knitr_1.15.1      
## [55] haven_1.0.0        quantreg_5.29

Explaining complex machine learning models with LIME

2017-04-23 00:00:00 +0000

The classification decisions made by machine learning models are usually difficult - if not impossible - to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well - often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.

Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand.

Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.

Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable. The method has been published in

“Why Should I Trust You?” Explaining the Predictions of Any Classifier. By Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin from the University of Washington in Seattle

lime is able to explain all models for which we can obtain prediction probabilities (in R, that is every model that works with predict(type = "prob")). It makes use of the fact that linear models are easy to explain because they are based on linear relationships between features and class labels: The complex model function is approximated by locally fitting linear models to permutations of the original training set.

On each permutation, a linear model is being fit and weights are given so that incorrect classification of instances that are more similar to the original data are penalized (positive weights support a decision, negative weights contradict them). This will give an approximation of how much (and in which way) each feature contributed to a decision made by the model.

The code for lime has originally been made available for Python but the awesome Thomas Lin Pedersen has already created an implementation in R. It is not on CRAN (yet, I assume), but you can install it via Github:

devtools::install_github("thomasp85/lime")


The data I am using is the World Happiness Data from my last post. So, let’s train a neural network on this data to predict three classes of the happiness scores: low, medium and high.

load("data_15_16.RData")
# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)
set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data  <- data_15_16[-index, ]
set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
                         data = train_data,
                         method = "mlp",
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 5, 
                                                  verboseIter = FALSE))


The explain function

The central function of lime is lime() It creates the function that is used in the next step to explain the model’s predictions.

We can give a couple of options. Check the help ?lime for details, but the most important to think about are:

  • Should continuous features be binned? And if so, into how many bins?

Here, I am keeping the default bin_continuous = TRUE but specify 5 instead of 4 (the default) bins with n_bins = 5.

library(lime)

explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)


Now, let’s look at how the model is explained. Here, I am not going to look at all test cases but I’m randomly choosing three cases with correct predictions and three with wrong predictions.

pred <- data.frame(sample_id = 1:nrow(test_data),
                   predict(model_mlp, test_data, type = "prob"),
                   actual = test_data$Happiness.Score.l)
  pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
  pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")

Beware that we need to give our test-set data table row names with the sample names or IDs to be displayed in the header of our explanatory plots below.

library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")

test_data_cor <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_cor$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)

test_data_wrong <- test_data %>%
  mutate(sample_id = 1:nrow(test_data)) %>%
  filter(sample_id %in% pred_wrong$sample_id) %>%
  sample_n(size = 3) %>%
  remove_rownames() %>%
  tibble::column_to_rownames(var = "sample_id") %>%
  select(-Happiness.Score.l)


The explain function from above can now be used with our test samples. Further options we can specify are:

  • How many features do we want to use in the explanatory function?

Let’s say we have a big training set with 100 features. Looking at all features and trying to understand them all could be more confusing than helpful. And very often, a handful of very important features will be enough to predict test samples with a reasonable accuracy (see also my last post on OneR). So, we can choose how many features we want to look at with the n_features option.

  • How do we want to choose these features?

Next, we specify how we want this subset of features to be found. The default, auto, uses forward selection if we chose n_features <= 6 and uses the features with highest weights otherwise. We can also directly choose feature_select = "forward_selection", feature_select = "highest_weights" or feature_select = "lasso_path". Again, check ?lime for details.

In our example dataset, we only have 7 features and I want to look at the top 5.

I also want to have explanation for all three class labels in the response variable (low, medium and high happiness), so I am choosing n_labels = 3.

explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)

It will return a tidy tibble object that we can plot with plot_features():

plot_features(explanation_cor, ncol = 3)

plot_features(explanation_wrong, ncol = 3)

The information in the output tibble is described in the help function ?lime and can be viewed with

tibble::glimpse(explanation_cor)


So, what does this tell us, now? Let’s look at case 22 (the first row of our plot for correctly predicted classes): This sample has been correctly predicted to come from the medium happiness group because it

  • has a dystopia value between 2.03 & 2.32,
  • a trust/government corruption score below 0.05,
  • a GDP/economy score between 1.06 and 1.23 and
  • a life expectancy score between 0.59 and 0.7.

From the explanation for the label “high” we can also see that this case has a family score bigger than 1.12, which is more representative of high happiness samples.

pred %>%
  filter(sample_id == 22)
##   sample_id        low   medium       high actual prediction correct
## 1        22 0.02906327 0.847562 0.07429938 medium     medium correct

The explanatory function named dystopia the most strongly supporting feature for this prediction. Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables […]

The explanatory plot tells us for each feature and class label in which range of values a representative data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For case 22 and the feature dystopia, the data point 2.27 falls within the range for medium happiness (between 2.03 and 2.32) with a high weight.

When we look at where this case falls on the range of values for this feature, we can see that is indeed very close to the median of medium training cases and further away from the medians for high and low training cases. The other supportive features show us the same trend.

train_data %>%
  gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
  ggplot(aes(x = Happiness.Score.l, y = y)) +
    geom_boxplot(alpha = 0.8, color = "grey") + 
    geom_point(data = gather(test_data[22, ], x, y, Economy..GDP.per.Capita.:Dystopia.Residual), color = "red", size = 3) +
    facet_wrap(~ x, scales = "free", ncol = 4)

An overview over the top 5 explanatory features for case 22 is stored in:

as.data.frame(explanation_cor[1:9]) %>%
  filter(case == "22")
##    case  label label_prob   model_r2 model_intercept
## 1    22 medium 0.84756196 0.05004205       0.5033729
## 2    22 medium 0.84756196 0.05004205       0.5033729
## 3    22 medium 0.84756196 0.05004205       0.5033729
## 4    22 medium 0.84756196 0.05004205       0.5033729
## 5    22 medium 0.84756196 0.05004205       0.5033729
## 6    22   high 0.07429938 0.06265119       0.2293890
## 7    22   high 0.07429938 0.06265119       0.2293890
## 8    22   high 0.07429938 0.06265119       0.2293890
## 9    22   high 0.07429938 0.06265119       0.2293890
## 10   22   high 0.07429938 0.06265119       0.2293890
## 11   22    low 0.02906327 0.07469729       0.3528088
## 12   22    low 0.02906327 0.07469729       0.3528088
## 13   22    low 0.02906327 0.07469729       0.3528088
## 14   22    low 0.02906327 0.07469729       0.3528088
## 15   22    low 0.02906327 0.07469729       0.3528088
##                          feature feature_value feature_weight
## 1              Dystopia.Residual       2.27394     0.14690100
## 2  Trust..Government.Corruption.       0.03005     0.06308598
## 3       Economy..GDP.per.Capita.       1.13764     0.02944832
## 4       Health..Life.Expectancy.       0.66926     0.02477567
## 5                     Generosity       0.00199    -0.01326503
## 6                         Family       1.23617     0.13629781
## 7                     Generosity       0.00199    -0.07514534
## 8  Trust..Government.Corruption.       0.03005    -0.07574480
## 9              Dystopia.Residual       2.27394    -0.07687559
## 10      Economy..GDP.per.Capita.       1.13764     0.07167086
## 11                        Family       1.23617    -0.14932931
## 12      Economy..GDP.per.Capita.       1.13764    -0.12738346
## 13                    Generosity       0.00199     0.09730858
## 14             Dystopia.Residual       2.27394    -0.07464384
## 15 Trust..Government.Corruption.       0.03005     0.06220305
##                                       feature_desc
## 1         2.025072 < Dystopia.Residual <= 2.320632
## 2        Trust..Government.Corruption. <= 0.051198
## 3  1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 4  0.591822 < Health..Life.Expectancy. <= 0.701046
## 5                           Generosity <= 0.123528
## 6                                1.119156 < Family
## 7                           Generosity <= 0.123528
## 8        Trust..Government.Corruption. <= 0.051198
## 9         2.025072 < Dystopia.Residual <= 2.320632
## 10 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 11                               1.119156 < Family
## 12 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 13                          Generosity <= 0.123528
## 14        2.025072 < Dystopia.Residual <= 2.320632
## 15       Trust..Government.Corruption. <= 0.051198

In a similar way, we can explore why some predictions were wrong.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_0.5.0       purrr_0.2.2       readr_1.1.0      
##  [4] tidyr_0.6.1       tibble_1.3.0      tidyverse_1.1.1  
##  [7] RSNNS_0.4-9       Rcpp_0.12.10      lime_0.1.0       
## [10] caret_6.0-73      ggplot2_2.2.1     lattice_0.20-35  
## [13] doParallel_1.0.10 iterators_1.0.8   foreach_1.4.3    
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.6.0    assertthat_0.2.0   glmnet_2.0-5      
##  [4] rprojroot_1.2      digest_0.6.12      psych_1.7.3.21    
##  [7] R6_2.2.0           plyr_1.8.4         backports_1.0.5   
## [10] MatrixModels_0.4-1 stats4_3.3.3       evaluate_0.10     
## [13] httr_1.2.1         hrbrthemes_0.1.0   lazyeval_0.2.0    
## [16] readxl_0.1.1       minqa_1.2.4        SparseM_1.76      
## [19] extrafontdb_1.0    car_2.1-4          nloptr_1.0.4      
## [22] Matrix_1.2-8       rmarkdown_1.4      labeling_0.3      
## [25] splines_3.3.3      lme4_1.1-12        extrafont_0.17    
## [28] stringr_1.2.0      foreign_0.8-67     munsell_0.4.3     
## [31] hunspell_2.3       broom_0.4.2        modelr_0.1.0      
## [34] mnormt_1.5-5       mgcv_1.8-17        htmltools_0.3.5   
## [37] nnet_7.3-12        codetools_0.2-15   MASS_7.3-45       
## [40] ModelMetrics_1.1.0 grid_3.3.3         nlme_3.1-131      
## [43] jsonlite_1.4       Rttf2pt1_1.3.4     gtable_0.2.0      
## [46] DBI_0.6-1          magrittr_1.5       scales_0.4.1      
## [49] stringi_1.1.5      reshape2_1.4.2     xml2_1.1.1        
## [52] tools_3.3.3        forcats_0.2.0      hms_0.3           
## [55] pbkrtest_0.4-7     yaml_2.1.14        colorspace_1.3-2  
## [58] rvest_0.3.2        knitr_1.15.1       haven_1.0.0       
## [61] quantreg_5.29

Happy EasteR: Plotting hare populations in Germany

2017-04-16 00:00:00 +0000

For Easter, I wanted to have a look at the number of hares in Germany. Wild hare populations have been rapidly declining over the last 10 years but during the last three years they have at least been stable.

This plot shows the 16 federal states of Germany. Their fill color shows the proportion of hares per square kilometer in 2015/2016. The black area of the pie charts shows the percentage of the total number of hares (not corrected by area) for each federal state in regard to the total number of hares in all of Germany during that time. The size of the hares in the background reflects the total number of hares in the eight federal states with the highest percentages.


Below, you can find the code I used to produce this plot.


Hare population numbers

The German hunter’s association (Deutscher Jagdverband) publishes population numbers for a variety of common species of wild game, including hares. The most recent numbers are for the year 2015/16.

I downloaded the pdf and extracted the table on page 1 with the tabulizer package.

library(tidyverse)
#devtools::install_github("ropensci/tabulizer")
library(tabulizer)

txt <- extract_tables("2015-16 Jahresstrecke Feldhase.pdf")

hares <- txt[[1]] %>%
  as.data.frame(stringsAsFactors = FALSE)

# first row contains column names, remove from df and set as colnames
hares_df <- data.frame(hares[-1 , ])
colnames(hares_df) <- gsub(" ", "", as.character(c("bundesland", unlist(hares[1, -1]))))

# remove the spaces in the numbers and make numeric
hares_df <- apply(hares_df, 2, function(x) gsub(" ", "", x))
hares_df[, -1] <- apply(hares_df[, -1], 2, function(x) as.numeric(x))

hare_final <- as.data.frame(hares_df, stringsAsFactors = FALSE)


The map

I downloaded the ESRI Shapefile of German federal states from the Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, 2011 and read it in with the rgdal package.

library(maptools)

ger <- rgdal::readOGR(dsn = "shp", layer = "vg2500_bld")
## OGR data source with driver: ESRI Shapefile 
## Source: "shp", layer: "vg2500_bld"
## with 16 features
## It has 6 fields

I then convert the shapefile into a dataframe for plotting. This allowes the plotting of polygons in the shape of ech federal state with ggplot.

library(plyr)
ger_df <- fortify(ger)
ger@data$id <- rownames(ger@data)
ger_df_final <- join(ger_df, ger@data, by = "id")
ger_df_final$GEN <- as.character(ger_df_final$GEN)
ger_df_final$GEN <- gsub("\xfc", "ü", ger_df_final$GEN)
ger_df_final$GEN <- gsub("ü", "ue", ger_df_final$GEN)

To match the names of the federal states in the map dataframe with the hare population table, I assign a corresponding column to the latter.

hare_final2 <- filter(hare_final, bundesland != "gesamt")
hare_final2$GEN <- as.factor(c("Baden-Wuerttemberg", "Bayern", "Berlin", "Brandenburg", "Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern", "Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz", "Saarland", "Sachsen", "Sachsen-Anhalt", "Schleswig-Holstein", "Thueringen"))
colnames(hare_final2)[2:12] <- paste0("Y_", gsub("/", "_", colnames(hare_final)[2:12]))

hare_final2[, 2:12] <- apply(hare_final2[, 2:12], 2, function(x) as.numeric(x))


Area

The hare population numbers are given as absolute numbers but I want them normalized by area. The German statistics portal (http://www.statistik-portal.de) provides this data on their website. I use the XML package to scrape it from there.

library(XML)
table = readHTMLTable("http://www.statistik-portal.de/Statistik-Portal/de_jb01_jahrtab1.asp", header = TRUE, which = 1, stringsAsFactors = FALSE)

table$V1 <- gsub("ü", "ü", table$V1)
colnames(table) <- c("GEN", "area", "pop_total", "pop_male", "pop_female", "inh_p_km2")

# numbers are in German format, so need to convert them to English decimal notation
table[, -1] <- apply(table[, -1], 2, function(x) as.numeric(gsub(",", ".", gsub("\\.", "", x))))

table <- table[-17, ]
table$GEN <- as.factor(table$GEN)

I then divide each population number by area (in km2).

hare_final2$GEN <- as.character(hare_final2$GEN)
table$GEN <- hare_final2$GEN

hare_final3 <- hare_final2 %>%
  left_join(table, by = "GEN")

hare_final3[, 2:12] <- apply(hare_final3[, 2:12], 2, function(x) x / hare_final3$area)

This final table I then join with the map dataframe.

map <- left_join(ger_df_final, hare_final3, by = "GEN")


The background map

The background map shows Germany’s federal states (as polygons) colored according to how many hares there were relative to the area in km2.

I define the mapping theme for ggplot, so that I have no axes, grids, etc. and a transparent background.

map_theme <- list(theme(panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_rect(fill = "transparent", color = NA),
                        plot.background = element_rect(fill = "transparent", color = NA),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))

The background of each polygon is also filled with a hare image with the hare’s size being proportional to the percentage of total hare numbers of each federal state compared to the total number of hares in Germany in 2015/16.

To achieve this, I first downloaded an open-source image of a rabbit and read it with the png package. This image I then want to plot with gplot’s annotation_custom() function, so I need to convert the image to a raster.

library(png)
bunny <-  readPNG("rabbit-297212_1280.png") #https://pixabay.com

library(grid)
g <- rasterGrob(bunny, interpolate = TRUE)

I am plotting hares only for the 8 federal states with percentage above 1, because the other ones would be too small on the plot.

For each of these 8 federal states, I am plotting the hare image in relative size by scaling the xmax and ymax values according to the percentage values for each state. The image canvas always has a size of 15 by 15, so I am centering the images at 7.5. I then save each image as a png with transparent background.

val <- round(filter(hare_final_pie, GEN == "Bayern")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Bayern.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Niedersachsen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Niedersachsen.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Nordrhein-Westfalen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Nordrhein-Westfalen.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Schleswig-Holstein")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Schleswig-Holstein.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Baden-Wuerttemberg")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Baden-Wuerttemberg.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Hessen")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Hessen.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Rheinland-Pfalz")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Rheinland-Pfalz.png", bg = "transparent")
val <- round(filter(hare_final_pie, GEN == "Brandenburg")$percent, digits = 0) * 0.5 / 2

df <- data.frame(xmin = 7.5 - val,
                 xmax = 7.5 + val,
                 ymin = 7.5 - val,
                 ymax = 7.5 + val)

qplot(0:15, 0:15, geom = "blank") +
  annotation_custom(g, xmin = df$xmin, xmax = df$xmax, ymin = df$ymin, ymax = df$ymax) +
  map_theme
ggsave("my_image_Brandenburg.png", bg = "transparent")


Next, I follow the instructions and code from a StackOverflow post: Each image is now read in again and converted to a dataframe for plotting. Because I only want to fill within the polygon borders, I am restricting the points to those, that fall in these shapes. I need to use the “groups” column here, because this is the column that was used for plotting the map polygons.

#http://stackoverflow.com/questions/28206611/adding-custom-image-to-geom-polygon-fill-in-ggplot
# converting raster image to plottable data.frame
library(sp)

ggplot_rasterdf <- function(color_matrix, bottom = 0, top = 1, left = 0, right = 1) {
  require("dplyr")
  require("tidyr")

  if (dim(color_matrix)[3] > 3) hasalpha <- T else hasalpha <- F

  outMatrix <- matrix("#00000000", nrow = dim(color_matrix)[1], ncol = dim(color_matrix)[2])

  for (i in 1:dim(color_matrix)[1])
    for (j in 1:dim(color_matrix)[2]) 
      outMatrix[i, j] <- rgb(color_matrix[i,j,1], color_matrix[i,j,2], color_matrix[i,j,3], ifelse(hasalpha, color_matrix[i,j,4], 1))

  colnames(outMatrix) <- seq(1, ncol(outMatrix))
  rownames(outMatrix) <- seq(1, nrow(outMatrix))
  as.data.frame(outMatrix) %>% mutate(Y = nrow(outMatrix):1) %>% gather(X, color, -Y) %>% 
    mutate(X = left + as.integer(as.character(X))*(right - left)/ncol(outMatrix), Y = bottom + Y*(top - bottom)/nrow(outMatrix))
}
my_image <- readPNG("my_image_Bayern.png")

groups <- as.character(unique(filter(map, GEN == "Bayern")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

bayern <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Niedersachsen.png")

groups <- as.character(unique(filter(map, GEN == "Niedersachsen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

niedersachsen <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                        map[map$group %in% groups,]$long, 
                                        map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Nordrhein-Westfalen.png")

groups <- as.character(unique(filter(map, GEN == "Nordrhein-Westfalen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

nrw <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Schleswig-Holstein.png")

groups <- as.character(unique(filter(map, GEN == "Schleswig-Holstein")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

sh <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Baden-Wuerttemberg.png")

groups <- as.character(unique(filter(map, GEN == "Baden-Wuerttemberg")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

bw <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Hessen.png")

groups <- as.character(unique(filter(map, GEN == "Hessen")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

hessen <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Rheinland-Pfalz.png")

groups <- as.character(unique(filter(map, GEN == "Rheinland-Pfalz")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

rp <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]
my_image <- readPNG("my_image_Brandenburg.png")

groups <- as.character(unique(filter(map, GEN == "Brandenburg")$group))

my_image_dat <- ggplot_rasterdf(my_image, 
                                left = min(map[map$group %in% groups,]$long), 
                                right = max(map[map$group %in% groups,]$long),
                                bottom = min(map[map$group %in% groups,]$lat),
                                top = max(map[map$group %in% groups,]$lat) )

brandenburg <- my_image_dat[point.in.polygon(my_image_dat$X, my_image_dat$Y, 
                                               map[map$group %in% groups,]$long, 
                                               map[map$group %in% groups,]$lat) %>% as.logical,]


Now, I can plot the background plot. I also set the x- and y-limits now, as well as an empty title because I want to overlay the pie charts onto this plot and need to have the same coordinates and margins for this. I am also defining the legend position to be on the top right, so that I can plot the legend of the pie charts on the bottom right.

p1 <- ggplot() +
  map_theme +
  geom_polygon(data = map, aes(long, lat, group = group, fill = Y_2015_16)) + 
  geom_path(data = map, aes(long, lat, group = group), color = "white", size = 0.5) + 
  scale_fill_gradientn(colors = rev(terrain.colors(10)), limits = c(0, max(as.numeric(hare_final3$Y_2015_16)))) +
  coord_cartesian(ylim = c(47, 55.5), xlim = c(5, 15.5)) +
  theme(legend.justification = "top") +
  labs(
    title = " ",
    fill = "# / km2"
  ) +
  geom_tile(data = bayern, aes(x = X, y = Y), fill = bayern$color) +
  geom_tile(data = niedersachsen, aes(x = X, y = Y), fill = niedersachsen$color) +
  geom_tile(data = nrw, aes(x = X, y = Y), fill = nrw$color) +
  geom_tile(data = sh, aes(x = X, y = Y), fill = sh$color) +
  geom_tile(data = bw, aes(x = X, y = Y), fill = bw$color) +
  geom_tile(data = hessen, aes(x = X, y = Y), fill = hessen$color) +
  geom_tile(data = rp, aes(x = X, y = Y), fill = rp$color) +
  geom_tile(data = brandenburg, aes(x = X, y = Y), fill = brandenburg$color)


The pie plot

Over each federal state, I want to plot a pie chart that shows the percentage of the total hare population that falls on each federal state. I use the scatterpie package for that. The coordinates for each pie should be the center points of each federal state, which I determined with the following code from a StackOverflow post.

library(scatterpie)

# http://stackoverflow.com/questions/10368180/plotting-pie-graphs-on-map-in-ggplot
getLabelPoint <- function(county) {Polygon(county[c('long', 'lat')])@labpt}

centroids <- by(map, map$GEN, getLabelPoint)    
centroids <- do.call("rbind.data.frame", centroids) %>%
  tibble::rownames_to_column()
names(centroids) <- c("GEN", "long", "lat")

I then calculate the percentages and join them with the centroid coordinates. I am also adjusting the positions of the pie charts for some federal states so that they don’t overlap with other pie charts or with the background images.

hare_final_pie <- hare_final2 %>%
  select(one_of(c("GEN", "Y_2015_16"))) %>%
  mutate(percent = round(Y_2015_16 / sum(Y_2015_16) * 100, digits = 2),
         rest = 100 - percent) %>%
  left_join(centroids, by = "GEN")

hare_final_pie[hare_final_pie$GEN == "Brandenburg", "long"] <- 14
hare_final_pie[hare_final_pie$GEN == "Brandenburg", "lat"] <- 52

hare_final_pie[hare_final_pie$GEN == "Niedersachsen", "long"] <- 8
hare_final_pie[hare_final_pie$GEN == "Niedersachsen", "lat"] <- 52.8

hare_final_pie[hare_final_pie$GEN == "Schleswig-Holstein", "long"] <- 10.5
hare_final_pie[hare_final_pie$GEN == "Schleswig-Holstein", "lat"] <- 54

hare_final_pie[hare_final_pie$GEN == "Hamburg", "long"] <- 10.2

hare_final_pie[hare_final_pie$GEN == "Berlin", "long"] <- 13.6

hare_final_pie[hare_final_pie$GEN == "Nordrhein-Westfalen", "long"] <- 6.8

hare_final_pie[hare_final_pie$GEN == "Hessen", "long"] <- 9.2
hare_final_pie[hare_final_pie$GEN == "Hessen", "lat"] <- 50.9

hare_final_pie[hare_final_pie$GEN == "Rheinland-Pfalz", "long"] <- 7
hare_final_pie[hare_final_pie$GEN == "Rheinland-Pfalz", "lat"] <- 50

hare_final_pie[hare_final_pie$GEN == "Baden-Wuerttemberg", "long"] <- 9.5
hare_final_pie[hare_final_pie$GEN == "Baden-Wuerttemberg", "lat"] <- 49

hare_final_pie[hare_final_pie$GEN == "Bayern", "long"] <- 11
hare_final_pie[hare_final_pie$GEN == "Bayern", "lat"] <- 49.6

Now, I am plottin the second plot with only the pie charts. Theoretically, one could plot these pie charts directly on top of another ggplot but here this doesn’t work because I am using conflicting fill attributes: one with a continuous scale for the polygons and another with a categorical scale for the pie charts. Therefore, I am overlaying the two plots instead.

hare_final_pie$radius <- 0.2

p2 <- ggplot() +
  geom_scatterpie(data = hare_final_pie, aes(x = long, y = lat, group = GEN, r = radius), cols = colnames(hare_final_pie)[3:4]) +
  map_theme +
  coord_cartesian(ylim = c(47, 55.5), xlim = c(5, 15.5)) +
  theme(legend.justification = "bottom") +
  scale_fill_manual(values = c("black", "transparent")) +
  labs(
    title = "German hare populations in 2015/16 by federal state",
    fill = ""
  )
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 1 ,widths = unit(1 ,"npc"))))
print(p1 + theme(legend.position = "right"), vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2 + theme(legend.position = "right"), vp = viewport(layout.pos.row = 1, layout.pos.col = 1))


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scatterpie_0.0.7 png_0.1-7        plyr_1.8.4       maptools_0.9-2  
##  [5] sp_1.2-4         dplyr_0.5.0      purrr_0.2.2      readr_1.1.0     
##  [9] tidyr_0.6.1      tibble_1.3.0     ggplot2_2.2.1    tidyverse_1.1.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     forcats_0.2.0    tools_3.3.3      digest_0.6.12   
##  [5] jsonlite_1.4     lubridate_1.6.0  evaluate_0.10    nlme_3.1-131    
##  [9] gtable_0.2.0     lattice_0.20-35  psych_1.7.3.21   DBI_0.6-1       
## [13] rgdal_1.2-6      yaml_2.1.14      parallel_3.3.3   haven_1.0.0     
## [17] xml2_1.1.1       stringr_1.2.0    httr_1.2.1       knitr_1.15.1    
## [21] hms_0.3          rprojroot_1.2    R6_2.2.0         readxl_0.1.1    
## [25] foreign_0.8-67   rmarkdown_1.4    udunits2_0.13    tweenr_0.1.5    
## [29] modelr_0.1.0     reshape2_1.4.2   magrittr_1.5     units_0.4-3     
## [33] MASS_7.3-45      backports_1.0.5  scales_0.4.1     htmltools_0.3.5 
## [37] rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     ggforce_0.1.1   
## [41] colorspace_1.3-2 labeling_0.3     stringi_1.1.5    lazyeval_0.2.0  
## [45] munsell_0.4.3    broom_0.4.2

Data on tour: Plotting 3D maps and location tracks

2017-04-09 00:00:00 +0000

Recently, I was on Gran Canaria for a vacation. So, what better way to keep up the holiday spirit a while longer than to visualize all the places we went in R!?

I am combining location data collected by

  1. our car GPS,
  2. Google location data from my phone and
  3. the hiking tracks we followed.

Obviously, the data itself is not public this time but the principles for plotting can be used with any .gpx files that cointain latitude, longitude and elevation data.


Gran Canaria

Gran Canaria is one of the Spanish Canary Islands off the coast of Morocco. As its sister islands, it is of volcanic origin.


Car GPS tracks

The GPS tracks in .gpx format were downloaded from the device (Garmin) the same way as I had already done with the tracks from our US/Canada roadtrip last year. Garmin’s .gpx tracks are not in a standard format that could be read with readGPX(), so I had to use htmlTreeParse() from the XML library.

library(XML)

# list of files in directory
myfiles <- list.files(path = "Takeout_2017_March", full.names = TRUE)

# all gpx files in that directory
myfiles <- myfiles[grep(".gpx", myfiles)]

for (i in 1:length(myfiles)) {

    # parse document
    pfile <- htmlTreeParse(myfiles[i], useInternalNodes = T)
  
    # Get all elevations, times and coordinates via the respective xpath
    elevations <- as.numeric(as.character(xpathSApply(pfile, path = "//trkpt/ele", xmlValue)))
    times <- xpathSApply(pfile, path = "//trkpt/time", xmlValue)
    coords <- xpathSApply(pfile, path = "//trkpt", xmlAttrs)
    
    # Extract latitude and longitude from the coordinates
    lats <- as.numeric(as.character(coords["lat",]))
    lons <- as.numeric(as.character(coords["lon",]))
    
    # Put everything in a dataframe
    geodf <- data.frame(lat = lats, lon = lons, ele = elevations, time = times)

    if (i == 1) {
      geodata <- geodf
      } else {
      geodata <- rbind(geodata, geodf)
    }
}

geodata_all <- geodata

# Transforming the time column
geodata_all$time <- as.character(strptime(geodata_all$time, format = "%Y-%m-%dT%H:%M:%SZ"))

# ordering by date
library(dplyr)
geodata_all <- arrange(geodata_all, time)

# keep only data points from Gran Canaria
geodata_gc <- filter(geodata_all, lat < 28.5 & lat > 27.5) %>%
  filter(lon > -16 & lon < -15)


Hiking tracks

The hiking tracks we followed came mostly from a German hiking guide-book, the Rother Wanderführer, 7th edition from 2016. They were in standard .gpx format and could be read with readGPX().

Only one of our hikes did not come from this book, but from Wikiloc. It could be treated the same way as the other hiking tracks, though, so I combined all hiking tracks.

library(plotKML)

# get file names
myfiles <- list.files(path = "hiking", full.names = TRUE)
myfiles <- myfiles[grep(".gpx", myfiles)]

for (i in 1:length(myfiles)) {

  # read in gpx files
  t <- readGPX(myfiles[i])
  
  # extract latitude, longitude and elevation
  geodf <- data.frame(lon = t$tracks[[1]][[1]]$lon,
                      lat = t$tracks[[1]][[1]]$lat,
                      ele = t$tracks[[1]][[1]]$ele,
                      tour = names(t$tracks[[1]]))

    # combine into data frame
    if (i == 1) {
      geodata <- geodf
      } else {
      geodata <- rbind(geodata, geodf)
    }
  rm(pfile)
}

geodata_tracks <- geodata


Google location history

The Google location data from my phone were prepared the same way as in my post about plotting your Google location history.

library(jsonlite)
system.time(x <- fromJSON("Standortverlauf.json"))
# extracting the locations dataframe
loc = x$locations

# converting time column from posix milliseconds into a readable time scale
loc$time = as.POSIXct(as.numeric(x$locations$timestampMs)/1000, origin = "1970-01-01")

# converting longitude and latitude from E7 to GPS coordinates
loc$lat = loc$latitudeE7 / 1e7
loc$lon = loc$longitudeE7 / 1e7

# keep only data from Gran Canaria
loc_gc <- filter(loc, lat < 28.5 & lat > 27.5) %>%
  filter(lon > -16 & lon < -15)


All three datasets had information about the latitude/longitude coordinate pairs and of the elevation of each observation, so I can combine these three attributes from the three location data sources.

geodata_tracks$ele <- as.numeric(as.character(geodata_tracks$ele))
data_combined <- data.frame(lat = c(geodata_gc$lat, loc_gc$lat, geodata_tracks$lat),
                            lon = c(geodata_gc$lon, loc_gc$lon, geodata_tracks$lon),
                            ele = c(geodata_gc$ele, loc_gc$altitude, geodata_tracks$ele),
                            track = c(rep("GPS", nrow(geodata_gc)), rep("Google", nrow(loc_gc)), rep("Hiking", nrow(geodata_tracks))))
data_combined <- data_combined[!duplicated(data_combined), ]


Elevation profile

I downloaded the elevation profile of Gran Canaria with the maptools and raster packages.

library(maptools)
library(raster)
srtm <- getData("SRTM", lon = -15.59972, lat = 27.965)

# crop to Gran Canaria & Tenerife
e1 <- extent(min(data_combined$lon) - 1.2, # xmin
            max(data_combined$lon) + 0.1, # xmax
            min(data_combined$lat) - 0.1, # ymin
            max(data_combined$lat) + 0.5) # ymax

srtm_ct <- crop(srtm, e1)

# plot slope and aspect with hill shades
slope <- terrain(srtm_ct, opt = "slope")
aspect <- terrain(srtm_ct, opt = "aspect")
hill <- hillShade(slope, aspect, angle = 45, direction = 45, normalize = TRUE)
plot(hill, col = grey(0:100/100), legend = FALSE)
plot(srtm_ct, col = rainbow(25, alpha = 0.35), add = TRUE)

Its sister island to the west, Tenerife is a bit larger and, as can be seen in the image above, its highest point, the Teide, is about 1000 meters higher than Gran Canaria’s highest point.


# crop to Gran Canaria
e2 <- extent(min(data_combined$lon) - 0.2, # xmin
            max(data_combined$lon) + 0.1, # xmax
            min(data_combined$lat) - 0.1, # ymin
            max(data_combined$lat) + 0.1) # ymax

srtm_c <- crop(srtm, e2)

slope <- terrain(srtm_c, opt = "slope")
aspect <- terrain(srtm_c, opt = "aspect")
hill <- hillShade(slope, aspect, angle = 45, direction = 45, normalize = TRUE)
plot(hill, col = grey(0:100/100), legend = FALSE)
plot(srtm_c, col = rainbow(25, alpha = 0.35), add = TRUE)

Looking only at Gran Canaria’s slope and aspect profile, the plots nicely show its highest point (almost 2000 m) in the center of the island and that it slopes down towards sea level at the coast.


ggmap

I centered the map at the midpoint of Gran Canaria, which is given by http://www.travelmath.com/island/Gran+Canaria as 27° 57’ 54” N / 15° 35’ 59” W. Converted to decimal with http://www.rapidtables.com/convert/number/degrees-minutes-seconds-to-degrees.htm that is 27.965 & 15.59972.

library(ggplot2)
library(ggmap)

map_theme <- list(theme(legend.position = "top",
                        panel.grid.minor = element_blank(),
                        panel.grid.major = element_blank(),
                        panel.background = element_blank(),
                        plot.background = element_rect(fill = "white"),
                        panel.border = element_blank(),
                        axis.line = element_blank(),
                        axis.text.x = element_blank(),
                        axis.text.y = element_blank(),
                        axis.ticks = element_blank(),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size = 18)))
map <- get_map(c(lon = -15.59972, lat = 27.965), zoom = 10, maptype = "terrain-background", source = "google")

ggmap(map) + 
  geom_point(data = data_combined, aes(x = lon, y = lat, color = track), alpha = 0.3) +
  scale_color_brewer(palette = "Set1") +
  map_theme

The above plot shows the different location data sources. Our car GPS recorded the most data points along our routes (blue), while Google location data (red) is only recorded occasionally (usually, when I turn on my phone, so it primarily shows places we walked around in). The hiking tracks in green show the 7 hikes we’ve followed.


map <- get_map(c(lon = -15.59972, lat = 27.965), zoom = 10)

ggmap(map) + 
  geom_point(data = na.omit(data_combined), aes(x = lon, y = lat, color = ele)) +
  scale_color_gradient2(low = "lightblue", mid = "blue", high = "red", name = "Elevation") +
  map_theme

This plots shows the elevation of the location points we’ve been.


3D plots

Because the 2D plot doesn’t show the elevation very clearly, I wanted to plot this in 3D.

library(scatterplot3d)

# http://gis.stackexchange.com/questions/142156/r-how-to-get-latitudes-and-longitudes-from-a-rasterlayer
r.pts <- rasterToPoints(srtm_c, spatial = TRUE)
geo.prj <- proj4string(r.pts)
r.pts <- spTransform(r.pts, CRS(geo.prj)) 
bg_matrix <- data.frame(lon = coordinates(r.pts)[,1],
                         lat = coordinates(r.pts)[,2])

ex_bg <- extract(srtm_c, bg_matrix, cellnumbers = TRUE, df = TRUE)
bg_matrix$ele <- ex_bg$srtm_33_07

ex_points <- extract(srtm_c, cbind(data_combined$lon, data_combined$lat), cellnumbers = TRUE, df = TRUE)

s3d <- scatterplot3d(x = bg_matrix[, 1], 
                     y = bg_matrix[, 2], 
                     z = bg_matrix[, 3], 
                     type = "p", 
                     pch = 16, 
                     highlight.3d = TRUE, 
                     cex.symbols = 0.5, 
                     box = FALSE,
                     grid = FALSE,
                     xlab = "Longitude",
                     ylab = "Latitude",
                     zlab = "Elevation")
s3d$points3d(x = data_combined$lon, y = data_combined$lat, z = ex_points$srtm_33_07, col = "blue", pch = 16, cex = 0.1)

The first 3D plot shows a scatterplot of our location points on the elevation profile of Gran Canaria. It was created with scatterplot3d.


But I also wanted to be able to see the 3D plot from different angles, soI used rgl to make interactive 3D plots.

library(rgdal)
library(rasterVis)
library(rgl)
library(htmlwidgets)
options(rgl.printRglwidget = TRUE)

open3d()
plot3D(srtm_c,  maxpixels = 7e4)


Click to open the interactive 3D image and use the mouse to zoom and turn the plot:

3dplot1

This interactive 3D plot shows the elevation profile of the whole island and you can clearly see the various ravines that have been shaped over the years.

Now, I would have liked to plot our location points directly onto this plot, but I couldn’t figure out how to do this (if someone knows, please let me know). Instead, I plotted only our location points in 3D.


options(rgl.printRglwidget = TRUE)
open3d()
plot3d(data_combined$lon, data_combined$lat, ex_points$srtm_33_07, 
       xlab = "", ylab = "", zlab = "",
       col = "blue", size = 5, alpha = 0.1,
       lit = TRUE,
       box = FALSE, axes = FALSE)


Click to open the interactive 3D image and use the mouse to zoom and turn the plot:

3dplot2

Here as well, I would have liked to have the elevation profile of the island behind it, but the code that worked well within RStudio (see below) did not plot the background points in my html output. I assume that it has to do with having too many points, so if somebody knows a way to reduce the resolution, please let me know!

plot3d(bg_matrix$lon, bg_matrix$lat, bg_matrix$ele, 
       xlab = "", ylab = "", zlab = "",
       col = "grey", size = 5, alpha = 0.1,
       box = FALSE, axes = FALSE, useNULL = TRUE)
plot3d(data_combined$lon, data_combined$lat, ex_points$srtm_33_07, 
       col = "blue", add = TRUE, size = 10, alpha = 0.5, useNULL = TRUE) 


sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scatterplot3d_0.3-38 htmlwidgets_0.8      rgl_0.98.1          
##  [4] rasterVis_0.41       latticeExtra_0.6-28  RColorBrewer_1.1-2  
##  [7] lattice_0.20-34      rgdal_1.2-5          raster_2.5-8        
## [10] maptools_0.9-2       sp_1.2-4            
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9       knitr_1.15.1      magrittr_1.5     
##  [4] xtable_1.8-2      viridisLite_0.2.0 R6_2.2.0         
##  [7] stringr_1.2.0     tools_3.3.3       parallel_3.3.3   
## [10] grid_3.3.3        htmltools_0.3.5   yaml_2.1.14      
## [13] rprojroot_1.2     digest_0.6.12     shiny_1.0.0      
## [16] mime_0.5          evaluate_0.10     rmarkdown_1.3    
## [19] stringi_1.1.2     backports_1.0.5   jsonlite_1.2     
## [22] httpuv_1.3.3      foreign_0.8-67    hexbin_1.27.1    
## [25] zoo_1.7-14

Dealing with unbalanced data in machine learning

2017-04-02 00:00:00 +0000

In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.

But because I had gotten a few questions regarding this, I thought it would be worthwhile to explain over- and under-sampling techniques in more detail and show how you can very easily implement them with caret.

library(caret)


Unbalanced data

In this context, unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases. Even more extreme unbalance is seen with fraud detection, where e.g. most credit card uses are okay and only very few will be fraudulent. In the example I used for my webinar, a breast cancer dataset, we had about twice as many benign than malignant samples.

summary(bc_data$classes)
##    benign malignant 
##       458       241


Why is unbalanced data a problem in machine learning?

Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our breast cancer dataset: assume we had 10 malignant vs 90 benign samples. A machine learning model that has been trained and tested on such a dataset could now predict “benign” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!


How to balance data for modeling

The basic theoretical concepts behind over- and under-sampling are very simple:

  • With under-sampling, we randomly select a subset of samples from the class with more instances to match the number of samples coming from each class. In our example, we would randomly pick 241 out of the 458 benign cases. The main disadvantage of under-sampling is that we loose potentially relevant information from the left-out samples.

  • With oversampling, we randomly duplicate samples from the class with fewer instances or we generate additional instances based on the data that we have, so as to match the number of samples in each class. While we avoid loosing information with this approach, we also run the risk of overfitting our model as we are more likely to get the same samples in the training and in the test data, i.e. the test data is no longer independent from training data. This would lead to an overestimation of our model’s performance and generalizability.

In reality though, we should not simply perform over- or under-sampling on our training data and then run the model. We need to account for cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance!


Modeling the original unbalanced data

Here is the same model I used in my webinar example: I randomly divide the data into training and test sets (stratified by class) and perform Random Forest modeling with 10 x 10 repeated cross-validation. Final model performance is then measured on the test set.

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]
set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  verboseIter = FALSE))
final <- data.frame(actual = test_data$classes,
                    predict(model_rf, newdata = test_data, type = "prob"))
final$predict <- ifelse(final$benign > 0.5, "benign", "malignant")
cm_original <- confusionMatrix(final$predict, test_data$classes)


Under-sampling

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "down")

set.seed(42)
model_rf_under <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = ctrl)
final_under <- data.frame(actual = test_data$classes,
                    predict(model_rf_under, newdata = test_data, type = "prob"))
final_under$predict <- ifelse(final_under$benign > 0.5, "benign", "malignant")
cm_under <- confusionMatrix(final_under$predict, test_data$classes)


Oversampling

For over- (also called up-) sampling we simply specify sampling = "up".

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "up")

set.seed(42)
model_rf_over <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = ctrl)
final_over <- data.frame(actual = test_data$classes,
                          predict(model_rf_over, newdata = test_data, type = "prob"))
final_over$predict <- ifelse(final_over$benign > 0.5, "benign", "malignant")
cm_over <- confusionMatrix(final_over$predict, test_data$classes)


ROSE

Besides over- and under-sampling, there are hybrid methods that combine under-sampling with the generation of additional data. Two of the most popular are ROSE and SMOTE.

From Nicola Lunardon, Giovanna Menardi and Nicola Torelli’s “ROSE: A Package for Binary Imbalanced Learning” (R Journal, 2014, Vol. 6 Issue 1, p. 79): “The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.”

You implement them the same way as before, this time choosing sampling = "rose"

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "rose")

set.seed(42)
model_rf_rose <- caret::train(classes ~ .,
                              data = train_data,
                              method = "rf",
                              preProcess = c("scale", "center"),
                              trControl = ctrl)
final_rose <- data.frame(actual = test_data$classes,
                         predict(model_rf_rose, newdata = test_data, type = "prob"))
final_rose$predict <- ifelse(final_rose$benign > 0.5, "benign", "malignant")
cm_rose <- confusionMatrix(final_rose$predict, test_data$classes)


SMOTE

… or by choosing sampling = "smote" in the trainControl settings.

From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357): “This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples.”

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     verboseIter = FALSE,
                     sampling = "smote")

set.seed(42)
model_rf_smote <- caret::train(classes ~ .,
                              data = train_data,
                              method = "rf",
                              preProcess = c("scale", "center"),
                              trControl = ctrl)
final_smote <- data.frame(actual = test_data$classes,
                         predict(model_rf_smote, newdata = test_data, type = "prob"))
final_smote$predict <- ifelse(final_smote$benign > 0.5, "benign", "malignant")
cm_smote <- confusionMatrix(final_smote$predict, test_data$classes)


Predictions

Now let’s compare the predictions of all these models:

models <- list(original = model_rf,
                       under = model_rf_under,
                       over = model_rf_over,
                       smote = model_rf_smote,
                       rose = model_rf_rose)

resampling <- resamples(models)
bwplot(resampling)

library(dplyr)
comparison <- data.frame(model = names(models),
                         Sensitivity = rep(NA, length(models)),
                         Specificity = rep(NA, length(models)),
                         Precision = rep(NA, length(models)),
                         Recall = rep(NA, length(models)),
                         F1 = rep(NA, length(models)))

for (name in names(models)) {
  model <- get(paste0("cm_", name))
  
  comparison[comparison$model == name, ] <- filter(comparison, model == name) %>%
    mutate(Sensitivity = model$byClass["Sensitivity"],
           Specificity = model$byClass["Specificity"],
           Precision = model$byClass["Precision"],
           Recall = model$byClass["Recall"],
           F1 = model$byClass["F1"])
}

library(tidyr)
comparison %>%
  gather(x, y, Sensitivity:F1) %>%
  ggplot(aes(x = x, y = y, color = model)) +
    geom_jitter(width = 0.2, alpha = 0.5, size = 3)

With this small dataset, we can already see how the different techniques can influence model performance. Sensitivity (or recall) describes the proportion of benign cases that have been predicted correctly, while specificity describes the proportion of malignant cases that have been predicted correctly. Precision describes the true positives, i.e. the proportion of benign predictions that were actual from benign samples. F1 is the weighted average of precision and sensitivity/ recall.

Here, all four methods improved specificity and precision compared to the original model. Under-sampling, over-sampling and ROSE additionally improved precision and the F1 score.


This post shows a simple example of how to correct for unbalance in datasets for machine learning. For more advanced instructions and potential caveats with these techniques, check out the excellent caret documentation.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.6.1         dplyr_0.5.0         randomForest_4.6-12
## [4] caret_6.0-73        ggplot2_2.2.1       lattice_0.20-34    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9        nloptr_1.0.4       plyr_1.8.4        
##  [4] class_7.3-14       iterators_1.0.8    tools_3.3.3       
##  [7] digest_0.6.12      lme4_1.1-12        evaluate_0.10     
## [10] tibble_1.2         gtable_0.2.0       nlme_3.1-131      
## [13] mgcv_1.8-17        Matrix_1.2-8       foreach_1.4.3     
## [16] DBI_0.5-1          yaml_2.1.14        parallel_3.3.3    
## [19] SparseM_1.74       e1071_1.6-8        stringr_1.2.0     
## [22] knitr_1.15.1       MatrixModels_0.4-1 stats4_3.3.3      
## [25] rprojroot_1.2      grid_3.3.3         nnet_7.3-12       
## [28] R6_2.2.0           rmarkdown_1.3      minqa_1.2.4       
## [31] reshape2_1.4.2     car_2.1-4          magrittr_1.5      
## [34] backports_1.0.5    scales_0.4.1       codetools_0.2-15  
## [37] ModelMetrics_1.1.0 htmltools_0.3.5    MASS_7.3-45       
## [40] splines_3.3.3      assertthat_0.1     pbkrtest_0.4-6    
## [43] colorspace_1.3-2   labeling_0.3       quantreg_5.29     
## [46] stringi_1.1.2      lazyeval_0.2.0     munsell_0.4.3

Building meaningful machine learning models for disease prediction

2017-03-31 00:00:00 +0000

Webinar for the ISDS R Group

This document presents the code I used to produce the example analysis and figures shown in my webinar on building meaningful machine learning models for disease prediction.

My webinar slides are available on Github


Description: Dr Shirin Glander will go over her work on building machine-learning models to predict the course of different diseases. She will go over building a model, evaluating its performance, and answering or addressing different disease related questions using machine learning. Her talk will cover the theory of machine learning as it is applied using R.


Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.


The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterize cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read.table("datasets/breast-cancer-wisconsin.data.txt", 
                      header = FALSE, 
                      sep = ",")
colnames(bc_data) <- c("sample_code_number", 
                       "clump_thickness", 
                       "uniformity_of_cell_size", 
                       "uniformity_of_cell_shape", 
                       "marginal_adhesion", 
                       "single_epithelial_cell_size", 
                       "bare_nuclei", 
                       "bland_chromatin", 
                       "normal_nucleoli", 
                       "mitosis", 
                       "classes")

bc_data$classes <- ifelse(bc_data$classes == "2", "benign",
                          ifelse(bc_data$classes == "4", "malignant", NA))


Missing data

bc_data[bc_data == "?"] <- NA

# how many NAs are in the data
length(which(is.na(bc_data)))
## [1] 16
# how many samples would we loose, if we removed them?
nrow(bc_data)
## [1] 699
nrow(bc_data[is.na(bc_data), ])
## [1] 16


Missing values are imputed with the mice package.

# impute missing data
library(mice)

bc_data[,2:10] <- apply(bc_data[, 2:10], 2, function(x) as.numeric(as.character(x)))
dataset_impute <- mice(bc_data[, 2:10],  print = FALSE)
bc_data <- cbind(bc_data[, 11, drop = FALSE], mice::complete(dataset_impute, 1))

bc_data$classes <- as.factor(bc_data$classes)

# how many benign and malignant cases are there?
summary(bc_data$classes)


Data exploration

  • Response variable for classification
library(ggplot2)

ggplot(bc_data, aes(x = classes, fill = classes)) +
  geom_bar()

We can see that our data is unbalanced. For simplicity’s sake, I am not going to go into how to deal with this, here. But I added a short post about dealing with unbalanced datasets using caret for you to check out.


  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) +
  geom_histogram(bins = 10)


  • Principal Component Analysis
library(pcaGoPromoter)
library(ellipse)

# perform pca and extract scores
pcaOutput <- pca(t(bc_data[, -1]), printDropped = FALSE, scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$scores)
  
# define groups for plotting
pcaOutput2$groups <- bc_data$classes
  
centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)

conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
  data.frame(groups = as.character(t),
             ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                   centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                   level = 0.95),
             stringsAsFactors = FALSE)))
    
ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
    geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
    geom_point(size = 2, alpha = 0.6) + 
    scale_color_brewer(palette = "Set1") +
    labs(color = "",
         fill = "",
         x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
         y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance")) 

  • Features
library(tidyr)

gather(bc_data, x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = classes, fill = classes)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)


Machine Learning packages for R

caret

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)


Training, validation and test data

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]
library(dplyr)

rbind(data.frame(group = "train", train_data),
      data.frame(group = "test", test_data)) %>%
  gather(x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = group, fill = group)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)


Regression

set.seed(42)
model_glm <- caret::train(clump_thickness ~ .,
                          data = train_data,
                          method = "glm",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))
model_glm
## Generalized Linear Model 
## 
## 490 samples
##   9 predictor
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 441, 441, 440, 442, 441, 440, ... 
## Resampling results:
## 
##   RMSE      Rsquared 
##   1.974296  0.5016141
## 
## 
predictions <- predict(model_glm, test_data)
# model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values
data.frame(residuals = resid(model_glm),
           predictors = model_glm$finalModel$linear.predictors) %>%
  ggplot(aes(x = predictors, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

# y == train_data$clump_thickness
data.frame(residuals = resid(model_glm),
           y = model_glm$finalModel$y) %>%
  ggplot(aes(x = y, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness,
           predicted = predictions) %>%
  ggplot(aes(x = actual, y = predicted)) +
    geom_jitter() +
    geom_smooth(method = "lm")


Classification

Decision trees

rpart

library(rpart)
library(rpart.plot)

set.seed(42)
fit <- rpart(classes ~ .,
            data = train_data,
            method = "class",
            control = rpart.control(xval = 10, 
                                    minbucket = 2, 
                                    cp = 0), 
             parms = list(split = "information"))

rpart.plot(fit, extra = 100)


Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf$finalModel$confusion
##           benign malignant class.error
## benign       313         8  0.02492212
## malignant      4       165  0.02366864


  • Feature Importance
imp <- model_rf$finalModel$importance
imp[order(imp, decreasing = TRUE), ]
##     uniformity_of_cell_size    uniformity_of_cell_shape 
##                   54.416003                   41.553022 
##             bland_chromatin                 bare_nuclei 
##                   29.343027                   28.483842 
##             normal_nucleoli single_epithelial_cell_size 
##                   19.239635                   18.480155 
##             clump_thickness           marginal_adhesion 
##                   13.276702                   12.143355 
##                     mitosis 
##                    3.081635
# estimate variable importance
importance <- varImp(model_rf, scale = TRUE)
plot(importance)


  • predicting test data
confusionMatrix(predict(model_rf, test_data), test_data$classes)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       133         2
##   malignant      4        70
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9386, 0.9894)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9369          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9708          
##             Specificity : 0.9722          
##          Pos Pred Value : 0.9852          
##          Neg Pred Value : 0.9459          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6364          
##    Detection Prevalence : 0.6459          
##       Balanced Accuracy : 0.9715          
##                                           
##        'Positive' Class : benign          
## 
results <- data.frame(actual = test_data$classes,
                      predict(model_rf, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)


Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42)
model_xgb <- caret::train(classes ~ .,
                          data = train_data,
                          method = "xgbTree",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))


  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE)
plot(importance)


  • predicting test data
confusionMatrix(predict(model_xgb, test_data), test_data$classes)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       132         2
##   malignant      5        70
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9322, 0.9864)
##     No Information Rate : 0.6555          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9266          
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9635          
##             Specificity : 0.9722          
##          Pos Pred Value : 0.9851          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.6555          
##          Detection Rate : 0.6316          
##    Detection Prevalence : 0.6411          
##       Balanced Accuracy : 0.9679          
##                                           
##        'Positive' Class : benign          
## 
results <- data.frame(actual = test_data$classes,
                      predict(model_xgb, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)


Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot)

# calculate correlation matrix
corMatMy <- cor(train_data[, -1])
corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70,
highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 2  and column  3 with corr  0.899 
##   Means:  0.696 vs 0.575 so flagging column 2 
## Compare row 3  and column  7 with corr  0.736 
##   Means:  0.654 vs 0.55 so flagging column 3 
## All correlations <= 0.7
# which variables are flagged for removal?
highlyCor
## [1] "uniformity_of_cell_size"  "uniformity_of_cell_shape"
#then we remove these variables
train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]


  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7)
results_rfe <- rfe(x = train_data[, -1], 
                   y = train_data$classes, 
                   sizes = c(1:9), 
                   rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10))
# chosen features
predictors(results_rfe)
## [1] "bare_nuclei"                 "uniformity_of_cell_size"    
## [3] "clump_thickness"             "uniformity_of_cell_shape"   
## [5] "bland_chromatin"             "marginal_adhesion"          
## [7] "normal_nucleoli"             "single_epithelial_cell_size"
## [9] "mitosis"
train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]


  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27)
model_ga <- gafs(x = train_data[, -1], 
                 y = train_data$classes,
                 iters = 10, # generations of algorithm
                 popSize = 10, # population size for each generation
                 levels = c("malignant", "benign"),
                 gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF
                                           method = "cv",    # 10 fold cross validation
                                           genParallel = TRUE, # Use parallel programming
                                           allowParallel = TRUE))
plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]


Grid search with caret

  • Automatic Grid
set.seed(42)
model_rf_tune_auto <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneLength = 15)
model_rf_tune_auto
## Random Forest 
## 
## 490 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.9692153  0.9323624
##   2     0.9704277  0.9350498
##   5     0.9645085  0.9216721
##   6     0.9639087  0.9201998
##   7     0.9632842  0.9186919
##   8     0.9626719  0.9172257
##   9     0.9636801  0.9195036
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
plot(model_rf_tune_auto)


  • Manual Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42)
grid <- expand.grid(mtry = c(1:10))

model_rf_tune_man <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneGrid = grid)
model_rf_tune_man
## Random Forest 
## 
## 490 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## Pre-processing: scaled (9), centered (9) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 442, 441, 441, 441, 441, 441, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.9696153  0.9332392
##    2    0.9706440  0.9354737
##    3    0.9696194  0.9330647
##    4    0.9661495  0.9253163
##    5    0.9649252  0.9225586
##    6    0.9653209  0.9233806
##    7    0.9634881  0.9192265
##    8    0.9624718  0.9169227
##    9    0.9641005  0.9203072
##   10    0.9628760  0.9176675
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
plot(model_rf_tune_man)


Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o)
h2o.init(nthreads = -1)
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.out
##     C:\Users\s_glan02\AppData\Local\Temp\RtmpwDqf33/h2o_s_glan02_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 815 milliseconds 
##     H2O cluster version:        3.10.3.6 
##     H2O cluster version age:    1 month and 10 days  
##     H2O cluster name:           H2O_started_from_R_s_glan02_tvy462 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.54 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.3 (2017-03-06)
bc_data_hf <- as.h2o(bc_data)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o.describe(bc_data_hf) %>%
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% 
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

library(reshape2) # for melting

bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1])

cor <- h2o.cor(bc_data_hf)
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")


Training, validation and test data

splits <- h2o.splitFrame(bc_data_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "classes"
features <- setdiff(colnames(train), response)
summary(train$classes, exact_quantiles = TRUE)
##  classes       
##  benign   :317 
##  malignant:174
summary(valid$classes, exact_quantiles = TRUE)
##  classes      
##  benign   :71 
##  malignant:35
summary(test$classes, exact_quantiles = TRUE)
##  classes      
##  benign   :70 
##  malignant:32
pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           impute_missing = TRUE,
           k = 3,
           seed = 42)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features

library(ggrepel)
ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel()


Classification

Random Forest
hyper_params <- list(
                     ntrees = c(25, 50, 75, 100),
                     max_depth = c(10, 20, 30),
                     min_rows = c(1, 3, 5)
                     )

search_criteria <- list(
                        strategy = "RandomDiscrete", 
                        max_models = 50,
                        max_runtime_secs = 360,
                        stopping_rounds = 5,          
                        stopping_metric = "AUC",      
                        stopping_tolerance = 0.0005,
                        seed = 42
                        )
rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, 
                                                # alternatively h2o.gbm 
                                                # for Gradient boosting trees
                    x = features,
                    y = response,
                    grid_id = "rf_grid",
                    training_frame = train,
                    validation_frame = valid,
                    nfolds = 25,                           
                    fold_assignment = "Stratified",
                    hyper_params = hyper_params,
                    search_criteria = search_criteria,
                    seed = 42
                    )
# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss")

for (sort_by_1 in sort_options_1) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path="models", force = TRUE)
  
}


# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")

for (sort_by_2 in sort_options_2) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path = "models", force = TRUE)
  
}
files <- list.files(path = "models")
rf_models <- files[grep("rf_grid_model", files)]

for (model_id in rf_models) {
  
  path <- paste0("U:\\Github_blog\\Webinar\\Webinar_ML_for_disease\\models\\", model_id)
  best_model <- h2o.loadModel(path)
  mse_auc_test <- data.frame(model_id = model_id, 
                             mse = h2o.mse(h2o.performance(best_model, test)),
                             auc = h2o.auc(h2o.performance(best_model, test)))
  
  if (model_id == rf_models[[1]]) {
    
    mse_auc_test_comb <- mse_auc_test
    
  } else {
    
    mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
    
  }
}

mse_auc_test_comb %>%
  gather(x, y, mse:auc) %>%
  ggplot(aes(x = model_id, y = y, fill = model_id)) +
    facet_grid(x ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
    labs(x = "", y = "value", fill = "")

for (model_id in rf_models) {
  
  best_model <- h2o.getModel(model_id)
  
  finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, 
                                                   nrow(test)),
                                    actual = as.vector(test$classes), 
                                    as.data.frame(h2o.predict(object = best_model, 
                                                              newdata = test)))
  
  finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                           finalRf_predictions$predict, 
                                         "yes", "no")
  
  finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, 
                                                  "benign", 
                                                  ifelse(finalRf_predictions$malignant 
                                                         > 0.8, "malignant", "uncertain"))
  
  finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                     finalRf_predictions$predict_stringent, "yes", 
                                         ifelse(finalRf_predictions$predict_stringent == 
                                                  "uncertain", "na", "no"))
  
  if (model_id == rf_models[[1]]) {
    
    finalRf_predictions_comb <- finalRf_predictions
    
  } else {
    
    finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
    
  }
}
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
finalRf_predictions_comb %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 3) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions_comb %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 3) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

rf_model <- h2o.loadModel("models/rf_grid_model_6")
h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model)
h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE)
##       train       valid        xval 
## 0.024674571 0.007042254 0.023097284
h2o.confusionMatrix(rf_model, valid = TRUE)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.293125896751881:
##           benign malignant    Error    Rate
## benign        70         1 0.014085   =1/71
## malignant      0        35 0.000000   =0/35
## Totals        70        36 0.009434  =1/106
plot(rf_model,
     timestep = "number_of_trees",
     metric = "classification_error")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "logloss")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "AUC")

plot(rf_model,
     timestep = "number_of_trees",
     metric = "rmse")

h2o.auc(rf_model, train = TRUE)
## [1] 0.989521
h2o.auc(rf_model, valid = TRUE)
## [1] 0.9995976
h2o.auc(rf_model, xval = TRUE)
## [1] 0.9890496
perf <- h2o.performance(rf_model, test)
perf
## H2OBinomialMetrics: drf
## 
## MSE:  0.03673598
## RMSE:  0.1916663
## LogLoss:  0.1158835
## Mean Per-Class Error:  0.0625
## AUC:  0.990625
## Gini:  0.98125
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           benign malignant    Error    Rate
## benign        70         0 0.000000   =0/70
## malignant      4        28 0.125000   =4/32
## Totals        74        28 0.039216  =4/102
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.735027 0.933333  25
## 2                       max f2  0.294222 0.952381  37
## 3                 max f0point5  0.735027 0.972222  25
## 4                 max accuracy  0.735027 0.960784  25
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.294222 1.000000  37
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.735027 0.909782  25
## 9   max min_per_class_accuracy  0.424524 0.937500  31
## 10 max mean_per_class_accuracy  0.294222 0.942857  37
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
plot(perf)

h2o.logloss(perf)
## [1] 0.1158835
h2o.mse(perf)
## [1] 0.03673598
h2o.auc(perf)
## [1] 0.990625
head(h2o.metric(perf))
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  1.000000 0.171429 0.114504 0.340909 0.715686  1.000000 0.093750
## 2  0.998333 0.222222 0.151515 0.416667 0.725490  1.000000 0.125000
## 3  0.998000 0.270270 0.187970 0.480769 0.735294  1.000000 0.156250
## 4  0.997222 0.315789 0.223881 0.535714 0.745098  1.000000 0.187500
## 5  0.996210 0.358974 0.259259 0.583333 0.754902  1.000000 0.218750
## 6  0.994048 0.400000 0.294118 0.625000 0.764706  1.000000 0.250000
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.257464               0.093750                0.546875
## 2    1.000000     0.298807               0.125000                0.562500
## 3    1.000000     0.335794               0.156250                0.578125
## 4    1.000000     0.369755               0.187500                0.593750
## 5    1.000000     0.401478               0.218750                0.609375
## 6    1.000000     0.431474               0.250000                0.625000
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1  70  29   0   3 1.000000 0.906250 0.000000 0.093750   0
## 2  70  28   0   4 1.000000 0.875000 0.000000 0.125000   1
## 3  70  27   0   5 1.000000 0.843750 0.000000 0.156250   2
## 4  70  26   0   6 1.000000 0.812500 0.000000 0.187500   3
## 5  70  25   0   7 1.000000 0.781250 0.000000 0.218750   4
## 6  70  24   0   8 1.000000 0.750000 0.000000 0.250000   5
finalRf_predictions <- data.frame(actual = as.vector(test$classes), 
                                  as.data.frame(h2o.predict(object = rf_model, 
                                                            newdata = test)))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                         finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", 
                                                ifelse(finalRf_predictions$malignant 
                                                       > 0.8, "malignant", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                   finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == 
                                                "uncertain", "na", "no"))

finalRf_predictions %>%
  group_by(actual, predict) %>%
  dplyr::summarise(n = n())
## Source: local data frame [3 x 3]
## Groups: actual [?]
## 
##      actual   predict     n
##      <fctr>    <fctr> <int>
## 1    benign    benign    62
## 2    benign malignant     8
## 3 malignant malignant    32
finalRf_predictions %>%
  group_by(actual, predict_stringent) %>%
  dplyr::summarise(n = n())
## Source: local data frame [4 x 3]
## Groups: actual [?]
## 
##      actual predict_stringent     n
##      <fctr>             <chr> <int>
## 1    benign            benign    61
## 2    benign         uncertain     9
## 3 malignant         malignant    26
## 4 malignant         uncertain     6
finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)]

thresholds <- seq(from = 0, to = 1, by = 0.1)

prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA)

for (threshold in thresholds) {
  pred <- ifelse(df$benign > threshold, "benign", "malignant")
  pred_t <- ifelse(pred == df$actual, TRUE, FALSE)
  
  group <- data.frame(df, "pred" = pred_t) %>%
  group_by(actual, pred) %>%
  dplyr::summarise(n = n())
  
  group_b <- filter(group, actual == "benign")
  
  prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n)
  prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b
  
  group_m <- filter(group, actual == "malignant")
  
  prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n)
  prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m
}

prop_table %>%
  gather(x, y, prop_true_b:prop_true_m) %>%
  ggplot(aes(x = threshold, y = y, color = x)) +
    geom_point() +
    geom_line() +
    scale_color_brewer(palette = "Set1") +
    labs(y = "proportion of true predictions",
         color = "b: benign cases\nm: malignant cases")

h2o.shutdown()

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.



sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.6.5        reshape2_1.4.2       h2o_3.10.3.6        
##  [4] corrplot_0.77        plyr_1.8.4           xgboost_0.6-4       
##  [7] randomForest_4.6-12  dplyr_0.5.0          caret_6.0-73        
## [10] lattice_0.20-35      doParallel_1.0.10    iterators_1.0.8     
## [13] foreach_1.4.3        tidyr_0.6.1          pcaGoPromoter_1.18.0
## [16] Biostrings_2.42.1    XVector_0.14.0       IRanges_2.8.1       
## [19] S4Vectors_0.12.1     BiocGenerics_0.20.0  ellipse_0.3-8       
## [22] ggplot2_2.2.1.9000  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10         class_7.3-14         assertthat_0.1      
##  [4] rprojroot_1.2        digest_0.6.12        R6_2.2.0            
##  [7] backports_1.0.5      MatrixModels_0.4-1   RSQLite_1.1-2       
## [10] evaluate_0.10        e1071_1.6-8          zlibbioc_1.20.0     
## [13] lazyeval_0.2.0       minqa_1.2.4          data.table_1.10.4   
## [16] SparseM_1.76         car_2.1-4            nloptr_1.0.4        
## [19] Matrix_1.2-8         rmarkdown_1.4        labeling_0.3        
## [22] splines_3.3.3        lme4_1.1-12          stringr_1.2.0       
## [25] RCurl_1.95-4.8       munsell_0.4.3        mgcv_1.8-17         
## [28] htmltools_0.3.5      nnet_7.3-12          tibble_1.2          
## [31] codetools_0.2-15     MASS_7.3-45          bitops_1.0-6        
## [34] ModelMetrics_1.1.0   grid_3.3.3           nlme_3.1-131        
## [37] jsonlite_1.3         gtable_0.2.0         DBI_0.6             
## [40] magrittr_1.5         scales_0.4.1         stringi_1.1.3       
## [43] RColorBrewer_1.1-2   tools_3.3.3          Biobase_2.34.0      
## [46] pbkrtest_0.4-7       yaml_2.1.14          AnnotationDbi_1.36.2
## [49] colorspace_1.3-2     memoise_1.0.0        knitr_1.15.1        
## [52] quantreg_5.29