Social Network Analysis and Topic Modeling of codecentric’s Twitter friends and followers

2017-07-28 00:00:00 +0000

I have written the following post about Social Network Analysis and Topic Modeling of codecentric’€™s Twitter friends and followers for codecentric’s blog:

Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling. In this article I am showing a real-world example of how we can use Data Science to gain insights from text data and social network analysis.

I am using publicly available Twitter data to characterize codecentric’€™s friends and followers for identifying the most “influential”€ followers and using text analysis tools like sentiment analysis to characterize their interests from their user descriptions, performing Social Network Analysis on friends, followers and a subset of second degree connections to identify key players who will be able to pass on information to a wide reach of other users and combing this network analysis with topic modeling to identify meta-groups with similar interests.

Knowing the interests and social network positions of our followers allows us to identify key users who are likely to retweet posts that fall within their range of interests and who will reach a wide audience.

Continue reading at https://blog.codecentric.de/en/2017/07/combining-social-network-analysis-topic-modeling-characterize-codecentrics-twitter-friends-followers/

The entire analysis has been done in R 3.4.0 and you can find my code on Github.

How to do Optical Character Recognition (OCR) of non-English documents in R using Tesseract?

2017-07-17 00:00:00 +0000

One of the many great packages of rOpenSci has implemented the open source engine Tesseract.

Optical character recognition (OCR) is used to digitize written or typed documents, i.e. photos or scans of text documents are “translated” into a digital text on your computer.

While this might seem like a trivial task at first glance, because it is so easy for our human brains. When reading text, we make use of our built-in word and sentence “autocomplete” that we learned from experience. But the same task is really quite difficult for a computer to recognize typed words correctly, especially if the document is of low quality.

One of the best open-source engines today is Tesseract. You can run tesseract from the command-line or - with the help of rOpenSci’s tesseract package - run it conveniently from within R!

Tesseract uses language specific training data to optimize OCR based on learned context. Therefore, it is much better at recognizing words in coherent sentences than at recognizing single words or abbreviations (we can see this e.g. with address lines in documents).

The default language is English, and you can find numerous examples on how to run OCR with this default. But running tesseract with a different language turned out to need a few additional tweaks, which I want to present here.


Installing tesseract

Check out the package vignette for instructions on how to install the libtesseract C++ library and the tesseract R package on your computer.

library(tesseract)

Because I’m German and will therefore use a German scanned document as an example, I first needed to install the German training dataset. Even though I installed it together with tesseract (the command-line tool), I still had to install it again for use with R:

If you happen to get an error, that it still didn’t find the correct language data where it expected (as I did), note the path that is given in the error message. This tells you where it looks for the training data and you can simply download the training data manually and copy it to the given path.

Here again, you need to make sure that you download training data for the correct version of tesseract. The link that is given in the package documentation turned out to point to a different version of tesseract than I was using. If this is the case, you will get a warning (Params model::Incomplete line) when running ocr().

As you can see, I had Tesseract version 3 installed:

system("tesseract --version")
tesseract 3.05.01
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.29 : libtiff 4.0.8 : zlib 1.2.8

So, I also needed to download the training data for version 3 (the default is for version 4), which you can find here and copy it to tesseracts language folder.


Image processing

Image quality is essential for good OCR! Tesseract performs different image processing steps internally with the Leptonica library but it is still a good idea to improve the image manually before running tesseract.

Here, I am using two random images from the internet:

  1. a manual for a printer and

  2. a description of a game

Image nr. 1 is machine-generated and should therefore yield much better results than image nr. 2, which looks like it was typewriter-written and moreover, is skewed.

This, we can also do from within R, making use of another one of rOpenSci’s packages: magick.

However, it seems that magick doesn’t have a function that converts an image to black and white, so I am using EBImage first. Because magick wants an magick image object for input, I saved the black-and-white image first and then read it in again with image_read(). This isn’t exactly an elegant solution, so if someone knows of a better way to do this, please let me know!

library(EBImage) # install from Bioconductor!
color.image <- readImage("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan.png")
bw.image <- channel(color.image,"gray")
writeImage(bw.image,file="/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bw.png")
color.image <- readImage("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2.jpg")
bw.image <- channel(color.image,"gray")
writeImage(bw.image,file="/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bw.jpg")
library(magick)
image1 <- image_read("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bw.png")
image2 <- image_read("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bw.jpg")

Now, we can follow Tesseract’s guidelines on how to improve image quality:

library(tidyverse)
image_bearb1 <- image1 %>%
    image_scale("x2000") %>%                        # rescale
    image_background("white", flatten = TRUE) %>%   # set background to white
    image_trim() %>%                                # Trim edges that are the background color from the image.
    image_noise() %>%                               # Reduce noise in image using a noise peak elimination filter
    image_enhance() %>%                             # Enhance image (minimize noise)
    image_normalize() %>%                           # Normalize image (increase contrast by normalizing the pixel values to span the full range of color values).
    image_contrast(sharpen = 1) %>%                 # increase contrast
    image_deskew(treshold = 40)                     # deskew image -> creates negative offset in some scans
image_bearb2 <- image2 %>%
    image_scale("x2000") %>%                        # rescale
    image_background("white", flatten = TRUE) %>%   # set background to white
    image_trim() %>%                                # Trim edges that are the background color from the image.
    image_noise() %>%                               # Reduce noise in image using a noise peak elimination filter
    image_enhance() %>%                             # Enhance image (minimize noise)
    image_normalize() %>%                           # Normalize image (increase contrast by normalizing the pixel values to span the full range of color values).
    image_contrast(sharpen = 1) %>%                 # increase contrast
    image_deskew(treshold = 40)                     # deskew image -> creates negative offset in some scans

If we want to inspect our images from within R, we can use magick’s image_browse() function.

image_browse(image_bearb2)


OCR

Tesseract’s central function is called ocr().

For some reason, I couldn’t solve the error message I got when directly pointing the processed images to the ocr() function: Magick: TIFF: negative image positions unsupported. This error results from image_deskew() but all potential solutions don’t seem to be implemented with the R package of ImageMagick, so I had to resort to a work-around: I saved the images first and then pointed ocr() to the images. This worked without a hitch!

image_write(image_bearb1, path = "/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bearb.png", format = "png")
image_write(image_bearb2, path = "/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bearb.jpg", format = "jpg")

When we are not working with an English document, we can specify the language with the engine = tesseract(language = "deu") option. We can also set a whole range of parameters but for now, I will only show how to use a whitelist: by specifying characters in a whitelist, tesseract will only look for these in the document. You can e.g. use this if you only expect numbers in you document. Here I am not being very restrictive, though…

whitelist <- "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&/()=?+"

text1 <- ocr("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bearb.png",
             engine = tesseract(language = "deu",
                                options = list(tessedit_char_whitelist = whitelist)))

text2 <- ocr("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bearb.jpg",
             engine = tesseract(language = "deu",
                                options = list(tessedit_char_whitelist = whitelist)))


Performance evaluation

Evaluating how good our model performed isn’t trivial either. While we as humans can very easily judge the accuracy of our digitzed text just by looking at it, we need a standardized and automated way to evaluate model performance.

One way would be to manually transcribe your document and check it against the text that tesseract extracted. However, this is very labor-intensive and time-consuming, especially when you’re working with a large number of documents!

Another, slightly less arduous way would be to create a list of relevant words that we want to have recognized. This could be adapted to include any list of given words that are important for you to be recognized (if you wanted to be able to search for them among your documents, for example).

Or, as I am going to do here, we can go a much easier way and compare the words that tesseract recognized to a list of words from a dictionary to get an idea about how many of them are actual words. This is of course not without problems: Some words may not be in our dictionary or they are borrowed from another language, they might be names of people or places, addresses, abbreviations, etc. Also, because we don’t have a benchmark for our document, we will only get an estimation of how many words were correctly recognized. But we can really only compare different tesseract models against each other with this method.

Because we use different grammatical forms of words, we also want to stem both our recognized words and the dictionary before comparing them!

Here, I am using the German Open Thesaurus.

To separate the text and the dictionary into distinct words, I am using the tidytext package together with SnowballC for word stemming. Finally, I am going to remove duplicate entries.

library(tidytext)
library(SnowballC)

openthes <- data.frame(words = read_lines("/Users/shiringlander/Documents/Projekte/OCR_Provinzial/German_dict/OpenThesaurus-Textversion/openthesaurus.txt", skip = 18)) %>%
  mutate(words = as.character(words)) %>%
  unnest_tokens(word, words) %>%
  mutate(word = wordStem(word, language = "deu")) %>%
  distinct()

This leaves us with 77453 German reference words.

Now, we can do the same with our tesseract output.

text_1_df <- data.frame(text = read.delim(textConnection(text1),    # make text into dataframe
                                          header = FALSE, 
                                          sep = "\n", 
                                          strip.white = TRUE)) %>%
  mutate(text = as.character(V1)) %>%
  unnest_tokens(word, text) %>%                                      # separate words
  mutate(word = wordStem(word, language = "deu"))                    # use word stem

text_2_df <- data.frame(text = read.delim(textConnection(text2),    # make text into dataframe
                                          header = FALSE, 
                                          sep = "\n", 
                                          strip.white = TRUE)) %>%
  mutate(text = as.character(V1)) %>%
  unnest_tokens(word, text) %>%                                      # separate words
  mutate(word = wordStem(word, language = "deu"))                    # use word stem

To get an estimate about model performance, I am then counting how many of the words that were recognized in each document are also part of the German dictionary (and therefore likely to be real words) and plot the results.

res1 <- text_1_df %>%
    mutate(in_dict = ifelse(word %in% openthes$word, TRUE, FALSE)) %>%
    count(in_dict) %>%
    mutate(percent = n / nrow(text_1_df) * 100,
           image = "image 1")

res2 <- text_2_df %>%
    mutate(in_dict = ifelse(word %in% openthes$word, TRUE, FALSE)) %>%
    count(in_dict) %>%
    mutate(percent = n / nrow(text_2_df) * 100,
           image = "image 2")

Between 80 and 90% of recognized words seem to be actual words!

Of course, the less “standard” a document is, the more difficult it will be to extract text correctly but I’d say this isn’t bad for a first attempt!



sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2    SnowballC_0.5.1 tidytext_0.1.3  dplyr_0.7.1    
##  [5] purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1 magick_0.4      EBImage_4.18.0 
## [13] tesseract_1.4  
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1      reshape2_1.4.2      haven_1.1.0        
##  [4] lattice_0.20-35     colorspace_1.3-2    htmltools_0.3.6    
##  [7] yaml_2.1.14         rlang_0.1.1         foreign_0.8-69     
## [10] glue_1.1.1          RColorBrewer_1.1-2  BiocGenerics_0.22.0
## [13] modelr_0.1.0        readxl_1.0.0        jpeg_0.1-8         
## [16] bindr_0.1           plyr_1.8.4          stringr_1.2.0      
## [19] cellranger_1.1.0    munsell_0.4.3       gtable_0.2.0       
## [22] rvest_0.3.2         psych_1.7.5         evaluate_0.10.1    
## [25] labeling_0.3        knitr_1.16          forcats_0.2.0      
## [28] parallel_3.4.0      tokenizers_0.1.4    broom_0.4.2        
## [31] Rcpp_0.12.11        backports_1.1.0     scales_0.4.1       
## [34] jsonlite_1.5        abind_1.4-5         mnormt_1.5-5       
## [37] hms_0.3             png_0.1-7           digest_0.6.12      
## [40] stringi_1.1.5       tiff_0.1-5          grid_3.4.0         
## [43] rprojroot_1.2       tools_3.4.0         magrittr_1.5       
## [46] lazyeval_0.2.0      janeaustenr_0.1.5   pkgconfig_2.0.1    
## [49] Matrix_1.2-10       xml2_1.1.1          lubridate_1.6.0    
## [52] assertthat_0.2.0    rmarkdown_1.6       httr_1.2.1         
## [55] R6_2.2.2            fftwtools_0.9-8     nlme_3.1-131       
## [58] compiler_3.4.0

Characterizing Twitter followers with tidytext

2017-06-28 00:00:00 +0000

Lately, I have been more and more taken with tidy principles of data analysis. They are elegant and make analyses clearer and easier to comprehend. Following the tidyverse and ggraph, I have been quite intrigued by applying tidy principles to text analysis with Julia Silge and David Robinson’s tidytext.

In this post, I will explore tidytext with an analysis of my Twitter followers’ descriptions to try and learn more about the people who are interested in my tweets, which are mainly about Data Science and Machine Learning.

Resources I found useful for this analysis were http://www.rdatamining.com/docs/twitter-analysis-with-r and http://tidytextmining.com/tidytext.html


Retrieving Twitter data

I am using twitteR to retrieve data from Twitter (I have also tried rtweet but for some reason, my API key, secret and token (that worked with twitteR) resulted in a “failed to authorize” error with rtweet’s functions).

library(twitteR)

Once we have set up our Twitter REST API, we get the necessary information to authenticate our access.

consumerKey = "INSERT KEY HERE"
consumerSecret = "INSERT SECRET KEY HERE"
accessToken = "INSERT TOKEN HERE"
accessSecret = "INSERT SECRET TOKEN HERE"
options(httr_oauth_cache = TRUE)

setup_twitter_oauth(consumer_key = consumerKey, 
                    consumer_secret = consumerSecret, 
                    access_token = accessToken, 
                    access_secret = accessSecret)


Now, we can access information from Twitter, like timeline tweets, user timelines, mentions, tweets & retweets, followers, etc.

All the following datasets were retrieved on June 7th 2017, converted to a data frame for tidy analysis and saved for later use:

  • the last 3200 tweets on my timeline
my_name <- userTimeline("ShirinGlander", n = 3200, includeRts=TRUE)
my_name_df <- twListToDF(my_name)
save(my_name_df, file = "my_name.RData")
  • my last 3200 mentions and retweets
my_mentions <- mentions(n = 3200)
my_mentions_df <- twListToDF(my_mentions)
save(my_mentions_df, file = "my_mentions.RData")

my_retweets <- retweetsOfMe(n = 3200)
my_retweets_df <- twListToDF(my_retweets)
save(my_retweets_df, file = "my_retweets.RData")
  • the last 3200 tweets to me
tweetstome <- searchTwitter("@ShirinGlander", n = 3200)
tweetstome_df <- twListToDF(tweetstome)
save(tweetstome_df, file = "tweetstome.RData")
  • my friends and followers
user <- getUser("ShirinGlander")

friends <- user$getFriends() # who I follow
friends_df <- twListToDF(friends)
save(friends_df, file = "my_friends.RData")

followers <- user$getFollowers() # my followers
followers_df <- twListToDF(followers)
save(followers_df, file = "my_followers.RData")


Analyzing friends and followers

In this post, I will have a look at my friends and followers.

load("my_friends.RData")
load("my_followers.RData")

I am going to use packages from the tidyverse (tidyquant for plotting).

library(tidyverse)
library(tidyquant)
  • Number of friends (who I follow on Twitter): 225

  • Number of followers (who follows me on Twitter): 324

  • Number of friends who are also followers: 97


What languages do my followers speak?

One of the columns describing my followers is which language they have set for their Twitter account. Not surprisingly, English is by far the most predominant language of my followers, followed by German, Spanish and French.

followers_df %>%
  count(lang) %>%
  droplevels() %>%
  ggplot(aes(x = reorder(lang, desc(n)), y = n)) +
    geom_bar(stat = "identity", color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "language ISO 639-1 code",
         y = "number of followers")


Who are my most “influential” followers (i.e. followers with the biggest network)?

I also have information about the number of followers that each of my followers have (2nd degree followers). Most of my followers are followed by up to ~ 1000 people, while only a few have a very large network.

followers_df %>%
  ggplot(aes(x = log2(followersCount))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    labs(x = "log2 of number of followers",
         y = "density")


How active are my followers (i.e. how often do they tweet)

The followers data frame also tells me how many statuses (i.e. tweets) each of followers have. To make the numbers comparable, I am normalizing them by the number of days that they have had their accounts to calculate the average number of tweets per day.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = log2(statusesCount_pDay))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq()


Who are my followers with the biggest network and who tweet the most?

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(followersCount)) %>%
  top_n(10)
##         screenName followersCount statusesCount_pDay
## 1        dr_morton         150937           71.35193
## 2    Scientists4EU          66117           17.64389
## 3       dr_morton_          63467           46.57763
## 4   NewScienceWrld          60092           54.65874
## 5     RubenRabines          42286           25.99592
## 6  machinelearnbot          27427          204.67061
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10 FranPatogenLoco            311           28.92593
followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(statusesCount_pDay)) %>%
  top_n(10)
##         screenName followersCount statusesCount_pDay
## 1  machinelearnbot          27427          204.67061
## 2        dr_morton         150937           71.35193
## 3   NewScienceWrld          60092           54.65874
## 4       dr_morton_          63467           46.57763
## 5  FranPatogenLoco            311           28.92593
## 6     RubenRabines          42286           25.99592
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10   Scientists4EU          66117           17.64389


Is there a correlation between number of followers and number of tweets?

Indeed, there seems to be a correlation that users with many followers also tend to tweet more often.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = followersCount, y = statusesCount_pDay, color = days)) +
    geom_smooth(method = "lm") +
    geom_point() +
    scale_color_continuous(low = palette_light()[1], high = palette_light()[2]) +
    theme_tq()


Tidy text analysis

Next, I want to know more about my followers by analyzing their Twitter descriptions with the tidytext package.

library(tidytext)
library(SnowballC)

To prepare the data, I am going to unnest the words (or tokens) in the user descriptions, convert them to the word stem, remove stop words and urls.

data(stop_words)

tidy_descr <- followers_df %>%
  unnest_tokens(word, description) %>%
  mutate(word_stem = wordStem(word)) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!grepl("\\.|http", word))


What are the most commonly used words in my followers’ descriptions?

The first question I want to ask is what words are most common in my followers’ descriptions.

Not surprisingly, the most common word is “data”. I do tweet mostly about data related topics, so it makes sense that my followers are mostly likeminded. The rest is also related to data science, machine learning and R.

tidy_descr %>%
  count(word_stem, sort = TRUE) %>%
  filter(n > 20) %>%
  ggplot(aes(x = reorder(word_stem, n), y = n)) +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of word stem in all followers' descriptions")


This, we can also show with a word cloud.

library(wordcloud)
library(tm)
tidy_descr %>%
  count(word_stem) %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  with(wordcloud(word_stem, n, max.words = 100, colors = palette_light()))


Instead of looking for the most common words, we can also look for the most common ngrams: here, for the most common word pairs (bigrams) in my followers’ descriptions.

tidy_descr_ngrams <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts <- tidy_descr_ngrams %>%
  count(word1, word2, sort = TRUE)
bigram_counts %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(word1, -n), y = reorder(word2, -n), fill = n)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    coord_flip() +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "first word in pair",
         y = "second word in pair")


These, we can also show as a graph:

library(igraph)
library(ggraph)
bigram_graph <- bigram_counts %>%
  filter(n > 5) %>%
  graph_from_data_frame()

set.seed(1)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color =  palette_light()[1], size = 5, alpha = 0.8) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 0.5) +
  theme_void()


We can also use bigram analysis to identify negated meanings (this will become relevant for sentiment analysis later). So, let’s look at which words are preceded by “not” or “no”.

bigrams_separated <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(word1 == "not" | word1 == "no") %>%
  filter(!word2 %in% stop_words$word)

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  count(word2, score, sort = TRUE) %>%
  ungroup()
not_words %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
    geom_col(show.legend = FALSE) +
    scale_fill_manual(values = palette_light()) +
    labs(x = "",
         y = "Sentiment score * number of occurrences",
         title = "Words preceded by \"not\"") +
    coord_flip() +
    theme_tq()


What’s the predominant sentiment in my followers’ descriptions?

For sentiment analysis, I will exclude the words with a negated meaning from nrc and switch their positive and negative meanings from bing (although in this case, there was only one negated word, “endorsement”, so it won’t make a real difference).

tidy_descr_sentiment <- tidy_descr %>%
  left_join(select(bigrams_separated, word1, word2), by = c("word" = "word2")) %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  rename(nrc = sentiment.x, bing = sentiment.y) %>%
  mutate(nrc = ifelse(!is.na(word1), NA, nrc),
         bing = ifelse(!is.na(word1) & bing == "positive", "negative", 
                       ifelse(!is.na(word1) & bing == "negative", "positive", bing)))
tidy_descr_sentiment %>%
  filter(nrc != "positive") %>%
  filter(nrc != "negative") %>%
  gather(x, y, nrc, bing) %>%
  count(x, y, sort = TRUE) %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(y, n), y = n)) +
    facet_wrap(~ x, scales = "free") +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of sentiment in followers' descriptions")


Are followers’ descriptions mostly positive or negative?

The majority of my followers have predominantly positive descriptions.

tidy_descr_sentiment %>%
  count(screenName, word, bing) %>%
  group_by(screenName, bing) %>%
  summarise(sum = sum(n)) %>%
  spread(bing, sum, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(x = sentiment)) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq()


What are the most common positive and negative words in followers’ descriptions?

library(reshape2)
tidy_descr_sentiment %>%
  count(word, bing, sort = TRUE) %>%
  acast(word ~ bing, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = palette_light()[1:2],
                   max.words = 100)


Topic modeling: are there groups of followers with specific interests?

Topic modeling can be used to categorize words into groups. Here, we can use it to see whether (some) of my followers can be grouped into subgroups according to their descriptions.

library(topicmodels)
dtm_words_count <- tidy_descr %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  count(screenName, word_stem, sort = TRUE) %>%
  ungroup() %>%
  filter(word_stem != "") %>%
  cast_dtm(screenName, word_stem, n)

# set a seed so that the output of the model is predictable
dtm_lda <- LDA(dtm_words_count, k = 5, control = list(seed = 1234))

topics_beta <- tidy(dtm_lda, matrix = "beta")
p1 <- topics_beta %>%
  filter(grepl("[a-z]+", term)) %>% # some words are Chinese, etc. I don't want these because ggplot doesn't plot them correctly
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, color = factor(topic), fill = factor(topic))) +
    geom_col(show.legend = FALSE, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    facet_wrap(~ topic, ncol = 5) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "beta (~ occurrence in topics 1-5)",
         title = "The top 10 most characteristic words describe topic categories.")
user_topic <- tidy(dtm_lda, matrix = "gamma") %>%
  arrange(desc(gamma)) %>%
  group_by(document) %>%
  top_n(1, gamma)
p2 <- user_topic %>%
  group_by(topic) %>%
  top_n(10, gamma) %>%
  ggplot(aes(x = reorder(document, -gamma), y = gamma, color = factor(topic))) +
    facet_wrap(~ topic, scales = "free", ncol = 5) +
    geom_point(show.legend = FALSE, size = 4, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    coord_flip() +
    labs(x = "",
         y = "gamma\n(~ affiliation with topics 1-5)")
library(grid)
library(gridExtra)
grid.arrange(p1, p2, ncol = 1, heights = c(0.7, 0.3))

The upper of the two plots above show the words that were most strongly grouped to five topics. The lower plots show my followers with the strongest affiliation with these five topics.

Because in my tweets I only cover a relatively narrow range of topics (i.e. related to data), my followers are not very diverse in terms of their descriptions and the five topics are not very distinct.

If you find yourself in any of the topics, let me know if you agree with the topic that was modeled for you!


For more text analysis, see my post about text mining and sentiment analysis of a Stuff You Should Know Podcast.



sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.2.1               topicmodels_0.2-6            
##  [3] reshape2_1.4.2                ggraph_1.0.0                 
##  [5] igraph_1.0.1                  tm_0.7-1                     
##  [7] NLP_0.1-10                    wordcloud_2.5                
##  [9] RColorBrewer_1.1-2            SnowballC_0.5.1              
## [11] tidytext_0.1.3                bindrcpp_0.2                 
## [13] tidyquant_0.5.1               quantmod_0.4-10              
## [15] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
## [17] xts_0.9-7                     zoo_1.8-0                    
## [19] lubridate_1.6.0               dplyr_0.7.0                  
## [21] purrr_0.2.2.2                 readr_1.1.1                  
## [23] tidyr_0.6.3                   tibble_1.3.3                 
## [25] ggplot2_2.2.1                 tidyverse_1.1.1              
## [27] twitteR_1.1.9                
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.4.0     httr_1.2.1        viridisLite_0.2.0
##  [4] bit64_0.9-7       jsonlite_1.5      modelr_0.1.0     
##  [7] assertthat_0.2.0  stats4_3.4.0      cellranger_1.1.0 
## [10] ggrepel_0.6.5     yaml_2.1.14       slam_0.1-40      
## [13] backports_1.1.0   lattice_0.20-35   glue_1.1.0       
## [16] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2 
## [19] htmltools_0.3.6   Matrix_1.2-10     plyr_1.8.4       
## [22] psych_1.7.5       pkgconfig_2.0.1   broom_0.4.2      
## [25] haven_1.0.0       scales_0.4.1      tweenr_0.1.5     
## [28] ggforce_0.1.1     lazyeval_0.2.0    mnormt_1.5-5     
## [31] magrittr_1.5      readxl_1.0.0      evaluate_0.10    
## [34] tokenizers_0.1.4  janeaustenr_0.1.5 nlme_3.1-131     
## [37] MASS_7.3-47       forcats_0.2.0     xml2_1.1.1       
## [40] foreign_0.8-68    tools_3.4.0       hms_0.3          
## [43] stringr_1.2.0     munsell_0.4.3     compiler_3.4.0   
## [46] rlang_0.1.1       units_0.4-5       rjson_0.2.15     
## [49] labeling_0.3      rmarkdown_1.6     gtable_0.2.0     
## [52] DBI_0.7           R6_2.2.2          knitr_1.16       
## [55] udunits2_0.13     bit_1.1-12        bindr_0.1        
## [58] rprojroot_1.2     Quandl_2.8.0      modeltools_0.2-21
## [61] stringi_1.1.5     parallel_3.4.0    Rcpp_0.12.11

Data Science for Business - Time Series Forecasting Part 3: Forecasting with Facebook's Prophet

2017-06-13 00:00:00 +0000

In my last two posts (Part 1 and Part 2), I explored time series forecasting with the timekit package.

In this post, I want to compare how Facebook’s prophet performs on the same dataset.


Predicting future events/sales/etc. isn’t trivial for a number of reasons and different algorithms use different approaches to handle these problems. Time series data does not behave like a regular numeric vector, because months don’t have the same number of days, weekends and holidays differ between years, etc. Because of this, we often have to deal with multiple layers of seasonality (i.e. weekly, monthly, yearly, irregular holidays, etc.). Regularly missing days, like weekends, are easier to incorporate into time series models than irregularly missing days.


Timekit uses a time series signature for modeling, which we used as features to build our model of choice (e.g. a linear model). This model was then used for predicting future dates.

Prophet is Facebook’s time series forecasting algorithm that was just recently released as open source software with an implementation in R.

Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.”

(I am not going to discuss forecast and ARIMA or other models because they are quite well established with lots and lots of excellent tutorials out there.)


Training and Test data

I am using the same training and test intervals as in my last post using timekit.

Just as with timekit, prophet starts with a data frame that consists of a date column and the respective response variable for each date.

library(prophet)

library(tidyverse)
library(tidyquant)
retail_p_day <- retail_p_day %>%
  mutate(model = ifelse(day <= "2011-11-01", "train", "test"))

train <- filter(retail_p_day, model == "train") %>%
  select(day, sum_income) %>%
  rename(ds = day,
         y = sum_income)

test <- filter(retail_p_day, model == "test") %>%
  select(day, sum_income) %>%
  rename(ds = day)


Model building

In contrast to timekit, we do not “manually” augment the time series signature in prophet, we can directly feed our input data to the prophet() function (check the function help for details on optional parameters).

To make it comparable, I am feeding the same list of irregularly missing days to the prophet() function. As discussed in the last post, I chose not to use a list of holidays because the holidays in the observation period poorly matched the days that were actually missing.

off_days <- data.frame(ds = as.Date(c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", 
                                      "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03",
                                      "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", 
                                      "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30"))) %>%
  mutate(holiday = paste0("off_day_", seq_along(1:length(ds))))
prophet_model_test <- prophet(train, 
                              growth = "linear", # growth curve trend
                              n.changepoints = 100, # Prophet automatically detects changes in trends by selecting changepoints from the data
                              yearly.seasonality = FALSE, # yearly seasonal component using Fourier series
                              weekly.seasonality = TRUE, # weekly seasonal component using dummy variables
                              holidays = off_days) 
## Initial log joint probability = -8.3297
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance


Predicting test data

With our model, we can now predict on the test data and compare the predictions with the actual values.

forecast_test <- predict(prophet_model_test, test)


Just as with timekit, I want to have a look at the residuals. Compared to timekit, the residuals actually look almost identical…

forecast_test %>%
  mutate(resid = sum_income - yhat) %>%
  ggplot(aes(x = ds, y = resid)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()


… As does the comparison plot. So, here it seems that prophet built a model that is basically identical to the linear model I used with timekit.

forecast_test %>%
  gather(x, y, sum_income, yhat) %>%
  ggplot(aes(x = ds, y = y, color = x)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()


Predicting future sales

Now, let’s see whether the future predictions will be identical as well.

Just like with timekit, I am using a future time series of 300 days. Here, we see a slight difference in how we generate the future time series: with timekit I could use the entire index of observed dates, together with the list of missing days, while prophet uses the forecasting model that was generated for comparing the test data, i.e. it only considers the dates from the training set. We could built a new model with the entire dataset but this would then be different to how I approached the modeling with timekit.

future <- make_future_dataframe(prophet_model_test, periods = 300)
forecast <- predict(prophet_model_test, future)
plot(prophet_model_test, forecast) +
    theme_tq()

Interestingly, prophet’s forecast is distinctly different from timekit’s, despite identical performance on test samples! While timekit predicted a drop at the beginning of the year (similar to the training period), prophet predicts a steady increase in the future. It looks like timekit put more weight on the overall pattern during the training period, while prophet seems to put more weight on the last months, which showed a rise in net income.



sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidyquant_0.5.1               quantmod_0.4-8               
##  [3] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
##  [5] xts_0.9-7                     zoo_1.8-0                    
##  [7] lubridate_1.6.0               dplyr_0.5.0                  
##  [9] purrr_0.2.2.2                 readr_1.1.1                  
## [11] tidyr_0.6.3                   tibble_1.3.1                 
## [13] ggplot2_2.2.1                 tidyverse_1.1.1              
## [15] prophet_0.1.1                 Rcpp_0.12.11                 
## 
## loaded via a namespace (and not attached):
##  [1] rstan_2.15.1         reshape2_1.4.2       haven_1.0.0         
##  [4] lattice_0.20-35      colorspace_1.3-2     htmltools_0.3.6     
##  [7] stats4_3.4.0         yaml_2.1.14          rlang_0.1.1         
## [10] foreign_0.8-68       DBI_0.6-1            modelr_0.1.0        
## [13] readxl_1.0.0         plyr_1.8.4           stringr_1.2.0       
## [16] Quandl_2.8.0         munsell_0.4.3        gtable_0.2.0        
## [19] cellranger_1.1.0     rvest_0.3.2          codetools_0.2-15    
## [22] psych_1.7.5          evaluate_0.10        labeling_0.3        
## [25] inline_0.3.14        knitr_1.16           forcats_0.2.0       
## [28] parallel_3.4.0       broom_0.4.2          scales_0.4.1        
## [31] backports_1.0.5      StanHeaders_2.15.0-1 jsonlite_1.5        
## [34] gridExtra_2.2.1      mnormt_1.5-5         hms_0.3             
## [37] digest_0.6.12        stringi_1.1.5        grid_3.4.0          
## [40] rprojroot_1.2        tools_3.4.0          magrittr_1.5        
## [43] lazyeval_0.2.0       xml2_1.1.1           extraDistr_1.8.5    
## [46] assertthat_0.2.0     rmarkdown_1.5        httr_1.2.1          
## [49] R6_2.2.1             nlme_3.1-131         compiler_3.4.0

Data Science for Business - Time Series Forecasting Part 2: Forecasting with timekit

2017-06-09 00:00:00 +0000

In my last post, I prepared and visually explored time series data.

Now, I will use this data to test the timekit package for time series forecasting with machine learning.


Forecasting

In time series forecasting, we use models to predict future time points based on past observations.

As mentioned in timekit’s vignette, “as with most machine learning applications, the prediction is only as good as the patterns in the data. Forecasting using this approach may not be suitable when patterns are not present or when the future is highly uncertain (i.e. past is not a suitable predictor of future performance).”

And while this is certainly true, we don’t always have data with a strong regular pattern. And, I would argue, data that has very obvious patterns doesn’t need a complicated model to generate forecasts - we can already guess the future curve just by looking at it. So, if we think of use-cases for businesses, who want to predict e.g. product sales, forecasting models are especially relevant in cases where we can’t make predictions manually or based on experience.

The packages I am using are timekit for forecasting, tidyverse for data wrangling and visualization, caret for additional modeling functions, tidyquant for its ggplot theme, broom and modelr for (tidy) modeling.

library(tidyverse)
library(caret)

library(tidyquant)
library(broom)
library(timekit)
library(modelr)
options(na.action = na.warn)


Training and test data

My input data is the tibble retail_p_day, that was created in my last post.

I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011).

retail_p_day <- retail_p_day %>%
  mutate(model = ifelse(day <= "2011-11-01", "train", "test"))

colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])

Here, I am testing out timekit’s functions with the net income per day as response variable. Because the time series in our data set is relatively short and doesn’t cover multiple years, this forecast will only be able to capture recurring variation in days and weeks. Variations like increased sales before holidays, etc. would need additional data from several years to be accurately forecast.

As we can see in the plot below, the net income shows variation between days.

retail_p_day %>%
  ggplot(aes(x = day, y = sum_income, color = model)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()


Augmenting the time series signature

With timekit, we can do forecasting with only a time series signature (a series of dates and times) and a corresponding response variable. If we had additional features that could be forecast independently, we could also introduce these into the model, but here, I will only work with the minimal data set.

A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature:

  • index: The index value that was decomposed
  • index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”.
  • diff: The difference in seconds from the previous numeric index value.
  • year: The year component of the index.
  • half: The half component of the index.
  • quarter: The quarter component of the index.
  • month: The month component of the index with base 1.
  • month.xts: The month component of the index with base 0, which is what xts implements.
  • month.lbl: The month label as an ordered factor beginning with January and ending with December.
  • day: The day component of the index.
  • hour: The hour component of the index.
  • minute: The minute component of the index.
  • second: The second component of the index.
  • hour12: The hour component on a 12 hour scale.
  • am.pm: Morning (AM) = 1, Afternoon (PM) = 2.
  • wday: The day of the week with base 1. Sunday = 1 and Saturday = 7.
  • wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6.
  • wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday.
  • mday: The day of the month.
  • qday: The day of the quarter.
  • yday: The day of the year.
  • mweek: The week of the month.
  • week: The week number of the year (Sunday start).
  • week.iso: The ISO week number of the year (Monday start).
  • week2: The modulus for bi-weekly frequency.
  • week3: The modulus for tri-weekly frequency.
  • week4: The modulus for quad-weekly frequency.
  • mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2.

Because we have missing data for the first column of diff, I am removing this row. We need to keep in mind too, that we have an irregular time series, because we never have data on Saturdays. This will affect the modeling and results and we need to account for this later on! Alternatively, it might make sense to compare the results when setting all NAs/Saturdays to 0, assuming that no information means that there was no income on a given day. Or we could impute missing values. Which strategy most accurately represents your data needs to be decided based on a good understanding of the business and how the data was collected.

retail_p_day_aug <- retail_p_day %>%
  rename(date = day) %>%
  select(model, date, sum_income) %>% 
  tk_augment_timeseries_signature() %>%
  select(-contains("month"))
  
retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]


Preprocessing

Not all of these augmented features will be informative for our model. For example, we don’t have information about time of day, so features like hour, minute, second, etc. will be irrelevant here.

Let’s look at column variation for all numeric feature and remove those with a variance of 0.

library(matrixStats)

(var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]),
           colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>%
  filter(colvars == 0))
##   colnames colvars
## 1     hour       0
## 2   minute       0
## 3   second       0
## 4   hour12       0
## 5    am.pm       0
retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames)))

Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set.

library(ggcorrplot)

cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])
p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])

ggcorrplot(cor,  type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor,
           colors = c(palette_light()[1], "white", palette_light()[2]))

I am going to choose a cutoff of 0.9 for removing features:

cor_cut <- findCorrelation(cor, cutoff=0.9) 
retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut]))


Now, I can split the data into training and test sets:

train <- filter(retail_p_day_aug, model == "train") %>%
  select(-model)
test <- filter(retail_p_day_aug, model == "test")


Modeling

To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple.

fit_lm <- glm(sum_income ~ ., data = train)


We can examine our model e.g. by visualizing:

tidy(fit_lm) %>%
  gather(x, y, estimate:p.value) %>%
  ggplot(aes(x = term, y = y, color = x, fill = x)) +
    facet_wrap(~ x, scales = "free", ncol = 4) +
    geom_bar(stat = "identity", alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

augment(fit_lm) %>%
  ggplot(aes(x = date, y = .resid)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()


With this model, we can now add predictions and residuals for the test data…

pred_test <- test %>%
  add_predictions(fit_lm, "pred_lm") %>%
  add_residuals(fit_lm, "resid_lm")

… and visualize the residuals.

pred_test %>%
    ggplot(aes(x = date, y = resid_lm)) +
    geom_hline(yintercept = 0, color = "red") +
    geom_point(alpha = 0.5, color = palette_light()[[1]]) +
    geom_smooth() +
    theme_tq()

We can also compare the predicted with the actual sum income in the test set.

pred_test %>%
  gather(x, y, sum_income, pred_lm) %>%
  ggplot(aes(x = date, y = y, color = x)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()


Forecasting

Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame).

# Extract index
idx <- retail_p_day %>%
    tk_index()

From this index we can generate the future time series.

Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds).

retail_p_day_aug %>%
  ggplot(aes(x = date, y = diff)) +
    geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days).

retail_p_day_aug %>%
  select(date, wday.lbl, diff) %>%
  filter(wday.lbl != "Sunday" & diff > 86400) %>%
  mutate(days_missing = diff / 86400 -1)
## # A tibble: 5 x 4
##         date wday.lbl    diff days_missing
##       <date>    <ord>   <int>        <dbl>
## 1 2011-01-04  Tuesday 1036800           11
## 2 2011-04-26  Tuesday  432000            4
## 3 2011-05-03  Tuesday  172800            1
## 4 2011-05-31  Tuesday  172800            1
## 5 2011-08-30  Tuesday  172800            1
retail_p_day_aug %>%
  select(date, wday.lbl, diff) %>%
  filter(wday.lbl == "Sunday" & diff > 172800) %>%
  mutate(days_missing = diff / 86400 -1)
## # A tibble: 1 x 4
##         date wday.lbl   diff days_missing
##       <date>    <ord>  <int>        <dbl>
## 1 2011-05-01   Sunday 259200            2

Let’s create a list of all missing days:

off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03",
              "2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>% 
  ymd()

Official UK holidays during that time were:

  • 2011:
  • Boxing Day December 26
  • Christmas Day Holiday December 27

  • 2012:
  • New Year’s Day Holiday January 2
  • Good Friday April 6
  • Easter Monday April 9
  • Early May Bank Holiday May 7
  • Spring Bank Holiday June 4
  • Diamond Jubilee Holiday June 5
  • Summer Bank Holiday August 27


We can account for the missing Saturdays with inspect_weekdays = TRUE.

Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. Generally, it is very difficult to account for holidays, because they don’t occur with an easy to model rule (e.g. Easter is on the first Sunday after the first full moon in Spring). Unfortunately, in our dataset we have seen that holidays and randomly missing days did not have a big overlap in the past.

Because not all holidays are missing days and we have more missing days than official holidays, I am using the list of missing days for skipping values - even though this is only a best-guess approach and likely not going to match all days that will be missing in reality during the future time series.

idx_future <- idx %>%
    tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days)
idx_future %>%
    tk_get_timeseries_signature() %>%
    ggplot(aes(x = index, y = diff)) +
    geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
    geom_line(alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq()

Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame.

data_future <- idx_future %>%
    tk_get_timeseries_signature() %>%
    rename(date = index)

pred_future <- predict(fit_lm, newdata = data_future)

pred_future <- data_future %>%
    select(date) %>%
    add_column(sum_income = pred_future)
retail_p_day %>%
  select(day, sum_income) %>%
  rename(date = day) %>%
  rbind(pred_future) %>%
  ggplot(aes(x = date, y = sum_income)) +
    scale_x_date() +
    geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    theme_tq()


When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals.

test_residuals <- pred_test$resid_lm
test_resid_sd <- sd(test_residuals, na.rm = TRUE)

pred_future <- pred_future %>%
    mutate(
        lo.95 = sum_income - 1.96 * test_resid_sd,
        lo.80 = sum_income - 1.28 * test_resid_sd,
        hi.80 = sum_income + 1.28 * test_resid_sd,
        hi.95 = sum_income + 1.96 * test_resid_sd
        )

This, we can then plot to show the forecast with confidence intervals:

retail_p_day %>%
  select(day, sum_income) %>%
  rename(date = day) %>%
  ggplot(aes(x = date, y = sum_income)) +
    geom_point(alpha = 0.5) +
    geom_line(alpha = 0.5) +
    geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future, 
                fill = "#D5DBFF", color = NA, size = 0) +
    geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future,
                fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
    geom_point(aes(x = date, y = sum_income), data = pred_future,
               alpha = 0.5, color = palette_light()[[2]]) +
    geom_smooth(aes(x = date, y = sum_income), data = pred_future,
                method = 'loess', color = "white") +
    theme_tq()

Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future.

Next time, I’ll compare how Facebook’s prophet will predict the future income.



sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggcorrplot_0.1.1              matrixStats_0.52.2           
##  [3] modelr_0.1.0                  timekit_0.3.0                
##  [5] broom_0.4.2                   tidyquant_0.5.1              
##  [7] quantmod_0.4-8                TTR_0.23-1                   
##  [9] PerformanceAnalytics_1.4.3541 xts_0.9-7                    
## [11] zoo_1.8-0                     lubridate_1.6.0              
## [13] caret_6.0-76                  lattice_0.20-35              
## [15] dplyr_0.5.0                   purrr_0.2.2.2                
## [17] readr_1.1.1                   tidyr_0.6.3                  
## [19] tibble_1.3.1                  ggplot2_2.2.1                
## [21] tidyverse_1.1.1              
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.2.1         jsonlite_1.5       splines_3.4.0     
##  [4] foreach_1.4.3      assertthat_0.2.0   stats4_3.4.0      
##  [7] cellranger_1.1.0   yaml_2.1.14        backports_1.0.5   
## [10] quantreg_5.33      digest_0.6.12      rvest_0.3.2       
## [13] minqa_1.2.4        colorspace_1.3-2   htmltools_0.3.6   
## [16] Matrix_1.2-10      plyr_1.8.4         psych_1.7.5       
## [19] SparseM_1.77       haven_1.0.0        padr_0.3.0        
## [22] scales_0.4.1       lme4_1.1-13        MatrixModels_0.4-1
## [25] mgcv_1.8-17        car_2.1-4          nnet_7.3-12       
## [28] lazyeval_0.2.0     pbkrtest_0.4-7     mnormt_1.5-5      
## [31] magrittr_1.5       readxl_1.0.0       evaluate_0.10     
## [34] nlme_3.1-131       MASS_7.3-47        forcats_0.2.0     
## [37] xml2_1.1.1         foreign_0.8-68     tools_3.4.0       
## [40] hms_0.3            stringr_1.2.0      munsell_0.4.3     
## [43] compiler_3.4.0     rlang_0.1.1        grid_3.4.0        
## [46] nloptr_1.0.4       iterators_1.0.8    labeling_0.3      
## [49] rmarkdown_1.5      gtable_0.2.0       ModelMetrics_1.1.0
## [52] codetools_0.2-15   DBI_0.6-1          reshape2_1.4.2    
## [55] R6_2.2.1           knitr_1.16         rprojroot_1.2     
## [58] Quandl_2.8.0       stringi_1.1.5      parallel_3.4.0    
## [61] Rcpp_0.12.11

Data Science for Business - Time Series Forecasting Part 1: EDA & Data Preparation

2017-05-28 00:00:00 +0000

Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models. But the techniques are only tools in a - sometimes very messy - toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered?

A straight-forward business problem is to estimate future sales and future income. Based on past experience, i.e. data from past sales, data science can help improve forecasts and generate models that describe the main factors of influence. This, in turn, can then be used to develop actions based on what we have learned, like where to increase advertisement, how much of which products to keep in stock, etc.


Data preparation

While it isn’t the most exciting aspect of data science, and therefore often neglected in favor of fancy modeling techniques, getting the data into the right format and extracting meaningful features, is arguably THE most essential part of any analysis!

Therefore, I have chosen to dedicate an entire article to this part and will discuss modeling and time series forecasting in separate blog posts.


Many of the formal concepts I am using when dealing with data in a tidy way come from Hadley Wickham & Garrett Grolemund’s “R for Data Science”.

The central package is tidyverse, which contains tidyr, dplyr, ggplot, etc. Other packages I am using are tidyquant for its nice ggplot theme, modelr, gridExtra and grid for additional plotting functionalities.

library(tidyverse)
library(tidyquant)
library(modelr)
library(gridExtra)
library(grid)


The data

I am again using a dataset from UC Irvine’s machine learning repository (converted to csv from xlsx).

From the dataset description:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).


Reading in the data

A fast way to read in data in csv format is to use readr’s read_csv() function. With a small dataset like this, it makes sense to specifically define what format each column should have (e.g. integers, character, etc.). In our case, this is particularly convenient for defining the date/time column to be read in correctly with col_datetime().

The original data contains the following features (description from UC Irvine’s machine learning repository):

  • InvoiceNo: Invoice number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  • StockCode: Product (item) code uniquely assigned to each distinct product.
  • Description: Product (item) name.
  • Quantity: The quantities of each product (item) per transaction.
  • InvoiceDate: Invoice Date and time, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Product price per unit in sterling.
  • CustomerID: Customer number uniquely assigned to each customer.
  • Country: Country name. The name of the country where each customer resides.

Because read_csv generates a tibble (a specific dataframe class) and is part of the tidyverse, we can directly create a few additional columns:

  • day: Invoice date, the day when each transaction was generated.
  • time: Invoice time, the time when each transaction was generated.
  • month: The month when each transaction was generated.
  • income: The amount of income generated from each transaction (Quantity * UnitPrice), negative in case of returns
  • income_return: Description of whether a transaction generated income or loss (i.e. purchase or return)
retail <- read_csv("OnlineRetail.csv",
                   col_types = cols(
                      InvoiceNo = col_character(),
                      StockCode = col_character(),
                      Description = col_character(),
                      Quantity = col_integer(),
                      InvoiceDate = col_datetime("%m/%d/%Y %H:%M"),
                      UnitPrice = col_double(),
                      CustomerID = col_integer(),
                      Country = col_character()
                      )) %>%
  mutate(day = parse_date(format(InvoiceDate, "%Y-%m-%d")),
         day_of_week = wday(day, label = TRUE),
         time = parse_time(format(InvoiceDate, "%H:%M")),
         month = format(InvoiceDate, "%m"),
         income = Quantity * UnitPrice,
         income_return = ifelse(Quantity > 0, "income", "return"))


Exploratory Data Analysis (EDA)

In order to decide, which features to use in the final dataset for modeling, we want to get a feel for our data. And a good way to do this, is by creating different visualizations. It also helps with assessing your models later on, because to closer you are acquainted with the data’s properties, the better you’ll be able to pick up on things that might have gone wrong in your analysis (think of it as a kind of sanity-check for your data).


Transactions by country

The online retailer is UK-based, but its customers come from all over the world. However, the plots below tell us very quickly that the main customer base is from the UK, followed by Germany and France.

p1 <- retail %>%
  filter(Country == "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    guides(fill = FALSE) +
    labs(x = "")

p2 <- retail %>%
  filter(Country != "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "",
         fill = "")

grid.arrange(p1, p2, widths = c(0.2, 0.8))


Transactions over time

To get an idea of the number of transactions over time, we can use a frequency polygon. Here, we can see that the number purchases slightly increased during the last two months of recording, while the number of returns remained relatively stable.

retail %>%
  ggplot(aes(x = day, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_freqpoly(bins = 100, size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Number of purchases/returns over time",
         x = "")


Because the number of returns is much smaller than the number of purchases, it is difficult to visualize and compare them in the same plot. While above, I split them into two facets with free scales, we can also compare the density of values. From this plot, we can more easily see the relationship between purchases and returns over time: except for the last month, the proportion of both remained relatively stable.

retail %>%
  ggplot(aes(x = day, y = ..density.., color = income_return)) +
    geom_freqpoly(size = 1, alpha = 0.8, bins = 100) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(title = "Density of purchases/returns over time",
         x = "",
         color = "")


Income/loss from transactions

Let’s look at the income/loss from transactions over time. Here, we plot the sum of income and losses for each day. The income seems to increase slightly during the last month, while losses remained more stable. The only severe outlier is the last day.

retail %>%
  group_by(day, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = day, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income/loss from transactions per day",
         x = "",
         y = "sum of income/losses",
         color = "")


We can also look at the sum of income and losses according to time of day of the transaction. Not surprisingly, transactions happen mostly during business hours.

retail %>%
  group_by(time, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = time, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income from purchases/returns per time of day",
         x = "time of day",
         y = "sum of income/losses",
         color = "")

Here, we again see the two extreme outliers. Let’s look at them in the dataset. This purchase of 80995 paper craft birdies might have been a mistake, because we can see that the same customer who bought them at 09:15 cancelled the order only 15 minutes later and didn’t order a smaller number either.

retail %>%
  filter(day == "2011-12-09") %>%
  arrange(-Quantity) %>%
  .[1:3, ]
## # A tibble: 3 x 14
##   InvoiceNo StockCode                         Description Quantity
##       <chr>     <chr>                               <chr>    <int>
## 1    581483     23843         PAPER CRAFT , LITTLE BIRDIE    80995
## 2    581476     16008 SMALL FOLDING SCISSOR(POINTED EDGE)      240
## 3    581476     22693  GROW A FLYTRAP OR SUNFLOWER IN TIN      192
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>
retail %>%
  filter(day == "2011-12-09") %>%
  arrange(Quantity) %>%
  .[1:3, ]
## # A tibble: 3 x 14
##   InvoiceNo StockCode                     Description Quantity
##       <chr>     <chr>                           <chr>    <int>
## 1   C581484     23843     PAPER CRAFT , LITTLE BIRDIE   -80995
## 2   C581490     22178 VICTORIAN GLASS HANGING T-LIGHT      -12
## 3   C581490     23144 ZINC T-LIGHT HOLDER STARS SMALL      -11
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>
retail %>%
  filter(CustomerID == 16446)
## # A tibble: 4 x 14
##   InvoiceNo StockCode                 Description Quantity
##       <chr>     <chr>                       <chr>    <int>
## 1    553573     22980      PANTRY SCRUBBING BRUSH        1
## 2    553573     22982         PANTRY PASTRY BRUSH        1
## 3    581483     23843 PAPER CRAFT , LITTLE BIRDIE    80995
## 4   C581484     23843 PAPER CRAFT , LITTLE BIRDIE   -80995
## # ... with 10 more variables: InvoiceDate <dttm>, UnitPrice <dbl>,
## #   CustomerID <int>, Country <chr>, day <date>, day_of_week <ord>,
## #   time <time>, month <chr>, income <dbl>, income_return <chr>


Transactions by day and time

The last plot told us that in general, transactions were done during business hours. We can look at this in even more detail by comparing the day and time of transactions in a 2D-bin-plot where the tile colors indicate transaction numbers.

retail %>%
  ggplot(aes(x = time, y = day)) +
    stat_bin2d(alpha = 0.8, bins = 25, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Purchases/returns per day and time")


Net income

The net income we can e.g. plot in a similar way by comparing month and day of the month of transactions with a tile plot:

retail %>%
  mutate(day2 = format(InvoiceDate, "%d")) %>%
  group_by(month, day2) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = month, y = day2, fill = sum_income)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Net income per month and day",
         y = "day of the month",
         fill = "net sum of income")


Items

Also of interest are the items that are being purchases or returned. Here, we sum up the net quantities for each item.

retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  arrange(-sum)
## Source: local data frame [5,749 x 3]
## Groups: StockCode [4,070]
## 
## # A tibble: 5,749 x 3
##    StockCode                        Description   sum
##        <chr>                              <chr> <int>
##  1     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS 53847
##  2    85099B            JUMBO BAG RED RETROSPOT 47363
##  3     84879      ASSORTED COLOUR BIRD ORNAMENT 36381
##  4     22197                     POPCORN HOLDER 36334
##  5     21212    PACK OF 72 RETROSPOT CAKE CASES 36039
##  6    85123A WHITE HANGING HEART T-LIGHT HOLDER 35025
##  7     23084                 RABBIT NIGHT LIGHT 30680
##  8     22492             MINI PAINT SET VINTAGE 26437
##  9     22616          PACK OF 12 LONDON TISSUES 26315
## 10     21977 PACK OF 60 PINK PAISLEY CAKE CASES 24753
## # ... with 5,739 more rows

As we can see in the plots below, the majority of items is purchases only occasionally, while a few items are purchased a lot.

p1 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p2 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 1) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p3 <- retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 10000) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()
    
grid.arrange(p1, p2, p3, ncol = 3)


We can also calculate on how many different days, items have been purchased.

most_sold <- retail %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(StockCode, Description) %>%
  summarise(n = n()) %>%
  arrange(-n)

head(most_sold)
## Source: local data frame [6 x 3]
## Groups: StockCode [6]
## 
## # A tibble: 6 x 3
##   StockCode                        Description     n
##       <chr>                              <chr> <int>
## 1    85123A WHITE HANGING HEART T-LIGHT HOLDER   304
## 2    85099B            JUMBO BAG RED RETROSPOT   302
## 3     22423           REGENCY CAKESTAND 3 TIER   301
## 4     84879      ASSORTED COLOUR BIRD ORNAMENT   300
## 5     20725            LUNCH BAG RED RETROSPOT   299
## 6     21212    PACK OF 72 RETROSPOT CAKE CASES   299

The item that has been purchased most often in terms of days is the white hanging heart t-light holder. Let’s look at its distribution of sold/returned quantities per day:

retail %>%
  filter(StockCode == "85123A") %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum, color = income_return)) +
    facet_wrap(~ income_return, ncol = 1, scales = "free") +
    geom_line(size = 1, alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "sum of quantities",
         color = "",
         title = "Transactions of WHITE HANGING HEART T-LIGHT HOLDER")


Preparing data for modeling by day

There are of course, infinitely more ways to visualize data but for now, I think we have enough of a feel for the data that we can start preparing a dataset for modeling. Because we have only limited information in that we only have data for one year, we might not have enough data to accurately forecast or model time-dependent trends. But we can try by creating a table of features per day.


Which customers are repeat customers?

If the customer ID has been recorded on more than one day (i.e., they have made multiple transactions during the year of recording), they are considered repeat customers. As we can see in the plot, the majority of customers are repeat customers.

rep_customer <- retail %>%
  group_by(day, CustomerID) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(CustomerID) %>%
  summarise(n = n()) %>%
  mutate(repeat_customer = ifelse(n > 1, "repeat_cust", "one_time_cust"))

length(which(rep_customer$repeat_customer == "repeat_cust"))
## [1] 2992
rep_customer_day <- left_join(retail, rep_customer, by = "CustomerID") %>%
  distinct(day, CustomerID, repeat_customer) %>%
  group_by(day, repeat_customer) %>%
  summarise(n = n()) %>%
  spread(key = repeat_customer, value = n)
rep_customer %>%
  group_by(repeat_customer) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = "", y = prop, fill = repeat_customer)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    coord_polar("y", start = 0) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(x = "",
         y = "",
         fill = "",
         title = "Proportion of one-time & repeat customers")


Transactions, quantities and items per customer and day

I am calculating the following metrics per day and customer.

  • n: number of different items purchased/returned per customer per day
  • sum_it: net sum of items (quantity) purchased/returned per customer per day
  • sum_in: mean net income per customer per day

From these, I can further calculate mean numbers per day:

  • mean_in_cust: mean net income from all customers per day
  • mean_quant_cust: mean net quantities of items from all customers per day
  • mean_items_cust: mean number of items from all customers per day
customer_purch <- retail %>%
  group_by(day, CustomerID) %>%
  summarise(n = n(),
            sum_it = sum(Quantity),
            sum_in = sum(income)) %>%
  group_by(day) %>%
  summarise(mean_in_cust = mean(sum_in),
            mean_quant_cust = mean(sum_it),
            mean_items_cust = mean(n))
customer_purch %>%
  gather(x, y, mean_in_cust:mean_items_cust) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")


Purchases/returns per day

This calculates the quantities of items that are purchased and returned per day:

income_return <- retail %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = income_return, value = sum)
income_return %>%
  gather(x, y, income:return) %>%
  ggplot(aes(x = day, y = y, color = x)) +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "", 
         y = "quantity of items",
         color = "")


How many items are purchased/returned per country?

Because the majority of transactions came from UK customers, I am dividing customers into UK-based and other nationalities and calculate the net quantities of items purchases and returned by day and country.

country_purch <- retail %>%
  mutate(Country2 = ifelse(Country == "United Kingdom", "uk", "other_country")) %>%
  group_by(day, Country2) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = Country2, value = sum) %>%
  mutate(prop_other_country = other_country / sum(other_country + uk),
         prop_uk = uk / sum(other_country + uk))
country_purch %>%
  gather(x, y, prop_other_country:prop_uk) %>%
  ggplot(aes(x = day, y = y)) +
    geom_bar(aes(fill = x), stat = "identity", alpha = 0.6) +
    scale_fill_manual(values = palette_light()) +
    geom_line(data = country_purch, aes(x = day, y = prop_uk), size = 1) +
    theme_tq() +
    labs(x = "", 
         y = "proportion of quantity of items",
         fill = "")

Here, we can now also see that not every day has a corresponding data point: there are regular (i.e. no Saturdays are recorded) and irregular patterns of missing data.


How many different items are purchased/returned per day?

I also want to know how many different items were purchased or returned. Here, we see a similar pattern where the number of different items per day increased during the last two months.

n_items <- retail %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  group_by(day) %>%
  summarise(n_items = n())
n_items %>%
  ggplot(aes(x = day, y = n_items)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "number of different items",
         color = "")


Net income & quantities summaries

Finally, I am calculating the sum and mean of the net income and net quantities sold per day.

income <- retail %>%
  group_by(day) %>%
  summarise(sum_income = sum(income),
            mean_income = mean(income),
            sum_quantity = sum(Quantity),
            mean_quantity = mean(Quantity))
income %>%
  gather(x, y, sum_income:mean_quantity) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")


Income from purchases and returns

The same I am also calculating for purchases and returns separately.

purchases <- retail %>%
  filter(income > 0) %>%
  group_by(day) %>%
  summarise(sum_income_purch = sum(income),
            mean_income_purch = mean(income),
            sum_quantity_purch = sum(Quantity),
            mean_quantity_purch = mean(Quantity))
purchases %>%
  gather(x, y, sum_income_purch:mean_quantity_purch) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

returns <- retail %>%
  filter(income < 0) %>%
  group_by(day) %>%
  summarise(sum_income_return = sum(income),
            mean_income_return = mean(income),
            sum_quantity_return = sum(Quantity),
            mean_quantity_return = mean(Quantity))
returns %>%
  gather(x, y, sum_income_return:mean_quantity_return) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "")


Mean price per units sold per day

This is a bit more complicated than thought because some items are sold multiple times per time and unit prices of the same item change between transactions and days. Therefore, I am creating a temporary dataframe where I combine day & StockCode. This, I can then combine with a table of the number of transactions per day and item to find the price per item and day. Based on this, I calculate the mean unit price per item and the mean unit price of all items per day.

temp <- distinct(select(retail, day, StockCode, UnitPrice)) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  select(temp, UnitPrice)

mean_unit_price <- retail %>%
  filter(income_return == "income") %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  left_join(temp, by = "temp") %>%
  group_by(day, StockCode) %>%
  summarise(mean = mean(UnitPrice)) %>%
  group_by(day) %>%
  summarise(mean_unit_price = mean(mean))
mean_unit_price %>%
  ggplot(aes(x = day, y = mean_unit_price)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "mean unit price of sold items")


Purchases of most sold items

The 10 items that have been sold on the most days throughout the year of recording are also introduced as separate features.

most_sold_day <- retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = StockCode, value = sum)
retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum)) +
    facet_wrap(~ StockCode, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "net sum of quantites sold")


Combining data

Now, we can combine all the different per-day features into one dataframe. Now, I am also calculating the difference of the net sum of income to the previous day by using the lag() function. And I am creating a column for seasons.

Additional time-based information will be added in a later part when we actually do forecasting with the timekit package.

retail_p_day <- distinct(select(retail, day, day_of_week, month)) %>%
  left_join(income, by = "day") %>%
  left_join(mean_unit_price, by = "day") %>%
  left_join(purchases, by = "day") %>%
  left_join(returns, by = "day") %>%
  left_join(customer_purch, by = "day") %>%
  left_join(rep_customer_day, by = "day") %>%
  left_join(income_return, by = "day") %>%
  left_join(country_purch, by = "day") %>%
  left_join(n_items, by = "day") %>%
  left_join(most_sold_day, by = "day") %>%
  mutate(diff_sum_income = sum_income - lag(sum_income),
         season = ifelse(month %in% c("03", "04", "05"), "spring",
                         ifelse(month %in% c("06", "07", "08"), "summer",
                                ifelse(month %in% c("09", "10", "11"), "fall", "winter"))))

To keep track of the features, I am creating an additional dataframe/tibble containing the descriptions of every feature column.

meta <- tibble(colnames_retail_p_day = colnames(retail_p_day)) %>%
  mutate(description = c(
    "day in YYYY-MM-DD",
    "weekday (Sun - Fri, there are no Sat in the dataset)",
    "month (as month number)",
    
    "sum of net income per day (all purchases & losses per day combined)",
    "mean net income per day (all purchases & losses per day combined)",
    "sum of net quantities sold/returned per day (all purchases & returns per day combined)",
    "mean net quantities sold/returned per day (all purchases & returns per day combined)",
    
    "mean price per unit sold (returns excluded)",
    
    "sum of income from purchases per day (losses excluded)",
    "mean income from purchases per day (losses excluded)",
    "sum of quantities from purchases per day (losses excluded)",
    "mean quantities from purchases per day (losses excluded)",
    
    "sum of losses from returns per day (purchases excluded)",
    "mean losses from returns per day (purchases excluded)",
    "sum of quantities from returns per day (purchases excluded)",
    "mean quantities from returns per day (purchases excluded)",
    
    "mean net income from all customers per day",
    "mean number of items from all customers per day",
    "mean number of items from all customers per day",

    "number of one-time customers per day",
    "number of repeat customers per day",
    
    "sum of items (quantities) purchased per day (returns excluded)",
    "sum of items (quantities) returned per day (purchases excluded)",
    
    "net sum of items (quantities) purchased per day from countries other than the UK",
    "net sum of items (quantities) purchased per day from the UK",
    "proportion of net sum of items (quantities) purchased per day from countries other than the UK",
    "proportion of net sum of items (quantities) purchased per day from the UK",
    
    "number of different items purchased/returned per day",
    
    "net sum of quantities sold of item with StockCode 20725",
    "net sum of quantities sold of item with StockCode 21212",
    "net sum of quantities sold of item with StockCode 22423",
    "net sum of quantities sold of item with StockCode 22457",
    "net sum of quantities sold of item with StockCode 22666",
    "net sum of quantities sold of item with StockCode 22960",
    "net sum of quantities sold of item with StockCode 47566",
    "net sum of quantities sold of item with StockCode 84879",
    "net sum of quantities sold of item with StockCode 85099B",
    "net sum of quantities sold of item with StockCode 85123A",
    
    "difference in sum of net income (purchases - returns) to previous day",
    "season"
    ))


Now, that we have all kinds of features per day, we can gather them in a final visualization to identify patterns and potential response variables suitable for modeling. I am also now adding points that are colored by day of the week so that we’ll be able to judge potential weekday biases.

retail_p_day %>%
  # remove last day because it is so extreme
  filter(day != max(retail_p_day$day)) %>%
  gather(x, y, sum_income:diff_sum_income) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, scales = "free", ncol = 2) +
    geom_line(alpha = 0.8, color = palette_light()[[1]]) +
    geom_point(aes(color = day_of_week)) +
    geom_smooth() +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "",
         color = "day of the week")

From the plots we can see that there has been a slight increase in sales during the months of November and December. Unfortunately, we don’t have data from additional years to judge whether that has to do with Christmas sales or whether it is marking a general increase in sales that will continue into Januar and February.

And indeed, for many features we see lower sales on Sundays than on Monday through Friday (Saturdays have not been recorded).

Nevertheless, I will attempt a time-series forecast on this data. First, however, I will be developing some simple models to test for weekday biases and other statistics of the data. Stay tuned for my next post… :-)



sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.2.1               modelr_0.1.0                 
##  [3] tidyquant_0.5.1               quantmod_0.4-8               
##  [5] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
##  [7] xts_0.9-7                     zoo_1.8-0                    
##  [9] lubridate_1.6.0               dplyr_0.5.0                  
## [11] purrr_0.2.2.2                 readr_1.1.1                  
## [13] tidyr_0.6.3                   tibble_1.3.1                 
## [15] ggplot2_2.2.1                 tidyverse_1.1.1              
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2   haven_1.0.0      lattice_0.20-35  colorspace_1.3-2
##  [5] htmltools_0.3.6  yaml_2.1.14      rlang_0.1.1      foreign_0.8-68  
##  [9] DBI_0.6-1        readxl_1.0.0     plyr_1.8.4       stringr_1.2.0   
## [13] Quandl_2.8.0     munsell_0.4.3    gtable_0.2.0     cellranger_1.1.0
## [17] rvest_0.3.2      psych_1.7.5      evaluate_0.10    labeling_0.3    
## [21] knitr_1.16       forcats_0.2.0    parallel_3.4.0   broom_0.4.2     
## [25] Rcpp_0.12.11     scales_0.4.1     backports_1.1.0  jsonlite_1.4    
## [29] mnormt_1.5-5     hms_0.3          digest_0.6.12    stringi_1.1.5   
## [33] rprojroot_1.2    tools_3.4.0      magrittr_1.5     lazyeval_0.2.0  
## [37] xml2_1.1.1       assertthat_0.2.0 rmarkdown_1.5    httr_1.2.1      
## [41] R6_2.2.1         nlme_3.1-131     compiler_3.4.0

New R Users group in Münster!

2017-05-20 00:00:00 +0000

This is to announce that Münster now has its very own R users group!


If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.


Code for the logo, which of course has been created in R:

library(ggplot2)
library(ggmap)
library(ggthemes)

munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor")
ggmap(munster) +
  scale_y_continuous(limits=c(51.94, 51.97)) +
  annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) +
  annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) +
  theme_map()

Network analysis of Game of Thrones family ties

2017-05-15 00:00:00 +0000

In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.


What is a network?

A network in this context is a graph of interconnected nodes/vertices. Nodes can e.g. be people in a social network, genes in a co-expression network, etc. Nodes are connected via ties/edges.


What can network analysis tell us?

Network analysis can e.g. be used to explore relationships in social or professional networks. In such cases, we would typically ask questions like:

  • How many connections does each person have?
  • Who is the most connected (i.e. influential or “important”) person?
  • Are there clusters of tightly connected people?
  • Are there a few key players that connect clusters of people?
  • etc.

These answers can give us a lot of information about the patterns of how people interact.


The Game of Thrones character network

The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

library(tidyverse)
library(igraph)
library(statnet)
load("union_edges.RData")
load("union_characters.RData")


I am using igraph to plot the initial network. To do so, I first create the graph from the edge- and node-table. An edge-table contains source and target nodes in the first two columns and optionally additional columns with edge attributes. Here, I have the type of interaction (mother, father or spouse), the color and line-type I want to assign to each edge.

Because the books and the TV series differ slightly, I have introduced edges that are only supported or hinted at by the TV series and are not part of the original narrative in the books. These edges are marked by being dotted instead of solid. An additional color for edges with unspecified parental origin are introduced as well. Originally, these served for interactions that were extracted from character names (i.e. characters that ended with “… son/daughter of …”) and could either mean mother or father. Now, they show unclear parentage or cases where there are a biological and a de facto father, as in the case of Jon Snow.

head(union_edges)
##               source            target   type   color   lty
## 1         Lysa Arryn      Robert Arryn mother #7570B3 solid
## 2       Jasper Arryn        Alys Arryn father #1B9E77 solid
## 3       Jasper Arryn         Jon Arryn father #1B9E77 solid
## 4          Jon Arryn      Robert Arryn father #1B9E77 solid
## 110 Cersei Lannister  Tommen Baratheon mother #7570B3 solid
## 210 Cersei Lannister Joffrey Baratheon mother #7570B3 solid

The nodetable contains one row for each character that is either a source or a target in the edgetable. We can give any number and type of node attributes. Here, I chose the followin columns from the original Kaggle dataset: gender/male (male = 1, female = 0), house (as the house each character was born into) and popularity. House2 was meant to assign a color to only the major houses. Shape represents the gender.

head(union_characters)
##            name male culture          house popularity      house2   color  shape
## 1    Alys Arryn    0    <NA>    House Arryn 0.08026756        <NA>    <NA> circle
## 2 Elys Waynwood    0    <NA> House Waynwood 0.07023411        <NA>    <NA> circle
## 3  Jasper Arryn    1    <NA>    House Arryn 0.04347826        <NA>    <NA> square
## 4   Jeyne Royce    0    <NA>    House Royce 0.00000000        <NA>    <NA> circle
## 5     Jon Arryn    1 Valemen    House Arryn 0.83612040        <NA>    <NA> square
## 6    Lysa Arryn    0    <NA>    House Tully 0.00000000 House Tully #F781BF circle

By default, we have a directed graph.

union_graph <- graph_from_data_frame(union_edges, directed = TRUE, vertices = union_characters)

For plotting the legend, I am summarizing the edge and node colors.

color_vertices <- union_characters %>%
  group_by(house, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

colors_edges <- union_edges %>%
  group_by(type, color) %>%
  summarise(n = n()) %>%
  filter(!is.na(color))

Now, we can plot the graph object (here with Fruchterman-Reingold layout):

layout <- layout_with_fr(union_graph)

Click on the image to get to the high resolution pdf:

plot(union_graph,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph)$name),
     vertex.shape = V(union_graph)$shape,
     vertex.color = V(union_graph)$color, 
     vertex.size = (V(union_graph)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph)$color,
     edge.lty = E(union_graph)$lty)
legend("topleft", legend = c(NA, "Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1,
       title = "") 
legend("topleft", legend = "", cex = 4, bty = "n", ncol = 1,
       title = "Game of Thrones Family Ties")

family_ties_1

Node color shows the major houses, node size the character’s popularity and node shape their gender (square for male, circle for female). Edge color shows interaction type.

As we can see, even with only a subset of characters from the Game of Thrones world, the network is already quite big. You can click on the image to open the pdf and zoom into specific parts of the plot and read the node labels/character names.

What we can see right away is that there are only limited connections between houses and that the Greyjoys are the only house that has no ties to any of the others.


Network analysis

How do we find out who the most important characters are in this network?

We consider a character “important” if he has connections to many other characters. There are a few network properties, that tell us more about this. For this, I am considering the network as undirected to account for parent/child relationships as being mutual.

union_graph_undir <- as.undirected(union_graph, mode = "collapse")


Centrality

Centrality describes the number of edges that are in- or outgoing to/from nodes. High centrality networks have few nodes with many connections, low centrality networks have many nodes with similar numbers of edges.

“Centralization is a method for creating a graph level centralization measure from the centrality scores of the vertices.” centralize() help

For the whole network, we can calculate centrality by degree (centr_degree()), closeness (centr_clo()) or eigenvector centrality (centr_eigen()) of vertices.

centr_degree(union_graph_undir, mode = "total")$centralization
## [1] 0.04282795
centr_clo(union_graph_undir, mode = "total")$centralization
## [1] 0.01414082
centr_eigen(union_graph_undir, directed = FALSE)$centralization
## [1] 0.8787532


Node degree

Node degree or degree centrality describes how central a node is in the network (i.e. how many in- and outgoing edges it has or to how many other nodes it is directly connected via one edge).

“The degree of a vertex is its most basic structural property, the number of its adjacent edges.” From the help pages of degree()

We can calculate the number of out- or ingoing edges of each node, or - as I am doing here - the sum of both.

union_graph_undir_degree <- igraph::degree(union_graph_undir, mode = "total")

#standardized by number of nodes
union_graph_undir_degree_std <- union_graph_undir_degree / (vcount(union_graph_undir) - 1)
node_degree <- data.frame(degree = union_graph_undir_degree,
                          degree_std = union_graph_undir_degree_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_degree, by = c("name" = "rowname"))

node_degree %>%
  arrange(-degree) %>%
  .[1:10, ]
##            rowname degree degree_std
## 1  Quellon Greyjoy     12 0.05797101
## 2      Walder Frey     10 0.04830918
## 3   Oberyn Martell     10 0.04830918
## 4     Eddard Stark      9 0.04347826
## 5    Catelyn Stark      8 0.03864734
## 6       Emmon Frey      7 0.03381643
## 7  Genna Lannister      7 0.03381643
## 8     Merrett Frey      7 0.03381643
## 9    Balon Greyjoy      7 0.03381643
## 10 Jason Lannister      7 0.03381643

In this case, the node degree reflects how many offspring and spouses a character had. With 3 wifes and several children, Quellon Greyjoy, the grandfather of Theon and Asha/Yara comes out on top (of course, had I included all offspring and wifes of Walder Frey’s, he would easily be on top but the network would have gotten infinitely more confusing).


Closeness

The closeness of a node describes its distance to all other nodes. A node with highest closeness is more central and can spread information to many other nodes.

closeness <- igraph::closeness(union_graph_undir, mode = "total")

#standardized by number of nodes
closeness_std <- closeness / (vcount(union_graph_undir) - 1)
node_closeness <- data.frame(closeness = closeness,
                          closeness_std = closeness_std) %>%
  tibble::rownames_to_column()

union_characters <- left_join(union_characters, node_closeness, by = c("name" = "rowname"))

node_closeness %>%
  arrange(-closeness) %>%
  .[1:10, ]
##             rowname    closeness closeness_std
## 1       Sansa Stark 0.0002013288  9.726028e-07
## 2  Tyrion Lannister 0.0002012882  9.724070e-07
## 3   Tywin Lannister 0.0002011668  9.718201e-07
## 4  Joanna Lannister 0.0002005616  9.688965e-07
## 5      Eddard Stark 0.0002002804  9.675381e-07
## 6     Catelyn Stark 0.0001986492  9.596579e-07
## 7  Cersei Lannister 0.0001984915  9.588960e-07
## 8   Jaime Lannister 0.0001975894  9.545382e-07
## 9    Jeyne Marbrand 0.0001966568  9.500330e-07
## 10  Tytos Lannister 0.0001966568  9.500330e-07

The characters with highest closeness all surround central characters that connect various storylines and houses in Game of Thrones.


Betweenness centrality

Betweenness describes the number of shortest paths between nodes. Nodes with high betweenness centrality are on the path between many other nodes, i.e. they are people who are key connections or bridges between different groups of nodes. In a social network, these nodes would be very important because they are likely to pass on information to a wide reach of people.

The igraph function betweenness() calculates vertex betweenness, edge_betweenness() calculates edge betweenness:

“The vertex and edge betweenness are (roughly) defined by the number of geodesics (shortest paths) going through a vertex or an edge.” igraph help for estimate_betweenness()

betweenness <- igraph::betweenness(union_graph_undir, directed = FALSE)

# standardize by number of node pairs
betweenness_std <- betweenness / ((vcount(union_graph_undir) - 1) * (vcount(union_graph_undir) - 2) / 2)

node_betweenness <- data.frame(betweenness = betweenness,
                               betweenness_std = betweenness_std) %>%
  tibble::rownames_to_column() 

union_characters <- left_join(union_characters, node_betweenness, by = c("name" = "rowname"))

node_betweenness %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##              rowname betweenness betweenness_std
## 1       Eddard Stark    6926.864       0.3248846
## 2        Sansa Stark    6165.667       0.2891828
## 3   Tyrion Lannister    5617.482       0.2634718
## 4    Tywin Lannister    5070.395       0.2378123
## 5   Joanna Lannister    4737.524       0.2221999
## 6  Rhaegar Targaryen    4301.583       0.2017533
## 7    Margaery Tyrell    4016.417       0.1883784
## 8           Jon Snow    3558.884       0.1669192
## 9        Mace Tyrell    3392.500       0.1591154
## 10   Jason Lannister    3068.500       0.1439191
edge_betweenness <- igraph::edge_betweenness(union_graph_undir, directed = FALSE)

data.frame(edge = attr(E(union_graph_undir), "vnames"),
           betweenness = edge_betweenness) %>%
  tibble::rownames_to_column() %>%
  arrange(-betweenness) %>%
  .[1:10, ]
##    rowname                              edge betweenness
## 1      160      Sansa Stark|Tyrion Lannister    5604.149
## 2      207          Sansa Stark|Eddard Stark    4709.852
## 3      212        Rhaegar Targaryen|Jon Snow    3560.083
## 4      296       Margaery Tyrell|Mace Tyrell    3465.000
## 5      213             Eddard Stark|Jon Snow    3163.048
## 6      131  Jason Lannister|Joanna Lannister    3089.500
## 7      159 Joanna Lannister|Tyrion Lannister    2983.591
## 8      171  Tyrion Lannister|Tywin Lannister    2647.224
## 9      192    Elia Martell|Rhaegar Targaryen    2580.000
## 10     300         Luthor Tyrell|Mace Tyrell    2565.000

This, we can now plot by feeding the node betweenness as vertex.size and edge betweenness as edge.width to our plot function:

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = V(union_graph_undir)$color, 
     vertex.size = betweenness * 0.001, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = edge_betweenness * 0.01,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1

Ned Stark is the character with highest betweenness. This makes sense, as he and his children (specifically Sansa and her arranged marriage to Tyrion) connect to other houses and are the central points from which the story unfolds. However, we have to keep in mind here, that my choice of who is important enough to include in the network (e.g. the Stark ancestors) and who not (e.g. the whole complicated mess that is the Targaryen and Frey family tree) makes this result somewhat biased.


Diameter

In contrast to the shortest path between two nodes, we can also calculate the longest path, or diameter:

diameter(union_graph_undir, directed = FALSE)
## [1] 21

In our network, the longest path connects 21 nodes.

“get_diameter returns a path with the actual diameter. If there are many shortest paths of the length of the diameter, then it returns the first one found.” diameter() help

This, we can also plot:

union_graph_undir_diameter <- union_graph_undir
node_diameter <- get.diameter(union_graph_undir_diameter,  directed = FALSE)

V(union_graph_undir_diameter)$color <- scales::alpha(V(union_graph_undir_diameter)$color, alpha = 0.5)
V(union_graph_undir_diameter)$size <- 2

V(union_graph_undir_diameter)[node_diameter]$color <- "red"
V(union_graph_undir_diameter)[node_diameter]$size <- 5

E(union_graph_undir_diameter)$color <- "grey"
E(union_graph_undir_diameter)$width <- 1

E(union_graph_undir_diameter, path = node_diameter)$color <- "red"
E(union_graph_undir_diameter, path = node_diameter)$width <- 5

plot(union_graph_undir_diameter,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir_diameter)$name),
     vertex.shape = V(union_graph_undir_diameter)$shape,
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.arrow.size = 0.5,
     edge.lty = E(union_graph_undir_diameter)$lty)
legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19,
       col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)

family_ties_1


Transitivity

“Transitivity measures the probability that the adjacent vertices of a vertex are connected. This is sometimes also called the clustering coefficient.” transitivity() help

We can calculate the transitivity or ratio of triangles to connected triples for the whole network:

transitivity(union_graph_undir, type = "global")
## [1] 0.2850679

Or for each node:

transitivity <- data.frame(name = V(union_graph_undir)$name,
      transitivity = transitivity(union_graph_undir, type = "local")) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, transitivity, by = "name")

transitivity %>%
  arrange(-transitivity) %>%
  .[1:10, ]
##                 name transitivity
## 1       Robert Arryn            1
## 2   Ormund Baratheon            1
## 3     Selyse Florent            1
## 4  Shireen Baratheon            1
## 5   Amarei Crakehall            1
## 6       Marissa Frey            1
## 7        Olyvar Frey            1
## 8        Perra Royce            1
## 9        Perwyn Frey            1
## 10         Tion Frey            1

Because ours is a family network, characters with a transitivity of one form triangles with their parents or offspring.


PageRank centrality

PageRank (originally used by Google to rank the importance of search results) is similar to eigenvector centrality. Eigenvector centrality scores nodes in a network according to the number of connections to high-degree nodes they have. It is therefore a measure of node importance. PageRank similarly considers nodes as more important if they have many incoming edges (or links).

page_rank <- page.rank(union_graph_undir, directed = FALSE)

page_rank_centrality <- data.frame(name = names(page_rank$vector),
      page_rank = page_rank$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, page_rank_centrality, by = "name")

page_rank_centrality %>%
  arrange(-page_rank) %>%
  .[1:10, ]
##                 name   page_rank
## 1     Oberyn Martell 0.018402407
## 2    Quellon Greyjoy 0.016128129
## 3        Walder Frey 0.012956029
## 4       Eddard Stark 0.011725019
## 5       Cregan Stark 0.010983561
## 6      Catelyn Stark 0.010555473
## 7       Lyarra Stark 0.009876629
## 8  Aegon V Targaryen 0.009688458
## 9      Balon Greyjoy 0.009647049
## 10         Jon Arryn 0.009623742

Oberyn Martell, Quellon Greyjoy and Walder Frey all have the highest number of spouses, children and grandchildren are are therefore scored highest for PageRank.


Matrix representation of a network

Connections between nodes can also be represented as an adjacency matrix. We can convert our graph object to an adjacency matrix with igraph’s as_adjacency_matrix() function. Whenever there is an edge between two nodes, this field in the matrix will get assigned a 1, otherwise it is 0.

adjacency <- as.matrix(as_adjacency_matrix(union_graph_undir))


Eigenvector centrality

We can now calculate the eigenvalues and eigenvectors of the adjacency matrix.

#degree diagonal matrix
degree_diag <- diag(1 / igraph::degree(union_graph_undir))

# PageRank matrix
pagerank <- adjacency %*% degree_diag

eigenvalues <- eigen(pagerank)

The eigenvector with the highest eigenvalue scores those vertices highly, that have many eges or that are connected to vertices with many edges.

eigenvector <- data.frame(name = rownames(pagerank),
           eigenvector = as.numeric(eigenvalues$vectors[, which.max(eigenvalues$values)]))

union_characters <- left_join(union_characters, eigenvector, by = "name")

eigenvector %>%
  arrange(eigenvector) %>%
  .[1:10, ]
##                       name eigenvector
## 1          Quellon Greyjoy  -0.6625628
## 2            Balon Greyjoy  -0.3864950
## 3   Lady of House Sunderly  -0.3312814
## 4           Alannys Harlaw  -0.2760678
## 5  Lady of House Stonetree  -0.2208543
## 6      Asha (Yara) Greyjoy  -0.1656407
## 7            Robin Greyjoy  -0.1104271
## 8            Euron Greyjoy  -0.1104271
## 9          Urrigon Greyjoy  -0.1104271
## 10       Victarion Greyjoy  -0.1104271

Because of their highly connected family ties (i.e. there are only a handful of connections but they are almost all triangles), the Greyjoys have been scored with the highest eigenvalues.

We can find the eigenvector centrality scores with:

eigen_centrality <- igraph::eigen_centrality(union_graph_undir, directed = FALSE)

eigen_centrality <- data.frame(name = names(eigen_centrality$vector),
           eigen_centrality = eigen_centrality$vector) %>%
  mutate(name = as.character(name))

union_characters <- left_join(union_characters, eigen_centrality, eigenvector, by = "name")

eigen_centrality %>%
  arrange(-eigen_centrality) %>%
  .[1:10, ]
##                name eigen_centrality
## 1   Tywin Lannister        1.0000000
## 2  Cersei Lannister        0.9168980
## 3  Joanna Lannister        0.8358122
## 4    Jeyne Marbrand        0.8190076
## 5   Tytos Lannister        0.8190076
## 6   Genna Lannister        0.7788376
## 7   Jaime Lannister        0.7642870
## 8  Robert Baratheon        0.7087042
## 9        Emmon Frey        0.6538709
## 10      Walder Frey        0.6516021

When we consider eigenvector centrality, Tywin and the core Lannister family score highest.


Who are the most important characters?

We can now compare all the node-level information to decide which characters are the most important in Game of Thrones. Such node level characteristics could also be used as input for machine learning algorithms.

Let’s look at all characters from the major houses:

union_characters %>%
  filter(!is.na(house2)) %>%
  dplyr::select(-contains("_std")) %>%
  gather(x, y, degree:eigen_centrality) %>%
  ggplot(aes(x = name, y = y, color = house2)) +
    geom_point(size = 3) +
    facet_grid(x ~ house2, scales = "free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

family_ties_1

Taken together, we could say that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones.


Groups of nodes

We can also analyze dyads (pairs of two nodes), triads (groups of three nodes) and bigger cliques in our network. For dyads, we can use the function dyad_census() from igraph or dyad.census() from sna. Both are identical and calculate a Holland and Leinhardt dyad census with

  • mut: The number of pairs with mutual connections (in our case, spouses).
  • asym: The number of pairs with non-mutual connections (in the original network: mother-child and father-child relationships; but in the undirected network, there are none).
  • null: The number of pairs with no connection between them.
#igraph::dyad_census(union_graph_undir)
sna::dyad.census(adjacency)
##      Mut Asym  Null
## [1,] 326    0 21202

The same can be calculated for triads (see ?triad_census for details on what each output means).

#igraph::triad_census(union_graph_undir)
sna::triad.census(adjacency)
##          003 012   102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300
## [1,] 1412100   0 65261    0    0    0    0    0    0    0 790    0    0    0   0 105
triad.classify(adjacency, mode = "graph")
## [1] 2

We can also calculate the number of paths and cycles of any length we specify, here e.g. of length <= 5. For edges, we obtain the sum of counts for all paths or cycles up to the given maximum length. For vertices/nodes, we obtain the number of paths or cycles to which each node belongs.

node_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = TRUE, dyadic.tabulation = "sum")
edge_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = FALSE)
edge_kpath
## $path.count
##     1     2     3     4     5 
##   326  1105  2973  7183 17026

This, we could plot with (but here, it does not give much additional information):

gplot(node_kpath$paths.bydyad,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")
node_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = TRUE, cycle.comembership = "sum")
edge_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = FALSE)
edge_kcycle
## $cycle.count
##   2   3   4   5   6   7   8 
##   0 105 136  27  57  58  86
node_kcycle_reduced <- node_kcycle$cycle.comemb
node_kcycle_reduced <- node_kcycle_reduced[which(rowSums(node_kcycle_reduced) > 0), which(colSums(node_kcycle_reduced) > 0)]

gplot(node_kcycle_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

“A (maximal) clique is a maximal set of mutually adjacency vertices.” clique.census() help

node_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = TRUE, clique.comembership = "sum")
edge_clique <- clique.census(adjacency, mode = "graph", tabulate.by.vertex = FALSE, clique.comembership = "sum")
edge_clique$clique.count
##   1   2   3 
##   0  74 105
node_clique_reduced <- node_clique$clique.comemb
node_clique_reduced <- node_clique_reduced[which(rowSums(node_clique_reduced) > 0), which(colSums(node_clique_reduced) > 0)]

gplot(node_clique_reduced,
      label.cex = 0.5, 
      vertex.cex = 0.75,
      displaylabels = TRUE,
      edge.col = "grey")

family_ties_1

The largest group of nodes ín this network is three, i.e. all parent/child relationships. Therefore, it does not really make sense to plot them all, but we could plot and color them with:

vcol <- rep("grey80", vcount(union_graph_undir))

# highlight first of largest cliques
vcol[unlist(largest_cliques(union_graph_undir)[[1]])] <- "red"

plot(union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.color = vcol, 
     vertex.size = 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = 2,
     edge.arrow.size = 0.5,
     edge.color = E(union_graph_undir)$color,
     edge.lty = E(union_graph_undir)$lty)


Clustering

We can also look for groups within our network by clustering node groups according to their edge betweenness:

ceb <- cluster_edge_betweenness(union_graph_undir)
modularity(ceb)
## [1] 0.8359884
plot(ceb,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1

Or based on propagating labels:

clp <- cluster_label_prop(union_graph_undir)

plot(clp,
     union_graph_undir,
     layout = layout,
     vertex.label = gsub(" ", "\n", V(union_graph_undir)$name),
     vertex.shape = V(union_graph_undir)$shape,
     vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, 
     vertex.frame.color = "gray", 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8)

family_ties_1


Network properties

We can also feed our adjacency matrix to other functions, like GenInd() from the NetIndices packages. This function calculates a number of network properties, like number of compartments (N), total system throughput (T..), total system throughflow (TST), number of internal links (Lint), total number of links (Ltot), like density (LD), connectance (C), average link weight (Tijbar), average compartment throughflow (TSTbar) and compartmentalization or degree of connectedness of subsystems in the network (Cbar).

library(NetIndices)
graph.properties <- GenInd(adjacency)
graph.properties
## $N
## [1] 208
## 
## $T..
## [1] 652
## 
## $TST
## [1] 652
## 
## $Lint
## [1] 652
## 
## $Ltot
## [1] 652
## 
## $LD
## [1] 3.134615
## 
## $C
## [1] 0.01514307
## 
## $Tijbar
## [1] 1
## 
## $TSTbar
## [1] 3.134615
## 
## $Cbar
## [1] 0.01086163


Alternatively, the network package provides additional functions to obtain network properties. Here, we can again feed in the adjacency matrix of our network and convert it to a network object.

library(network)
adj_network <- network(adjacency, directed = TRUE)
adj_network
##  Network attributes:
##   vertices = 208 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 652 
##     missing edges= 0 
##     non-missing edges= 652 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

From this network object, we can e.g. get the number of dyads and edges within a network and the network size.

network.dyadcount(adj_network)
## [1] 43056
network.edgecount(adj_network)
## [1] 652
network.size(adj_network)
## [1] 208

“equiv.clust uses a definition of approximate equivalence (equiv.fun) to form a hierarchical clustering of network positions. Where dat consists of multiple relations, all specified relations are considered jointly in forming the equivalence clustering.” equiv.clust() help

ec <- equiv.clust(adj_network, mode = "graph", cluster.method = "average", plabels = network.vertex.names(adj_network))
ec
## Position Clustering:
## 
##  Equivalence function: sedist 
##  Equivalence metric: hamming 
##  Cluster method: average 
##  Graph order: 208
ec$cluster$labels <- ec$plabels
plot(ec)

family_ties_1

From the sna package, we can e.g. use functions that tell us the graph density and the dyadic reciprocity of the vertices or edges

gden(adjacency)
## [1] 0.01514307
grecip(adjacency)
## Mut 
##   1
grecip(adjacency, measure = "edgewise")
## Mut 
##   1


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NetIndices_1.4.4     MASS_7.3-47          statnet_2016.9       sna_2.4              ergm.count_3.2.2     tergm_3.4.0          networkDynamic_0.9.0 ergm_3.7.1           network_1.13.0       statnet.common_3.3.0 igraph_1.0.1         dplyr_0.5.0          purrr_0.2.2          readr_1.1.0          tidyr_0.6.2          tibble_1.3.0         ggplot2_2.2.1        tidyverse_1.1.1     
## 
## loaded via a namespace (and not attached):
##  [1] lpSolve_5.6.13    reshape2_1.4.2    haven_1.0.0       lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14       foreign_0.8-68    DBI_0.6-1         modelr_0.1.0      readxl_1.0.0      trust_0.1-7       plyr_1.8.4        robustbase_0.92-7 stringr_1.2.0     munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2       coda_0.19-1       psych_1.7.5       evaluate_0.10     labeling_0.3      knitr_1.15.1      forcats_0.2.0     parallel_3.4.0    DEoptimR_1.0-8    broom_0.4.2       Rcpp_0.12.10      scales_0.4.1      backports_1.0.5   jsonlite_1.4      mnormt_1.5-5      hms_0.3           digest_0.6.12     stringi_1.1.5     grid_3.4.0        rprojroot_1.2     tools_3.4.0       magrittr_1.5      lazyeval_0.2.0    Matrix_1.2-10     xml2_1.1.1        lubridate_1.6.0   assertthat_0.2.0  rmarkdown_1.5     httr_1.2.1        R6_2.2.0          nlme_3.1-131      compiler_3.4.0

Update to autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-02 00:00:00 +0000

This is a reply to Wojciech Indyk’s comment on yesterday’s post on autoencoders and anomaly detection with machine learning in fraud analytics:

“I think you can improve the detection of anomalies if you change the training set to the deep-autoencoder. As I understand the train_unsupervised contains both class 0 and class 1. If you put only class 0 as the input of the autoencoder, the network should learn the “normal” pattern. Then the MSE should be higher for 1 class in the test set (of course if anomaly==fraud).”

To test this, I follow the same workflow as in yesterday’s post but this time, I am moving all fraud instances from the first training set for unsupervised learning to the second training test for supervised learning. Now, the autoencoder learns a pattern solely on non-fraud cases.

library(tidyverse)
library(h2o)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 minutes 11 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    17 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_erp741 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.54 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)
# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)
splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

# move class 1 instances to second training set...
train_supervised <- rbind(as.data.frame(train_supervised), as.data.frame(train_unsupervised[train_unsupervised$Class == "1", ])) %>%
  as.h2o()

# ... and remove from first training set
train_unsupervised <- train_unsupervised[train_unsupervised$Class == "0", ]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)
model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")
anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))
ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

Compared to the results from yesterday’s post, this model seems to have learned a pattern that found two major cases. The mean reconstruction MSE was slightly higher for class 0 and definitely higher for class 1.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.04, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56608 0.9995762113
## 2     0    outlier    24 0.0004237887
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

Anomaly detection with a higher threshold based on the plot above did not improve the results compared to yesterday’s post.

With a lower threshold of 0.2 (not shown here), the test set performed much better for detecting fraud cases as outliers (65 vs 27, compared to 32 vs 60 in yesterday’s post). However, this model also categorized many more non-fraud cases as outliers (2803 vs 53829, compared to only 30 vs 56602).


Now, I am again using the autoencoder model as pre-training input for supervised learning.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))
pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 56347 0.99496751
## 2      0       1   285 0.00503249
## 3      1       0     9 0.09782609
## 4      1       1    83 0.90217391

This model is now much better at identifying fraud cases than in yesterday’s post (90%, compared to 83% - even though we can’t directly compare the two models as they were trained on different training sets) but it is also slightly less accurate at predicting non-fraud cases (99.5%, compared to 99.8%).


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] h2o_3.10.4.4    dplyr_0.5.0     purrr_0.2.2     readr_1.1.0    
## [5] tidyr_0.6.1     tibble_1.3.0    ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10       RColorBrewer_1.1-2 cellranger_1.1.0  
##  [4] compiler_3.4.0     plyr_1.8.4         bitops_1.0-6      
##  [7] forcats_0.2.0      tools_3.4.0        digest_0.6.12     
## [10] lubridate_1.6.0    jsonlite_1.4       evaluate_0.10     
## [13] nlme_3.1-131       gtable_0.2.0       lattice_0.20-35   
## [16] psych_1.7.3.21     DBI_0.6-1          yaml_2.1.14       
## [19] parallel_3.4.0     haven_1.0.0        xml2_1.1.1        
## [22] stringr_1.2.0      httr_1.2.1         knitr_1.15.1      
## [25] hms_0.3            rprojroot_1.2      grid_3.4.0        
## [28] R6_2.2.0           readxl_1.0.0       foreign_0.8-68    
## [31] rmarkdown_1.5      modelr_0.1.0       reshape2_1.4.2    
## [34] magrittr_1.5       backports_1.0.5    scales_0.4.1      
## [37] htmltools_0.3.6    rvest_0.3.2        assertthat_0.2.0  
## [40] mnormt_1.5-5       colorspace_1.3-2   labeling_0.3      
## [43] stringi_1.1.5      RCurl_1.95-4.8     lazyeval_0.2.0    
## [46] munsell_0.4.3      broom_0.4.2

Autoencoders and anomaly detection with machine learning in fraud analytics

2017-05-01 00:00:00 +0000

All my previous posts on machine learning have dealt with supervised learning. But we can also use machine learning for unsupervised learning. The latter are e.g. used for clustering and (non-linear) dimensionality reduction.

For this task, I am using Kaggle’s credit card fraud dataset from the following study:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

The dataset gives > 280,000 instances of credit card use and for each transaction, we know whether it was fraudulent or not.

Datasets like this needs special treatment when performing machine learning because they are severely unbalanced: in this case, only 0.17% of all transactions are fraudulent.

While we could try to work with classifiers, like random forests or support vector machines, by applying over- or under-sampling techniques, we can alternatively try to find anomalies in the data (assuming we expect our fraud cases to be anomalies within the whole dataset).

When dealing with such a severe unbalance of response labels, we also need to be careful when measuring model performance. Because there are only a handful of fraudulent instances, a model that predicts everything as non-fraud will already achieve a > 99% accuracy. But despite its high accuracy, such a model won’t necessarily help us find fraudulent cases - the proverbial “needle-in-a-haystack” - that we actually want to find!

Below, I will show how you can use autoencoders and anomaly detection, how you can use autoencoders to pre-train a classification model and how you can measure model performance on unbalanced data.


Exploring the data

library(tidyverse)
# download from https://www.kaggle.com/dalpozz/creditcardfraud
creditcard <- read.csv("creditcard.csv")

The dataset contains numerical input variables V1 to V28, which are the result of a PCA transformation of the original features (which could not be provided due to confidentiality issues).

The response variable Class tell us whether a transaction was fraudulent (value = 1) or not (value = 0).

creditcard %>%
  ggplot(aes(x = Class)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

There are two additional features, Time (time in seconds between each transaction and the first transaction) and Amount (how much money was transferred in this transaction).

Because Time only tells us the order in which transactions have been done, it doesn’t actually tell us anything about the actual times (i.e. time of day) of the transaction. Therefore, I am normalizing them by day and bin them into four groups according to time of day.

summary(creditcard$Time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   54202   84692   94814  139320  172792
# how many seconds are 24 hours
# 1 hr = 60 mins = 60 x 60 s = 3600 s
3600 * 24
## [1] 86400
# separate transactions by day
creditcard$day <- ifelse(creditcard$Time > 3600 * 24, "day2", "day1")

# make transaction relative to day
creditcard$Time_day <- ifelse(creditcard$day == "day2", creditcard$Time - 86400, creditcard$Time)

summary(creditcard[creditcard$day == "day1", ]$Time_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   38432   54689   52948   70977   86400
summary(creditcard[creditcard$day == "day2", ]$Time_day)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   37843   53425   51705   68182   86392
# bin transactions according to time of day
creditcard$Time <- as.factor(ifelse(creditcard$Time_day <= 38138, "gr1", # mean 1st Qu.
                          ifelse(creditcard$Time_day <= 52327, "gr2", # mean mean
                                 ifelse(creditcard$Time_day <= 69580, "gr3", # mean 3rd Qu
                                        "gr4"))))
creditcard %>%
  ggplot(aes(x = day)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw()

We can see now that the transactions in this dataset have all been recorded on two consecutive days and there are roughly the same number of transactions on these two days.

Now, I remove the columns I used to create the Time bins.

creditcard <- select(creditcard, -Time_day, -day)

# convert class variable to factor
creditcard$Class <- factor(creditcard$Class)
creditcard %>%
  ggplot(aes(x = Time)) +
    geom_bar(color = "grey", fill = "lightgrey") +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

The distribution of transactions over the four Time bins shows, that the majority of fraud cases have happened in group 1 (although, I can’t say when exactly because the original Time feature did not tell us when the first transaction occurred).

I also want to look at the distribution of the amounts of money that were transferred:

summary(creditcard[creditcard$Class == "0", ]$Amount)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.65    22.00    88.29    77.05 25691.16
summary(creditcard[creditcard$Class == "1", ]$Amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    9.25  122.21  105.89 2125.87
creditcard %>%
  ggplot(aes(x = Amount)) +
    geom_histogram(color = "grey", fill = "lightgrey", bins = 50) +
    theme_bw() +
    facet_wrap( ~ Class, scales = "free", ncol = 2)

Interestingly, fraudulent credit card transactions had a higher mean amount of money that was transferred, but the maximum amount was much lower compared to regular transactions.


Modeling

For modeling, I am using R’s H2O implementation with the h2o package. For more details and other examples, see my posts on my machine learning webinar, on building neural nets with h2o and on performing grid search for hyperparameter tuning.

library(h2o)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 minutes 6 seconds 
##     H2O cluster version:        3.10.4.4 
##     H2O cluster version age:    16 days  
##     H2O cluster name:           H2O_started_from_R_Shirin_nly512 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.60 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)
# convert data to H2OFrame
creditcard_hf <- as.h2o(creditcard)


Then, I am splitting the dataset into training and test sets. Because I want to check how a pre-trained model with perform, I am splitting my data into two separate training sets and one independent test set for final model comparison.

splits <- h2o.splitFrame(creditcard_hf, 
                         ratios = c(0.4, 0.4), 
                         seed = 42)

train_unsupervised  <- splits[[1]]
train_supervised  <- splits[[2]]
test <- splits[[3]]

response <- "Class"
features <- setdiff(colnames(train_unsupervised), response)


Autoencoders

First, I am training the unsupervised neural network model using deep learning autoencoders. With h2o, we can simply set autoencoder = TRUE.

Here, I am applying a technique called “bottleneck” training, where the hidden layer in the middle is very small. This means that my model will have to reduce the dimensionality of the input data (in this case, down to 2 nodes/dimensions).

The autoencoder model will then learn the patterns of the input data irrespective of given class labels. Here, it will learn, which credit card transactions are similar and which transactions are outliers or anomalies. We need to keep in mind though, that autoencoder models will be sensitive to outliers in our data, which might throw off otherwise typical patterns.

model_nn <- h2o.deeplearning(x = features,
                             training_frame = train_unsupervised,
                             model_id = "model_nn",
                             autoencoder = TRUE,
                             reproducible = TRUE, #slow - turn off for real problems
                             ignore_const_cols = FALSE,
                             seed = 42,
                             hidden = c(10, 2, 10), 
                             epochs = 100,
                             activation = "Tanh")

Because training can take a while, I am saving the model:

h2o.saveModel(model_nn, path="model_nn", force = TRUE)
model_nn <- h2o.loadModel("model_nn")
model_nn
## Model Details:
## ==============
## 
## H2OAutoEncoderModel: deeplearning
## Model ID:  model_nn 
## Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 776 weights/biases, 16.0 KB, 2,622,851 training samples, mini-batch size 1
##   layer units  type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    34 Input  0.00 %                                              
## 2     2    10  Tanh  0.00 % 0.000000 0.000000  0.709865 0.320108 0.000000
## 3     3     2  Tanh  0.00 % 0.000000 0.000000  0.048458 0.109033 0.000000
## 4     4    10  Tanh  0.00 % 0.000000 0.000000  0.164717 0.192053 0.000000
## 5     5    34  Tanh         0.000000 0.000000  0.369681 0.425672 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1                                          
## 2   -0.039307   0.691302  0.008052 0.965178
## 3   -0.097383   0.314106  0.226376 0.067917
## 4    0.227664   1.089589  0.112032 0.672444
## 5    0.011072   0.605586  0.091124 0.722602
## 
## 
## H2OAutoEncoderMetrics: deeplearning
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.001472071
## RMSE: (Extract with `h2o.rmse`) 0.03836757
#Convert to autoencoded representation
test_autoenc <- h2o.predict(model_nn, test)


Dimensionality reduction with hidden layers

Because I had used a bottleneck model with two nodes in the hidden layer in the middle, we can use this dimensionality reduction to explore our feature space (similar to what to we could do with a principal component analysis). We can extract this hidden feature with the h2o.deepfeatures() function and plot it to show the reduced representation of the input data.

train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 2) %>%
  as.data.frame() %>%
  mutate(Class = as.vector(train_unsupervised[, 31]))
ggplot(train_features, aes(x = DF.L2.C1, y = DF.L2.C2, color = Class)) +
  geom_point(alpha = 0.1)

Here, we do not see a cluster of fraudulent transactions that is distinct from non-fraud instances, so dimensionality reduction with our autoencoder model alone is not sufficient to identify fraud in this dataset.

But we could use the reduced dimensionality representation of one of the hidden layers as features for model training. An example would be to use the 10 features from the first or third hidden layer:

# let's take the third hidden layer
train_features <- h2o.deepfeatures(model_nn, train_unsupervised, layer = 3) %>%
  as.data.frame() %>%
  mutate(Class = as.factor(as.vector(train_unsupervised[, 31]))) %>%
  as.h2o()
features_dim <- setdiff(colnames(train_features), response)
model_nn_dim <- h2o.deeplearning(y = response,
                               x = features_dim,
                               training_frame = train_features,
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
h2o.saveModel(model_nn_dim, path="model_nn_dim", force = TRUE)
model_nn_dim <- h2o.loadModel("model_nn_dim/DeepLearning_model_R_1493574057843_49")
model_nn_dim
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_49 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 184 weights/biases, 7.1 KB, 4,098,024 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    10   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.299549 0.363303
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.009907 0.012193
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.248854 0.126722
## 5     5     2 Softmax         0.000000 0.000000  0.160990 0.078139
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.539949   4.601841 -0.060170 2.765655
## 3 0.000000   -0.384247   1.763649 -1.429832 0.570700
## 4 0.000000   -0.004865   0.774441 -0.320202 0.668747
## 5 0.000000    0.563951   1.365125  0.000000 0.397460
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9974 samples **
## 
## MSE:  0.4905394
## RMSE:  0.7003852
## LogLoss:  2.763673
## Mean Per-Class Error:  0.3856454
## AUC:  0.6656933
## Gini:  0.3313867
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1438 3594 0.714229  =3594/5032
## 1       282 4660 0.057062   =282/4942
## Totals 1720 8254 0.388610  =3876/9974
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.002173 0.706275 300
## 2                       max f2  0.001540 0.842820 335
## 3                 max f0point5  0.002531 0.619024 280
## 4                 max accuracy  0.002531 0.619711 280
## 5                max precision  0.030005 1.000000   0
## 6                   max recall  0.000496 1.000000 388
## 7              max specificity  0.030005 1.000000   0
## 8             max absolute_mcc  0.002173 0.302697 300
## 9   max min_per_class_accuracy  0.002590 0.554634 277
## 10 max mean_per_class_accuracy  0.002531 0.622276 280
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`


For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the trainings data:

test_dim <- h2o.deepfeatures(model_nn, test, layer = 3)
h2o.predict(model_nn_dim, test_dim) %>%
  as.data.frame() %>%
  mutate(actual = as.vector(test[, 31])) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n       freq
##    <chr>  <fctr> <int>      <dbl>
## 1      0       0 16710 0.29506286
## 2      0       1 39922 0.70493714
## 3      1       0     7 0.07608696
## 4      1       1    85 0.92391304

Now, this actually looks quite good in terms of identifying fraud cases: 92% of fraud cases were identified! However, many non-fraud cases were also classified as fraud. For real-life application, this wouldn’t be a good model. Let’s try some other techniques…


Anomaly detection

We can also ask which instances were considered outliers or anomalies within our test data, using the h2o.anomaly() function. Based on the autoencoder model that was trained before, the input data will be reconstructed and for each instance, the mean squared error (MSE) between actual value and reconstruction is calculated.

I am also calculating the mean MSE for both class labels.

anomaly <- h2o.anomaly(model_nn, test) %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  mutate(Class = as.vector(test[, 31]))

mean_mse <- anomaly %>%
  group_by(Class) %>%
  summarise(mean = mean(Reconstruction.MSE))

This, we can now plot:

ggplot(anomaly, aes(x = as.numeric(rowname), y = Reconstruction.MSE, color = as.factor(Class))) +
  geom_point(alpha = 0.3) +
  geom_hline(data = mean_mse, aes(yintercept = mean, color = Class)) +
  scale_color_brewer(palette = "Set1") +
  labs(x = "instance number",
       color = "Class")

As we can see in the plot, there is no perfect classification into fraud and non-fraud cases but the mean MSE is definitely higher for fraudulent transactions than for regular ones.

We can now identify outlier instances by applying an MSE threshold for what we consider outliers. We could e.g. say that we consider every instance with an MSE > 0.02 (chosen according to the plot above) to be an anomaly/outlier.

anomaly <- anomaly %>%
  mutate(outlier = ifelse(Reconstruction.MSE > 0.02, "outlier", "no_outlier"))

anomaly %>%
  group_by(Class, outlier) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: Class [2]
## 
##   Class    outlier     n         freq
##   <chr>      <chr> <int>        <dbl>
## 1     0 no_outlier 56602 0.9994702642
## 2     0    outlier    30 0.0005297358
## 3     1 no_outlier    60 0.6521739130
## 4     1    outlier    32 0.3478260870

As we can see, outlier detection is not sufficient to correctly classify fraudulent credit card transactions either (at least not with this dataset).


Pre-trained supervised model

We can now try using the autoencoder model as a pre-training input for a supervised model. Here, I am again using a neural network. This model will now use the weights from the autoencoder for model fitting.

model_nn_2 <- h2o.deeplearning(y = response,
                               x = features,
                               training_frame = train_supervised,
                               pretrained_autoencoder  = "model_nn",
                               reproducible = TRUE, #slow - turn off for real problems
                               balance_classes = TRUE,
                               ignore_const_cols = FALSE,
                               seed = 42,
                               hidden = c(10, 2, 10), 
                               epochs = 100,
                               activation = "Tanh")
h2o.saveModel(model_nn_2, path="model_nn_2", force = TRUE)
model_nn_2 <- h2o.loadModel("model_nn_2/DeepLearning_model_R_1493574057843_9")
model_nn_2
## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1493574057843_9 
## Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 424 weights/biases, 11.7 KB, 3,643,248 training samples, mini-batch size 1
##   layer units    type dropout       l1       l2 mean_rate rate_rms
## 1     1    34   Input  0.00 %                                     
## 2     2    10    Tanh  0.00 % 0.000000 0.000000  0.274110 0.291925
## 3     3     2    Tanh  0.00 % 0.000000 0.000000  0.004480 0.002651
## 4     4    10    Tanh  0.00 % 0.000000 0.000000  0.210337 0.326239
## 5     5     2 Softmax         0.000000 0.000000  0.137694 0.123830
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1                                                   
## 2 0.000000   -0.006945   1.058583 -0.501068 1.576798
## 3 0.000000    0.021012   0.309286  0.360286 0.105522
## 4 0.000000    0.934361   1.759343  0.304729 0.532527
## 5 0.000000   -0.133149   2.641065  2.392999 3.137845
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9973 samples **
## 
## MSE:  0.01500211
## RMSE:  0.1224831
## LogLoss:  0.04408663
## Mean Per-Class Error:  0.00169424
## AUC:  0.9995075
## Gini:  0.9990149
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error      Rate
## 0      5000   17 0.003388  =17/5017
## 1         0 4956 0.000000   =0/4956
## Totals 5000 4973 0.001705  =17/9973
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.134320 0.998288  96
## 2                       max f2  0.134320 0.999314  96
## 3                 max f0point5  0.134320 0.997263  96
## 4                 max accuracy  0.134320 0.998295  96
## 5                max precision  1.000000 1.000000   0
## 6                   max recall  0.134320 1.000000  96
## 7              max specificity  1.000000 1.000000   0
## 8             max absolute_mcc  0.134320 0.996597  96
## 9   max min_per_class_accuracy  0.134320 0.996612  96
## 10 max mean_per_class_accuracy  0.134320 0.998306  96
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
pred <- as.data.frame(h2o.predict(object = model_nn_2, newdata = test)) %>%
  mutate(actual = as.vector(test[, 31]))
pred %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>  <fctr> <int>       <dbl>
## 1      0       0 56512 0.997881057
## 2      0       1   120 0.002118943
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957
pred %>%
  ggplot(aes(x = actual, fill = predict)) +
    geom_bar() +
    theme_bw() +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap( ~ actual, scales = "free", ncol = 2)

Now, this looks much better! We did miss 17% of the fraud cases but we also did not mis-classify too many of the non-fraud cases.

In real-life, we would now spend some more time trying to improve the model by e.g. performing grid search for hyperparameter tuning, going back to the original features (which we did not have here) and trying different engineered features and/or trying different algorithms. But here, I will leave it at that.


Measuring model performance on highly unbalanced data

Because of the severe bias towards non-fraud cases, we can not use performance measures like accuracy or area under the curve (AUC), as they would give overly optimistic results based on the high percentage of correct classifications of the majority class.

An alternative to AUC is to use the precision-recall curve or the sensitivity (recall)-specificity curve. To calculate and plot these metrics, we can use the ROCR package. There are different ways to calculate the area under a curve (see the PRROC package for details) but I am going to use a simple function that calculates the area between every consecutive points-pair of x (i.e. x1 - x0, x2 - x1, etc.) under the corresponding values of y.

library(ROCR)

# http://stackoverflow.com/questions/24563061/computing-integral-of-a-line-plot-in-r
line_integral <- function(x, y) {
  dx <- diff(x)
  end <- length(y)
  my <- (y[1:(end - 1)] + y[2:end]) / 2
  sum(dx * my)
} 

prediction_obj <- prediction(pred$p1, pred$actual)
par(mfrow = c(1, 2))
par(mar = c(5.1,4.1,4.1,2.1))

# precision-recall curve
perf1 <- performance(prediction_obj, measure = "prec", x.measure = "rec") 

x <- perf1@x.values[[1]]
y <- perf1@y.values[[1]]
y[1] <- 0

plot(perf1, main = paste("Area Under the\nPrecision-Recall Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

# sensitivity-specificity curve
perf2 <- performance(prediction_obj, measure = "sens", x.measure = "spec") 

x <- perf2@x.values[[1]]
y <- perf2@y.values[[1]]
y[1] <- 0

plot(perf2, main = paste("Area Under the\nSensitivity-Specificity Curve:\n", round(abs(line_integral(x,y)), digits = 3)))

Precision is the proportion of test cases predicted to be fraud that were indeed fraudulent (i.e. the true positive predictions), while recall or sensitivity is the proportion of fraud cases that were identified as fraud. And specificity is the proportion of non-fraud cases that are identified as non-fraud.

The precision-recall curve tells us the relationship between correct fraud predictions and the proportion of fraud cases that were detected (e.g. if all or most fraud cases were identified, we also have many non-fraud cases predicted as fraud and vice versa). The sensitivity-specificity curve thus tell us the relationship between correctly identified classes of both labels (e.g. if we have 100% correctly classified fraud cases, we will have no correctly classified non-fraud cases and vice versa).

We can also look at this a little bit differently, by manually going through different prediction thresholds and calculating how many cases were correctly classified in the two classes:

thresholds <- seq(from = 0, to = 1, by = 0.1)
pred_thresholds <- data.frame(actual = pred$actual)

for (threshold in thresholds) {
  
  prediction <- ifelse(pred$p1 > threshold, 1, 0)
  prediction_true <- ifelse(pred_thresholds$actual == prediction, TRUE, FALSE)
  pred_thresholds <- cbind(pred_thresholds, prediction_true)

}

colnames(pred_thresholds)[-1] <- thresholds
pred_thresholds %>%
  gather(x, y, 2:ncol(pred_thresholds)) %>%
  group_by(actual, x, y) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = as.numeric(x), y = n, color = actual)) +
    geom_vline(xintercept = 0.6, alpha = 0.5) +
    geom_line() +
    geom_point(alpha = 0.5) +
    theme_bw() +
    facet_wrap(actual ~ y, scales = "free", ncol = 2) +
    labs(x = "prediction threshold",
         y = "number of instances")

This plot tells us that we can increase the number of correctly classified non-fraud cases without loosing correctly classified fraud cases when we increase the prediction threshold from the default 0.5 to 0.6:

pred %>%
  mutate(predict = ifelse(pred$p1 > 0.6, 1, 0)) %>%
  group_by(actual, predict) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) 
## Source: local data frame [4 x 4]
## Groups: actual [2]
## 
##   actual predict     n        freq
##    <chr>   <dbl> <int>       <dbl>
## 1      0       0 56558 0.998693318
## 2      0       1    74 0.001306682
## 3      1       0    16 0.173913043
## 4      1       1    76 0.826086957

Our final model now correctly identified 83% of fraud cases and almost 100% of non-fraud cases.


If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ROCR_1.0-7      gplots_3.0.1    h2o_3.10.4.4    dplyr_0.5.0    
##  [5] purrr_0.2.2     readr_1.1.0     tidyr_0.6.1     tibble_1.3.0   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       reshape2_1.4.2     haven_1.0.0       
##  [4] lattice_0.20-35    colorspace_1.3-2   htmltools_0.3.6   
##  [7] yaml_2.1.14        foreign_0.8-68     DBI_0.6-1         
## [10] RColorBrewer_1.1-2 modelr_0.1.0       readxl_1.0.0      
## [13] plyr_1.8.4         stringr_1.2.0      munsell_0.4.3     
## [16] gtable_0.2.0       cellranger_1.1.0   rvest_0.3.2       
## [19] caTools_1.17.1     psych_1.7.3.21     evaluate_0.10     
## [22] labeling_0.3       knitr_1.15.1       forcats_0.2.0     
## [25] parallel_3.4.0     broom_0.4.2        Rcpp_0.12.10      
## [28] KernSmooth_2.23-15 scales_0.4.1       backports_1.0.5   
## [31] gdata_2.17.0       jsonlite_1.4       mnormt_1.5-5      
## [34] hms_0.3            digest_0.6.12      stringi_1.1.5     
## [37] grid_3.4.0         rprojroot_1.2      tools_3.4.0       
## [40] bitops_1.0-6       magrittr_1.5       lazyeval_0.2.0    
## [43] RCurl_1.95-4.8     xml2_1.1.1         lubridate_1.6.0   
## [46] assertthat_0.2.0   rmarkdown_1.5      httr_1.2.1        
## [49] R6_2.2.0           nlme_3.1-131       compiler_3.4.0