Explore Predictive Maintenance with flexdashboard

2017-11-02 00:00:00 +0000

I have written the following post about Predictive Maintenance and flexdashboard at my company codecentric’s blog:

Predictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below). A common use-case for Predictive Maintenance is to proactively monitor machines, so as to predict when a check-up is needed to reduce failure and maximize performance. In contrast to traditional maintenance, where each machine has to undergo regular routine check-ups, Predictive Maintenance can save costs and reduce downtime. A machine learning approach to such a problem would be to analyze machine failure over time to train a supervised classification model that predicts failure. Data from sensors and weather information is often used as features in modeling.

With flexdashboard RStudio provides a great way to create interactive dashboards with R. It is an easy and very fast way to present analyses or create story maps. Here, I have used it to demonstrate different analysis techniques for Predictive Maintenance. It uses Shiny run-time to create interactive content.

Continue reading at https://blog.codecentric.de/en/2017/11/explore-predictive-maintenance-flexdashboard/

Blockchain & distributed ML - my report from the data2day conference

2017-09-28 00:00:00 +0000

Yesterday and today I attended the data2day, a conference about Big Data, Machine Learning and Data Science in Heidelberg, Germany. Topics and workshops covered a range of topics surrounding (big) data analysis and Machine Learning, like Deep Learning, Reinforcement Learning, TensorFlow applications, etc. Distributed systems and scalability were a major part of a lot of the talks as well, reflecting the growing desire to build bigger and more complex models that can’t (or would take too long to) run on a single computer. Most of the application examples were built in Python but one talk by Andreas Prawitt was specifically titled “Using R for Predictive Maintenance: an example from the TRUMPF Laser GmbH”. I also saw quite a few graphs that were obviously made with ggplot!

The keynote lecture on Wednesday about Blockchains for AI was given by Trent McConaghy. Blockchain technology is based on a decentralized system of storing and validating data and changes in data. It experiences a huge hype at the moment but it is only starting to gain track in Data Science and Machine Learning as well. I therefore found it a very fitting topic for the keynote lecture! Trent and his colleagues at BigchainDB are implementing an “internet-scale blockchain database for the world” - the Interplanetary Database (IPDB).

“IPDB is a blockchain database that offers decentralized control, immutability and the creation and trading of digital assets. […] As a database for the world, IPDB offers decentralized control, strong governance and universal accessibility. IPDB relies on “caretaker” organizations around the world, who share responsibility for managing the network and governing the IPDB Foundation. Anyone in the world will be able to use IPDB. […]” https://blog.bigchaindb.com/ipdb-announced-as-public-planetary-scale-blockchain-database-7a363824fc14

He presented a number of examples where blockchain technology for decentralized data storage/access can be beneficial to Machine Learning and AI, like exchanging data from self-driving cars, of online market places and for generating art with computers. You can learn more about him here:

The other talks were a mix of high- and low-level topics: from introductions to machine learning, Apache Spark and data analysis with Python to (distributed) data streaming with Kappa architecture or Apache Kafka, containerization with Docker and Kubernetes, data archiving with Apache Cassandra, relevance tuning with Solr and much more. While I spent most of the time at my company’s conference stand, I did hear three of the talks. I summarize each of them below…

  1. Scalable Machine Learning with Apache Spark for Fraud Detection

In this first talk I heard, Dr. Patrick Baier and Dr. Stanimir Dragiev presented their work at Zalando. They built a scalable machine learning framework with Apache Spark, Scala and AWS to assess and predict fraud in online transactions. Zalando is a German online store that sells clothes, shoes and accessories. Normally, they allow registered customers to buy via invoice, i.e. they receive their ordered items before they pay them. This leaves them vulnerable to fraud where item are not paid for. The goal of their data science team is to use customer and basket data to obtain a probability score for how likely a transaction is going to be fraudulent. High-risk payment options, like invoice, can then be disabled in transactions with high fraud probability. To build and run such machine learning models, the Zalando data science team uses a combination of Spark, Scala, R, AWS, SQL, Python, Docker, etc. In their workflow, they use a combination of static and dynamic features, imputing missing values and building a decision model. In order to scale their modeling workflow to process more requests, use more data in training, update models more frequently and/or run more models, they described a workflow that uses Spark, Scala and Amazon Web Services (AWS). Spark’s machine learning library can be used for modeling and scaled horizontally by increasing the number of clusters on which to run the models. Scala provides multi-threading functionality and AWS is used for storing data in S3 and extending computation power depending on changing needs. Finally, they include a model inspector into their workflow to assure comparability of training and test data and check that the model is behaving as expected. Problems that they are dealing with include highly unbalanced data (which is getting even worse the better their models work as they keep reducing the number of fraud cases), delayed labeling due to the long process of the transactions, seasonality in data.

  1. Sparse Data: Don’t Mind the Gap!

In this talk, my colleagues from codecentric Dr. Daniel Pape and Dr. Michael Plümacher showed an example from ad targeting of how to deal with sparse data. Sparse data occurs in many areas, e.g. as rare events over a long period of time or in areas where there are many items and few occurrences per item, like in recommender systems or in natural language processing (NLP). In ad targeting, the measure of success is the rate of the click-through rate (CRT): this is the number of clicks on a given advertisement displayed to a user on a website divided by the total number of advertisements, or impressions. Because financial revenue comes from a high CTR, advertisements should be placed in a way that maximizes their chance of being clicked, i.e. we want to recommend advertisements for specific users that match their interests or are of actual relevance. Sparsity come into play with ad targeting because the number of clicks is very low compared to two metrics: a) from all the potential ads that a user could see, only a small proportion is actually shown to her/him and b) of the ads that a user sees, she/he only clicks on very few. This means that, a CTR matrix of advertisements x targets will have very few combinations that have been clicked (the mean CTR is 0.003) and contain many missing values. The approach they took was to impute the missing values and predict for each target/user the most similar ads from the imputed CTR matrix. This approach worked well for a reasonably large data set but it didn’t perform so well with smaller (and therefore even sparser) data. They then talked about alternative approaches, like grouping users and/or ads into groups in order to reduce the sparsity of the data. Their take-home messages were that 1) there is no one-size-fits-all solution, what works depends on the context and 2) if the underlying data is of bad quality, the results will be sub-optimal - no matter how sophisticated the model.

  1. Distributed TensorFlow with Kubernetes

In the third talk, another colleague of mine from codecentric, Jakob Karalus, explained in detail how to set up a distributed machine learning modelling set-up with TensorFlow and Kubernetes. TensorFlow is used to build neural networks in a graph-based manner. Distributed and parallel machine learning can be necessary when training big neural networks with a lot of training data, very deep neural networks, with complex parameters, grid search for hyper-parameter tuning, etc. A good way to build neural networks in a controlled and stable environment is to use Docker containers. Kubernetes is a container orchestration tool that can set up distribution of nodes from our TensorFlow modeling container. Setting up this distributed system is quite complex, though and Jakob recommended to try to stay on one CPU/GPU as long as possible.

From Biology to Industry. A Blogger’s Journey to Data Science.

2017-09-20 00:00:00 +0000

Today, I have given a webinar for the Applied Epidemiology Didactic of the University of Wisconsin - Madison titled “From Biology to Industry. A Blogger’s Journey to Data Science.”

I talked about how blogging about R and Data Science helped me become a Data Scientist. I also gave a short introduction to Machine Learning, Big Data and Neural Networks.

My slides can be found here: https://www.slideshare.net/ShirinGlander/from-biology-to-industry-a-bloggers-journey-to-data-science

Why I use R for Data Science - An Ode to R

2017-09-19 00:00:00 +0000

I have written a blog post about why I love R and prefer it to other languages. The post is on my new site, but since it isn’t on R-bloggers yet I am also posting the link here:

Working in Data Science, I often feel like I have to justify using R over Python. And while I do use Python for running scripts in production, I am much more comfortable with the R environment. Basically, whenever I can, I use R for prototyping, testing, visualizing and teaching. But because personal gut-feeling preference isn’t a very good reason to give to (scientifically minded) people, I’ve thought a lot about the pros and cons of using R. This is what I came up with why I still prefer R.

I can’t stress enough how much I appreciate all the people who are involved in the R-community; who write packages, tutorials, blogs, who share information, provide support and who think about how to make data analysis easy, more convenient and - dare I say - fun!

Continue reading…

Moving my blog to blogdown

2017-09-14 00:00:00 +0000

It’s been a long time coming but I finally moved my blog from Jekyll/Bootstrap on Github pages to blogdown, Hugo and Netlify! Moreover, I also now have my own domain name www.shirin-glander.de. :-)

I followed the blogdown ebook to set up my blog. I chose Thibaud Leprêtre’s tranquilpeak theme. It looks much more polished than my old blog.

My old blog will remain where it is, so that all the links that are out there will still work (and I don’t have to go through the hassle of migrating all my posts to my new site).

From this point on, all new posts will go up on my new page!

Data Science for Fraud Detection

2017-09-06 00:00:00 +0000

I have written the following post about Data Science for Fraud Detection at my company codecentric’s blog:

Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business there is the potential for one party scamming the other. With an ever increasing use of the internet for shopping, banking, filing insurance claims, etc. these businesses have become targets of fraud in a whole new dimension. Fraud has become a major problem in e-commerce and a lot of resources are being invested to recognize and prevent it.

Traditional approaches to identifying fraud have been rule-based. This means that hard and fast rules for flagging a transaction as fraudulent have to be established manually and in advance. But this system isn’t flexible and inevitably results in an arms-race between the seller’s fraud detection system and criminals finding ways to circumnavigate these rules. The modern alternative is to leverage the vast amounts of Big Data that can be collected from online transactions and model it in a way that allows us to flag or predict fraud in future transactions. For this, Data Science and Machine Learning techniques, like Deep Neural Networks (DNNs), are the obvious solution!

Here, I am going to show an example of how Data Science techniques can be used to identify fraud in financial transactions. I will offer some insights into the inner workings of fraud analysis, aimed at non-experts to understand.

Continue reading at https://blog.codecentric.de/en/2017/09/data-science-fraud-detection/

The blog post is also available in German.

Migrating from GitHub to GitLab with RStudio (Tutorial)

2017-09-04 00:00:00 +0000

GitHub vs. GitLab

Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.

In short, its main feature is that version control allows you to keep track of the changes you make to your code. It will also keep a history of all the changes you have made in the past and allows you to go back to specific versions if you made a major mistake. And Git makes collaborating on the same code very easy.

Most R packages are also hosted on GitHub. You can check out their R code in the repositories if you want to get a deeper understanding of the functions, you can install the latest development versions of packages or install packages that are not on CRAN. The issue tracker function of GitHub also makes it easy to report and respond to issues/problems with your code.

Why would you want to leave GitHub?

Public repositories are free on GitHub but you need to pay for private repos (if you are a student or work in academia, you get private repos for free). Since I switched from academia to industry lately and no longer fulfil these criteria, all my private repos would have to be switched to public in the future. Here, GitLab is a great alternative!

GitLab offers very similar functionalities as GitHub. There are many pros and cons for using GitHub versus GitLab but for me, the selling point was that GitLab offers unlimited private projects and collaborators in its free plan.


Migrating from GitHub to GitLab with RStudio is very easy! Here, I will show how I migrated my GitHub repositories of R projects, that I work with from within RStudio, to GitLab.

Beware, that ALL code snippets below show Terminal code (they are NOT from the R console)!

Migrating existing repositories

You first need to set up your GitLab account (you can login with your GitHub account) and connect your old GitHub account. Under Settings &Account, you will find “Social sign-in”; here click on “Connect” next to the GitHub symbol (if you signed in with your GitHub account, it will already be connected).

Once you have done this, you can import all your GitHub repositories to GitLab. To do this, you first need to create a new project. Click on the drop-down arrow next to the plus sign in the top-right corner and select “New project”. This will open the following window:

Here, choose “Import project from GitHub” and choose the repositories you want to import.

If you go into one of your repositories, GitLab will show you a message at the top of the site that tells you that you need to add an SSH key. The SSH key is used for secure communication between the GitLab server and your computer when you want to share information, like push/pull commits.

If you already work with GitHub on your computer, you will have an SSH key set up and you can copy your public SSH key to GitLab. Follow the instructions here.

Here is how you do it on a Mac:

  1. Look for your public key and copy it to the clipboard
cat ~/.ssh/id_rsa.pub
pbcopy < ~/.ssh/id_rsa.pub

Then paste it into the respective field here.

The next step is to change the remote URL for pushing/pulling your project from RStudio. In your Git window (tab next to “Environment” and “History” for me), click on Settings and “Shell”.

Then write in the shell window that opened:

git remote set-url origin git@<GITLABHOST>:<ORGNAME>/<REPO>.git

You can copy the link in the navigation bar of your repo on GitLab.

Check that you now have the correct new gitlab path by going to “Tools”, “Project Options” and “Git/SVN”.

Also check your SSH key configuration with:

ssh -T git@<GITLABHOST>

If you get the following message

The authenticity of host 'gitlab.com (' can't be established.
ECDSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)?

type “yes” (and enter passphrase if prompted).

If everything is okay, you now get a message saying Welcome to GitLab!

Now, you can commit, push and pull from within RStudio just as you have done before!

In case of problems with pushing/pulling

In my case, I migrated both, my private as well as my company’s GitHub repos to GitLab. While my private repos could be migrated without a hitch, migrating my company’s repos was a bit more tricky (because they had additional security settings, I assume).

Here is how I solved this problem with my company’s repos:

I have protected my SSH key with a passphrase. When pushing or pulling commits via the shell with git pull and git push origin master, I am prompted to enter my passphrase and everything works fine. Pushing/pulling from within RStudio, however, threw an error:

ssh_askpass: exec(/usr/X11R6/bin/ssh-askpass): No such file or directory
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I am using a MacBook Pro with MacOS Sierra version 10.12.6, so this might not be an issue with another operating system.

The following solution worked for me:

  1. Add your SSH key
ssh-add ~/.ssh/id_rsa
  1. And reinstall VS Code

Now I could commit, push and pull from within RStudio just as before!

Social Network Analysis and Topic Modeling of codecentric’s Twitter friends and followers

2017-07-28 00:00:00 +0000

I have written the following post about Social Network Analysis and Topic Modeling of codecentric’€™s Twitter friends and followers for codecentric’s blog:

Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling. In this article I am showing a real-world example of how we can use Data Science to gain insights from text data and social network analysis.

I am using publicly available Twitter data to characterize codecentric’€™s friends and followers for identifying the most “influential”€ followers and using text analysis tools like sentiment analysis to characterize their interests from their user descriptions, performing Social Network Analysis on friends, followers and a subset of second degree connections to identify key players who will be able to pass on information to a wide reach of other users and combing this network analysis with topic modeling to identify meta-groups with similar interests.

Knowing the interests and social network positions of our followers allows us to identify key users who are likely to retweet posts that fall within their range of interests and who will reach a wide audience.

Continue reading at https://blog.codecentric.de/en/2017/07/combining-social-network-analysis-topic-modeling-characterize-codecentrics-twitter-friends-followers/

The entire analysis has been done in R 3.4.0 and you can find my code on Github.

How to do Optical Character Recognition (OCR) of non-English documents in R using Tesseract?

2017-07-17 00:00:00 +0000

One of the many great packages of rOpenSci has implemented the open source engine Tesseract.

Optical character recognition (OCR) is used to digitize written or typed documents, i.e. photos or scans of text documents are “translated” into a digital text on your computer.

While this might seem like a trivial task at first glance, because it is so easy for our human brains. When reading text, we make use of our built-in word and sentence “autocomplete” that we learned from experience. But the same task is really quite difficult for a computer to recognize typed words correctly, especially if the document is of low quality.

One of the best open-source engines today is Tesseract. You can run tesseract from the command-line or - with the help of rOpenSci’s tesseract package - run it conveniently from within R!

Tesseract uses language specific training data to optimize OCR based on learned context. Therefore, it is much better at recognizing words in coherent sentences than at recognizing single words or abbreviations (we can see this e.g. with address lines in documents).

The default language is English, and you can find numerous examples on how to run OCR with this default. But running tesseract with a different language turned out to need a few additional tweaks, which I want to present here.

Installing tesseract

Check out the package vignette for instructions on how to install the libtesseract C++ library and the tesseract R package on your computer.


Because I’m German and will therefore use a German scanned document as an example, I first needed to install the German training dataset. Even though I installed it together with tesseract (the command-line tool), I still had to install it again for use with R:

If you happen to get an error, that it still didn’t find the correct language data where it expected (as I did), note the path that is given in the error message. This tells you where it looks for the training data and you can simply download the training data manually and copy it to the given path.

Here again, you need to make sure that you download training data for the correct version of tesseract. The link that is given in the package documentation turned out to point to a different version of tesseract than I was using. If this is the case, you will get a warning (Params model::Incomplete line) when running ocr().

As you can see, I had Tesseract version 3 installed:

system("tesseract --version")
tesseract 3.05.01
  libjpeg 8d : libpng 1.6.29 : libtiff 4.0.8 : zlib 1.2.8

So, I also needed to download the training data for version 3 (the default is for version 4), which you can find here and copy it to tesseracts language folder.

Image processing

Image quality is essential for good OCR! Tesseract performs different image processing steps internally with the Leptonica library but it is still a good idea to improve the image manually before running tesseract.

Here, I am using two random images from the internet:

  1. a manual for a printer and

  2. a description of a game

Image nr. 1 is machine-generated and should therefore yield much better results than image nr. 2, which looks like it was typewriter-written and moreover, is skewed.

This, we can also do from within R, making use of another one of rOpenSci’s packages: magick.

However, it seems that magick doesn’t have a function that converts an image to black and white, so I am using EBImage first. Because magick wants an magick image object for input, I saved the black-and-white image first and then read it in again with image_read(). This isn’t exactly an elegant solution, so if someone knows of a better way to do this, please let me know!

library(EBImage) # install from Bioconductor!
color.image <- readImage("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan.png")
bw.image <- channel(color.image,"gray")
color.image <- readImage("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2.jpg")
bw.image <- channel(color.image,"gray")
image1 <- image_read("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bw.png")
image2 <- image_read("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bw.jpg")

Now, we can follow Tesseract’s guidelines on how to improve image quality:

image_bearb1 <- image1 %>%
    image_scale("x2000") %>%                        # rescale
    image_background("white", flatten = TRUE) %>%   # set background to white
    image_trim() %>%                                # Trim edges that are the background color from the image.
    image_noise() %>%                               # Reduce noise in image using a noise peak elimination filter
    image_enhance() %>%                             # Enhance image (minimize noise)
    image_normalize() %>%                           # Normalize image (increase contrast by normalizing the pixel values to span the full range of color values).
    image_contrast(sharpen = 1) %>%                 # increase contrast
    image_deskew(treshold = 40)                     # deskew image -> creates negative offset in some scans
image_bearb2 <- image2 %>%
    image_scale("x2000") %>%                        # rescale
    image_background("white", flatten = TRUE) %>%   # set background to white
    image_trim() %>%                                # Trim edges that are the background color from the image.
    image_noise() %>%                               # Reduce noise in image using a noise peak elimination filter
    image_enhance() %>%                             # Enhance image (minimize noise)
    image_normalize() %>%                           # Normalize image (increase contrast by normalizing the pixel values to span the full range of color values).
    image_contrast(sharpen = 1) %>%                 # increase contrast
    image_deskew(treshold = 40)                     # deskew image -> creates negative offset in some scans

If we want to inspect our images from within R, we can use magick’s image_browse() function.



Tesseract’s central function is called ocr().

For some reason, I couldn’t solve the error message I got when directly pointing the processed images to the ocr() function: Magick: TIFF: negative image positions unsupported. This error results from image_deskew() but all potential solutions don’t seem to be implemented with the R package of ImageMagick, so I had to resort to a work-around: I saved the images first and then pointed ocr() to the images. This worked without a hitch!

image_write(image_bearb1, path = "/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bearb.png", format = "png")
image_write(image_bearb2, path = "/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bearb.jpg", format = "jpg")

When we are not working with an English document, we can specify the language with the engine = tesseract(language = "deu") option. We can also set a whole range of parameters but for now, I will only show how to use a whitelist: by specifying characters in a whitelist, tesseract will only look for these in the document. You can e.g. use this if you only expect numbers in you document. Here I am not being very restrictive, though…

whitelist <- "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&/()=?+"

text1 <- ocr("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_bearb.png",
             engine = tesseract(language = "deu",
                                options = list(tessedit_char_whitelist = whitelist)))

text2 <- ocr("/Users/shiringlander/Documents/Github/Blog_posts_prep/ocr/beispiel_scan_2_bearb.jpg",
             engine = tesseract(language = "deu",
                                options = list(tessedit_char_whitelist = whitelist)))

Performance evaluation

Evaluating how good our model performed isn’t trivial either. While we as humans can very easily judge the accuracy of our digitzed text just by looking at it, we need a standardized and automated way to evaluate model performance.

One way would be to manually transcribe your document and check it against the text that tesseract extracted. However, this is very labor-intensive and time-consuming, especially when you’re working with a large number of documents!

Another, slightly less arduous way would be to create a list of relevant words that we want to have recognized. This could be adapted to include any list of given words that are important for you to be recognized (if you wanted to be able to search for them among your documents, for example).

Or, as I am going to do here, we can go a much easier way and compare the words that tesseract recognized to a list of words from a dictionary to get an idea about how many of them are actual words. This is of course not without problems: Some words may not be in our dictionary or they are borrowed from another language, they might be names of people or places, addresses, abbreviations, etc. Also, because we don’t have a benchmark for our document, we will only get an estimation of how many words were correctly recognized. But we can really only compare different tesseract models against each other with this method.

Because we use different grammatical forms of words, we also want to stem both our recognized words and the dictionary before comparing them!

Here, I am using the German Open Thesaurus.

To separate the text and the dictionary into distinct words, I am using the tidytext package together with SnowballC for word stemming. Finally, I am going to remove duplicate entries.


openthes <- data.frame(words = read_lines("/Users/shiringlander/Documents/Projekte/OCR_Provinzial/German_dict/OpenThesaurus-Textversion/openthesaurus.txt", skip = 18)) %>%
  mutate(words = as.character(words)) %>%
  unnest_tokens(word, words) %>%
  mutate(word = wordStem(word, language = "deu")) %>%

This leaves us with 77453 German reference words.

Now, we can do the same with our tesseract output.

text_1_df <- data.frame(text = read.delim(textConnection(text1),    # make text into dataframe
                                          header = FALSE, 
                                          sep = "\n", 
                                          strip.white = TRUE)) %>%
  mutate(text = as.character(V1)) %>%
  unnest_tokens(word, text) %>%                                      # separate words
  mutate(word = wordStem(word, language = "deu"))                    # use word stem

text_2_df <- data.frame(text = read.delim(textConnection(text2),    # make text into dataframe
                                          header = FALSE, 
                                          sep = "\n", 
                                          strip.white = TRUE)) %>%
  mutate(text = as.character(V1)) %>%
  unnest_tokens(word, text) %>%                                      # separate words
  mutate(word = wordStem(word, language = "deu"))                    # use word stem

To get an estimate about model performance, I am then counting how many of the words that were recognized in each document are also part of the German dictionary (and therefore likely to be real words) and plot the results.

res1 <- text_1_df %>%
    mutate(in_dict = ifelse(word %in% openthes$word, TRUE, FALSE)) %>%
    count(in_dict) %>%
    mutate(percent = n / nrow(text_1_df) * 100,
           image = "image 1")

res2 <- text_2_df %>%
    mutate(in_dict = ifelse(word %in% openthes$word, TRUE, FALSE)) %>%
    count(in_dict) %>%
    mutate(percent = n / nrow(text_2_df) * 100,
           image = "image 2")

Between 80 and 90% of recognized words seem to be actual words!

Of course, the less “standard” a document is, the more difficult it will be to extract text correctly but I’d say this isn’t bad for a first attempt!

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] bindrcpp_0.2    SnowballC_0.5.1 tidytext_0.1.3  dplyr_0.7.1    
##  [5] purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3   
##  [9] ggplot2_2.2.1   tidyverse_1.1.1 magick_0.4      EBImage_4.18.0 
## [13] tesseract_1.4  
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1      reshape2_1.4.2      haven_1.1.0        
##  [4] lattice_0.20-35     colorspace_1.3-2    htmltools_0.3.6    
##  [7] yaml_2.1.14         rlang_0.1.1         foreign_0.8-69     
## [10] glue_1.1.1          RColorBrewer_1.1-2  BiocGenerics_0.22.0
## [13] modelr_0.1.0        readxl_1.0.0        jpeg_0.1-8         
## [16] bindr_0.1           plyr_1.8.4          stringr_1.2.0      
## [19] cellranger_1.1.0    munsell_0.4.3       gtable_0.2.0       
## [22] rvest_0.3.2         psych_1.7.5         evaluate_0.10.1    
## [25] labeling_0.3        knitr_1.16          forcats_0.2.0      
## [28] parallel_3.4.0      tokenizers_0.1.4    broom_0.4.2        
## [31] Rcpp_0.12.11        backports_1.1.0     scales_0.4.1       
## [34] jsonlite_1.5        abind_1.4-5         mnormt_1.5-5       
## [37] hms_0.3             png_0.1-7           digest_0.6.12      
## [40] stringi_1.1.5       tiff_0.1-5          grid_3.4.0         
## [43] rprojroot_1.2       tools_3.4.0         magrittr_1.5       
## [46] lazyeval_0.2.0      janeaustenr_0.1.5   pkgconfig_2.0.1    
## [49] Matrix_1.2-10       xml2_1.1.1          lubridate_1.6.0    
## [52] assertthat_0.2.0    rmarkdown_1.6       httr_1.2.1         
## [55] R6_2.2.2            fftwtools_0.9-8     nlme_3.1-131       
## [58] compiler_3.4.0

Characterizing Twitter followers with tidytext

2017-06-28 00:00:00 +0000

Lately, I have been more and more taken with tidy principles of data analysis. They are elegant and make analyses clearer and easier to comprehend. Following the tidyverse and ggraph, I have been quite intrigued by applying tidy principles to text analysis with Julia Silge and David Robinson’s tidytext.

In this post, I will explore tidytext with an analysis of my Twitter followers’ descriptions to try and learn more about the people who are interested in my tweets, which are mainly about Data Science and Machine Learning.

Resources I found useful for this analysis were http://www.rdatamining.com/docs/twitter-analysis-with-r and http://tidytextmining.com/tidytext.html

Retrieving Twitter data

I am using twitteR to retrieve data from Twitter (I have also tried rtweet but for some reason, my API key, secret and token (that worked with twitteR) resulted in a “failed to authorize” error with rtweet’s functions).


Once we have set up our Twitter REST API, we get the necessary information to authenticate our access.

consumerKey = "INSERT KEY HERE"
consumerSecret = "INSERT SECRET KEY HERE"
accessToken = "INSERT TOKEN HERE"
options(httr_oauth_cache = TRUE)

setup_twitter_oauth(consumer_key = consumerKey, 
                    consumer_secret = consumerSecret, 
                    access_token = accessToken, 
                    access_secret = accessSecret)

Now, we can access information from Twitter, like timeline tweets, user timelines, mentions, tweets & retweets, followers, etc.

All the following datasets were retrieved on June 7th 2017, converted to a data frame for tidy analysis and saved for later use:

  • the last 3200 tweets on my timeline
my_name <- userTimeline("ShirinGlander", n = 3200, includeRts=TRUE)
my_name_df <- twListToDF(my_name)
save(my_name_df, file = "my_name.RData")
  • my last 3200 mentions and retweets
my_mentions <- mentions(n = 3200)
my_mentions_df <- twListToDF(my_mentions)
save(my_mentions_df, file = "my_mentions.RData")

my_retweets <- retweetsOfMe(n = 3200)
my_retweets_df <- twListToDF(my_retweets)
save(my_retweets_df, file = "my_retweets.RData")
  • the last 3200 tweets to me
tweetstome <- searchTwitter("@ShirinGlander", n = 3200)
tweetstome_df <- twListToDF(tweetstome)
save(tweetstome_df, file = "tweetstome.RData")
  • my friends and followers
user <- getUser("ShirinGlander")

friends <- user$getFriends() # who I follow
friends_df <- twListToDF(friends)
save(friends_df, file = "my_friends.RData")

followers <- user$getFollowers() # my followers
followers_df <- twListToDF(followers)
save(followers_df, file = "my_followers.RData")

Analyzing friends and followers

In this post, I will have a look at my friends and followers.


I am going to use packages from the tidyverse (tidyquant for plotting).

  • Number of friends (who I follow on Twitter): 225

  • Number of followers (who follows me on Twitter): 324

  • Number of friends who are also followers: 97

What languages do my followers speak?

One of the columns describing my followers is which language they have set for their Twitter account. Not surprisingly, English is by far the most predominant language of my followers, followed by German, Spanish and French.

followers_df %>%
  count(lang) %>%
  droplevels() %>%
  ggplot(aes(x = reorder(lang, desc(n)), y = n)) +
    geom_bar(stat = "identity", color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "language ISO 639-1 code",
         y = "number of followers")

Who are my most “influential” followers (i.e. followers with the biggest network)?

I also have information about the number of followers that each of my followers have (2nd degree followers). Most of my followers are followed by up to ~ 1000 people, while only a few have a very large network.

followers_df %>%
  ggplot(aes(x = log2(followersCount))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    labs(x = "log2 of number of followers",
         y = "density")

How active are my followers (i.e. how often do they tweet)

The followers data frame also tells me how many statuses (i.e. tweets) each of followers have. To make the numbers comparable, I am normalizing them by the number of days that they have had their accounts to calculate the average number of tweets per day.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = log2(statusesCount_pDay))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +

Who are my followers with the biggest network and who tweet the most?

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(followersCount)) %>%
##         screenName followersCount statusesCount_pDay
## 1        dr_morton         150937           71.35193
## 2    Scientists4EU          66117           17.64389
## 3       dr_morton_          63467           46.57763
## 4   NewScienceWrld          60092           54.65874
## 5     RubenRabines          42286           25.99592
## 6  machinelearnbot          27427          204.67061
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10 FranPatogenLoco            311           28.92593
followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(statusesCount_pDay)) %>%
##         screenName followersCount statusesCount_pDay
## 1  machinelearnbot          27427          204.67061
## 2        dr_morton         150937           71.35193
## 3   NewScienceWrld          60092           54.65874
## 4       dr_morton_          63467           46.57763
## 5  FranPatogenLoco            311           28.92593
## 6     RubenRabines          42286           25.99592
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10   Scientists4EU          66117           17.64389

Is there a correlation between number of followers and number of tweets?

Indeed, there seems to be a correlation that users with many followers also tend to tweet more often.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = followersCount, y = statusesCount_pDay, color = days)) +
    geom_smooth(method = "lm") +
    geom_point() +
    scale_color_continuous(low = palette_light()[1], high = palette_light()[2]) +

Tidy text analysis

Next, I want to know more about my followers by analyzing their Twitter descriptions with the tidytext package.


To prepare the data, I am going to unnest the words (or tokens) in the user descriptions, convert them to the word stem, remove stop words and urls.


tidy_descr <- followers_df %>%
  unnest_tokens(word, description) %>%
  mutate(word_stem = wordStem(word)) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!grepl("\\.|http", word))

What are the most commonly used words in my followers’ descriptions?

The first question I want to ask is what words are most common in my followers’ descriptions.

Not surprisingly, the most common word is “data”. I do tweet mostly about data related topics, so it makes sense that my followers are mostly likeminded. The rest is also related to data science, machine learning and R.

tidy_descr %>%
  count(word_stem, sort = TRUE) %>%
  filter(n > 20) %>%
  ggplot(aes(x = reorder(word_stem, n), y = n)) +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of word stem in all followers' descriptions")

This, we can also show with a word cloud.

tidy_descr %>%
  count(word_stem) %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  with(wordcloud(word_stem, n, max.words = 100, colors = palette_light()))

Instead of looking for the most common words, we can also look for the most common ngrams: here, for the most common word pairs (bigrams) in my followers’ descriptions.

tidy_descr_ngrams <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts <- tidy_descr_ngrams %>%
  count(word1, word2, sort = TRUE)
bigram_counts %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(word1, -n), y = reorder(word2, -n), fill = n)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    coord_flip() +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "first word in pair",
         y = "second word in pair")

These, we can also show as a graph:

bigram_graph <- bigram_counts %>%
  filter(n > 5) %>%


a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color =  palette_light()[1], size = 5, alpha = 0.8) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 0.5) +

We can also use bigram analysis to identify negated meanings (this will become relevant for sentiment analysis later). So, let’s look at which words are preceded by “not” or “no”.

bigrams_separated <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(word1 == "not" | word1 == "no") %>%
  filter(!word2 %in% stop_words$word)

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  count(word2, score, sort = TRUE) %>%
not_words %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
    geom_col(show.legend = FALSE) +
    scale_fill_manual(values = palette_light()) +
    labs(x = "",
         y = "Sentiment score * number of occurrences",
         title = "Words preceded by \"not\"") +
    coord_flip() +

What’s the predominant sentiment in my followers’ descriptions?

For sentiment analysis, I will exclude the words with a negated meaning from nrc and switch their positive and negative meanings from bing (although in this case, there was only one negated word, “endorsement”, so it won’t make a real difference).

tidy_descr_sentiment <- tidy_descr %>%
  left_join(select(bigrams_separated, word1, word2), by = c("word" = "word2")) %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  rename(nrc = sentiment.x, bing = sentiment.y) %>%
  mutate(nrc = ifelse(!is.na(word1), NA, nrc),
         bing = ifelse(!is.na(word1) & bing == "positive", "negative", 
                       ifelse(!is.na(word1) & bing == "negative", "positive", bing)))
tidy_descr_sentiment %>%
  filter(nrc != "positive") %>%
  filter(nrc != "negative") %>%
  gather(x, y, nrc, bing) %>%
  count(x, y, sort = TRUE) %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(y, n), y = n)) +
    facet_wrap(~ x, scales = "free") +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of sentiment in followers' descriptions")

Are followers’ descriptions mostly positive or negative?

The majority of my followers have predominantly positive descriptions.

tidy_descr_sentiment %>%
  count(screenName, word, bing) %>%
  group_by(screenName, bing) %>%
  summarise(sum = sum(n)) %>%
  spread(bing, sum, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(x = sentiment)) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +

What are the most common positive and negative words in followers’ descriptions?

tidy_descr_sentiment %>%
  count(word, bing, sort = TRUE) %>%
  acast(word ~ bing, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = palette_light()[1:2],
                   max.words = 100)

Topic modeling: are there groups of followers with specific interests?

Topic modeling can be used to categorize words into groups. Here, we can use it to see whether (some) of my followers can be grouped into subgroups according to their descriptions.

dtm_words_count <- tidy_descr %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  count(screenName, word_stem, sort = TRUE) %>%
  ungroup() %>%
  filter(word_stem != "") %>%
  cast_dtm(screenName, word_stem, n)

# set a seed so that the output of the model is predictable
dtm_lda <- LDA(dtm_words_count, k = 5, control = list(seed = 1234))

topics_beta <- tidy(dtm_lda, matrix = "beta")
p1 <- topics_beta %>%
  filter(grepl("[a-z]+", term)) %>% # some words are Chinese, etc. I don't want these because ggplot doesn't plot them correctly
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, color = factor(topic), fill = factor(topic))) +
    geom_col(show.legend = FALSE, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    facet_wrap(~ topic, ncol = 5) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "beta (~ occurrence in topics 1-5)",
         title = "The top 10 most characteristic words describe topic categories.")
user_topic <- tidy(dtm_lda, matrix = "gamma") %>%
  arrange(desc(gamma)) %>%
  group_by(document) %>%
  top_n(1, gamma)
p2 <- user_topic %>%
  group_by(topic) %>%
  top_n(10, gamma) %>%
  ggplot(aes(x = reorder(document, -gamma), y = gamma, color = factor(topic))) +
    facet_wrap(~ topic, scales = "free", ncol = 5) +
    geom_point(show.legend = FALSE, size = 4, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    coord_flip() +
    labs(x = "",
         y = "gamma\n(~ affiliation with topics 1-5)")
grid.arrange(p1, p2, ncol = 1, heights = c(0.7, 0.3))

The upper of the two plots above show the words that were most strongly grouped to five topics. The lower plots show my followers with the strongest affiliation with these five topics.

Because in my tweets I only cover a relatively narrow range of topics (i.e. related to data), my followers are not very diverse in terms of their descriptions and the five topics are not very distinct.

If you find yourself in any of the topics, let me know if you agree with the topic that was modeled for you!

For more text analysis, see my post about text mining and sentiment analysis of a Stuff You Should Know Podcast.

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## other attached packages:
##  [1] gridExtra_2.2.1               topicmodels_0.2-6            
##  [3] reshape2_1.4.2                ggraph_1.0.0                 
##  [5] igraph_1.0.1                  tm_0.7-1                     
##  [7] NLP_0.1-10                    wordcloud_2.5                
##  [9] RColorBrewer_1.1-2            SnowballC_0.5.1              
## [11] tidytext_0.1.3                bindrcpp_0.2                 
## [13] tidyquant_0.5.1               quantmod_0.4-10              
## [15] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
## [17] xts_0.9-7                     zoo_1.8-0                    
## [19] lubridate_1.6.0               dplyr_0.7.0                  
## [21] purrr_0.2.2.2                 readr_1.1.1                  
## [23] tidyr_0.6.3                   tibble_1.3.3                 
## [25] ggplot2_2.2.1                 tidyverse_1.1.1              
## [27] twitteR_1.1.9                
## loaded via a namespace (and not attached):
##  [1] viridis_0.4.0     httr_1.2.1        viridisLite_0.2.0
##  [4] bit64_0.9-7       jsonlite_1.5      modelr_0.1.0     
##  [7] assertthat_0.2.0  stats4_3.4.0      cellranger_1.1.0 
## [10] ggrepel_0.6.5     yaml_2.1.14       slam_0.1-40      
## [13] backports_1.1.0   lattice_0.20-35   glue_1.1.0       
## [16] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2 
## [19] htmltools_0.3.6   Matrix_1.2-10     plyr_1.8.4       
## [22] psych_1.7.5       pkgconfig_2.0.1   broom_0.4.2      
## [25] haven_1.0.0       scales_0.4.1      tweenr_0.1.5     
## [28] ggforce_0.1.1     lazyeval_0.2.0    mnormt_1.5-5     
## [31] magrittr_1.5      readxl_1.0.0      evaluate_0.10    
## [34] tokenizers_0.1.4  janeaustenr_0.1.5 nlme_3.1-131     
## [37] MASS_7.3-47       forcats_0.2.0     xml2_1.1.1       
## [40] foreign_0.8-68    tools_3.4.0       hms_0.3          
## [43] stringr_1.2.0     munsell_0.4.3     compiler_3.4.0   
## [46] rlang_0.1.1       units_0.4-5       rjson_0.2.15     
## [49] labeling_0.3      rmarkdown_1.6     gtable_0.2.0     
## [52] DBI_0.7           R6_2.2.2          knitr_1.16       
## [55] udunits2_0.13     bit_1.1-12        bindr_0.1        
## [58] rprojroot_1.2     Quandl_2.8.0      modeltools_0.2-21
## [61] stringi_1.1.5     parallel_3.4.0    Rcpp_0.12.11