Learning R for visualising humanities data
I think you should learn R! No really – I’ve spent the last 6-7 weeks learning R so I can visualise the data we’ve collected in the Database of Machine Vision in Art, Games and Narratives, and it’s not as hard as I’d imagined, and I’m thrilled at all I can do with it.
Previously I’ve used Gephi and Excel, and I guess I thought it would just be too hard to learn a programming language, but honestly, it’s been a blast learning R and I wish I’d started ages ago. This blog post is a collection of the resources I’ve found most useful in teaching myself. I hope it helps other humanities scholars who would like to learn R.
But why learn R?
- R is a powerful tool for creating data visualisations and analysing data. The basics are pretty simple, and depending on what you want to do you’ll find specialised tools for almost anything, from statistics, to network analysis, to mapping or interactive graphs and more.
- The community base is huge and it’s pretty much the standard in data science both in academia and industry. There are lots of tutorials and examples. If you’re stuck, post your question on a site like StackOverflow and you’ll often get answers within minutes.
- R makes the analysis process explicit. I can publish the scripts I used to generate a visualisation or a table, and other people can check my steps or build upon my work to do something similar with a different dataset. It makes it easy to follow FAIR principles: making data Findable, Accessible, Interoperable and Reusable. You can even publish an R notebook with embedded code so people can see (and reuse) exactly what you did. Here’s an example I’m working on with our data about machine vision in art, games and narratives. I’ve been learning a lot about FAIR data by working with Jenny Ostrop, who has been helping me work out how to best format and present our data when we deposit it. This is still in progress since I’m still figuring stuff out.
- It’s fun! Kind of like the puzzle appeal of Wordle, but much more satisfying. Because each step is reasonably simple and you can find so many tutorials and examples, you keep feeling that rush of figuring out something new! And then you see something else you’d like to learn and get excited because you actually know what you need to do to figure that out.
If you want to see where I’m up to after six weeks, here is an R notebook with the code and visualisations I’ve been working on just to sort out my data, and here are the network visualisations I’ve been working on this week. It’s very much work-in-progress, but it shows how working through various tutorials but with my own data is a really productive way of exploring my data – and the network analysis is starting to give unexpected but generative findings that I want to explore more.
How to get started with R
Here’s my recommended progression for a humanities or social science scholar who wants to learn how to visualise categorical data or textual data rather than numeric data.
I started on Coursera, found some tutorials, and then was lucky enough to find Jeffrey Tharsen’s Data Analysis for Linguistic, Cultural, and Historical Research course at the University of Chicago, where I’m a visiting scholar this semester, and Jeffrey has let me audit the course, which has been great. There are lots of online courses and tutorials for general data science and for people with programming backgrounds, but when I was starting out (all of six weeks ago) I found it hard to figure out how to analyse textual data, and categorical data like we have in our database, because most of the standard tutorials use numerical datasets and do statistical analyses. This list is for people who are more interested in finding patterns and sorting through categories and words than people who want to figure out the mean or the standard deviation of census data.
- Take the Coursera course The Data Scientist’s Toolbox to learn how to set up RStudio and Github. You don’t have to do this, but I was so glad that I did, it made everything else much simpler. Weeks 2 (setting up RStudio) and 3 (version control and Github) are the most important, although the stuff about R notebooks and sharing your process are really good too. Coursera has little videos and quizzes and this course is really pretty nicely set up. It says 18 hours over 4 weeks, but I used about 6-7 hours I think, doing most of the tasks. If you don’t want to take this course, the bare minimum you need is to install R and RStudio on your computer, and you can Google other tutorials for that.
- Start visualising! Start with Chapter 3: Data visualisation in R for Data Science. This walks you through creating your first data visualisations – and it’s such a fun way to get started. They use a built-in dataset, mpg, which is used a lot in R tutorials. It’s very much numerical data, lots of stats about car models, so the kind of visualisations this does isn’t much like what many of us need in the humanities where we more often use textual or categorical data. But it gets you doing so much so fast.
- (optional) Here is another tutorial specifically for digital humanities. I did this one before I found the Data visualisation in step 2, and that worked – I just think going straight to the visualisation instead of starting with data organisation would be more fun. This is definitely a helpful tutorial though.
- Learn the difference between base R and the Tidyverse. Base R is the stuff that’s been in R for decades. Tidyverse is a package that includes really easy-to-use visualisation and analysis tools, and I highly recommend focusing on this. However, you want to learn a bit of the base syntax and what a function is and so on, even though you may not really need it much. This free Introduction to R course on Datacamp takes about 4 hours and goes through the basics in an easy-to-grasp way. You could start here instead of with my recommended step 1 and 2, but I think you’ll have more fun if you see the potentials before learning how to subset a data frame or write a function.
- Work through chapter 4 Data Visualisation in R to learn how to organise your datasets using the Tidyverse system, and/or try Rob Kabakoff’s book Data Visualization with R, which has lots of good information. For my data, looking at how to visualise different kinds of categorical data was really helpful and Kabakoff has lots of examples. Kieran Healy’s Data Visualisation: A Practical Introduction is also excellent. The trick with any of these is to skim to identify the bits that look interesting to you, based on the kinds of data you want to work with. Then download the datasets they use and work through the examples, following the book exactly. If you have a dataset of your own, try adapting the same scripts to use on your data.
- If you want to learn network visualisation in R, Katherine Ognyanova’s Network Visualization with R tutorial is brilliant. David Schoch’s Network Visualizations in R is good too, and if you like me want to convert bi-partite networks to one-mode networks, Phil Murphy and Brendan Knapp’s instructions from Bipartite/Two-Mode Networks in igraph will help. And of course you can look at the code I used for my analysis, and even try running the code with our dataset – though be aware it’s in progress!
- If taking the full text of a novel and analysing that is your goal, you could try Matthew Jocker and Rosamond Thalken’s Text Analysis with R: For Students of Literature (see if your library has the second edition online), but be aware they mostly use base R instead of Tidyverse. I did a bit of this but it’s not my main interest so I haven’t delved deep.
- This week the topic in Jeffrey Tharsen’s class is machine learning using R. I haven’t really tried this but plan to. Here is a tutorial Jeffrey recommended. I’ve read a few chapter of the textbook Jeffrey’s assigned (Brett Lantz: Machine Learning with R (Packt, 2015), but it’s not open access. If your university library has electronic access, you may want to take a look though.
- When you’re stuck, post a question to Stack Overflow. Read a few questions first to see how to ask a question that’s easy to answer – you want to provide a “miniature” dataset and show the code that you’ve tried that’s not working.
- Finally, I’m really enjoying using R Notebooks to write up my work-in-progress in a way where the code itself is embedded within regular text, and can easily be exported to HTML, PDF, Word or Latex. Here’s an in-progress version of the network analysis I’ve been working on this week – if you click on the CODE buttons in the right above each visualisation you can see exactly what I did, and even copy it and try it out yourself. Here is a detailed guide to details of how to do this, though the basics are very easy and don’t require much.
Btw, the comments on my blog aren’t working right now, but if you have questions or suggestions, feel free to ask me on Twitter – I’m @jilltxt.