Learning R for visualising humanities data

I think you should learn R! No really – I’ve spent the last 6-7 weeks learning R so I can visualise the data we’ve collected in the Database of Machine Vision in Art, Games and Narratives, and it’s not as hard as I’d imagined, and I’m thrilled at all I can do with it.

Previously I’ve used Gephi and Excel, and I guess I thought it would just be too hard to learn a programming language, but honestly, it’s been a blast learning R and I wish I’d started ages ago. This blog post is a collection of the resources I’ve found most useful in teaching myself. I hope it helps other humanities scholars who would like to learn R.

A network diagram of the most common actions taken by characters interacting with machine vision technologies in videogames, movies, novels etc.

But why learn R?

  1. R is a powerful tool for creating data visualisations and analysing data. The basics are pretty simple, and depending on what you want to do you’ll find specialised tools for almost anything, from statistics, to network analysis, to mapping or interactive graphs and more.
  2. The community base is huge and it’s pretty much the standard in data science both in academia and industry. There are lots of tutorials and examples. If you’re stuck, post your question on a site like StackOverflow and you’ll often get answers within minutes.
  3. R makes the analysis process explicit. I can publish the scripts I used to generate a visualisation or a table, and other people can check my steps or build upon my work to do something similar with a different dataset. It makes it easy to follow FAIR principles: making data Findable, Accessible, Interoperable and Reusable. You can even publish an R notebook with embedded code so people can see (and reuse) exactly what you did. Here’s an example I’m working on with our data about machine vision in art, games and narratives. I’ve been learning a lot about FAIR data by working with Jenny Ostrop, who has been helping me work out how to best format and present our data when we deposit it. This is still in progress since I’m still figuring stuff out.
  4. It’s fun! Kind of like the puzzle appeal of Wordle, but much more satisfying. Because each step is reasonably simple and you can find so many tutorials and examples, you keep feeling that rush of figuring out something new! And then you see something else you’d like to learn and get excited because you actually know what you need to do to figure that out.

If you want to see where I’m up to after six weeks, here is an R notebook with the code and visualisations I’ve been working on just to sort out my data, and here are the network visualisations I’ve been working on this week. It’s very much work-in-progress, but it shows how working through various tutorials but with my own data is a really productive way of exploring my data – and the network analysis is starting to give unexpected but generative findings that I want to explore more.

How to get started with R

Here’s my recommended progression for a humanities or social science scholar who wants to learn how to visualise categorical data or textual data rather than numeric data.

I started on Coursera, found some tutorials, and then was lucky enough to find Jeffrey Tharsen’s Data Analysis for Linguistic, Cultural, and Historical Research course at the University of Chicago, where I’m a visiting scholar this semester, and Jeffrey has let me audit the course, which has been great. There are lots of online courses and tutorials for general data science and for people with programming backgrounds, but when I was starting out (all of six weeks ago) I found it hard to figure out how to analyse textual data, and categorical data like we have in our database, because most of the standard tutorials use numerical datasets and do statistical analyses. This list is for people who are more interested in finding patterns and sorting through categories and words than people who want to figure out the mean or the standard deviation of census data.

  1. Take the Coursera course The Data Scientist’s Toolbox to learn how to set up RStudio and Github. You don’t have to do this, but I was so glad that I did, it made everything else much simpler. Weeks 2 (setting up RStudio) and 3 (version control and Github) are the most important, although the stuff about R notebooks and sharing your process are really good too. Coursera has little videos and quizzes and this course is really pretty nicely set up. It says 18 hours over 4 weeks, but I used about 6-7 hours I think, doing most of the tasks. If you don’t want to take this course, the bare minimum you need is to install R and RStudio on your computer, and you can Google other tutorials for that.
  2. Start visualising! Start with Chapter 3: Data visualisation in R for Data Science. This walks you through creating your first data visualisations – and it’s such a fun way to get started. They use a built-in dataset, mpg, which is used a lot in R tutorials. It’s very much numerical data, lots of stats about car models, so the kind of visualisations this does isn’t much like what many of us need in the humanities where we more often use textual or categorical data. But it gets you doing so much so fast.
  3. (optional) Here is another tutorial specifically for digital humanities. I did this one before I found the Data visualisation in step 2, and that worked – I just think going straight to the visualisation instead of starting with data organisation would be more fun. This is definitely a helpful tutorial though.
  4. Learn the difference between base R and the Tidyverse. Base R is the stuff that’s been in R for decades. Tidyverse is a package that includes really easy-to-use visualisation and analysis tools, and I highly recommend focusing on this. However, you want to learn a bit of the base syntax and what a function is and so on, even though you may not really need it much. This free Introduction to R course on Datacamp takes about 4 hours and goes through the basics in an easy-to-grasp way. You could start here instead of with my recommended step 1 and 2, but I think you’ll have more fun if you see the potentials before learning how to subset a data frame or write a function.
  5. Work through chapter 4 Data Visualisation in R to learn how to organise your datasets using the Tidyverse system, and/or try Rob Kabakoff’s book Data Visualization with R, which has lots of good information. For my data, looking at how to visualise different kinds of categorical data was really helpful and Kabakoff has lots of examples. Kieran Healy’s Data Visualisation: A Practical Introduction is also excellent. The trick with any of these is to skim to identify the bits that look interesting to you, based on the kinds of data you want to work with. Then download the datasets they use and work through the examples, following the book exactly. If you have a dataset of your own, try adapting the same scripts to use on your data.
  6. If you want to learn network visualisation in R, Katherine Ognyanova’s Network Visualization with R tutorial is brilliant. David Schoch’s Network Visualizations in R is good too, and if you like me want to convert bi-partite networks to one-mode networks, Phil Murphy and Brendan Knapp’s instructions from Bipartite/Two-Mode Networks in igraph will help. And of course you can look at the code I used for my analysis, and even try running the code with our dataset – though be aware it’s in progress!
  7. If taking the full text of a novel and analysing that is your goal, you could try Matthew Jocker and Rosamond Thalken’s Text Analysis with R: For Students of Literature (see if your library has the second edition online), but be aware they mostly use base R instead of Tidyverse. I did a bit of this but it’s not my main interest so I haven’t delved deep.
  8. This week the topic in Jeffrey Tharsen’s class is machine learning using R. I haven’t really tried this but plan to. Here is a tutorial Jeffrey recommended. I’ve read a few chapter of the textbook Jeffrey’s assigned (Brett Lantz: Machine Learning with R (Packt, 2015), but it’s not open access. If your university library has electronic access, you may want to take a look though.
  9. When you’re stuck, post a question to Stack Overflow. Read a few questions first to see how to ask a question that’s easy to answer – you want to provide a “miniature” dataset and show the code that you’ve tried that’s not working.
  10. Finally, I’m really enjoying using R Notebooks to write up my work-in-progress in a way where the code itself is embedded within regular text, and can easily be exported to HTML, PDF, Word or Latex. Here’s an in-progress version of the network analysis I’ve been working on this week – if you click on the CODE buttons in the right above each visualisation you can see exactly what I did, and even copy it and try it out yourself. Here is a detailed guide to details of how to do this, though the basics are very easy and don’t require much.

Btw, the comments on my blog aren’t working right now, but if you have questions or suggestions, feel free to ask me on Twitter – I’m @jilltxt.

17. February 2022 by Jill
Categories: Uncategorized | Leave a comment

Visiting scholar at the University of Chicago

I’m a visiting scholar at the University of Chicago this year, affiliated with the Center for Applied AI at Booth School of Business. I’m excited about the opportunity to learn from a different disciplinary approach to AI and machine vision. I discovered the work of Sendhil Mullainathan, who leads the Center, when I heard one of his collaborators, Ziad Obermeyer, speak about a study they did showing how running machine learning on medical records to find predictors of a stroke gives misleading results because so much context is missing from medical data.

The Center for Applied AI focuses on algorithmic bias, and works to find solutions to it. In most of the humanities and social science approaches I’ve seen a lot of focus on what’s wrong with algorithmic bias, but less work on how to solve the problems, so I’m particularly looking forwards to learning more about that.

I’m also eager to learn about other research and activities at UChicago. Today I started auditing a digital humanities course on data analysis using R, which is brilliant since I have been teaching myself R for the last few weeks so I can do more with our data on cultural representations of machine vision. And there’s the Weston Game Lab here, too. I just hope campus opens up as planned next week so I get to actually spend more time there in person!

I’ll be continuing to lead the Machine VIsion project of course. We’ve finished the data collection phase and are just beginning to analyse the data so you’ll hear more about that soon, too.

If you know of something at UChicago or elsewhere in Chicago I should be aware of, please let me know!

18. January 2022 by Jill
Categories: Uncategorized | 1 comment

My talk on caring AIs in recent sci-fi novels

I’m giving a talk at an actual f2f academic conference today, Critical Borders, Radical Re(visions) of AI, in Cambridge. I was particularly excited to see this conference because it’s organised by the people who edited AI Narratives A History of Imaginative Thinking about Intelligent Machines, a really useful anthology of essays on stories about AI ranging from the Ancient Greek myths about the autonomous machines Hephaistos built, via medieval ideas about magical mirrors and golems to twentieth century scifi.

Thumbnail image of PDF handout for my talk.

Here is the PDF Handout for my paper “Artificial Intelligence is Social and Embodied: AIs that Care in Contemporary Science Fiction”. I would love feedback, if you have any – I’ll be revising this before it eventually hopefully ends up as a full paper. I’m presenting at 13:20 UK time today, and it’ll be streamed (and I think archived?) on YouTube here, with the many other interesting talks happening today.

Yesterday I was able to sit in an actual auditorium and hear Ruha Benjamin speak, and today the conference itself starts. It’ll be live-streamed on YouTube as well and about 2/3 of the speakers will be remote. It’ll be interesting to see how a hybrid conference works.

My paper is about eleven science fiction novels published in the last five years where caring AIs are main characters:

AuthorTitleYearAI character
Becky ChambersA Closed and Common Orbit2016Sidra, Owl
Annalee NewitzAutonomous2017Paladin, Med
Martha WellsMurderbot series (2017-21)2017Murderbot
Neal Shusterman Thunderhead2018Thunderhead
Yudhanjaya WijeratneThe Salvage Crew2018Amber Rose 348
Ian McEwanMachines Like Me2019Adam
Carole StiversThe Mother Code 2020Rho-Z (Rosie)
Bjørn VatneDød og oppstandelse2020Oda
William GibsonAgency2020UNISS (Eunice)
S. B. DivyaMachinehood2021Welga/dakini
Kazuo IshiguroKlara and the Sun2021Klara

I’m using this to explore how actual (non-fictional) AI is also always social and embodied. This is a work in progress, and I’m trying out the Mumford Method, where you write up a concise handout, present using this several times, revising the handout each time to integrate feedback, and then write it up as a full paper when the ideas are throughly worked through.

If you’d like to look at my thoughts, I would really love feedback, as this is very much a work-in-progress.

I usually just start writing without a clear idea of where I’m going, which often works well, but also often requires a LOT of work and confusion when I’m revising, and has led to many abandonned half written papers. So I’m curious as to whether this method will work for me. I’ve enjoyed writing the handout – I like the constrained space and having the whole structure laid out. It really makes me think about what the point of the paper is rather than just writing out bits I enjoy. I do wonder whether the very concise style of writing will affect my final writing style though. Will I end up less essayistic than I’d like?

Here is the video feed from today’s conference – I’m in the 13:20 panel (UK time) today.

19. October 2021 by Jill
Categories: Uncategorized | 2 comments

Google is your first reader: How to write an academic paper that machines can easily read

  1. Put your main concept in the first part of your title, not in the subtitle.
  2. Use the format “X is [simple definition]” in your abstract.
  3. Use images and be aware that the first image will be used in many previews, so consider thinking of it as a graphical abstract.

OK, that was a test. The real title of this blog post is:

Google is your first reader: How to write research papers for machines

You see, after reading more about how Google selects its “Featured Snippets”, I learnt that in order to blog for machines (and thus get more human readers too) you should start with a question and then immediately follow with a short list.

To be honest, I don’t even know whether the rules I’ve listed are what really work, but then, community folklore about how algorithms work is common (Bishop 2019, 2020, van der Nagel 2018), so why not add to it.

And yes, this is all a bit tongue in cheek. I do research on algorithmic culture, and so I’m fascinated by how algorithmic decisions are made. I want to understand how algorithms try to select what is the most important point of an academic paper.

This is not to say that we should all write our papers for machines and ignore human readers, or that we should optimise our research for google rather than aiming to primarily do robust and useful and interesting research. Write the paper you want to write. But if you’re interested in how algorithms and search engines are reading your papers, read on.

Many platforms display the first image in your article when someone shares it, so choose wisely. This is a stock image of clockwork because the screenshots in the rest of the blog post don’t look great when this post is shared on Twitter. Also, metaphors. (Source: Colourbox)
Continue Reading →

26. March 2021 by Jill
Categories: Uncategorized | Leave a comment

Energy marking for privacy?

I’m attending a two-day digital meeting of Personvernskommisjonen, the privacy commission tasked with surveying the state of privacy in Norway and recommending policy for the future. We just had professor of ICT and private law, Frederik Zuiderveen Borgesius, visit for a short talk, and he raised some really pertinent points. The discussion was much richer than I’m able to summarise here, but I wanted to capture a few points.

Borgesius argued that there is a fundamental information asymmetry in privacy online today, where consumers don’t know what data is collected about them or how it will be used or what the consequences could be. In economics, information symmetry tends to be bad not just for consumers but for everyone, as described by Nobel-prize winning economist George Akerlof in his analysis of “The Market for Lemons“. The lemons are used cars, and of course typical consumers can’t really tell whether the car is a “lemon” or not.

How do you reduce information asymmetries? Well, in supermarkets, we have systems that make sure all the food is safe to eat. There are also various labelling schemes, showing if food is healthy, organic and so on – but some of these are not very helpful, or are even designed to confuse us. Different countries have different laws. So for instance in Norway, the country that fruit and vegetables comes from has to be marked. But all countries have systems so you can trust that food in the supermarket isn’t poisonous.

So we could imagine legally banning certain kinds of privacy-destroying websites. No collecting personal data for advertising on potentially sensitive sites, for instance, like sites about medical information. But where to draw the line? The articles I choose to read in a newspaper can also be used to infer very sensitive information about my sexuality, mental health or political standpoint, for instance.

Perhaps energy efficiency labelling of fridges and other electrical appliances is a better example? Here the EU has established clear guidelines, and they’re clearly displayed when you buy a new fridge.

Could we imagine something similar for websites? And if so, how would it be implemented?

Arguably, this is what Apple is doing with its privacy labels for apps. I love being able to see this more clearly and in a structured way, and this is a reason I like using Apple products (although I realise that’s a privilege: they’re expensive, which is an issue if only the rich get privacy). But is it a problem that a commercial company is doing this for us, rather than it being democratically defined?

Not all labelling works. You need a lot of basic things in place. (The Norwegian Consumer Council has an overview of labelling in Norway that explains some of the issues.) You also need genuine competition so users have a real choice based on the labelling. That means you need data portability, so users can easily switch to a different social media platform for instance.

Another online pandemic meeting. Speaker’s face blanked out, it’s the privacy commission after all.

We’re not concluding anything yet, and I’m sure there’ll be lots more discussions in the year to come.

09. March 2021 by Jill
Categories: Uncategorized | Leave a comment

Exhibition setup!

OK, this is extremely exciting: the University Museum is making an exhibition about research in our Machine Vision in Everyday Life project! They’ve been working on it for months, and COVID has made everything look very iffy, but now it really looks as though it is nearly ready, and we hope to be able to open it on March 18. Fingers crossed there isn’t a new lockdown before then…

Marit Amundsen is one of the curators on the project.

You’ll enter the exhibition through this tunnel, which will be lit up and “scan” you – then a mechanic guide (a talking head video) will greet you and instruct you to draw a card that gives you a specific role. You’ll view the exhibition from the point of view of your role – a wonderful touch that our larp development team came up with. (The actual larp is planned for November.)

Kurdin Jacob and Magnus Knustad worked with the curators to map surveillance cameras and other tracking around Bergen.

The exhibition is beautifully thought out. Our whole project team contributed ideas, with Andreas Zingerle, Linda Kronman and Gabriele de Seta leading, and the curator team came up with excellent ideas for how to actually make the exhibition exciting to visit. The focus is on the research, but as we research art and games and narratives there are also a lot of artworks in the exhibition – I am thrilled we were able to include this, because there is a lot of evocative art that directly engages with or critiques machine vision technologies.

Eli Hausken and Marit Amundsen looking at James Bridle’s nearly-fully installed artwork, Cloud Index.
Mattias and Eli working on getting the artwork set up – this is Forensic Architecture’s The Battle of Ilovaisk.

There are a lot of details involved with organising an exhibition. Thankfully the Museum are extremely professional, and Andreas was immensely helpful in coordinating with artists and so on (if any of the artists are reading this, your contracts are on the way, we had a few bureaucratic hurdles that are figured out now).

I also loved learning how the museum curators work when designing an exhibition like this. This is the floor plan they developed after many conversations with the research team.

The floor plan for the exhibition.

Sadly the University Museum isn’t able to accept school classes now due to COVID, which is a real shame, as this is an exhibition that would have been so well suited to school visits – and that is usually a major part of what the Museum does. If we’re lucky we’ll be able to have visiting school classes later this spring.

Woman smiling and holding a poster.
This will be on one of the walls. They let me hold it up for the photo though!

I’ll share more as the exhibition is closer to being ready to show!

26. February 2021 by Jill
Categories: Uncategorized | Leave a comment

Can I blog my way out of my pandemic slump?

I’m going to try to start blogging again, as a way to make myself more accountable to myself. I used to use my blog as a research journal, writing little bits and pieces and saving links and stray thoughts, and often I would use bits of blog posts when writing papers and books. People don’t really use blogs like that any more. Most blog posts are more like polished essays than thinking-while-you-write, or Thinking With My Fingers as Torill titled her blog years ago. Some people used their blogs to document their research process, but I used mine to do my research. And to think about how to organise my days and my time, or how to deal with new responsibilities and tasks. Now I’m nearly 50 and nobody accuses me of looking like a student anymore as they did in 2005 (ha) and that outsider peeking into the ivory tower stance certainly doesn’t work anymore. I’ll have to find a new voice, perhaps, if I start blogging again.

I started today by writing my notes about a novel I just read as a blog post instead of just for myself.

One step at a time.

Maybe trying to write a short blog post every day is a way to get myself back into research, get myself back, after this pandemic slump of a year.

25. February 2021 by Jill
Categories: Uncategorized | 2 comments

Robot mothers: The Mother Code

cover of the novel The Mother Code by Carole Stivers

I’m fascinated by fleshy, emotional ideas about AI and robots. A lot of recent science fiction I’ve been reading explores this: what would a sentient, emotional AI be like? How would they experience the world? What would their material form mean? Would they love? So much of being human is about our bodily emotions and gut feelings and our physical responses to our experiences.

I just finished reading The Mother Code by Carole Stiver. I found the book quite annoying in many ways, but towards the end there are some really interesting descriptions of the relationship between “the Mothers” and the children they have incubated, birthed and brought up. The Mothers are repurposed military bots, designed to nurture human babies after an out-of-control bioweapon kills all humans.

Continue Reading →

25. February 2021 by Jill
Categories: Uncategorized | Tags: , , , | Leave a comment

Forskning på nedkjølingseffekten og personvern (Lit. review of research on the chilling effect and privacy)

I’m a member of Personvernskommisjonen, a committee appointed by the Norwegian government to write a report that assesses the state of current privacy regulations and practices and gives recommendations on policies to meet current challenges to privacy (here is our mandate). I was asked to have a look at research on the “chilling effect” and privacy, and to be honest, I got a bit carried away, because I really love exploring new research areas and seeing all the new connections, and constructing new searches and seeing how everything interconnects. This is the informal summary I wrote for the commission, with an annotated bibliography at the end. It’s in Norwegian, but if you don’t read Norwegian, Google will translate it reasonably well, and the bibliography mostly references English language research so you can scroll down to that as well.

Please let me know if you have suggestions or more to add! We have almost a year more to finish the report so will definitely be looking for more material.

Nedkjølingseffekten (“the chilling effect”) oppstår “i situasjoner hvor utøvelse av legitime handlinger innskrenkes eller motvirkes gjennom trusselen om mulige sanksjoner” (NOU 2016:19: Samhandling for sikkerhet).

Continue Reading →

18. January 2021 by Jill
Categories: Uncategorized | Leave a comment

VR Narratives: A Workshop in VR, about VR

Now that I have a VR headset at home I’m both enjoying VR experiences and I’m exploring social interaction in VR spaces. I’ll write more about the pros and cons of VR meetings vs Zoom later, but right now I want to share this recording of a conference panel we organised in VR about VR narratives, for ELO2020 last week.

Continue Reading →

26. July 2020 by Jill
Categories: Uncategorized | Leave a comment

← Older posts