ChatGPT is multilingual but monocultural, and it’s learning your values
Like the rest of the internet, I’ve been playing with ChatGPT, the new AI chatbot released by OpenAI, and I’ve been fascinated by how much it does well and how it still gets a lot wrong.
ChatGPT is a foundation model, that is, a deep learning model (also called a neural network) that is trained on so much data and with so many parameters that it is qualitatively different from models you could feasibly train yourself. I wanted to know what data ChatGPT is trained on, but it turns out information is not readily available.
My conclusion, after reading up on all this, is that ChatGPT is multilingual but monocultural – but that by using it, we’re all helping to train it to align its values with our own.
Let me explain.
What is ChatGPT trained on?
The basics are clear. ChatGPT is based on the GPT models (GPT-1, GPT-2, GPT-3, and the current GPT-3.5 series), which are trained on data scraped from the web and some books. I’ll discuss them in more detail below.
In addition, as described in Ouyang et.al. 2022 or for non-scholars, here, ChatGPT is based on InstructGPT, which was fine-tuned by humans who wrote “desired responses” to prompts that the model was then trained on. After that, human labellers rated GPT-3’s responses (presumably similarly to the way ChatGPT asks us to label its responses). A model was trained on the labeled responses to predict what the humans would prefer, and that gave us InstructGPT, which is what ChatGPT is based on.
Here’s OpenAI’s visual explanation of the process for the value alignment training.
The team describes InstructGPT (which ChatGPT is based on) as aligned with the values of the 40 contractors they initially hired to test it. It is also “biased towards the cultural values of English-speaking people”.
More generally, aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately we must establish responsible, inclusive processes for making these decisions.OpenAI: “Aligning Language Models to Follow Instructions” (2022)
The model card for InstructGTP explains that it still has issues. For instance, and rather seriously, it makes up “facts”. Unfortunately, it’s really good at making its “facts” sound quite convincing.
In this blog post, I’ll explain more about the data its trained on, how it all works, and how you and I are training the model each time we use it.
How deep learning models make sense of the world
First of all: what is the data the GPT-series of AI models were trained on? Here is the table from the paper that introduced GTP-3 in 2020 (Brown et.al 2020).
I’ll return to each of these datasets below, but first I need to explain tokens and vectors and latent space.
Models like GPT-3 count things in tokens. A token is the smallest semantic unit for a machine learning unit, like a phoneme is the smallest unit of sound that can distinguish one word from another in spoken language. Often a token corresponds to a word, although it gets more complicated. The basic GPT-3 model is trained on unlabelled data, so it figures out what a token is itself. A model like GPT-3 calculates how tokens (let’s just say words) relate to each other by assigning each word a vector. For example, in a specific model trained on Wikipedia and newswire data, McCoy and Ullman explain that “the word ‘dog’ is represented as the vector [0.308, 0.309, 0.528, ?0.925, ….]”. If you plot that into a coordinate system, then words that often co-occur with “dog” in the training data will be positioned close to “dog”. This “map” of how words are related to each other is also called the “vector space” or “latent space” or even just “space”.
Remember those x/y coordinate grids we drew in 6th grade? It’s kind of like that. Except instead of two dimensions (an x-axis and a y-axis) there are literally billions of axes, or parameters.
Once GPT-3 is trained, it doesn’t “know” anything about its training data any more. All it knows is those coordinates. Dog is [0.308, 0.309, 0.528, ?0.925, ….], and that …. stands for a lot more numbers. It also knows what other words (or tokens) “dog” is close to. All those tokens and their coordinates across billions of different parameters make up the “latent space” of the model.
OK, so back to the table about the data GPT-3 was trained on.
The Common Crawl is a lot of scraped web data. WebText2 is webpages that have been shared in Reddit posts that have received at least three upvotes. Books1 and Books2 are not specified, but people have suggested the Gutenberg library, BookCorpus (free, self-published books) and libgen as possibilities. Finally, the Wikipedia means the English-language Wikipedia, not all of them. The Quantity (tokens) shows how much is in each dataset, but they’re not equally weighted. Here is a table showing the relative weighting.
OpenAI are pretty vague about what, exactly, some of those datasets are, but here’s what I’ve found out so far:
Common Crawl (filtered)
The Common Crawl is an open repository of web crawl data in more than 40 languages. In the paper Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Jesse Dodge and co-authors (including Margaret Mitchell who was fired from Google’s AI ethics team last year with Timnit Gebru but now works for Hugging Face) document a version of the Common Crawl that is filtered in the way described for the GPT-2 training data.
Dodge et.al. analyse three levels of the Common Crawl dataset, the metadata, like what domains data is from and when it was created or collected, the text itself, and what is missing or not included.
At the metadata level, they found that US domains dominate, with far more content than domains with many native English speakers like India or Pakistan.
“51.3% pages are hosted in the United States. The countries with the estimated 2nd, 3rd, 4th largest English speaking populations—India, Pakistan, Nigeria, and The Philippines—have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States, despite having many tens of millions of English speakers.”(Dodge et al., 2021, p. 4)
They found a surprising amount of patent data, and a lot of this is machine-translated, because various countries require patents to be in their languages. There are even patents run through OCR, so quite a bit of text is machine-generated in one way or another. Finally, they found that filters that remove words that are on a banned words list “disproportionately remove documents in dialects of English associated with minority identities (e.g., text in African American English, text discussing LGBTQ+ identities).” (Dodge et al., 2021, p. 2) You can take a look at the “bad words list” yourself. It’s clear that most of these words are on it so porn can be filtered out, and there are some slurs and swearwords on there as well. This means that texts representing minorities is missing. Removing sex words also means that non-offensive material about queer culture, including legal documents about same sex marriage, have been filtered out.
OK, so there is some bias there, and a crawl of “all the web” is bound to have a lot of not exactly high quality language. The next corpus is meant to remedy that.
WebText2 is a corpus of websites that have been linked to Reddit posts that have three or more upvotes. The idea is that having three upvotes from Reddit ensures that the webpages have a certain level of quality. The exact corpus used to train GPT-3 is not available, but it has been recreated and can be downloaded as OpenWebText2, which also has instructions for how to recreate the dataset.
Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.Radford et.al. 2019, page 3.
Unfortunately Reddit users are not a representative sample of humanity, so there is likely to be bias in this too. And three upvotes is not a lot. But OpenAI must trust this dataset, because WebText2 is the most heavily weighted sample of all five samples used to train GPT-3.
Books1 and Books2
The description of these datasets in the original paper is disappointingly vague: “two internet-based books corpora (Books1 and Books2).”
Presumably the reason that OpenAI is so vague about what, exactly, these two datasets are is that it’s a bit dodgy in terms of copyright. I assume (hope?) that at least one of these is the Gutenberg library, which is books in the public domain. But if it is, why not just say so?
Many assume that one of these is BookCorpus, which consists of 11038 books that were self-published on Smashwords and are available for free. BookCorpus was definitely used for training GPT-1 and BERT (another large language model). The BookCorpus dataset is available from Hugging Face, and Jack Bandy and Nicholas Vincent have published a paper retrospectively documenting it.
The biggest issues Bandy and Vincent identify in the BookCorpus dataset are
- Even though the books are free, the licence doesn’t really permit this use. It’s legally dubious.
- There are lots of duplicates (particularly of romance novels) and some authors have published hundreds of novels in the dataset, so it’s not exactly representative.
- There’s a skewed religious orientation – Christianity is overrepresented.
Another group of authors found that there is a lot of toxic language in BookCorpus (Gehman et.al. 2020). That’s honestly not surprising given that “toxic language” includes flirtation, and anything sexual, threatening or insulting. I mean, this is literature. You want that kind of stuff in literature. But the context is crucial. Does the language model recognise this context?
Here is an example of a book published a few days ago that is currently available for free on Smashwords.
As a human looking at that I am pretty sure you won’t be surprised that the first two paragraphs of the novel are, well, in the style you’d expect:
“Not many people went up against me or any of the members of my family, but when
someone actually did, I found that shit entertaining as hell.
Staring down at the girl yelling at me, it was almost a shame that I was going to have to sick D.J., Maggie, or Lennon on her. Though our code of honor might not always be so honorable, the rules when it came to the fairer sex had been ingrained in us since birth. While us guys handled any males that(Enticing the Enemy by M.E. Clayton)
were stupid enough to tangle with our family, we left any girls that were just as stupid to my sisters.”
Imagine you’re a neural network inputting this data and assigning values to the tokens so you can organise them in your vector space. The token “girl” is close to “yelling” and to “stupid“. So yelling and stupid are probably also related to each other.
The idea is, of course, that with enough data you end up with so many parameters that you don’t need to worry about this kind of gender stereotypes, because there’ll be enough good stuff to even it out. Maybe.
InstructGPT solves it by having humans label responses, and I’m guessing most of the humans would have labelled output that assumed girls are stupid and yell all the time as bad and thereby trained the model to avoid that.
Here is the full model card Bandy and Vincent made for BookCorpus:
The final dataset specified in Brown et.al. 2020 as one of the datasets GPT-3 was trained on is English-language Wikipedia pages. The Wikipedia has a lot of great information, but we know there is a strong bias in who edits it. A 2015 analysis of differences between articles about women and men found clear differences not only in coverage and interlinking but also in how women are described. Interestingly, given that GPT-3 is only trained on the English Wikipedia, the English and Russian language versions have the strongest gender bias.
What does this mean?
AI is getting very, very good. It still has issues, but it now generates convincing language. It is trained on somewhat dubious data that we know is biased – Wikipedia, self-published novels and pages linked from Reddit in particular. But with the addition of labelled value alignment it does a much better job of avoiding the most obvious toxicity and bias, though it still often fabricates information. It seems to use some templates to deal with potentially difficult questions related to bias or values or violence. Also answers tend to follow US-centric genres like the three paragraph essay. This “pay for essay” website has a decent explanation of how the three paragraph essay formula works.
ChatGPT is multilingual but monocultural
I was surprised at how good ChatGPT is at answering questions in Norwegian. Its multi-lingual capabilty is potentially very misleading, because it is trained on English-language texts, with the cultural biases and values embedded in them, and then aligned with the values of a fairly small group of US-based contractors.
- ChatGPT doesn’t know much about Norwegian culture. Or rather, whatever it knows about Norwegian culture is presumably mostly learned from English language sources. It translates that into Norwegian on the fly.
- ChatGPT is explicitly aligned with US values and laws. In many cases these are close to Norwegian and European values, but presumably this will not always be the case.
- ChapGPT frequently uses US genres and templates to answer questions, like the three paragraph essay or standard self-help strategies.
Customised values: we’re training AI to align with our values
But this won’t last. By playing with ChatGPT, we’re training it to align more with our values. We’re giving OpenAI a vast, human-labelled dataset showing what responses we like and don’t like. InstructGPT was trained by 40 human contracters in the USA. ChatGPT is being trained by thousands and thousands of people all over the world.
Right now, ChatGPT is free for use, and there’s a thumbs up or thumbs down option for every answer it gives. If you click the icon, it asks for additional feedback.
InstructGPT was “value aligned” using labels from just 40 contractors (Ouyang 2022, page 2). ChatGPT is providing vastly more data. Although it’s free to use ChatGPT, you do have to create an account. OpenAI knows my email and the country I am connecting from, so they can assume my judgements about how ChatGPT responds to me align with “Norwegian values”. OpenAI also knows what device, browser and operating system I am using, which can be a proxy for class and socio-economic status.
Presumably OpenAI will fine-tune their AI to learn what people in different countries and using different devices and browsers prefer. Perhaps they’ll align future GPT values with those of “Mac OS users in Norway who sometimes connect using an iPhone that is recent but not this year’s model”. It’ll be like customised advertising only it’s an artificial friend, a companion, a conversation partner. (I wrote about how apps are like our companions now a few years ago, and I’m working on a piece about our companionable relationships with AIs.)
Perhaps we don’t need to create a “Norwegian AI” to get Norwegian values. OpenAI will do it for us, learning our preferences, adding parameters to the vectors of each token in our semantic constructs of reality. It will align with us perfectly.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv. http://arxiv.org/abs/2005.14165. (This is the paper that introduced GPT-3)
Bandy, Jack, and Nicholas Vincent. 2021. “Addressing ‘Documentation Debt’ in Machine Learning Research: A Retrospective Datasheet for BookCorpus.” arXiv. http://arxiv.org/abs/2105.05241.
Dodge, Jesse, Maarten Sap, Ana Marasovi?, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.” arXiv. http://arxiv.org/abs/2104.08758.
Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv. http://arxiv.org/abs/2101.00027.
Gehman, Samuel, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. “RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models.” arXiv. http://arxiv.org/abs/2009.11462.
McCoy, John P., and Tomer D. Ullman. 2018. “A Minimal Turing Test.” Journal of Experimental Social Psychology 79 (November): 1–8. https://doi.org/10.1016/j.jesp.2018.05.007.
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv. http://arxiv.org/abs/2203.02155.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners”. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Rettberg, Jill Walker. 2018. “Apps as Companions: How Quantified Self Apps Become Our Audience and Our Companions.” In Self-Tracking: Empirical and Philosophical Investigations, edited by Btihaj Ajana, 27–42. Basingbroke: Palgrave Macmillan. https://doi.org/10.1007/978-3-319-65379-2_3.
Solaiman, Irene, and Christy Dennison. 2021. “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, 34:5861–73. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2021/file/2e855f9489df0712b4bd8ea9e2848c5a-Paper.pdf. (This paper explains how you can improve the bias and values in GPT-3 by adding a relatively small “value-targeted dataset”)
Wagner, Claudia, David Garcia, Mohsen Jadidi, and Markus Strohmaier. 2015. “It’s a Man’s Wikipedia? Assessing Gender Inequality in an Online Encyclopedia.” arXiv. http://arxiv.org/abs/1501.06307.