Language is a funny thing. Sometimes a whole article cannot evoke anything, but a single word can evoke a lot of things for people. For example, the word “learning”. The moment we see it, we think back to our own learning process and we assume that the learning someone is talking about it is similar.
So when GPT-3 was introduced in the paper titled “Language Models are Few-Shot Learners”, what we feel is that, LMs are close to that holy grail where we give a few examples and it learns a new concept.
Though the abstract and introduction wax lyrical of the capabilities, the footnote on page 4 tempers down the expectations:
“In the context of language models this has sometimes been called “zero-shot transfer”, but this term is potentially ambiguous: the method is “zero-shot” in the sense that no gradient updates are performed, but it often involves providing inference-time demonstrations to the model, so is not truly learning from zero examples. To avoid this confusion, we use the term “meta-learning” to capture the inner-loop / outer-loop structure of the general method, and the term “in context-learning” to refer to the inner loop of meta-learning. We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many demonstrations are provided at inference time. These terms are intended to remain agnostic on the question of whether the model learns new tasks from scratch at inference time or simply recognizes patterns seen during training – this is an important issue which we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer loop structure. “
The bold line in the footnote basically says even if the model is just repeating things which it has seen from training(and it has seen a lot of data, almost all of the data on the web), it will be considered as few shot “learning”.
This assumption is ok to make, though it dilutes a lot of the enthusiasm for actual few shot learning. Since GPT-3 has been trained on a lot of data, it is equal to few shot learning for almost all practical cases. But semantically it’s not actually learning but just regurgitating from a huge database of data it has already seen.
I compare this to the way students learn for exams(especially in India :)). One option is to learn by rote every single page in the text book. The other option is to actually understand the subject while learning so variations don’t affect the learner. The examination system in India has actually proven that it takes a real good examiner or examination paper to find out who has actually learnt the subject and who has just mugged up the whole syllabus. Invariably the student who has just put the whole text book in memory gets the best marks by just doing a pattern match of the questions and spitting out the most probable answer. But when a tricky question is given, the student stumbles.
GPT-3 is like that student who has learnt everything by rote. It’s very hard to trip it because almost everything is in it’s memory :)
It’s hard to find a topic where it already does not know something about the topic(Don’t get me wrong. This is an awesome achievement in itself. I am just talking about semantics).
But when it does encounter a topic it has no idea about(like the COVID topic as the current data GPT-3 is trained on is before the COVID pandemic), it fails badly.
Now, given the fact that the paper itself accepts that it may or may not be learning(“remain agnostic on the question of whether the model learns”), let us see how far we can go with GPT-3.
There have been a lot of articles about what is great about GPT-3. And it really is great. I have also done a bunch of cool experiments which I have documented at https://gpt3experiments.substack.com/.
We also have an excellent skeptical GPT-3 review by Gary Marcus and Ernest Davis at https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/
There have been a lot of articles about what is great about GPT-3. And it really is great. But as a person who deploys things in production I would also like to know what can GPT-3 not do? This is important to understand so that we can deploy the right systems. I will mostly work on the examples provided in the GPT-3 paper itself instead of going beyond it and make some remarks on what I think about the capability in each area and also comment on few shot learning or not.
TL;DR: GPT-3 is just average when it comes to regular NLP tasks like Summary generation, Winograd, Translation, Closed Book answering, Reading comprehension, Common sense reasoning, SuperGLUE and NLI. GPT-3 is the best when it comes to NLG. (This is from the results of the paper itself.)
Before going into an analysis of the paper I will list some experiments that I tried to see if it does some few shot learning. The competitor to GPT-3 was by 6 year old kid :)
The kid was able to crack all these few shot examples.
I tried to keep the experiments grounded in language and avoided experiments where extra world knowledge might be needed. For example, skip alternate words problem. The kid does it easily, but GPT-3 is expected to know math(counting) for this. So I avoided experiments of those kinds.
Experiment 1: Copycat Analogies by Melanie Mitchell, https://firstname.lastname@example.org/can-gpt-3-make-analogies-16436605c446. My kid was able to get the answers after few shot examples. GPT-3 mostly fails. I tried with multiple other priming examples also. But could not make GPT-3 learn about copycat analogies. But I wouldn’t say its a complete failure as it seems to get some patterns right. I dont think thats because of coincidence. If I was able to explain why it got the patterns right, I would have claimed there is no few shot learning.
Experiment 2: The P language game. p language is a language where every word is appended with p.
Result: Works for the basic version. It has understood that you should append a p to every word. But if we change the game to change only some words, like animal names for example, then GPT-3 stumbles while my kid doesnt.
Experiment 3: Reversing words. The paper talks about reversing letters and blames the BPE encoding. So I thought let me try reversing words in a sentence.
Result: GPT-3 does not learn from few shot that it has to reverse the words. My kid gets it in 2 sentences.
Experiment 4: Train GPT-3 to reject words.
Result: GPT-3 works well in replacing specified words. Though I have to say, GPT-2 also works well for this example, https://transformer.huggingface.co/doc/gpt2-large.
Experiment 5: Creating opposite sentences.
Result: Works mostly, but suddenly gives weird answers. But I would still consider this a win.
Now onto the experiments mentioned in the paper.
I will not go into detail about the standard tests and tasks as they are pretty standardized and results have been explained in detail in the paper. And since in most cases SOTA is better than GPT-3 it might be better to use the SOTA systems for those tasks than GPT-3. I will just list the tasks out below:
Language Modeling, Cloze, and Completion Tasks:
Language Modeling: GPT-3 beats the socks of everything else. If you have an LM task, you just have ti use GPT-3.
LAMBADA: Predicting next word is what makes GPT-3 great. So as expected, GPT-3 is the best in class in this also.
Comment: I loved the few shot priming idea. Unfortunately, on SOTA systems we have not tried the priming idea to see how they perform. The closest I have seen a “priming” example is with T5 where you train with primed samples like “translate English to German”. With GPT-3, few shot is only few sentences, but for regular systems I think if we give more priming example(within context size), the results should improve over SOTA.
HellaSwag: GPT-3 does not outperform SOTA here. The fine-tuned multi-task model ALUM performs better.
StoryCloze: GPT-3 does not outperform SOTA here. The fine-tuned BERT model performs better.
Closed Book Question Answering: Have to say, given the fact that GPT-3 is trained on the whole of the Internet knowledge base, I expected it to perform great in this test. Unfortunately, it performs poorly in some tests and just a little better in one test. NaturalQS and WebQS, T% and other systems are much better. In TriviaQA GPT-3 comes out on top by just a little bit.
Analysis: From the above tests the following is clear. GPT-3 is the go to model for word completion and text completion tasks. But based on the results of Closed book QnA I wouldn’t use GPT-3 in a system when accurately predicting an answer is more important.
Translation: GPT-3 performs capably on this. As the authors mention, since most of its training data is in English, it performs better when we have to translate into English. But if you have to translate English to some other language, other systems are SOTA.
Winograd-Style Tasks: GPT-3 is not better than SOTA systems in this too. Unfortunately GPT-3 does not perform well in this task.
Common sense reasoning: GPT-3 perfroms better on one dataset. On the rest its results are not great. It especially has pretty bad results in ARC tasks.
Reading Comprehension: GPT-3 performs worse than SOTA in all datasets.
SuperGLUE: GPT-3 doesn’t perform better than SOTA. In fact for most tasks fine tuned BERT performs better.
NLI: GPT-3 performs abysmally in these tasks.
Comment: In my opinion, the above tasks are good to find if a model “understands” language. Unfortunately GPT-3 doesn’t really push the SOTA. Also, unfortunately, comparative results with fine tuned GPT-2 models are not available. This would have given us an idea of how much importance more data is.
After the regular tasks, GPT-3 paper talks about “synthetic and qualitative tasks”. This is where the actual “few shot learning” capabilities are discussed. So let me share my thoughts on each experiment in detail.
Wouldn’t it be awesome if a language model can learn math :). But does it? The jury is still out. From the paper:”Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix.” So, the 175B model performs better. But only upto 2 digits(3 digits, if 80% accuracy is fine). So unless we know what is there in the training set, its hard to verify if it really did “learn” math. For example, did it parse websites like https://adding.info/Sum-Difference/the-sum-of-two-numbers-is-43-and-their-difference-is-21.html. The authors do mention that from their tests, less than 0.5% data was there in training.
My experiments also suggest the same. 2 digit math is good. Rest, doubtful. The log probabilities show that it is predicting the whole 2 digits instead of digit by digit.
Also, it does not seem to know about the numbers mentioned as text and misses sometimes.
Comment: If someone where to ask me to guess, I would say in some way or the other, all the two digits and many of the three digit operations are present in the 175B parameter dataset. But whether it has learnt by rote or understood it, GPT-3 performs 2 digit math(addition and subtraction) pretty well.
2. Word Scrambling and Manipulation Tasks:
The authors tried out Cycle letters in word (CL), Anagrams of all but first and last characters (A1), Anagrams of all but first and last 2 characters (A2), Random insertion in word (RI), Reversed words (RW) tasks. The authors mention that performance might not be great because of BPE encoding. But the thing is, I can prime the model to output individual words. So not sure if this theory holds.
I tried the above tasks with different experiments and mostly its hit or miss.
These tasks are anyway same as any language model, for example if you ask Google for criroptuon , it says did you mean corruption.
Comment: No model got the word reversal. Even with few shot. That's a big problem. Most kids can easily figure it out to reverse and get the word. Though I agree it might be a problem with the encoding. Only a character level model might get these right. I would not use this in a production system for these use cases. Go with regular statistical or character level models for these.
3. SAT Analogies:
GPT-3 performs well on this as per the paper. I would still like to see the training data to understand how many of the test analogies were already there in the training data.
4. News Article Generation:
Without a doubt, this is where GPT-3 shines. I would use GPT-3 for this use case in production.
5. Learning and Using Novel Words:
GPT-3 is pretty good at this. Though it mostly works for single novel words. When you try to teach it multiple words at the same time it gets confused as shown below.
Comment: Maybe different types of priming might help. But I couldn’t get this to work in a predictable mode.
6. Correcting English Grammar and spelling mistakes:
GPT-3 works perfectly for this scenario. We can easily use this in production.
In this article I wanted to look at where GPT-3 can be used for production and where it fails. I mainly discussed the points in the paper from a real world scenario. In addition to the tasks mentioned in the paper, I have done numerous other experiments to see where we can use GPT-3 in production. I will write in detail about those in the next blog. But for now, the following are some of my observations:
GPT-3 is not good for creating summaries.
GPT-3 cannot be used reliably do detect curse words in text.
GPT-3 works great for slot filling.
Using a GPT-3 as a chatbot backend is possible but tough. Especially for long conversations.