Takeaways from "How does ChatGPT work" blog

Mar 01, 2023

The best explanation for how ChatGPT works that I have found is written by Wolfram. If you have around half a day to spend I suggest that you go to that link and read it thoroughly. Wolfram lays the groundwork and explains how ChatGPT works from scratch.

I mention that blog because it mirrors my thought process almost exactly and if I had the writing skills of Wolfram, I would have written a similar article :)

But in any case, since Wolfram has already written it, I would like to highlight the main points which I think are very important from my perspective and which I think will form the foundations for my research into making a better, cheaper and greener ChatGPT.

The most important takeaway from that blog is in the first line itself.

It’s[ChatGPT] Just Adding One Word at a Time

Note: Because of RLHF and other nifty techniques, this is not entirely true, but the core of ChatGPT is still predicting the next word.

I have already written about how GPT3 and other LLMs are just adding one word at a time(they are sometimes called stochastic parrots just for this reason) and how because of this, I don’t consider LLMs as the path to AGI because I don’t think the mind generates one word at a time.

Since ChatGPT generates just one word at a time, that means that to build a ChatGPT we just need to build a system that just predicts the next word given a set of words.

So, assume that you have millions of words written as sentences. Lets assume you have all the sentences ever written online. Your job is to predict the next word given a set of words. So, its a statistics problem.

Wolfram’s blog goes into detail about why its a hard problem and why we need deep learning to solve this problem. In his own words, ChatGPT works as follows:

First, it takes the sequence of tokens that corresponds to the text so far, and finds an embedding (i.e. an array of numbers) that represents these.
Then it operates on this embedding—in a “standard neural net way”, with values “rippling through” successive layers in a network—to produce a new embedding (i.e. a new array of numbers).
It then takes the last part of this array and generates from it an array of about 50,000 values that turn into probabilities for different possible next tokens.

The hard problem of representing a sentence as a bunch of numbers so that computers can work on them has been done. We can represent a sentence with an “embedding”. So, if we can take these embeddings and find a statistical solution, we can then replace the deep learning solution with our own solution.

Another big takeaway from the blog is the following sentence:

“The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.”

The biggest obstacle of any problem is knowing that it cannot be done. Thanks to ChatGPT, we know that a solution exists. So, there is a pattern and it can be done.

And I am optimistic of finding these laws, because if an artifical neural network can find these laws implicitly, I am sure, there are some real neural networks(humans) who can look at these patterns and find the laws explicitly thereby breaking open the compute problem of ChatGPT.

So let me redefine the problem statement.

We have: