On November 30, 2022, the artificial intelligence company OpenAI launched its flagship product: ChatGPT...

foreward

This blog will attempt to provide extensive, math-free backgrounding on modern AI in a manner accessible to a non-technical audience. I include links and footnotes to facilitate deeper learning for those interested, but this will be by no means comprehensive. However, it is hopefully enough to help you build a deeper understanding of AI beyond the hype (and maybe impress your friends with some fun facts). My hope is that the mental models herein will help develop clearer mathematical intuition for technical and non-technical readers alike.

To learn more about how the current AI revolution actually affects you or your business, you can also reach out to us directly to schedule a seminar, consultation, or development project.

The Generative AI Revolution

On November 30, 2022, the artificial intelligence company OpenAI launched its flagship product: ChatGPT. The product, a chatbot that allows users to ask questions in plain English about just about anything, was an instant sensation. Why? Because it gave regular people, outside the world of tech, a glimpse into the practical power of state of the art generative AI.

Though ChatGPT did come with its own breakthrough innovations, the underlying technology had been around for years. In reality, ChatGPT was moreso a breakthrough in productizing AI–specifically something called a “Transformer.”

Transformers

Originally designed by Google in 2017 and announced in Vaswani et al, the Transformer was a novel architecture for neural networks that not only improved on previous designs but massively “parallelized” training, making it possible to efficiently scale neural nets to enourmous sizes. With enourmous size has come the ability to tackle problems of enormous complexity, namely understanding language. In other words, it is a type of deep neural network that can learn how to write. The first transformer, as described in the paper, was used to translate between languages, using what’s called an “encoder decoder” transformer architecture, but the first prominent “generative” application was the “encoder-only” GPT-2¹ in 2019.

I first encountered GPT-2 that same year, in an application called TabNine, which originally used the GPT-2 model to generate code, as a type of autocomplete. At the same time, my then startup (Flowbot, Y-Combinator, class of Winter 2020) was working on a competitor which also used AI for programming. Then, in early 2020, we received an email from Sam Altman (then and still now CEO of OpenAI), previewing a new AI model, codenamed Da Vinci, which was even better. That model is better known today as GPT-3, which would ultimately become ChatGPT.² It was then, in 2020, that I knew where things were headed. My startup was probably going to need to pivot.

The model no longer just sped up typing; with some creative “prompting,” it could be instructed to write entire blocks of code. Not only that, the same model could also be used to mimick the writing style of famous authors, answer questions, and generate its own creative writing. This was understandable, given its enormous size: where GPT-2 was about 3GB³ in size, GPT-3 was 350GB. The limitation, at the time, was that this all required significant coaxing. This coaxing was finickey and not particularly user-friendly, requiring an understanding of how the model actually worked. Thankfully, I’d been studying the Transformer architecture since GPT-2, but most people hadn’t, and the technology largely flew under the radar. In essence, the model was like an ineffectual teacher that knew a lot, but couldn’t communicate it.

The great technological advance of ChatGPT came in a paper known as InstructGPT, which used a reinforcement learning technique called RLHF⁴ to eliminate the need to “coax” the model into usefulness. Now the model could communicate with anyone, just like a human. A human that has read most of the internet, that is.

Since then, OpenAI has continued to scale up its model size: GPT-4, the current state of the art, is unofficially believed to be an 8x440GB mixture-of-experts model–2.4 terabytes in total. The mixture-of-experts architecture (MoE) combines multiple transformer models but with the addition of a “gating network” (another neural net), which takes input from each “expert” model and decides what to output. In short, this may allow for a broader scope of knowledge, due to the specialization of each expert, than a single monolithic model of equal parameter count could encode, and it offers a natural way of divvying up the model across hardware components, making it well-suited to parallelization. The specifics of GPT-4 are not known publicly, however, as they remain a closely guarded trade secret for “OpenAI” and its philanthropic partner, Microsoft.

Background

“Generative AI” generally refers to applications of artificial intelligence, also known as machine learning and abbreviated either “AI” or “ML,” that generate something new. This is as opposed to many other forms of machine learning that simply understand things, such as computer vision models, which might identify that something is a cat but cannot itself draw one.

The field of machine learning has been around about as long as modern computers, and a lot of it could also be called “statistics.” If you remember learning about linear regressions in school, you’ve encountered a primitive form of “machine learning” that actually precedes the computer… by several hundred years, in fact. The original “machine learning” was done on paper by Johann Carl Friedrich Gauss in 1809 to predict planetary motion, using a method called Least Squares Linear Regression.

However, much of modern machine learning, at least the state of the art, uses something called a “Neural Network,” which builds on the idea of a “Multilayer Perceptron” from 1958⁵. Neural Networks, whose name comes by analogy to the neural synaptic connections in the human brain, first arrived in 1965 and can be thought of in two parts: inference and learning. The “inference” side refers to the ability of the trained model to actually perform some task, while the “learning” aspect refers to the computational framework used to actual train the model. By analogy, you may choose to think of these as a student and a teacher: without first learning, the model cannot know or do anything, but, once taught, the “teacher” is no longer needed for the model to perform a function.

It is exactly this aspect that helps explain why you can “run” a model on a gaming PC, even though they require massive supercomputer clusters for training. The “model” that you can run on your computer went to computer college, but it is not the college itself. The silver lining is that, unlike college students, the model only has to learn once, and then the model can be copy-and-pasted. That’s because the model itself is just a collection of numbers. It is a very specific set of numbers, but nothing more than data nonetheless. And like any other kind of data, you can copy it, download it, and put it on a (very large) flash drive. You still need something to “run” it, but that part is comparatively easy.⁶ There is nothing magical about today’s “artificial intelligence” models, and they can technically run on any computer (though perhaps slowly). The only caveat to that is that that run significantly faster on GPUs

Oragami Interlude (How Computers Think)

…

This section has been extracted and moved to another blog post

If the section title sounds intriguing and you’d like to go deeper on neural networks, click here to see it. It does not include/require any of the relevant mathematics, but it will probably be gratuitous information to most, hence its removal.

Embeddings

Stay Tuned: More Content Coming Soon

“GPT” stands for “generative pre-trained” and was coined by OpenAI in the paper “Improving Language Understanding by Generative Pre-Training” downloadable here. ↩︎
Technically, ChatGPT originally used GPT-3.5-turbo, which, as the name suggests, is just a modified version of the original GPT-3 model, which preceded today’s state of the art model: GPT-4. ↩︎
Here and elsewhere, I use “3GB” to describe the memory footprint instead of the traditional parameter count figure, which would be 1.5B. In general, neural networks’ weights are stored in FP16 (16-bit floating point numbers), which means each number is two bytes (a byte is 8 bits), so 1.5B parameters times two is 3B bytes, also written as 3GB. To convert back to billion parameter count, just divide the number of gigabytes by two. ↩︎
RLHF stands for “Reinforcement Learning with Human Feedback” and used a large body of annotated example conversations, written by humans, to teach the model how it should “behave.” ↩︎
source. Note that the relevant dates here are subject to debate. Modern “deep learning” neural networks look a lot more like those featured in the 1967, 1970, or even 1982 breakthroughs, but that has more to do with the teaching methodology than the neural network itself. ↩︎
The “inference” code required to run the most popular downloadable model, Llama 2, is only ~1000 lines of code that can run on any computer. For more on so-called “local” models, stay tuned for a post on the subject here. ↩︎

Free Consultation for New Clients

The AI revolution underway **right now** is changing the world. We're here to help you understand it and keep up. If you'd like to learn more about what we do, we'd like to meet you.

Get in touch with us today to find out how we can help your business grow.

Get in Touch

The Generative AI Revolution - Fri, Jan 5, 2024