Four Things to Know About GPT-4

On Tuesday, March 14, OpenAI released its latest machine learning model, GPT-4. While it hasn’t immediately rocked the world in the same way ChatGPT did, that’s mostly because there wasn’t a shiny new interface to go along with it. Trust us — it’s still incredibly exciting.

Thing #1: Multimodality isn’t here yet

Pre-launch, a lot of the hype around GPT-4 was about its being multimodal, or able to accept both text and images as input. Currently, to upload images you need access to the developer API, which is obviously not for everyone. For everyone else, GPT-4 still only accepts text input.

The hype around multimodality is likely warranted. Expanding the input options to both text and images could (should?) exponentially improve the potential output of the AI, and could pave the way for video, audio, and other multimodal inputs and outputs in the future. 

Thing #2: GPT-4 can accept much larger inputs

In the absence of multimodality, one of the most obvious ways GPT-4 differs from GPT-3.5 is that it can accept much larger inputs (and produce larger outputs, but that’s not going to be useful in the same way.)

The maximum number of tokens you can use at a time with GPT-3.5 is 4,096.With the base model of GPT-4, that max doubles to 8,192 tokens—and there’s even a second GPT-4 model that can handle up to 32,768 tokens.

What does that mean in practice?

For starters, it means I can give GPT-4 OpenAI’s entire technical report (minus the appendices) on GPT-4 to read. (That’s over 5,000 words of content.) I asked it to summarize the report and call out any important information that was missing.

Here was GPT-4’s response: 

Prompt: Summarize the main points of this research paper. What important information is missing? (followed by the full text of OpenAI’s GPT-4 Technical Report)

This technical report summarizes the development of OpenAI’s GPT-4, a large-scale, multimodal model that accepts image and text inputs to generate text outputs. GPT-4 demonstrates human-level performance on various professional and academic benchmarks, such as scoring in the top 10% on a simulated bar exam. It is a Transformer-based model, with its performance enhanced using the post-training alignment process.

GPT-4’s primary capabilities include:

1. Processing image and text inputs, with potential applications in dialogue systems, text summarization, and machine translation.

2. Scoring well on exams designed for humans, with a performance on a simulated bar exam in the top 10% of test takers.

3. Outperforming previous language models and state-of-the-art systems on traditional NLP benchmarks, and performing strongly on the translated versions of MMLU in 24 of 26 languages considered.

The report also highlights the development of deep learning infrastructure and optimization methods that behave predictably across a range of scales. This enabled the researchers to make accurate predictions on GPT-4’s performance based on smaller models trained with 1/1,000th the compute.

However, GPT-4 exhibits several limitations, such as not being fully reliable, having a limited context window, and not learning from experience. Its capabilities and limitations pose novel safety challenges, with bias, disinformation, over-reliance, privacy, and cybersecurity among the risks identified.

The report refrains from discussing the architecture, hardware, training compute, dataset construction, or training methods of GPT-4 in detail due to competitive concerns and safety implications.

On the whole, GPT-4’s development marks significant progress in the field of AI and demonstrates considerable potential for transformative applications in natural language processing, while also emphasizing the need for ongoing refinement and attention to safety challenges.

The research paper does not disclose specific details about GPT-4’s hardware, architecture, dataset construction, or training methods. Getting more information on these technical aspects would be helpful for researchers looking to replicate or build upon GPT-4’s work.

With GPT-3.5 and earlier models, you couldn’t give it such a long input as an entire technical report. This is a really cool advancement, as you can now provide the model with a lot more information as context.

This capability is especially useful since the model isn’t hooked up to the internet. The only way for it to have new information is if you provide it — and you can now provide it a lot more.

For contrast, if I ask what GPT-4 is without providing the technical report, here’s what I get:

Prompt: What is GPT-4?

GPT-4 is the hypothetical successor to OpenAI’s GPT-3, the state-of-the-art language model that has achieved remarkable results in a variety of natural language processing tasks. GPT-4, although not currently available, is expected to further improve on GPT-3’s capabilities, potentially enabling even more human-like text generation and understanding.

As a language model, GPT-4’s goal would be to generate human-like text based on the context of provided input. To understand how it might work, let’s dive into some real-life scenarios where the power of GPT-4 could be utilized.

Suppose you’re a writer

…and so on.

As far as GPT-4 knows, GPT-4 is still a hypothetical successor to GPT-3. Which makes sense, because obviously it couldn’t have been trained on text from a world in which GPT-4 already existed. In all the content the model has seen, GPT-4 is still a future development.

What this means, though, is that we can now get much better results from GPT-4 on things like new events or extremely in-depth topics, by providing it much more information in the prompt.

In addition to what this improvement enables, it’s also really interesting to consider from an architecture standpoint. In order to accept more tokens, the model has able to recall and synthesize information over a much larger window. Was this done simply by building a larger model with more layers and parameters, or were fundamental changes made to how it processes and stores information?

Unfortunately, the lack of any answer to that question brings us to our third point.

Thing #3: OpenAI isn’t quite so…open…anymore

One fascinating thing about GPT-4 has absolutely nothing to do with its abilities. From OpenAI’s research paper on it:

This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

(Emphasis mine)

No further details about the model size, dataset, training…anything?

That is wildly not open. It’s also a big departure from OpenAI’s public research on earlier GPTs.

It’s also worth noting how at odds those two reasons for secrecy are: the competitive landscape, and the safety implications of large-scale models. “Safety implications” require caution and prudence, but a “competitive landscape” requires full steam ahead to beat out anyone else.

Leaving users in the dark about dataset construction and training method means that we’ll struggle to identify potential biases in the AI output. After all, human beings made the decisions about those training models and datasets, and those humans have implicit biases. The training data then also has built in bias.

Eliminating that bias is messy, complex, and quickly descends into a rabbit hole of debate only enjoyed by philosophy majors and people who like commenting on local news articles. However, being aware of that bias is important for everyone using AI to create new content.

On a totally unrelated note, two other major AI advancements were released the same day as GPT-4: Anthropic’s Claude model and Google’s PaLM API. Since then, Anthropic has launched Claude 2 and Meta has thrown their hat in the ring with Llama 2. Claude 2 offers up to 100,000 tokens.

Clearly, this arms race is in full swing.

Thing #4: AI is becoming a star student (but still lies)

One of the most widely shared graphs from the launch shows GPT-4’s performance on various tests. It’s almost like OpenAI is still under the illusion, shared by high-achieving high schoolers everywhere, that standardized test scores in some way correlate to real-world success.

Lol.

What is worth noting, however, is that GPT-4 was not specifically trained to take any of these tests. This isn’t the case of an AI model being specifically trained to play Go and eventually beating the best human player; rather, its ability to ace these tests represents a more “emergent” intelligence.

Previous models like GPT-3 also weren’t trained to take particular tests, but, as you can see, GPT-4’s performance has improved significantly over GPT-3’s:

from OpenAI’s GPT-4 Technical Report

These graphs look good and have become staples of articles and press announcements featuring new models. But ask yourself: do you really want an AP English student – even a particularly skilled one – in control of your marketing messaging and copywriting? Me neither.

If you don’t care about AI’s ability to take standardized tests and just want to know how well it’s going to do what you want, this is still good news. From the report:

GPT-4 substantially improves over previous models in the ability to follow user intent. On a dataset of 5,214 prompts submitted to ChatGPT and the OpenAI API, the responses generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.

So, GPT-4 is more likely to give you what you’re looking for than GPT-3.5. That’s great. It’s important to keep in mind, though, that in spite of its improved performance, the new model still has all the same limitations we know and love from our existing AI friends.

Another quote from the report:

Despite its capabilities, GPT-4 has similar limitations to earlier GPT models: it is not fully reliable (e.g. can suffer from ‘hallucinations’), has a limited context window, and does not learn from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.

In fact, hallucinations could become an even bigger problem than they were, simply because the better the AI gets, the easier it will be to believe what it says. With GPT-3 and GPT-3.5, people are well aware the model will totally make stuff up because it happens so frequently. As newer and better models do that less frequently, there’s a greater risk that when they do hallucinate, we may fail to notice or fact-check it.

So stay vigilant, friends. But also, these are very exciting times.


Avatar photo

Megan Skalbeck

Megan traffics in words. Whether that’s spinning up a story on the blog or paring down a conversation on the podcast, it’s all elementary math in the end: She adds, subtracts, multiplies for effect, and divides for readability. When she’s not helping words live their most meaningful life, she’s usually in the woods, in the ocean, on a rock, or on the road.

Questions? Check out our FAQs or contact us.