OpenAI’s GPT-4 is Here. It’s Passing More Exams

The new model of OpenAI’s ChatGPT landed in the 90th percentile on the Bar Exam.

Updated on March 16, 2023

Edited by

SHANGHAI, CHINA - MARCH 15, 2023 - A young man visits and tries out OpenAi's new GPT-4 in Shanghai, China, March 15, 2023. On March 14 Eastern time, OpenAI officially announced the launch of the large multimodal model GPT-4. This is the latest in its line of AI language models that power applications such as ChatGPT and the new Bing. (Photo credit should read CFOTO/Future Publishing via Getty Images)

Credit: Image Credit: CFOTO / Future Publishing / Getty Images

The new model of OpenAI’s ChatGPT can pass more exams with higher scores, accurately process images, and adopt different personalities.
The model still makes simple factual and logical reasoning errors.
The model is available to some developers and researchers. Users must sign up for the waitlist for access.

OpenAI’s ChatGPT-3.5 artificial intelligence has proved to be a passable student for law, medical, and college-level exams. But OpenAI’s new model, GPT-4, looks like it wants to be the top student in the class.

On March 14, OpenAI released a technical report of GPT-4, the newest iteration of the ChatGPT artificial intelligence showcasing the model’s capabilities and limitations. College students, professors and administrators take note: This version of the AI chatbot improves academic performance, tunes AI personalities, and even shows the ability to assess images.

However, the AI can still make simple logical and factual mistakes.

Here’s a first look at how GPT-4 performed on college- and graduate-level exams and other benchmarks.

What Exams Can GPT-4 Pass?

One of GPT-4’s biggest accomplishments is becoming a licensed practitioner of law.

GPT-4 shot beyond GPT-3.5’s performance in the Uniform Bar Exam with a 298/400, landing in the 90th percentile of students. GPT-3.5’s test score was 213/400, in the 10th percentile of students.

GPT-3.5 vs. GPT-4 Exam Performance

Uniform Bar Exam (MBE+MEE+MPT)

GPT-4 Score: 298/400 (90th percentile)
GPT-3.5 Score: 213/400 (10th percentile)

Law School Admission Test (LSAT)

GPT-4 Score: 163/180 (88th percentile)
GPT-3.5 Score: 149/180 (40th percentile)

Scholastic Assessment Test (SAT) Evidence-Based Reading & Writing

GPT-4 Score: 710/800 (93rd percentile)
GPT-3.5 Score: 670/800 (87th percentile)

Scholastic Assessment Test (SAT) Math

GPT-4 Score: 700/800 (89th percentile)
GPT-3.5 Score: 590/800 (70th percentile)

Graduate Record Examination (GRE) Quantative

GPT-4 Score: 163/170 (80th percentile)
GPT-3.5 Score: 147/170 (25th percentile)

Graduate Record Examination (GRE) Verbal

GPT-4 Score: 169/170 (99th percentile)
GPT-3.5 Score: 154/170 (63rd percentile)

Graduate Record Examination (GRE) Writing

GPT-4 Score: 4/6 (54th percentile)
GPT-3.5 Score: 4/6 (54th percentile)

The USA Biology Olympiad (USABO) Semifinal Exam 2020

GPT-4 Score: 87/150 (99th-100th percentile)
GPT-3.5 Score: 43/150 (31st-33rd percentile)

The U.S. National Chemistry Olympiad (USNCO) Local Section Exam 2022

GPT-4 Score: 36/60
GPT-3.5 Score: 24/60

Medical Knowledge Self-Assessment Program

GPT-4 Score: 75%
GPT-3.5 Score: 53%

Codeforces Rating

GPT-4 Score: 392 (below 5th percentile)
GPT-3.5 Score: 260 (below 5th percentile)

AP Art History

GPT-4 Score: 5 (86th-100th percentile)
GPT-3.5 Score: 5 (86th-100th percentile)

AP Biology

GPT-4 Score: 5 (85th-100th percentile)
GPT-3.5 Score: 4 (62nd-85th percentile)

AP Calculus BC

GPT-4 Score: 4 (43rd-59th percentile)
GPT-3.5 Score: 1 (0th-7th percentile)

AP Chemistry

GPT-4 Score: 4 (71st-88th percentile)
GPT-3.5 Score: 2 (22nd-46th percentile)

AP English Language and Composition

GPT-4 Score: 2 (14th-44th percentile)
GPT-3.5 Score: 2 (14th-44th percentile)

AP English Literature and Composition

GPT-4 Score: 2 (8th-22nd percentile)
GPT-3.5 Score: 2 (8th-22nd percentile)

AP Environmental Science

GPT-4 Score: 5 (91sth-100th percentile)
GPT-3.5 Score: 5 (91st-100th percentile)

AP Macroeconomics

GPT-4 Score: 5 (84th-100th percentile)
GPT-3.5 Score: 2 (33rd-48th percentile)

AP Microeconomics

GPT-4 Score: 5 (82nd-100th percentile)
GPT-3.5 Score: 4 (60th-82nd percentile)

AP Physics 2

GPT-4 Score: 4 (66th-84th percentile)
GPT-3.5 Score: 3 (30th-66th percentile)

AP Psychology

GPT-4 Score: 5 (83rd-100th percentile)
GPT-3.5 Score: 5 (83rd-100th percentile)

AP Statistics

GPT-4 Score: 5 (85th-100th percentile)
GPT-3.5 Score: 3 (40th-63rd percentile)

AP U.S. Government

GPT-4 Score: 5 (88th-100th percentile)
GPT-3.5 Score: 4 (77th-88th percentile)

AP U.S. History

GPT-4 Score: 5 (89th-100th percentile)
GPT-3.5 Score: 4 (74th-89th percentile)

AP World History

GPT-4 Score: 5 (89th-100th percentile)
GPT-3.5 Score: 4 (74th-89th percentile)

Image Processing

One of the biggest differences between GPT-3.5 and GPT-4 is the AI’s ability to accurately see and assess images. Previously, a study testing GPT-3.5 on the United States Medical Exam removed all questions containing visual assets due to the model’s inability to determine what was in an image.

OpenAI submitted a combination of text and images to ask the AI, “What’s funny about this image? Describe it panel by panel.”

From OpenAI’s GPT-4 Technical Report.

Steerability

Developers, and later users, can change the AI’s “character” to be different from the usual style of ChatGPT. For example, students can now change GPT-4 into a Socratic tutor that will never give students the answer but guide them through problem-solving.

Or they can turn the AI into a Shakespearean pirate.

Limitations

GPT-4, like its predecessors, can still “hallucinate” facts and make reasoning errors. The base model is slightly better than GPT-3.5. The gap widens after Reinforcement Learning from Human Feedback (RLHF) training.

Like GPT-3.5, GPT-4’s brain is stuck in the past. It generally lacks knowledge of any event after September 2021.

GPT-4 is only available to some developers and researchers, but you can join OpenAI’s waitlist. Text-only requests are currently available, and pricing is $.03 per 1k prompt tokens and $.06 per 1k completion tokens.

“We look forward to GPT-4 becoming a valuable tool in improving people’s lives by powering many applications,” OpenAI said. “There’s still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model.”