ashish.is | Why Students Hallucinate

OpenAI put out a paper recently called Why Language Models Hallucinate. The core claim is that LLMs hallucinate (make stuff up) because the way that they're trained mathematically incentivizes confident-sounding answers over admitting uncertainty. The paper opens with an analogy that sent my mind reeling: "Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty." It's like the ACT or SAT, how you're told to never leave a question blank.

In my mind, this could so easily be flipped. What if our education system produces similar failure modes in humans? What is the "objective function" that we train our students against?

The Objective Function Problem

Oftentimes in undergrad (like PHYS1120 at CU Boulder) we were served multiple choice tests where leaving an answer blank would be a 0, so you were instructed to guess as to not leave any bubbles blank because you'd still have a 25% shot of getting the point. Models are trained with similar strategies: the benchmarks that score AI work this way, and so models trained on these benchmarks learn to produce plausible-sounding output regardless of whether it's the accurate answer.

If you zoom out a bit to see the education system as a whole, you see a similar incentive structure around benchmarks rather than true knowledge. Pass your high school exams, get in through college admissions, pass your college exams, get professional certifications: these are compound filters optimized to test performance not understanding. I've passed tests, and then remembered nothing the day after. The system doesn't really distinguish between a student who truly gained an understanding of thermodynamics and someone who got the same score on the test, but just crammed the night before.

This kind of evaluation structure arises from the necessity posed by the scale of the university. Unlike more direct and guided forms of class structure oftentimes (especially in general courses), a single professor may need to manage a class of over 100 people. The professorship system has its own objective functions: profs are expected to maintain a balance in their careers amongst 3 full time jobs. There's expectations around teaching responsibilities, administration, and research. Trying to excel at 3 full time jobs is excessive, throw in having to manage over 100 students for a class, and the system breaks down. Professors get stretched thin, and the students don't end up learning at the end of the day. The closest direct and guided teaching available to them that lets students learn actively is office hours or recitations, but even those don't guarantee that the TA or Prof will be able to get around to them.

The Purpose of a System is What it Does, and the systems for teaching, and metrics used for evaluating both students and professors don't align with the incentives for robust education and student success. In fact, strategies exist to game the system.

Gaming the Test

Here too we can strike a parallel between the learning AI companies are competing over, and metrics used to evaluate success in education.

Chatbot Arena is a crowdsourced site that hosts a leaderboard for different chatbot benchmarks. It lets a human user blindly compare two anonymous AI chatbots side-by-side, vote on their preferred response, and then aggregate all those responses into a leaderboard. Great way to get community input to accurately understand human preferences around LLMs, and AI companies are incentivized to get their products ranked higher.

A paper called "The Leaderboard Illusion" found that companies like Meta, OpenAI, and Google were privately testing dozens of model variants and only publishing scores for the ones that performed best. Meta allegedly tested 27 different versions before the Llama 4 release and only revealed the top scorer. Some models were tuned specifically for Arena preferences, verbose answers with emojis that charmed voters but didn't reflect what the public release actually did. These test scores didn't necessarily reflect the true landscape of the models, the test scores just revealed insights about how companies approached the test.

Open science encourages the sharing of all useful data, even negative or messy data. This is a clear case of Goodhart's law, saying that when using a metric as a goal to reach, people are incentivized to manipulate that metric, thus undermining the whole original point of the measurement. People game the system for their rewards once they know what's being measured.

Students learn to write to their teacher's preferences, learn to cram to pass a test, learn to bullshit essays last minute. ML algorithms playing video games can discover glitches and exploits to win faster while breaking the game. Tenure-track professors often have to pour so much more into their research than their teaching. It's just what the system rewards. The metrics might get hit, but the true goal is left in the dust.

Scale Cuts Both Ways

One of the greatest things about a large university is the random encounters with people you get from that scale. You'll bump into people from a wide range of completely different life lore, academic backgrounds, and current disciplines. It's the perfect storm, you'll have conversations you would never have had otherwise, and the benefit that has to the creative process (scientific investigation is as much a creative process as art is) is hard to replicate.

But scale in the classroom is a different thing. Though CU says the average class size is 97 students (with a median of 51), there's a whole chapter in How to Lie with Statistics about why that kind of stat as a metric is slippery. The median being so much lower means most classes are small, but massive intro lectures drag the average up. And those intro courses, where students are supposed to be building that critical foundational knowledge, are exactly where you end up in a hall with 200-300 people. Class sizes thin out as you progress through the curriculum, but often at higher levels, tenure-track professors are even more focused on research and administrative responsibilities than teaching. A professor's attention gets spread impossibly thin, and the students get marginal benefits from being talked at.

Especially for my own self, I've found active learning to be the best way for me to internalize teaching. At some point in the semester, for classes like Calculus and Physics I found myself learning much more effectively reading the textbook at home (and watching The Organic Chemistry Tutor on YouTube) and just grinding practice problems. Though one could argue that such a system was just teaching me how to pass the test, I want to draw emphasis away from doing practice problems and more to the framework: by seeking out the knowledge for myself, trying to actually do the thing with guidance from experts online I internalized the learning much more than I ever did in lecture frantically trying to keep up copying down notes from the chicken scrawl on the board.

There's a huge distinction to be made here between the system that worked for me and the flipped classroom.

Flipping the Classroom Didn't Work

The chemical engineering thermo class at CU has been taught in a flipped classroom style the past few years. A flipped classroom "flips" the responsibility of the instructional strategy. The instructor will provide lectures and foundational content online, the student will watch them at home and in-class time is used for interactive activities like problem-solving, and discussions.

Of the past 4 years of Chemical and Biological engineering students I've talked to, I've yet to hear a single good thing about the flipped classroom. The professors themselves are largely adored by the students, but the inheritance of the same flipped classroom structure is universally despised.

Though arguments could be made about student-focused pacing, deeper knowledge development in class time, personalized support and increased engagement, I've yet to find any of these to hold true. Individual pacing doesn't really work if all you have is 1 day to learn the content before the next class. Personalized support and increased engagement also don't hold true if the support structure is not given to the professor: if you have a huge lecture hall with 150 students and a single professor walking around with 2 TAs answering questions, most of the students don't get any assistance. Usually groups form out, and students try solving problems on their own (funnily enough, this unintended consequence actually does develop learning really effectively). But even this hits a wall when facing a subject as brutal as thermo: you need to have that guidance.

The end result is a system where students not only feel like they have to waste their lecture hours not getting any guidance from an expert, but also having to spend 2-3x of their own time at home developing foundational knowledge rather than applying higher-order skills. When I'd study on my own doing practice problems, following youtube lectures, or doing practice problems I was replacing the time I'd have spent in lecture with that. The entirety of the success depends on the students completing the work before class (and there's another whole can of worms in the incentive structures here).

IMHO the Best Way to Learn

Like I've briefly touched on throughout the piece, the best way to learn something is to actually do it, with guidance from someone who knows more than you and can help you develop your own path to understanding. I learned more sitting on floors and swapping desks, working through problems with other students than I did waiting for the professor to come around to me during lecture. It's a shared struggle (:

Professors for sure know this, but you just can't feasibly run a Socratic dialogue or HW problem-solving session with 200 students. Professors don't get enough support for the educational side of their responsibilities as is, and unless incentive structures and the objective functions we use for evaluating both professors and students change, I don't think more support will be coming anytime soon.

Tuition for universities has increased about 312.4% (after adjusting for inflation). CU Boulder, a public university, charges $45k in out of state tuition before room and board, and tuition has outpaced inflation for decades. It's harder to get in, it's more financially inaccessible, and you're really going to pass through that societal filter of "having a college diploma". Ideally you go to uni to grow as a person and learn, but what you're paying for is to attend classes, pass tests, and get your diploma. A lot of the educational substance, the actual learning and the growth is elsewhere.

AI has shifted this landscape pretty significantly. It's the perfect platform for active-learning, as long as you are diligent about double-checking and bringing in other sources. Using an AI tool to help break down a paper, or gently introduce a concept, help provide foundational knowledge is invaluable before trying to attack more advanced concepts. And being able to back-and-forth with as many clarifying dumb questions as I want is invaluable, especially learning things like different programming libraries and languages.

It's interactive, always available, and can be truly tuned to your pace and adapted to your thinking framework. If used in certain ways, AI can be a huge boon to individual learning.

Changing the Objective Function

The social value of the university is very real, and very impactful: the collisions, the perspectives, the relationships, and the discovery it facilitates. But if educational substance is increasingly accessible outside the traditional structure (and the traditional structure is crumbling), maybe it's time for the structure to shift to actually serve learning better.

Professors are stretched across research, service, and teaching. Give them time to actually focus on teaching well: it takes real effort to structure a course thoughtfully! Hire more people whose job is specifically to teach. Lower class sizes so students can get actual guidance instead of bouncing between overworked TAs.

The "Why Language Models Hallucinate" paper argues that fixing hallucination requires "modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations." You don't fix the problem by adding more tests. You fix it by changing what the tests reward.

Same thing applies here. As long as the system optimizes for throughput and production (graduation rates, enrollment numbers, research output) it'll keep producing students who sound confident and can pass tests but don't actually know. We'll keep training humans to hallucinate, I think we should focus on training humans to really learn. We're at the perfect point in time: with so many issues in the educational system, and such powerful technology of our hands, this is the perfect chance to try and make that shift.

The Folly of the University; or, Why Students Hallucinate

The Objective Function Problem

Gaming the Test

Scale Cuts Both Ways

Flipping the Classroom Didn't Work

IMHO the Best Way to Learn

Changing the Objective Function