2025 Year in Review for LLM Evaluation: When the Scorecard Broke
Randy Olson, PhD··15 min read
In 2025, we discovered we'd been measuring memorization, not intelligence. Models scored 80-90% on static benchmarks but dropped to 60-70% on truly novel problems. This year exposed the fundamental crisis in AI evaluation, and taught us what to build instead.
Read article →