Insights from Goodeye Labs

Expert insights on LLM evaluation and AI quality assessment

2025 Year in Review for LLM Evaluation: When the Scorecard Broke

Randy Olson, PhD·December 28, 2025·15 min read

In 2025, we discovered we'd been measuring memorization, not intelligence. Models scored 80-90% on static benchmarks but dropped to 60-70% on truly novel problems. This year exposed the fundamental crisis in AI evaluation, and taught us what to build instead.

Read article →
© 2026 Goodeye Labs·Privacy Policy·Insights
Book a Demo·hello@goodeyelabs.com·