Readplace

Viral BridgeBench Post Claims Claude Opus 4.6 Was ‘Nerfed,’ Critics Call It Bad Science

Yahoo Tech 3 min read
View original
Summary (TL;DR)
BridgeMind AI claimed Anthropic's Claude Opus 4.6 was secretly downgraded after its hallucination benchmark score dropped from 83.3% to 68.3%. Critics, including computer scientist Paul Calcraft, called the comparison flawed because the original score was based on 6 tasks and the retest on 30 tasks. On the 6 overlapping tasks, performance was nearly identical (87.6% vs 85.4%), with the swing due to a single fabrication within normal variance. Broader complaints about Claude Opus 4.6's quality may stem from Anthropic's adaptive thinking controls, which prioritize efficiency. The benchmark does not prove a deliberate downgrade.