Summary (TL;DR)
BridgeMind AI claimed Anthropic's Claude Opus 4.6 was secretly downgraded after its hallucination benchmark score dropped from 83.3% to 68.3%. Critics, including computer scientist Paul Calcraft, called the comparison flawed because the original score was based on 6 tasks and the retest on 30 tasks. On the 6 overlapping tasks, performance was nearly identical (87.6% vs 85.4%), with the swing due to a single fabrication within normal variance. Broader complaints about Claude Opus 4.6's quality may stem from Anthropic's adaptive thinking controls, which prioritize efficiency. The benchmark does not prove a deliberate downgrade.
Public access expired
Save this link to your readplace queue and read every link without expiration.
Save to My Queue