Something being simultaneously described as a "30 sheet, mind-numbingly complex ...

martinald · 2026-02-02T01:55:29 1769997329

It compacted at least twice but continued with no real issues.

Anyway, please try it if you find it unbelievable. I didn't expect it to work FWIW like it did. Opus 4.5 is pretty amazing at long running tasks like this.

moregrist · 2026-02-02T02:11:22 1769998282

I think the skepticism here is that without tests or a _lot_ of manual QA how would you know that it did it correctly?

Maybe you did one or the other , but “nearly one-shotted” doesn’t tend to mean that.

Claude Code more than occasionally likes to make weird assumptions, and it’s well known that it hallucinates quite a bit more near the context length, and that compaction only partially helps this issue.

skybrian · 2026-02-02T06:02:10 1770012130

If you’re porting some formulas from one language to another, “correct” can be defined as “gets the same answers as before.” Assuming you can run both easily, this is easy to write a property test for.

Sure, maybe that’s just building something that’s bug-for-bug compatible, but it’s something Claude can work with.

gregoryl · 2026-02-02T08:10:37 1770019837

For starters, Python uses IEEE 754, and Excel uses IEEE 754 (with caveats). I wonder if that's being emulated.

stavros · 2026-02-02T02:19:26 1769998766

I generally agree with you, but I tried to get it to modernize a fairly old SaaS codebase, and it couldn't. It had all the code right there, all it had to do was change a few lines, upgrade a few libraries, etc, but it kept getting lots of things wrong. The HTML was wrong, the CSS was completely missing, basic views wouldn't work, things like that.

I have no idea why it had so much trouble with this generally easy task. Bizarre.