That, in part, comes from x86 vs arm. x86, with its variable length instructions, means your logic gets increasingly complex as you widen. Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size), that lets Apple do things like an 8-bit wide decode block (vs. 4-bit for Intel) and a humongous 630 OOB (almost double Intel’s deepest cores).
I think this marks the definitive point where RISC dominates CISC. Been a long time coming but M1 spells it out in bold. Sure variable size instructions are great when cache is limited and page faults are expensive. But with clock speeds and node sizes ansymptoting, the only way to scale is horizontal. More chiplets, more cores, bigger caches, bringing DRAM closer to cache. Minimizing losses from cache flushes by more threads, and less context-switching.
Basically computers that start looking and acting more like clusters. Smarter memory and caching, zero-copy/fast-copy-on-mutate IPC as first-class citizens. More system primitives like io_uring to facilitate i/o.
The popularity of modern languages that make concurrency easy means more leveraging of all those cores.
Agree with everything except the RISC vs CISC bit. Modern Arm isn't really RISC and x86 gets translated into RISC like microinstructions. To the extent that ARMv8 has an advantage I think it's due to being a nearly clean design 10 years ago rather than carrying the x86 legacy.
AnandTech's analysis corroborates what DeRock is saying here about the x86 variable length instructions being the limiting factor on how wide the decode stage can be.
The other factor is Apple is using a "better" process node (TSMC 5nm). I put it in quotes because Intel's 10nm and upcoming nodes may be competitive, but Intel's 14nm is what Apple gets to compete against today, right now.
> but Intel's 14nm is what Apple gets to compete against today, right now.
Intel's 10nm node is out, I'm typing this on one right now. It's competitive in single-core performance against what we've seen from the M1. Graphics and multi-core it gets beat though...
Or do you mean what Apple used to use? (edit: the following is incorrect) It's true Apple never used an Intel 10nm part.
EDIT: I was wrong! Apple has used an Intel 10nm part. Thanks for the correction!
I'm using a MacBook Pro with an Intel 10nm part in it. The 4 port 13" MBP still comes with one. I think the MacBook Air might have had a 10nm part before it went ARM, too.
There are still no 10nm parts for the desktop or the high-end/high-TDP laptops anyway afaik.
Tiger Lake is objectively the fastest thing in any laptop you can buy, regardless of whether its TDP is as high as others. You're right about the lack of desktop parts, though.
I'm typing this on a Tiger Lake, too, and it is very fast and draws modest power. But, are they making any money on it? If they lose Apple as a laptop customer how much does that hurt their business.
There was a lot of discussion on a related, recent post about Apple buying many of Intel’s most desirable chips. Will be interesting to see whether the loss of a high-end customer translates into more woes for Intel.
> Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size),
That's only for 32-bit ARM; for 64-bit ARM, the instruction size is always constant (there's no Thumb/Thumb2/ThumbEE/etc). It won't surprise me at all if Apple's new ARM processor is 64-bit only (being 64-bit only is allowed for ARM, unlike x86), which means that not only the decoder does not have to worry about the 2-byte instructions from Thumb, but also the decoder does not have to worry about the older 32-bit ARM instructions (including the quirky ones like LDM/STM).
That would also explains why Apple can have wide decoders while their competitors can't: these competitors want to keep compatibility with 32-bit ARM, while Apple doesn't care.
From the Anandtech article[1] on the M1, I think this is referring to the reorder buffer, or ROB. Reorder buffers allow instructions to execute out-of-order, so maybe that's where the "OO" comes from.
Yes, I was going by memory, which thought “out of order buffer” was a thing. I meant ROB. Point still stands, in that scaling these blocks is very difficult with variable instruction size.