That, in part, comes from x86 vs arm. x86, with its variable length instructions...

kortex · on Nov 14, 2020

I think this marks the definitive point where RISC dominates CISC. Been a long time coming but M1 spells it out in bold. Sure variable size instructions are great when cache is limited and page faults are expensive. But with clock speeds and node sizes ansymptoting, the only way to scale is horizontal. More chiplets, more cores, bigger caches, bringing DRAM closer to cache. Minimizing losses from cache flushes by more threads, and less context-switching.

Basically computers that start looking and acting more like clusters. Smarter memory and caching, zero-copy/fast-copy-on-mutate IPC as first-class citizens. More system primitives like io_uring to facilitate i/o.

The popularity of modern languages that make concurrency easy means more leveraging of all those cores.

klelatti · on Nov 14, 2020

Agree with everything except the RISC vs CISC bit. Modern Arm isn't really RISC and x86 gets translated into RISC like microinstructions. To the extent that ARMv8 has an advantage I think it's due to being a nearly clean design 10 years ago rather than carrying the x86 legacy.

sounds · on Nov 14, 2020

AnandTech's analysis corroborates what DeRock is saying here about the x86 variable length instructions being the limiting factor on how wide the decode stage can be.

The other factor is Apple is using a "better" process node (TSMC 5nm). I put it in quotes because Intel's 10nm and upcoming nodes may be competitive, but Intel's 14nm is what Apple gets to compete against today, right now.

Intel has been defeated in detail.

eulers_secret · on Nov 14, 2020

> but Intel's 14nm is what Apple gets to compete against today, right now.

Intel's 10nm node is out, I'm typing this on one right now. It's competitive in single-core performance against what we've seen from the M1. Graphics and multi-core it gets beat though...

Or do you mean what Apple used to use? (edit: the following is incorrect) It's true Apple never used an Intel 10nm part.

EDIT: I was wrong! Apple has used an Intel 10nm part. Thanks for the correction!

micv · on Nov 14, 2020

I'm using a MacBook Pro with an Intel 10nm part in it. The 4 port 13" MBP still comes with one. I think the MacBook Air might have had a 10nm part before it went ARM, too.

There are still no 10nm parts for the desktop or the high-end/high-TDP laptops anyway afaik.

jeffbee · on Nov 14, 2020

Tiger Lake is objectively the fastest thing in any laptop you can buy, regardless of whether its TDP is as high as others. You're right about the lack of desktop parts, though.

Miraste · on Nov 14, 2020

It's not faster than the M1, is it?

jeffbee · on Nov 14, 2020

Maybe but I won’t give it credit for existing until it lands in public hands. You can buy tiger lake laptops off literal shelves, today.

klelatti · on Nov 14, 2020

Are you sure about the single core performance - Geekbench has the M1 far ahead against 10nm laptops?

jeffbee · on Nov 14, 2020

I'm typing this on a Tiger Lake, too, and it is very fast and draws modest power. But, are they making any money on it? If they lose Apple as a laptop customer how much does that hurt their business.

macintux · on Nov 14, 2020

There was a lot of discussion on a related, recent post about Apple buying many of Intel’s most desirable chips. Will be interesting to see whether the loss of a high-end customer translates into more woes for Intel.

alisonkisk · on Nov 14, 2020

Maybe that will just make PCs better now that Apple isn't buying out the best chips.

cesarb · on Nov 14, 2020

> Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size),

That's only for 32-bit ARM; for 64-bit ARM, the instruction size is always constant (there's no Thumb/Thumb2/ThumbEE/etc). It won't surprise me at all if Apple's new ARM processor is 64-bit only (being 64-bit only is allowed for ARM, unlike x86), which means that not only the decoder does not have to worry about the 2-byte instructions from Thumb, but also the decoder does not have to worry about the older 32-bit ARM instructions (including the quirky ones like LDM/STM).

That would also explains why Apple can have wide decoders while their competitors can't: these competitors want to keep compatibility with 32-bit ARM, while Apple doesn't care.

sidewndr46 · on Nov 14, 2020

For the uneducated, what is an "OOB" here?

sounds · on Nov 14, 2020

I think that's a typo - the Apple Silicon M1 has a "ROB is in the 630 instruction range" - https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

"ROB" and "OOO" might have gotten mixed together here.

OOO = Out-of-order. Refers to the fact that the M1 can decode instructions in parallel.

ROB = Re-Order Buffer. Refers to the stage where the parallel instructions get put back "in-order" and "retired."

jmull · on Nov 14, 2020

From the Anandtech article[1] on the M1, I think this is referring to the reorder buffer, or ROB. Reorder buffers allow instructions to execute out-of-order, so maybe that's where the "OO" comes from.

[1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

DeRock · on Nov 14, 2020

Yes, I was going by memory, which thought “out of order buffer” was a thing. I meant ROB. Point still stands, in that scaling these blocks is very difficult with variable instruction size.