Apple M1 foreshadows Rise of RISC-V

noizejoy · on Dec 20, 2020

With all the discussion about what the “big trick” is that makes the M1 seem to be such a breakthrough, I can’t help but wonder, if the M1 is more like the iPhone: The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.

Analogous to the iPhone being foreshadowed by the iPod without most experts believing Apple could make a mobile phone from that, the M1 was foreshadowed by the A1 for mobile devices with many(most?) experts not forecasting how much it could be the base for laptops and desktops.

It seems, the M1 includes numerous small engineering advances and the near term lockup of the top of the line fab in the supply chain also reminds me of how Apple had secured exclusivity for some key leading edge iPhone parts (was it the screens?).

So the M1 strikes me as the result of something that Apple has the ability to pull off from time to time.

And that is rather hard to pull off financially, organizationally and culturally. And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).

EDIT for typo

gchadwick · on Dec 20, 2020

> The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.

I think the vertical integration they have is a major advantage too.

I used to work at arm on CPUs. One thing I worked on was memory prefetching which is critical to performance. When designing a prefetcher you can do a better job if you have some understanding or guarantees as to the behaviour of the wider memory system (better yet if you can add prefetching specific functionality to it). The issue I faced is the partners (Samsung, Qualcomm etc) are the ones implementing the SoC and hence controlling the wider memory system. They don't give you detailed specs of how that works, nor is there an method where you can discuss with them appropriate ways to build things to enable better prefetching performance. You end up building something that's hopefully adaptable for multiple scenarios and no one ever gets a chance to do some decent end to end performance tuning. I'm either working with a model of what the memory system might be and Qualcomm/Samsung etc engineers are working with the CPU as a black box trying to tune their side of things to work better. Were we all under one roof I suspect we could easily have got more out of it.

You also get requirements based upon targets to hit for some specific IP, rather than requirements around the final product, e.g. silicon area. Generally arm will be keen to keep area increase low or improve performance / area ratio without any huge shocks on overall area. If you're apple you just care about the final end user experience and the potential profit margin. You can run the numbers and realise you can go big on silicon area and get where you want to be. With a multi-company/vendor chain each link is trying to optimise for some number they control, even if that overal has a negative impact on the final product.

socialdemocrat · on Dec 20, 2020

Very interesting comment. I mean you see some of the same things with companies like Tesla also pushing vertical integration.

A lot of the examples you see are similar to what you talk about. You can cut down on the friction between different parts.

I remember an example of software controlling a blinking icon on the dashboard, where this was a 10 minute code change for Tesla but a 2-3 month update cycle for a traditional automaker due to the dashboard hardware coming from a supplier.

twic · on Dec 20, 2020

If we're comparing the M1 to x86, though, then all the prefetching and other memory shenanigans are on the CPU die. The A1 had an advantage over the SoCs used in Android phones here, but the M1 doesn't have an advantage over Intel and AMD CPUs.

wffurr · on Dec 20, 2020

In the case of the memory prefetcher story, this was about other ARM CPUs, which the M1 and A-series have left completely in the dust some years ago.

bloopernova · on Dec 20, 2020

May I ask what your opinion is on the NVidia ARM purchase?

aledthemathguy · on Dec 20, 2020

thanks for sharing this. you are an insider. did u think of putting your knowledge on writing? i will sure like to read that kind of content :)

danarmak · on Dec 20, 2020

> the partners (Samsung, Qualcomm etc) are the ones implementing the SoC and hence controlling the wider memory system.

And I assume the partners also do some things differently, for at least somewhat legitimate reasons, and no one ARM design can be optimal for everyone.

fluffy87 · on Dec 20, 2020

Nvidia makes arm processors, GPUs and SoCs, so this integration will be good for them if the arm sale is approved.

sitkack · on Dec 20, 2020

You use the word partner with the proper noun Qualcomm but there are no quotes. Qualcomm's only focus is to make money while delivering the worst experience in every direction. They are often stuck in local maximums and they are too big to just flow around.

? shared prefetch queue ?

simonh · on Dec 20, 2020

Apple has used exclusive access to advanced hardware as a differentiator several times. With screens it was Retina. They funded the development and actually owned the manufacturing equipment and leased it to the manufacturing subcontractors.

Also in 2008 they secured exclusive access to a then new laser cutting technology that they used to etch detail cuts in the unibody enclosures of their MacBooks, and then iPads. This enables them to mill and then precision cut the device bodies out of single blocks of Aluminium.

They’ve also frequently bought small companies to secure exclusive use of their specialist tech, like Anobit for their flash memory controllers, Primesense for the face mapping tech in FaceID, and there are many more. For Apple simply having the best isn’t enough, they want to be the only people with the best.

_ph_ · on Dec 20, 2020

Retina is a very interesting example for how Apple works. They have identified the necessary resolution (200+ ppi) for this technology and worked towards across their whole product range. The technology isn't exclusive to Apple, but they are the only company which pushes it, even if it sometimes means quite odd display resolutions.

Other manufacturers seem to be completely oblivious to it. They still equip their laptops either with full hd or 4k screens. The resulting ppi are all over the place. Sometimes way to low (bad quality) or way to high (4k in an 13" laptop, halves the runtime). Same with standalone screens, there is a good selection around 100ppi, but for "high res" the manufacturers just offer 4k in whatever size, so once again the ppi are all over the place again.

whynotminot · on Dec 20, 2020

Retina is a great example of how Apple operates in general. They care about outcomes, not spec sheets. Sure, they'll take the time to spec dunk when the opportunity presents itself. It's just rarely the reason they do something, whereas Dell/HP/whatever want to say "4K SCREEN OMG" on the box regardless of whether that actually leads to a better experience.

Apple realizes there's diminishing returns in upping the resolution beyond what your eyes can see. So they hit Retina, and then move on to focus on color accuracy, brightness, and contrast.

They do this throughout their product stack. Only as much RAM as their phones actually need, because RAM is both costly in terms of BOM and also consumes battery life.

Using fewer Megapixels than competing phones because it's about the quality, not quantity, when other manufacturers trip over themselves to have the most Megapixels.

macintux · on Dec 20, 2020

> Using fewer Megapixels than competing phones because it's about the quality, not quantity, when other manufacturers trip over themselves to have the most Megapixels.

When your customers are picking a product based off a spec sheet, that's the trap you easily fall into.

fomine3 · on Dec 23, 2020

> Apple realizes there's diminishing returns in upping the resolution beyond what your eyes can see. So they hit Retina, and then move on to focus on color accuracy, brightness, and contrast.

It's still not enough resolution for x2 upscaling that Apple argues optimal.

macintux · on Dec 20, 2020

> The technology isn't exclusive to Apple, but they are the only company which pushes it, even if it sometimes means quite odd display resolutions.

Apple's focus and commitment can be a pain to users who want something different, but overall it's a huge strategic advantage.

Software developers are building their apps for M1 because they know beyond a shadow of a doubt that Apple isn't going to keep Intel around any longer than they have to, whereas Microsoft had and will continue to have a hard time persuading anyone to adopt ARM Windows because the opposite is true.

AtlasBarfed · on Dec 21, 2020

That would imply split-brain development. Yes I agree that devs will support M1 out of necessity in supporting a minority market share OS that will expand from M1's apparent superiority....

But Wintel will own business desktops for probably a decade, unfortunately.

abcxjeb284 · on Dec 21, 2020

> But Wintel will own business desktops for probably a decade, unfortunately.

Do you see this as being different from what was going on pre-M1?

Aside from shared math libraries, seems like most stuff required cross compilation just to work - I’m not clear that M1 adds more work on top of that.

withaplomb · on Dec 20, 2020

There are a few PC manufacturers offering 3K laptops at least (Lenovo, Huawei when I looked). For monitors it’s nuts, just Iiyama has a 5k screen with multiple inputs (the LG has only one input so useless for switching between pc and Mac)

musicale · on Dec 20, 2020

Perhaps I'm lucky, but my LG 5K has aged really well. Fantastic upgrade in 2016 and still going strong. I'm surprised there aren't more 5K 27" monitors because they are great.

_ph_ · on Dec 20, 2020

Personally, I hope Dell makes their own screen with the panel of the 6k Apple display. I would probably grab it in a heartbeat. I would consider the Apple display, but I don't need the reference video quality and especially I would like to have more than one input on a display that expensive.

Apple, what are you thinking? People who can afford that display might want to connect their desktop mac and their laptop to it.

AtlasBarfed · on Dec 21, 2020

8k TVs are imminent. The only real market for these with their detail only apparent up close is PC monitors and gaming.

guerby · on Dec 20, 2020

Dell Inspiron 8000 serie was released in 2000 with 15 inch 1600x1200 screen resolution:

https://en.wikipedia.org/wiki/Dell_Inspiron_laptops

That's about 130+ish dpi and it was unmatched for years: I know I had one and was depressed to see dpi do down and down afterwards :)

Hopefully Apple restarted the dpi race.

bitwize · on Dec 20, 2020

I have a Chromebook with a 2400x1600 display. Not one of the standard resolutions, but it is 200-some ppi.

brandonmenc · on Dec 20, 2020

FingerWorks was one of the best of these acquisitions.

By the time I was ready to purchase one of their keyboards to put in my iBook G3 Snow, they had shut down. Little did I know...

https://technical.ly/philly/2013/01/09/jeff-white-fingerwork...

vizzier · on Dec 20, 2020

I believe this is the only consumer 5nm chip currently available as well. Ryzen gen 3 is still on 7nm. I'd be interested to see how well general purpose compute on the m1 vs ryzen gen 3 mobile will be.

simonh · on Dec 20, 2020

The thing is M1 performance isn't really the point in itself though. This is the lowest performance core architecture Apple will ever produce for MacOS, aimed at their lowest end cheapest hardware. It's only one data point towards what we can expect when they take the gloves off and go after the performance end of the market for real.

bloopernova · on Dec 20, 2020

I hope so, I'd love to see a more heterogeneous chip market, with x64, ARM, M1+, RISC-V all competing with each other.

Maybe compilers would be default spit out binaries that run on all of them, like Apple's Rosetta or whatever it's called.

mbreese · on Dec 20, 2020

> binaries that run on all of them

Universal binaries. Apple calls those binaries “Universal binaries”. Rosetta is more like running different arch binaries through qemu. A brief look through Google says that there was a FatELF specification created years back, but never really went anywhere. Presumably because Linux users tend to know what arch they are using.

Fat binaries would make distribution easier, but would double (or triple) the size of a binary. I doubt it would be worth the size trade off.

sitkack · on Dec 20, 2020

Binary size is a small percentage of the overall asset bundle.

Docker also supports multiarch images.

Given how easy it is to JIT or BT RISC ISAs, the future is good for binary portability.

mbreese · on Dec 20, 2020

I’m thinking of all of the small utilities and small command line programs that make up a stock Linux distro. Those don’t have many resources other than the binaries. Sure, the size of each is not much in absolute scale, but combined, you have a pretty significant increase if they were all fat binaries.

That said, I don’t know what Apple does. For example, in the main download for Big Sur, is (for example) zsh a universal binary, or are there a specific x86/M1 downloads. I haven’t looked.

musicale · on Dec 20, 2020

Looking at /usr/bin, /usr/lib/ etc. it's not quite as big as I expected. Going fat might be feasible, especially considering how large storage is at this point.

sitkack · on Dec 20, 2020

My /usr/bin is 800M

    /usr/bin$ du -schL .
    813M total

leecb · on Dec 20, 2020

The Huawei Mate 40 came out two months ago, and uses the Hisilicon Kirin 9000, which is also on TSMC's 5nm. It has about 30% more transistors and a higher clockspeed than the A14.

1123581321 · on Dec 21, 2020

Would you say that the comparison to the A13 from this article is fair? https://www.notebookcheck.net/HiSilicon-Kirin-9000-Processor...

AsyncAwait · on Dec 20, 2020

AMD has substantially more CPU/Microarch design experience.

molszanski · on Dec 20, 2020

"still on 7nm" :)

FullyFunctional · on Dec 20, 2020

They bought Intrinsity as well, leaving us wonder about the fate of Fast14.

intricatedetail · on Dec 20, 2020

What could be the other motivation to be "The only one with the best" apart from greed? That's a pretty strong giveaway of being evil.

simonh · on Dec 20, 2020

They're not 'the only ones' in that childish playground bragging kind of sense. If you want one you can buy one from them. It's just business.

Also much of this tech only existed when it did due to their investment in technology development and tooling. Microsoft had dragged their feet on Hi-DPI for years before Apple jumped ahead with retina and 5K and forced them to pull their socks up. Android phones probably went to 64-bit two years earlier than they otherwise would have. With Surface devices Microsoft has come up with some genuinely innovative and forward looking designs and engineering, but would they have done that without the spur of competing with the iPad and Mac? Apple have been single handedly dragging the rest of the industry kicking and screaming into the future, to everyone's benefit. How is that outcome 'evil'?

alanfalcon · on Dec 20, 2020

I think it starts, at least, with simple practicality. Apple products usually sell in massive volumes. When they introduce some new aspect, they have to be able to produce huge numbers of this thing, and since it's new, the industrial capacity to produce at those numbers reliably does not exist. They have no choice but to buy out and invest in all the available capacity, because they rarely release innovations in some small-scale experimental product. They swing for the fences for a mass market home run almost every time.

And they do this so well because they're led by Tim Cook who approaches the big picture in part in terms of supply chain.

I don't argue that cornering the market isn't an accounted-for and desirable side-benefit of this arrangement. I just don't think it's usually the starting point for them.

christophilus · on Dec 20, 2020

It’s business. This is all part of the reputation and image Apple wants. Any reasonably managed business will lean into its differentiations. This happens in every business sector at any measurable scale.

calvano915 · on Dec 20, 2020

Greed sure, but it's also control. Apple has always been fanatical about controlling what the user, seller, repair shops, retail, etc can do with their products. Among other reasons, its one of the top reasons I'll never use Apple computers. I want complete control of my devices, and the ability to break my machines while pushing boundaries. Apple would rather have me locked in to their 30% fee App Store cash cow with apps that require their approval to exist.

musicale · on Dec 20, 2020

I have heard that it might be possible to hack your Mac laptop (using a setting in System Preferences>Security&Privacy>General) to allow it to run code on that isn't from Apple's App Store.

simonh · on Dec 29, 2020

There is no lock preventing you installing non-Apple executable. Theres a warning and such but I install apps from third parties all the time. There are even several package managers, macports and homebrew, that will install userland unix binaries for you.

jononor · on Dec 20, 2020

Where is that setting on the iPhone and iPad?

EricE · on Dec 24, 2020

There isn't one because the iPhone/iPad is a closed ecosystem, whereas macOS is not.

There are pro's and con's to both models; choose the one that works best for you. Even if it's not Apple :)

sneak · on Dec 20, 2020

Brand marketing. It cements the Apple brand as being high quality and Apple products as being better than anything else available, justifying a premium price in the minds of their customers.

It's usually a legitimate justification, too. Nobody else makes hardware this nice, almost as a general rule.

Shivetya · on Dec 20, 2020

It is not greed, it is ego. Apple as a whole has an ego complex and it shows in so much of what they do. They always, to the point of harping, emphasize the Pro label and premium experience and premium materials. Even their head quarters is an expression of pure ego.

So when compared to other companies which do put out products with similar materials the difference is that Apple never reaches low. They won't sacrifice their premium material usage and expression to reach lower end markets.

Air Pods Max might be the situation backfiring on them. They were so wrapped up in having aluminum enclosures for the headphones they ended up with an overweight and overpriced to many product. It is going to be real curious what a lower cost version of these entail; the rumor is they will come but when and of what materials?

_ph_ · on Dec 20, 2020

The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work, topped off by some very shrewd supply chain arrangements.

I think you precisely have it. There is no single magic reason the M1 is so good, just a lot of things coming together. They start with a better instruction set than x86, have of course the best process available, and perhaps the largest part, they have built up an increadible team over a decade. And they are extremely focussed in what they target. If anything, that is Apples "magic". They are not making a chip which is built in an abstract manner to be sold to random customers. They have exactly their needs in mind and execute towards those. In a sense AMD did that with the chips for the Playstation/XBox. Like the M1 it is basically a SOC. There optimized for great graphics performance. Unfortunately, those chips are not sold separately for building your own PC.

alwillis · on Dec 20, 2020

So the M1 strikes me as the result of something that Apple has the ability to pull off from time to time.

Perhaps you haven't been paying attention?

Apple shipped 64-bit ARM processors for the iPhone at least a year before Qualcomm could do it for Android devices. The reaction to the A7 was similar to what we're seeing now with the M1—not possible, there's some trickery going on, etc.

Apple is pretty good at this processor transition thing, going from 68k to PowerPC to Intel to ARM.

And it more than makes up for some pretty spectacular tactical mis-steps (I’m looking at you, puck mouse, cube mac, butterfly keyboard).

Except for the recent keyboard issues, you're literally talking about another era. I wouldn't put the shape of the mouse for the 1998 iMac in the same category as transitioning a $9 billion revenue product line to a radically different processor architecture.

dwighttk · on Dec 20, 2020

> not possible, there's some trickery going on

“...and even if they did it, it doesn’t matter” (though I guess not many are saying that about M1)

alwillis · on Dec 21, 2020

Yes—there was a lot of “it won’t matter” talk when the A7 shipped.

DonHopkins · on Dec 20, 2020

Don't forget Apple's first transition from 6502 to 68k. That was pretty bumpy though.

alwillis · on Dec 20, 2020

I was only talking about the Mac, which never used the 6502–that was the Apple II.

The Apple IIgs did end up on the 65816.

hajile · on Dec 20, 2020

> Apple shipped 64-bit ARM processors for the iPhone at least a year before Qualcomm could do it for Android devices. The reaction to the A7 was similar to what we're seeing now with the M1—not possible, there's some trickery going on, etc.

That is because it is not possible to ship a top-end design for a new ISA in that amount of time. The more reasonable answer is they had been working on a new core design for some years before. AMD has hinted that their Zen design makes it relatively easy to swap the x86 frontend for an ARM frontend.

Apple was considering buying MIPS around that time. I suspect they strong-armed ARM into accepting their ARMv8 proposal because it was good and because Apple buying MIPS would be disastrous for ARM's share price. At that point, it wasn't faster than possible, it was just designing the last part of the chip (or if both frontends were being worked on in tandem, cancelling one of them and focusing everyone on the other).

This explains why ARM announced v8 and then took the full 4 years to ship their first low-power core (A53) and even longer to ship their bad first try at a high-performance core (A57 -- with the more baked A72 being superior in almost every way).

jonas21 · on Dec 20, 2020

> the near term lockup of the top of the line fab in the supply chain also reminds me of how Apple had secured exclusivity for some key leading edge iPhone parts (was it the screens?).

Yes, Apple managed to lock up most of the global supply of capacitive touchscreens for about a year after the iPhone came out. The iPhone wasn't the first phone to use a capacitive touchscreen, but for a while, it seemed like it was because nobody else could produce devices with them in large volumes.

People used to dismiss Tim Cook as "just" a supply chain guy. But I think it's become clear that supply chain management is at least as important to Apple's success as anything on the design or marketing side.

Zenst · on Dec 20, 2020

In some ways it is the environment the M1 was born from that helped. mobile space CPU's focus upon low power usage and that has seen many core software tasks get dedicated instructions and why you end up with the M1 in some tests utterly trouncing competition as it has dedicated hardware catering for the common niche things that software ends up doing - the hardware video encoding being a small area, but deep down, more than that. This along with advances in software/hardware integration and being able to synergies that at a level nobody else can. The way to think of it is - if Intel did an operating system from scratch, it would tap the CPU extremely well compared to others due to them knowing the internals better and fully. Then add the ability for them to see that adding some dedicated hardware to replace some software instruction combinations and you start to see a tightly integrated team of CPU and Operating system/software.

One area that I've always wished CPU's would take would be a dedicated core or two for the OS that is completely isolated from the other cores, which would be for the software/applications you run. Now if those ran about two different architectures - darn that would be the inner geek in me appeased.

setpatchaddress · on Dec 20, 2020

What would your goal be? I think locking a modern, general-purpose OS to a small number of cores would artificially constrain performance, assuming a reasonable scheduler.

Zenst · on Dec 20, 2020

The OS don't need that much grunt, some drivers maybe - hence one or two cores, isolated away from user software and secure.

Scheduler wise it would become easier as no juggling cycles upon main core that the OS is running upon. Whilst not hitting real-time OS levels, it would be a nice middle ground in some area's as well.

Also for cores, the OS and user software don't need the same OS when the OS is just API calls passing parameters, so would be viable to have the OS upon one core type and the user-space cores a totally different architecture. That again would add another level of security.

billsnow · on Dec 20, 2020

You can already do this. There's a kernel boot parameter called isocpu or something and the kernel will only run on the logical cpus listed. Furthermore, you can isolate your user programs. The general benefit is latency, but there's a theoretical trade-off in scheduling bandwidth. The memory heirarchy will be utilized less efficiently too.

africanboy · on Dec 20, 2020

The project HydrOS aimed to build an Erlang operating system that way: one core for the OS, one scheduler per core other than the OS one.

Unfortunately it looks the project is now abandoned

Zenst · on Dec 20, 2020

Thank you, not heard of that, had a look https://www.youtube.com/watch?v=8OyRFbf6MDk Seems the main/only dev moved onto other things and it stagnated.

stoat_meniscus · on Dec 20, 2020

The PS3 did something like this. One processing unit of the Cell processor was reserved for the OS while the rest could be used for games.

jolux · on Dec 20, 2020

The first A-series chip was the A4, because it came out with the iPhone 4.

ogre_codes · on Dec 20, 2020

> The sum of a large number of small engineering improvements, coupled with a lot of component integration detail work,

Exactly. ARM has been progressing faster than Intel. For the past 8 years or so, Apple has had the fastest ARM CPU out there on the iPhone/ iPad. Apple has sucked up TSMC's 5nm production. They've integrated a pile of relevant coprocessors into the CPU and put fast RAM on the package. The SSD is lightning fast and SSD encryption is done via a dedicated coprocessor.

It's not one magic trick, it is countless bits of engineering, manufacturing, and purchase choices.

xorcist · on Dec 20, 2020

> by the iPod without most experts believing Apple could make a mobile phone

Except for all the people practically begging Apple to make a phone for years, except all the analysts who wrote essays on how computer companies could make successful phones, except for all the fanboys making fan-art of phones with that big circular wheel.

api · on Dec 20, 2020

I don't buy it. I think there is in fact one "trick," which is shedding the X86 decode bottleneck.

People always make the point that the X86 decoder is only ~5% of the die. Sure, that's true, but keep two things in mind:

(1) While it's only 5% of the die, it runs constantly at full utilization. The ALU is also only a small percentage of the die (5-10%). How hot does your CPU get when you're running the ALU full blast? Now consider that there is a roughly ALU-sized piece always running full blast no matter what the CPU is doing because X86 instructions are so complex to decode. Not only does this give X86 a higher power use "floor," but it means there's always more heat being dissipated. This extra heat limits thermal throttling and thus sustained clock speed unless you have really good cooling, which is why the super high performance X86 chips need beefy heatsinks or water cooling.

(2) It apparently takes exponentially more silicon to decode X86 instructions with parallelism beyond 4 instructions at once. This limits instruction level parallelism unless you're willing to add heat dissipation and power, which is a show stopper for phones and laptops and undesirable even for servers and desktops.

People make the point that ARM64 (and even RISC-V) are not really "RISC" in the classic "reduced" sense as they have a lot of instructions, but that's not really relevant. The complexity in X86 decoding does not come from the number of instructions or even the number of legacy modes and instructions but from the variable length of these instructions and the complexity of determining that length during pipelined decode.

M1 leverages the ARM64 instruction set's relative decode simplicity to do 8X parallel decode and keep a really deep reorder buffer full, permitting a lot of reordering and instruction level parallelism for a very low cost in power and complexity. That's a huge win. Moreover there is nothing stopping them from going to 12X, 16X, 24X, and so on if it's profitable to do so.

The second big win is probably weaker memory ordering requirements in multiprocessor ARM, which allows more reordering.

There are other wins in M1 like shared memory between CPU, GPU, and I/O, but those are smaller wins compared to the big decoder win.

So yes this does foreshadow the rise of RISC-V as RISC-V also has a simple-to-decode instruction set. It would be much easier to "pull an M1" with RISC-V than with X86. Apple could have gone RISC-V, but they already had a huge investment in ARM64 due to the iPhone and iPad.

X86 isn't quite on its death bed, but it's been delivered a fatal prognosis. It'll be around for a long long time due to legacy demand but it won't be where the action is.

imtringued · on Dec 21, 2020

>This extra heat limits thermal throttling and thus sustained clock speed unless you have really good cooling, which is why the super high performance X86 chips need beefy heatsinks or water cooling.

The 16 core Ryzen has the same TDP as the 8 core Ryzen. Increasing the clock speed for slightly more single core performance is an intentional design decision, not an engineering flaw. Clock up those Apple chips and they are going to guzzle more power than AMD's chips. https://images.anandtech.com/doci/14892/a12-fvcurve_575px.pn...

Apple's preference for manufacturing processes that optimize for mobile low ower consumption below the 4Ghz range mean scaling up is harder than just slapping a higher TDP on the chips. Remember the TDP of the whole package already exceeds the TDP of the most power hungry Ryzen core running at 4.8Ghz. Apple has enough headroom to boost to the same frequencies but they don't, because of the manufacturing process they have chosen which loses all of its efficiency beyond 4Ghz.

pmlnr · on Dec 20, 2020

Or... in 10 years, we'll have another round of Meltdown, Spectre, etc, because of the big tricks.

scottlocklin · on Dec 20, 2020

I haven't studied it carefully, but it sure looks like 90% of the performance improvement is using a big cache, which is a totally obvious thing to do. Also the big x86 guys have more or less been asleep at the wheel for almost a decade.

My go to example: my 2011 x220 sandybridge stinkpad is faster than my 2017 kaby lake mbp. 2005 machines (I dunno, Lakeport?) aren't even in the same ballpark as modern machines. Had that pace continued up to current year, the M1 chip would be a stinker. As it is, AMD is close and could smoke M1 in next generation 5nm chips, restoring order to the universe.

whynotminot · on Dec 20, 2020

> I haven't studied it carefully, but it sure looks like 90% of the performance improvement is using a big cache, which is a totally obvious thing to do. Also the big x86 guys have more or less been asleep at the wheel for almost a decade.

Dude, has Intel called you yet? You've got some serious CTO chops.

CrazyStat · on Dec 20, 2020

>As it is, AMD is close and could smoke M1 in next generation 5nm chips, restoring order to the universe.

Comparing next gen to current gen is a strange way to do things. Apple will also have a next gen M chip.

NathanOsullivan · on Dec 20, 2020

AMD has a public roadmap, and (assuming they execute) their next CPU in 2021 will be 5nm.

Apple do not of course, but TSMC have stated they expect 3nm to be produced in volume in 2H22 which tells us we have at least one more round of Apple Silicon on 5nm.

In 2021 we will presumably see Apple Silicon parts with increased core counts and desktop TDP

This will be our first chance to see how the best of ARM goes against the best of X86 in like-for-like configurations.

enos_feedler · on Dec 20, 2020

I think the narrative around instruction set is a bit overblown. I was a chip architect for the shader core at a major GPU company. I worked on simulators and modeling performance for next generation chips where we changed the ISA for each family of chips. The big reason why Apple Silicon is so damn fast is because they were able to shape the design at modeling time exclusively around Mac system level workloads. At best, Intel would have some subset of traces come from Apple for important traces to optimize for. Combine getting to narrow traces down exclusively for one platform, and heterogeneous design space (cpu + coprocessors) and you can really tune a monster.

skohan · on Dec 20, 2020

> The big reason why Apple Silicon is so damn fast is because they were able to shape the design at modeling time exclusively around Mac system level workloads

Is that really the case? My understanding was that M1 is fast because it's able to keep the chip saturated with instructions due to a large L1 cache and wide instruction decoders. Is anything about that specific to mac workloads?

helsinkiandrew · on Dec 20, 2020

> Is anything about that specific to mac workloads?

The memory and instruction architecture may be more 'generic' but it and the neural engine, storage and media controllers, image processor etc will have been shaped and fine tuned by the requirements of the mac.

It is probably the marginal gains of each subsystem being 5-10% better for purpose that gives it the edge.

rhencke · on Dec 20, 2020

It's mostly just inertia.

https://images.anandtech.com/doci/16226/perf-trajectory_575p...

eru · on Dec 20, 2020

Interesting!

Compare https://slatestarcodex.com/2019/03/13/does-reality-drive-str...

skavi · on Dec 20, 2020

Does optimizing for a single system really improve performance significantly in general purpose computing benchmarks like SPEC? IIRC, the M1 also does fairly well with virtualized Linux.

kbumsik · on Dec 20, 2020

It does. The point is how Apple could use so much silicon space for the CPU in the SoC. Most AP and CPUs are designed to be general purpose as possible, so there are some spaces used for unnecessary interfaces. But Apple could use those spaces for CPU. Also Apple could increase the die size without thinking about profit in chip production because they earn money from their devices, so by selling SoCs like others. No other companies can do silicon business like Apple.

beagle3 · on Dec 20, 2020

That doesn’t sound reasonable to my (electrical engineer but 15 years since last involvement in processor design) ears.

Intel has enough product lines that there are no “unnecessary interfaces”; what were the unnecessary interfaces in the Intel chip used by Apple?

Similarly, so does Qualcomm and the tens of other of ARM licensees - any mass market design can find or customize a core with no meaningful dead weight.

It may be a contributing factor, but I have not seen anything to indicate it’s very important.

I give the “many small things done right” theory much higher likelihood - just as in the case if the iPhone, there wasn’t any specific thing that wasn’t done before - except for a winning combination.

phkahler · on Dec 20, 2020

AVX512 is a waste of transistors for most workloads.

beagle3 · on Dec 20, 2020

Yes. But Apple has the neural accelerators which are a waste of transistors for most workloads, and Intel has a bunch of chips without AVX512.

So that does not fit the GPs claim that the M1 is fast because Apple does vertical integration and others have to support many systems.

EricE · on Dec 24, 2020

"But Apple has the neural accelerators which are a waste of transistors for most workloads"

Based on?

Many Apple frameworks tie into those accelerators; including frameworks like Metal that one wouldn't necessarily expect.

Another strength of Apple - as a programmer your workloads can gain automatic acceleration as frameworks are extended to leverage new hardware. Note this isn't anything new; it's been going on for years already.

mschuster91 · on Dec 20, 2020

> what were the unnecessary interfaces in the Intel chip used by Apple?

Everything Management Engine related, for one.

ithkuil · on Dec 21, 2020

Do we have any indication apple has a similar management cpu buried somewhere?

mschuster91 · on Dec 21, 2020

All the new models have the T2 chip.

whynotminot · on Dec 20, 2020

The M1 runs Windows through virtualization faster than Windows runs natively on Qualcomm ARM.

People need to stop saying stupid stuff like "It's only fast because it's so intensely focused on macOS."

saagarjha · on Dec 20, 2020

Apple silicon is plenty fast at “non-Mac” workloads.

diroussel · on Dec 20, 2020

How do you run an M1 not in a Mac? It only runs with apple RAM and Apple IO and apple firmware.

Wowfunhappy · on Dec 20, 2020

I assume the parent means UNIXy tools which target ARM broadly are still very fast on Apple Silicon. Not to mention VM performance.

But your point, that there's still a lot of Apple stuff at play there, is a fair one. It will be very interesting to see how (native) Linux runs once that porting effort gets more under way...

artemonster · on Dec 20, 2020

being "a chip architect for the shader core at a major GPU company" sounds like a dream job for me. Do you have any interesting tips or books to read for a fellow hardware design engineer? :)

socialdemocrat · on Dec 20, 2020

I don't see how an ISA doesn't matter. While not a chip architect like you, I do work as a developer and I know that the interface you make to something affects what kind of performance you can build in the backend.

In principle whether you are using Python or C++ doesn't matter. It is just an interface. The compiler or interpreter in the back decides the actual performance. Yet it is pretty obvious that the specifications of C++ syntax makes it much easier to create a high performance compiler than the specification for Python.

I have been quite involved with Julia. It is a dynamic language like Python, but specific language syntax choices has made it possible to create a JIT that rivals Fortran and C in performance.

Likewise we have seen from Nvidia slides when the went with e.g. RISC-V over ARM, that the simple and smaller instruction-set of RISC-V allowed them to make much cores consuming much smaller silicon, better fitting their silicon budget.

When you worked as a chip architect didn't the ISA affect in any way how hard or easy it would be for you to make an optimization/improvement in silicon?

I mean if one ISA requires memory writes to happen in order, or have variable length, or left too little space for encoding register targets etc. All that kind of stuff is going to make your job as a chip architect harder isn't it?

Also I don't quite get your argument about modeling the M1 around Mac workloads. We know the M1 is having great performance on Geekbench and other benchmarks which have not been specifically designed for Mac workloads.

Only things I can see with M1 which is specific to Mac is:

1) They do the code needed for automatic reference counting faster. Big deal on Mac since more software is Objective-C or Swift which uses automatic reference counting.

2) They prioritized faster single cores over multiple cores. Hence optimizing more for a desktop type workload than a server workload.

3) A number of coprocessors/accelerators for tasks popular on Macs such as image processing and video encoding. But that is orthogonal to the design of the Firestorm cores.

I don't claim to know this remotely as well as you. I am just trying to reason based on what you said and what I know. Would be interested in hearing your thoughts/response. Thanks.

auggierose · on Dec 20, 2020

It's not the C++ syntax that makes it fast, it's the semantics.

socialdemocrat · on Dec 20, 2020

Yes but that is all included in an ISA. A CPU ISA specifies opcodes as well as semantics. Together that affects what optimizations you can do. Or rather how much silicon and brain power you need to accomplish it.

gcanyon · on Dec 20, 2020

Does this potentially mean that as the OS evolves, the chip will likely become less efficient, as it becomes "out of tune"? Apple could mitigate this obviously.

Animats · on Dec 20, 2020

Not clear what RISC-V has to do with the Apple M1.

Also not clear what benefit RISC-V would have for "coprocessors". GPUs and various machine learning speedup devices are massively parallel devices, intended to run small, specialized programs in parallel on multiple specialized execution units.

Also note that the real win of the Apple M1 is lower power consumption. In terms of basic compute speed, there are Intel products that are roughly comparable. But they use more much more power. This is more about battery life than compute power. (Also heat. Apple laptops have had a long-standing problem with running too hot, from having too much electronics in a thin fanless enclosure. The M1 gets them past that.)

The hardware video decoder is probably to make it possible to play movies with most of the rest of the machine in a sleep mode. The CPU is probably fast enough to do the decode in software, but would use more power than the video decoder.

webmobdev · on Dec 20, 2020

> Also not clear what benefit RISC-V would have for "coprocessors".

As the article states, the architects of RISC-V recognized that co-processors that assist the CPU to do more and more specialised repetitive tasks will be the norm. Thus, RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler.

The Apple ARM processor is also similar - the ARM system-on-chip they have designed is highly customised with many co-processors all optimised for the macos / ios platform.

Apple's SoC contains a GPU, an image processing unit, a digital signal processing unit, Neural processing unit, video encoder / decoder, a "secure enclave" for encryption / decryption and unified memory (RAM integrated) etc. (Note that this is not a unique innovation - many ARM SoCs like these already exist in different variations. In fact, it's what made ARM popular.) Obviously, when a system software or application uses these specific units of an SoC to process specific data, they may be faster than a processor that doesn't have these units. And Intel and AMD processors currently don't have these specific units integrated with their CPU.

Anyway, the point the article is making that the RISC-V architects recognized that such co-processors will be the norm in the future, and thus the author is predicting that RISC-V will become more popular, now that the M1 acts a showpiece for the architectural idea the RISC-V wants to popularise.

FullyFunctional · on Dec 20, 2020

And this is just idle speculation and I (and I guess Animats) don't necessarily buy. I think RISC-V is doing fine and will grow regardless.

Where more Arm mainstream success will have a slipstream effect on RISC-V is in app porting. There are significant differences between x86 and Arm, notably memory model (AS does support TSO with a flag, but native apps use the weak mode). Porting from x86 to Arm can be non-trivial, whereas porting from Arm to RISC-V is far easier.

socialdemocrat · on Dec 20, 2020

As mentioned in the article Nvidia has selected RISC-V to use in their graphics cards after careful evaluation of alternatives. RISC-V beat all the alternatives. A lot of other accelerator card makers are reaching the same conclusions. You can google and find many white papers detailing how RISC-V is getting incorporated into accelerators/coprocessors with great success.

FullyFunctional · on Dec 21, 2020

I'm aware of this (I've been doing since RISC-V before it was public), but it seems unlikely to me, knowing how these companies operate, that Apple would use RISC-V for anything when they already have extensive HW and SW Arm64 IP, expertise, and infrastructure.

NVIDIA's use of RISC-V is for the internal power sequencing cores and similar tasks. This is not going into the shaders, nor is Arm64. Their future use of RISC-V for stuff like this will probably continue, but it seems unlikely that they will ever release anything with a user-visible ISA based on RISC-V.

floatboth · on Dec 20, 2020

RISC-V is just an ISA. How exactly can it popularize the already extremely popular idea of shoving a bunch of peripherals onto a SoC?

> RISC-V was designed in a way to accommodate such co-processors, with limited instruction sets that make its CPU design simpler

This only affects the extremely tiny embedded space, only under the most extreme constraints you have the "simpler ISA → simpler CPU core design → more space on the silicon for coprocessors" thing.

For a general purpose high performance SoC, you don't want a simple CPU design, you want a fast CPU design, and you have space for all the coprocessors you want anyway.

Other than "being simple", there's nothing an ISA has to do with coprocessors. There's nothing ISA-specific about having memory-mapped peripherals.

Adding custom instructions directly to the CPU ISA instead? That's not exactly coprocessors, that's more like modifying the main processor, it's annoying (fragmentation) and Apple for some reason was allowed to do it with Arm anyway >_<

> Intel and AMD processors currently don't have these specific units integrated with their CPU

Of course they have GPU, video encode/decode, "secure enclave" (fTPM).

There's even an ISP on some Intel laptop chips: https://www.kernel.org/doc/html/v5.4/media/v4l-drivers/ipu3....

Neural thingy.. I'm happy not to pay for one :P

hajile · on Dec 20, 2020

Consider tensor cores. They are basically just 4/8-bit SIMD units. Add a very basic integer control unit and some specialized, custom tensor instructions. It's literally just another CPU core from an integration perspective. It shares the same memory/coherency architecture in every way without a bunch of subtle edge cases laying in wait. It even largely shares the same programming model for software making optimizations easier and faster to create in the compiler. Along the same lines, if the OS is aware of the extensions on each core, it can view all processors in the system as "cpu cores", but target specific cores based on their available extensions.

This same process applies to most of the GPU. Nvidia uses RISC-V controllers for this reason. AMD uses a scalar unit of whatever ISA to do one-off calculations and control SIMD flow. A shared privilege model would also be good here. The current GPU landscape is rife with ways to bleed into privileged space. A shared hardware privilege model would go a long way toward dealing with this issue.

> For a general purpose high performance SoC, you don't want a simple CPU design, you want a fast CPU design, and you have space for all the coprocessors you want anyway.

You don't want your one core that does fast CPU execution OR fast tensor OR gpu OR whatever else. You want dedicated cores to do those things AND fast CPU cores too. There are quite a few RISC-V extensions aimed at improving single-thread performance and code density. Like with other ISAs, if there's any low-hanging fruit at the instruction level, it will be added to an extension soon enough.

rjsw · on Dec 20, 2020

It could be good to use the same ISA for the coprocessor as for the main CPU, worked well for IBM mainframes.

Smarter Ethernet NICs often contain a CPU to do things like TCP offload, the "microcode" for them can be just machine code for this CPU, having a more open architecture for the peripheral devices would make it easier to add support for more network protocols.

webmobdev · on Dec 21, 2020

> Of course they have GPU, video encode/decode, "secure enclave" (fTPM).

Yes, they do, but they are limited to that in comparison to the M1. That was my point. The M1 has a lot more co-processors than the current offerings of Intel / AMD.

simonh · on Dec 20, 2020

Hardware accelerator modules also often need their own mini-cpu built in to them as a controller, apart from the main CPU cores. RISC-V was specifically designed for this use case to have a very light weight core ISA, with an extensions mechanism that make it easily customisable for specific accelerator design. Even the lightest weight ARM cores are monsters in comparison.

Thus we are likely to see a lot of new ARM chips containing a few RISC-V cores tucked away inside the design. In fact NVIDIA already does this on some graphics cards and it’s not impossible M1 does as well.

jillesvangurp · on Dec 20, 2020

They are both RISC instruction sets used for SOC style chips. So what makes the M1 successful should also work for RISC-V. The key non-technical difference is that you won't have to license it from NVidia which might make it attractive to some companies that don't want to pay as many license fees to these companies.

Licensing and patents have historically been in the hands of only a few companies; which limits other companies doing custom designs. With RISC-V, that could change. Of course that's only the instruction set and you'd likely still need to license lots of patents to get anything shipping. But it fits the pattern of OSS software driving a lot of innovation and hardware design becoming more like software design.

Theoretically, if Intel wanted to make a comeback, RISC-V might actually be interesting for them. Right now they would have to compete with Apple, Nvidia/ARM and the likes of Qualcomm for non X-86 based CPUs. Those three are basically using ARM based designs and you need to license patents and designs to do anything there. Intel having to license chip designs + patents from their competitors is likely not compatible with their ambitions of wanting to dominate that market (like they dominated X86 for nearly half a century). They are clearly having issues keeping X86 relevant. So, RISC-V might provide them an alternative. The question is if they have enough will left to think laterally like that or whether they are just doomed to slowly become less relevant as they milk the X86 architecture.

runeks · on Dec 21, 2020

> Also note that the real win of the Apple M1 is lower power consumption.

Doesn’t this automatically translate into higher performance — by adding more cores or increasing clock rate — since TDP is the limiting factor for CPU speed?

I mean, if someone created a 1W CPU that performed as well as a 100W CPU, would you say “lovely, a lower power CPU” or “overclock/add cores until it reaches 100W and give me that”?

webwielder2 · on Dec 20, 2020

Why do people ascribe broader ARM implications to the M1? Apple uses the ARM instruction set to make an amazing CPU. It could probably make one with the x86 set too. It doesn’t mean everyone else making ARM processors will suddenly get much better. Not to mention that Apple’s very similar A series has already been around for years.

scoopertrooper · on Dec 20, 2020

There was an article posted not long ago that suggested the variable length instruction set in x86 chips prevented some of M1's most important design innovations being replicated by Intel and AMD.

https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...

psykotic · on Dec 20, 2020

It's true that ARM64 has a load-store architecture and fixed-length instructions (the latter depending on the former for encoding space efficiency). Other than that, the instruction set design is very far from minimalist textbook-style RISC ISAs like RISC-V. It has both flag-based branches and fused compare-to-zero-and-branch instructions. It has very complex immediate encodings. It has instructions for loading/storing register pairs. It has pre-increment/post-increment addressing modes of the kind that were hallmarks of CISCs like M68K and VAX.

It seems unwise to draw far-reaching conclusions about RISC-V or even ARM64's intrinsic merits versus Apple's CPU designers when there are so many variables. The frontend decoder hasn't been a frequent bottleneck in Intel cores for a long time and they could scale it up more aggressively if they wanted.

Apple's engineers did a great job. That seems to be the conclusion we can draw based on currently available evidence.

FullyFunctional · on Dec 20, 2020

> The frontend decoder hasn't been a frequent bottleneck in Intel cores for a long time and they could scale it up more aggressively if they wanted.

This isn't grounded in any facts. Decoding the variable length x86 ISA costs you exponentially in decoding width, both power and area. You can scale it, but it will never be efficient. The way Intel and AMD combat this is by having a decoded uOP cache from which the issue width is typically twice that of the frontend decoder. Arm64 has an inherent advantage here (RISC-V does not have quite the same advantage as RV64GC instructions are a mix of 16- and 32-bit). Arm64 also is much more recent design than x86_64 that learned a lot from the past experience and isn't bogged down by a lot of useless legacy. This helps.

Arm64 is rather large for a RISC ISA, but it's mostly pretty good (however IMO RISC-V's lack of flags and implementation of conditional branches is superior).

psykotic · on Dec 20, 2020

Of course a fixed-length ISA has an inherent advantage for parallel decoding efficiency. The question is whether that is a decisive advantage in M1's impressive performance. After Intel refined their decoder and uop cache, you virtually never see that part of the frontend as a bottleneck when doing microarchitectural profiling. That's been true since Sandy Bridge but even more so since Skylake.

All the legacy junk in x86 is obviously a pain for Intel to support. Any blank-slate ISA is going to have an advantage there.

twic · on Dec 20, 2020

Presumably the importance of decode bandwidth depends on what you're decoding.

Most classic computationally intensive work (video encoding, science, but also benchmarks) spends its time in fairly tight loops or small kernels, running over large data. uop caches make decode bandwidth irrelevant here.

But general usage of a machine sees the instruction pointer wander all over the place (particularly if you have multiple tabs of JavaScript open). More decode bandwidth means more performance here.

Are compilers are an an example of a heavy workload with a large hot code size? It would be interesting to compare the M1's advantage in compiling to its advantage in, say, video encoding.

Symmetry · on Dec 20, 2020

It doesn't take exponential power. My understanding is that the basic approach for instructions without boundary tagging in L1I$ is to start decoding every byte in the stream in parallel, discard the ones that don't make sense, and then later propagate boundary to boundary across the length of the fetch window. Sort of similar to how a carry-bypass adder works. This is expensive but not that expensive compared to other structures.

But it does mean that x86 designs tend to carefully balance the size of the decoders to other structures to make sure they're not the binding constraint too often. With ARM the approach seems to be more to make the front end 50% bigger than you think you need to be sure it's never a problem and refill the front end buffers more quickly after a mispredict.

psykotic · on Dec 21, 2020

Yeah, the algorithm for parallel decoding you outlined scales linearly in area and power with respect to the speculative look-ahead depth. This is true even if you speculate on more than the per-byte "boundary or not boundary" condition. A parallel-prefix circuit for processing a DFA with m states where you speculate on all m possible initial states for each of n bytes "only" consumes O(m n) power. [1] In absolute terms this is obviously still a problem as you crank up m or n, but the scaling is certainly not exponential. You do see local exponential scaling if the state space is large enough (think of minimax search in chess) but for these decoding problems the state space is tiny and you don't even need to speculate over all possible states (e.g. you're not going to decode all possible combinations of 4 instructions per cycle, only certain prefixes, etc).

[1] The Hillis-Steele paper on data-parallel algorithms from 1986 describes this algorithm for parallel lexing.

FullyFunctional · on Dec 21, 2020

You are right it's x^2 not 2^x. However it's bit worse still because the area grows too which either hurts your timing (longer distances) or forces more stages (power, yet more area, and mispredict penalty).

It simply isn't practically scalable much beyond where we are; if it were, you can be sure Intel would have scaled it instead of using µOP caches.

psykotic · on Dec 21, 2020

The power scales linearly, not quadratically with the amount of look-ahead. The "m" is the number of states you're speculating on which doesn't grow with look-ahead length. In the case where you're just speculating on whether an instruction starts at a given byte offset, you would have m = 2.

I don't think anyone is saying they could scale up the decoder "for free". If they had a fixed-length ISA, I'm sure they would have increased the decoder width sooner (and using different techniques) since with high-end out-of-order cores you're always looking for cheap ways to over-provision your pipeline even if it only helps on some workloads some of the time. Their current use of the uop cache tells us that they consider it the most economical trade-off at that point in the design space (where the decoder can output up to 4 instructions and the uop cache can output up to 6 instructions); you can't infer that they've hit an impassable brick-wall with instruction decoding.

amelius · on Dec 20, 2020

In any case, it would be relatively simple for intel/amd engineers to evaluate the effect of different parameters using their quantitative analysis tools which include an emulation environment. I don't think it makes much sense to speculate here about these parameters.

Symmetry · on Dec 20, 2020

The traditional RISC philosophy stemmed from the constraints on chip development that mostly existed from the 80s to the mid 90s. After that ballooning transistor counts and design effort for top line out of order application processors made reducing the number of instructions pretty pointless in that design space, though limiting the number of ways instructions could interact through a load-store architecture and keeping decode simple through fixed length instructions (plus longer jumps) remain relevant useful things to take away from RISC. All the complexities that ARM has let it do more with fewer instructions and a high performance RISC-V core is going to have to do enough instruction fusion that its internal ops will end up being just as complicates as those that ARM uses, but it'll also have the disadvantage of having to do that extra fusion.

But of course if the target isn't a high end application processor but instead a microcontroller, say, RISC-V's simplicity has a lot going for it. Or for a grad student implementing a simple OoO processor in a semester long class. Or back when I was doing my thesis having an open source core to modify would have been a huge advantage. As the article says RISC-V can be a success without replacing ARM, POWER, and x86.

IshKebab · on Dec 20, 2020

I'm not sure what your point is, but no modern ISA is really bare-bones RISC. They're all somewhere in the middle, including RISC-V, despite the name (it just puts the more complex instructions in optional extensions).

userbinator · on Dec 20, 2020

I doubt that --- modern x86 (everything since the original Pentium) breaks instructions into uops anyway and caches those, so if anything I'd say the M1 is impressive despite having relatively large fixed-length instructions.

There's some more discussion in here about the source of the M1's performance, and it largely seems to come down to the smaller process size that enabled Apple to scale up a lot of the structures in the uarch:

https://news.ycombinator.com/item?id=25394301

eredengrin · on Dec 20, 2020

> modern x86 (everything since the original Pentium) breaks instructions into uops anyway and caches those, so if anything I'd say the M1 is impressive despite having relatively large fixed-length instructions

I believe this is covered in the medium article that was linked, in part of the discussion on the number of decoders x86 processors have vs the m1. It is in fact this process of breaking instructions into uops that seems to be the bottleneck, and it is apparently not easy to improve this due to the complexities introduced by variable length instructions. If you have reason to think that part of the article is wrong I'm interested in hearing it, I'm not an expert on modern day processor architecture techniques so I don't really have an insider perspective on this issue.

sharpneli · on Dec 20, 2020

And M1 does the same. All modern CPU's are really similar internally in that sense. ISA is just a frontend that gets translated into micro-ops that then get scheduled based on dependencies and available execution ports. Even the registers in ISA don't match the internal registers at all. ARM64 has 32 general purpose registers in ISA level. M1 seems to have 354 internally [Anandtech]

This is also the reason why the whole CISC and RISC debate in it's original form is outdated. The processors internally are all RISC. But the ISA can be more complex.

The x86 ISA makes the decoder harder to parallelize, so it takes more chip area compared to equivalent width for ARM64. And the wider you want to go the harder it becomes, whereas with ARM64 you just slap more decoders.

Another is the x86 memory model that restricts how stores can be issued into memory so that they're visible to other cores.

This is also a good thing for AMD. They could "just" make a Zen ARM CPU. Sure it would be a lot of work, but vast majority of the chip is shared.

hajile · on Dec 20, 2020

That frontend isn't free and it isn't small. Look at modern Intel processors and you'll see that the decoder takes as much die area as the entire Integer ALU (if you don't count caches). Unlike the ALU which powergates unused ports, the decoder almost never turns off.

The more 1-to-1 your translations to uops are, the less power and die area you need to spend decoding them. In addition, less complex translations means fewer pipeline stages needed for the same design which also has lots of ramifications.

sharpneli · on Dec 20, 2020

Yap. And ARM has the advantage of requiring a smaller frontend. Especially when one looks at wider decoders.

On the other hand if your ISA is the micro-ops directly then the instructions start to take ridiculous amounts of space. It's a balancing act between instruction size (to save instruction cache) and the complexity of decoding them.

And it's not just about being 1:1. It's also about how wide you can go. And variable length encodings simply are fundamentally more hard to parse in parallel fashion. That means a wider unit is harder to achieve, needing more space and power.

tanilama · on Dec 20, 2020

New to the hardware land, so the core argument here is that CISC instructions are not fixed in length, so decoding becomes less efficient?

stingraycharles · on Dec 20, 2020

It’s mostly that you can only pull instructions off the queue from the front, whereas with ARM since the size is fixed you can just pull them off anywhere.

I think with x86 Intel and AMD are basically brute forcing this by just pulling instructions off a random position and hoping it’s a correct offset, but it’s very inefficient.

amelius · on Dec 20, 2020

Yes, it seems that CISC is at a dead end.

But perhaps Intel/AMD can surprise us with a dynamic allocator that runs in the reorder buffer. Or perhaps they can still push the limit one more time with more transistors. Another option would be to implement a fast-path for small instructions, so in effect they would be moving from CISC to RISC but only for parts of the code that need the extra performance.

skohan · on Dec 20, 2020

A lot of it has to do with the stricter requirements on ordering of memory operations on x86 right?

sliken · on Dec 20, 2020

Well the perception was that AMD and Intel had a unassailable lead. That even with a power and clock speed disadvantage that the M1 can be quite competitive with several other serious mobile chips, like the Intel I9.

Now apple has proved that a cool running chip that sips power can run a wide variety of intensive applications well.

People were quite dubious of apple's chances on a competitive desktop chip and have just received a wake up call with a relatively conservative M1 chip (3.2 GHz and 4 fast cores).

fulafel · on Dec 20, 2020

It wasn't generally thought that amd/intel were unassailable engineering wise, just that sw compatibility, x86 patents and volumes were important enough that it was economically hard to go against them. But other chips (eg IBM) regularly challenged them on speed despite relatively tiny volumes and budgets. And of course years earlier the Itanium debacle + exponentially increasing fab costs (favouring volumes) killed off most of the RISC competition.

Trivia: Simultaneously to the previous Mac ISA transition, Apple acquired PA Semi who had a power efficient and fast PPC chip. Then, Apple decided to go to Intel anyway instead of betting on their new in-house chip. Discarding their newly acquired highly acclaimed chip design, they put the newly acquired semi team to work on the A series of chips instead.

webwielder2 · on Dec 20, 2020

But they had no reason to be skeptical, given the A series. To only take a processor seriously once it’s housed in a case with a keyboard attached is ridiculous.

sliken · on Dec 20, 2020

Well, I kind of agreed with them. It's easy to assume that phone apps written for a few GB ram would not be representative of high end use like compiling large projects, editing 4k video, manipulating large datasets, etc.

But as it turns out the M1 does quite well at quite a few real world aggressive desktop applications.

mrtksn · on Dec 20, 2020

I was on the other side: Bits are bits, loops are loops. Why there’s a distinction between phone app and desktop app? Isn’t it that the difference is only the UI input method?

At least for the last 5 years an iPhone and a desktop would handle the exact same images, videos, spreadsheets and websites with any desktop and often with better performance. Why would anyone think that these are fundamentally different?

sliken · on Dec 21, 2020

Iphones are generally media consumption devices. Sure they can take pictures and video, but most often they are processed on laptops or desktops.

Similarly programming, compiling, making videos, making music, etc is most often done on laptops and desktops.

Certainly you are right, the M1 is a very capable CPU for all of these things, just that it was a surprise for many.

Retric · on Dec 20, 2020

RAM latency for one thing. Phones have a huge advantage in that area which translates very well to most tasks as long as they have enough memory. OS differences also favor phones in many ways.

BeeOnRope · on Dec 20, 2020

No, they don't. Memory latencies for top-end phones like the iPhone are generally over 100 ns, which far from being a huge advantage is consderably worse than the best desktop and latpop x86 chip latencies which hover around 50 ns.

sliken · on Dec 21, 2020

I generally try to quantify TLB latency and DRAM latency separately. The M1 chip has quite impressive memory latency around 30ns, assuming you aren't thrashing the TLB.

Desktops tend to be much higher latency, like say the AMD ryzen 5950X. In particular the R per RV range benchmark uses between 1 and 32 pages in a sliding window. The M1 gets around 30ns (which I've personally verified with code I wrote) and the Ryzen 5950x gets around 65ns. Intel does a bit better than AMD, but nowhere close to the M1.

Retric · on Dec 20, 2020

You’re numbers are way off, where did you get them?

Ex: https://en.wikipedia.org/wiki/CAS_latency

BeeOnRope · on Dec 20, 2020

Benchmarks. CAS latency is only a small part of total memory latency as seen by the CPU.

Retric · on Dec 20, 2020

As we are talking about comparing different CPU architectures it’s the external latency that’s important. Even then I have been looking at sub 25ns on AMD CPU memory benchmarks and I assume similar Intel numbers.

skohan · on Dec 20, 2020

If anything it's made me more interested in micro-PC's. I have been playing a lot with Raspberry Pi's lately, and the M1 shows how much better we could do in that form factor. Why can't I have an Apple Silicon SOC the size of an iPhone, with the battery and screen replaced with a heat sync and a couple of ports?

hajile · on Dec 20, 2020

We do and it's called the mac mini. The actual circuit board looks to be about 2x the size of the pi. Their only reason for the massive mini case is economies of scale. They have to leave that case around for the intel mini for another year or two and all the tooling exists, so they didn't have any costs there either.

I think the real next-gen mini should have been the iPhone 12 pro. Imagine a desk with just an Apple monitor, charger, and wireless keyboard. Pop your mag-safe phone onto the charger so you have a trackpad. The mm-wave connects to the monitor for video and giving access to it's USB-c ports. You get the normal mac desktop ready to start working for the day. It's not as powerful as most desktops, but it can give tons of plugged-in laptops a run for their money. More than good enough for 95% of users.

thekyle · on Dec 20, 2020

You also have to leave some room for ports.

jackcodes · on Dec 20, 2020

It’s slightly tangential, but I’m managing an engineering team of 28-30 and we’re currently considering a wholesale change to ARM CPUs across the board.

MacBooks are our de facto development laptop and all our services use skaffold for local development, Docker basically. If we consider the perhaps likely outcome that MacBooks will one day be ARM-only, that Docker will not offer cross-arch emulation, and that our development environment will be ARM only, it then becomes likely that we’ll migrate our UAT and PROD to ARM based instances.

If we go that route it’ll mean more money to the AWS Graviton programme and likely further development of ARM chips. I can’t see this affecting RISC-V but the M1 switch could very well benefit the wider ARM ecosystem.

thawaway1837 · on Dec 20, 2020

I don’t get this.

You’re basically locking yourself to a single development eco system, and a highly limited deployment eco system.

It’s not clear what the benefits of either are either. I get that the MacBook gets great performance for battery life but the majority of work is gonna be done in desktop settings, so simply using more/equally powerful x86 chips is only gonna cost you a few dollars a developer per year in electricity costs.

And all that despite the fact that your development is on Docker which doesn’t even have a working solution for the workflow you’re considering at the moment.

jackcodes · on Dec 20, 2020

It‘s currently in consideration and by the time we’re ready to make a call on it, Docker will be too. They almost are in fact.

But consider that we may be optimising for different things. Most new developers I hire can be thrown a MacBook and they’ll know what to do, Linux on the other hand doesn’t have that guarantee especially towards the junior and front-end market segments of where I work. It’s a (real) broad strokes opinion, but I’m of the belief that macOS and by extension MacBooks offer us fewer overheads in terms of setup, maintenance, onboarding, tooling suitability for the median developer. So that leaves us using macOS.

This is the factor we’re optimising for more than deployment portability - we optimise for vendor lock-in in less than the developer experience for the median of our developers. For many of us on this forum we may be best with Linux on a bleeding edge distro, but for our preferences we deploy MacBooks for portability. Whether it helps things overall, this is in Manila where a net monthly salary is often less than the cost of a laptop, so we deploy one device that can be transported between home and work as required for those that don’t have a personal device.

With that, I see this as Apple locking us into that ecosystem rather than a choice we’re making on our side, so I’d rather lean into this and explore it further than doing nothing. If it comes out positive then we’ll be ready to make the switch before Apple forces us into it, and if not we’ll deploy something thinkpad-esque and keep our production instances x86.

nbzso · on Dec 20, 2020

"With that, I see this as Apple locking us into that ecosystem rather than a choice we’re making on our side, so I’d rather lean into this and explore it further than doing nothing. If it comes out positive then we’ll be ready to make the switch before Apple forces us into it, and if not we’ll deploy something thinkpad-esque and keep our production instances x86."

As a long time Apple user (personally and staff wise), please don't tie your business decisions with company that treats professional users badly, every time they can. Your median developer benefits from Linux knowledge in general, you can deploy stable distribution without fear of compatibility problems after minor software update.

Apple marketing and lure is great, I have fallen for their game for 20 years. But I cannot be comfortable with ideas, business and management practices that this generation of Apple deploys.

Toutouxc · on Dec 20, 2020

> company that treats professional users badly > ideas, business and management practices that this generation of Apple deploys

Can you please elaborate?

AsyncAwait · on Dec 20, 2020

They destroyed entire indie businesses by arbitrary changes and/or enforcement of App Store policies, not to mention they're leading the war on general purpose computing as we know it by locking everything down.

I want to be able to tell my children I didn't participate in that.

nbzso · on Dec 20, 2020

If I compile the list of all anti professional moves that Apple has made in recent years I will get depressed and I don't like to be depressed:)) Here, watch this funny rant from proven Apple professional user, may be it will give you some insight. https://www.youtube.com/watch?v=MKJjLwMUPJI

On other hand most valuable company in the world uses slave labor and gives the consumer highest possible price, I cannot support this dynamic anymore. https://www.youtube.com/watch?v=zeEERdbfH0c

simonh · on Dec 20, 2020

M1 performance is about much more than just battery life, it’s screaming fast is raw execution power as well. In single core it’s even competitive with Ryzen for goodness sake. That’s just mental.

I don’t see this as a significant lock in risk. It’s not like Apple are the only company selling ARM laptops and desktops, and it seems clear Google, Microsoft and Amazon among others are serious about ARM.

trimbo · on Dec 20, 2020

99.9999..% of servers in the world run x64.

x64 virtual machines, Docker, etc have to be supported on Apple's M chips for a long time to come. There's zero risk of this changing soon unless Apple wants to scuttle the non-iOS/non-Mac developer market for Mac.

M1 is a cool chip, but there's no reason for an average development company to rush into it unless targeting M1 MacOS specifically. Maybe the server world swings to ARM, but that will take decades to sort out, if it actually happens at all.

tybit · on Dec 20, 2020

AWS have thrown their weight behind ARM server side and are really ramping it up for internal and customer usage.

x64 is still going strong but the competition has massively heated up over the past few years.

sharpneli · on Dec 20, 2020

It took about 10 years for x86 to go to zero marketshare in servers into 80%+ in the 90's. Similar change in HPC market etc. So based on history the transition time is around 10 years, not tens of years.

CyberDildonics · on Dec 20, 2020

That would mean 1 out of every million servers is not x64, which seems hyperbolic when amazon is making ARM servers and power chips are still out there.

zapita · on Dec 20, 2020

Doesn’t Docker already support cross-arch emulation?

jackcodes · on Dec 20, 2020

Cross-arch images are supported, but you can only virtualise the arch you’re on. No running x86 on ARM for example.

Edit; see replies, I got this wrong

tempay · on Dec 20, 2020

This isn’t true, there is a --platform flag which can be used to emulate another arch: https://docs.docker.com/docker-for-mac/apple-m1/

bboreham · on Dec 20, 2020

It’s fairly straightforward to configure QEMU to do the emulation. And the Docker preview released a few days ago does that without even asking.

wmf · on Dec 20, 2020

"ARM is killing x86" is a cooler narrative than "Macs are now crazy-fast but they're still Macs so few people will switch".

alwillis · on Dec 20, 2020

"Macs are now crazy-fast but they're still Macs so few people will switch".

Anecdotally there have been a bunch of posts on HN since the M1 Macs shipped by people who've either stopped using Macs years ago or who've never bought a Mac previously who are happy M1 Mac owners.

The M1 Mac mini retails at $699, but I've already seen it as low as $625. There's certainly nothing in that price range that's better.

And even before the M1 Macs shipped in November, Mac revenue hit an all-time high of $9 billion in the quarter that ended September 26, 2020 [1]. Apple often highlights that about 50% of Mac customers are new to the Mac, a trend that's likely to accelerate.

[1]: https://www.apple.com/newsroom/2020/10/apple-reports-fourth-...

novaRom · on Dec 20, 2020

>> There's certainly nothing in that price range that's better.

You can buy very good AMD based PC nowdays for way below $700. All modern Linux kernels fully support AMD GPUs.

tybit · on Dec 20, 2020

The second narrative doesn’t explain AWS throwing its weight behind ARM.

Not to say that ARM is killing x64, it’s definitely not, but ARM is clearly being invested in and rolled out at a massive scale by 2 of the biggest tech companies in the world in both consumer devices and server side. To me that’s quite something.

giantrobot · on Dec 20, 2020

AWS is throwing their weight behind ARM because they have thousands upon thousands of servers. Their Graviton2 chip runs between 1-1.8W/core [0] and has 64 cores. Their TDP is half that of the EPYC and Xeon equivalents. At data center scale that's a huge savings in power. They could also increase density in the same power envelope.

If Amazon were to move significant amounts of their non-EC2 server hardware to Graviton they can see cost savings with no end-user impact. If an AWS product doesn't run client code there's no real requirement to run it on x86.

[0] https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

cinquemb · on Dec 20, 2020

I wish the t4g was available in Singapore… I had to settle for t3a, but the cost savings there combined with using spot instances behind a custom load balancing/dynamic scaling setup (nginx+3rd party modules for hot ip reloads+python daemon using cloudwatch free tier stats) deploying from AMI (docker and k8s needed machines that were 2x ram than just using AMI) made server costs at my last company super cheap, and probably saved amazon even more than when we were running on t2.

danellis · on Dec 20, 2020

Few people will switch, but those people may still end up running Windows on ARM.

vlovich123 · on Dec 20, 2020

Apple is playing the margin game not the volume game. Just like Apple takes something like 98% of the profit in the global phone manufacturing business, I wouldn’t be surprised if they’re doing the same thing in the developer compute market.

johnbender · on Dec 20, 2020

Worth noting that the ISA is more than a set of instructions it’s also a semantics for those instructions. For example the concurrent semantics of ARM processors permits a much larger array of optimizations on the per thread level which is good for performance.

socialdemocrat · on Dec 20, 2020

That is like saying, why do people use C++ to make fast compilers, couldn't they just add a fast compiler for Python and everybody is happy?

You interface whether a programming language, library API or and ISA has strong implications for what optimization and implementer than do.

The ARM ISA has many advantages over x86:

1) Fixed sized instructions, which make it easier to add more instruction decoders. Discussed here: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...

2) More registers. ARM64 has 32 general purpose registers and 32 registers for SIMD stuff. x86 has fewer registers which are also wasted on all sorts of legacy junk.

3) More lax restrictions on memory-write back. It is easier to optimize the Out-of-Order execution on ARM, as you don't need to write back everything in order to memory.

As for everybody else. ARM designs from ARM Ltd. is showing rapid performance increases and gradually closing the gap to x86. It really is inevitable as there is NOTHING special about the x86 ISA that gives it higher performance. Nothing prevents other ARM makers from catching up: https://medium.com/swlh/is-it-game-over-for-the-x86-isa-and-...

iridium_core · on Dec 20, 2020

With all of these advantages why don't the PS5 and XboxSeriesX use ARM? Do they have plans to for the next generation?

ksec · on Dec 20, 2020

1. They want some level of backward compatibility, and they were on x86 before.

2. Most of the Tools and Library they used are on x86. Not only from MS & Sony, but also Game developers. It is going to take some courage to change all of that.

3. When the design of PS5 and Xbox X started, none of the ARM CPU IP design, ( In fact even as of today and next year, or may be even 2022 ) has a Single Core Performance that rivals AMD's Zen 2, and certainly not Apple's M1.

4. GPU IP, both are from AMD. Which may actually be more important that the CPU IP from a gaming perspective.

5. In case you ask why not Nvidia then since they made ARM SoC like those in Nintendo. Nvidia's pricing for latest Gen IP is way out of touch. Hence that is why you only see non-leading edge tech used in Switch. ( Nvidia is not a player that wants to sacrifice Margin for Market Share, which is actually a fair point )

socialdemocrat · on Dec 20, 2020

Nobody is selling high performance ARM chips. Apple isn't selling their chips. ARM from others have been optimized for micro controllers, phones, servers and super computers. Not for high performance desktop system of game consoles.

If you wanted a high performance alternative to x86 in the past, you would have to go with PowerPC, which Playstation famously did with Playstation 3. Except that was a disaster because the architecture was too novel.

That experience made Sony, very afraid of doing anything unconventional hardware wise.

But I supposed Apple may have opened the flood gates and my prediction would be that next generation Playstation will be ARM based.

terramex · on Dec 20, 2020

> Except that was a disaster because the architecture was too novel.

But it had nothing to do with PowerPC instruction set. Xbox 360's CPU used PowerPC instruction set too, but Microsoft went with much more classical 3-core design instead of single core + 8 SPUs that Sony used.

The only impact PowerPC had on game developers was making them aware of load-hit-store performance hit.

Macha · on Dec 20, 2020

It's not even clear that the M1's big leap is due to ARM vs x86 rather than say 5nm vs 7nm (amd) or 14nm (Intel), or design ideas such as big/little cores and more specialized accelerators (which is ironically against the risc idea which people are claiming as the reason why arm vs x86 so the reason m1 does well)

mlyle · on Dec 20, 2020

Specialized accelerators doesn't explain it, because we're measuring a lot of general purpose CPU tasks for the most part.

Big/little is good for power consumption, not so much for performance which is still good.

There's a lot of microarchitectural goodness here beyond ARM, though. Apple's got lots of little details right, and fat connection to memory helps, too. It doesn't hurt to be on leading fab, too.

vinay_ys · on Dec 20, 2020

Having dedicated silicon for most frequently used primitives (specialized accelerators) helps in getting those out of the way for the main core's pipeline execution to run predictably fast.

mlyle · on Dec 20, 2020

That makes no sense here. For compilation workloads and a lot of these other tests where we're showing benefits, basically any machine is able to give well over 99% of CPU to the task. Just how exactly do you think that having any dedicated silicon is helping clang compile benchmarks, etc?

floatboth · on Dec 20, 2020

> fat connection to memory

It's the same "thickness" as desktop (or good laptop) DDR4, i.e. 128 bits. Apple is running a very high clock though, and with other manufacturers, even for laptops with soldered memory they were quite conservative with memory tuning, basically running JEDEC spec. Maybe now they'll feel the kick in the ass and overclock their RAM already.

enos_feedler · on Dec 20, 2020

The top level things like process node, ISA and memory controller are big. But a lot of boils down to being able to shape the entire chip design exclusively around system level traces of real mac workloads. Intel needs to factor so many different kinds of traces into their chip design. Even windows vs apple makes a huge difference.

tsimionescu · on Dec 20, 2020

So your prediction is that the chip will be bad at running Linux and windows?

To me it seems a priori quite unlikely that the patterns of MacOS, windows and Linux are so different that this would be a major win. There may be a few specific things, but any CPU that prioritizes to much for some particular os would have big problems running CPU-intensive user-space-only workloads.

Gibbon1 · on Dec 20, 2020

I'm out of my league here but I've seen references to 8 bit cores that can run at a couple of giga instructions a second. It's hard to understate the performance vs power cores like that are cable of. Also sub nanosecond interrupt latency.

Think a small coprocessor with local memory that's pulling commands out of a queue and managing an io controller. Couple of wins, lower power consumption, fewer context switches, and cache pressure.

cpuguy83 · on Dec 20, 2020

They can't make an x86 one because Intel holds the rights on it and only AMD has a license to do this (there is a history here).

Also I wouldn't be surprised if one actually could not build anything like M1 (at that power usage) w/ x86.... Intel certainly hasn't been able to.

RuggedPineapple · on Dec 20, 2020

The history is actually pretty simple.

Originally it was because the US government requires a second source for any components and so Intel had to license it to somebody to supply the US government.

Then later AMD's 64 bit instructions became the standard, so Intel needed the license for the 64 bit extensions and AMD needed the x86 base and so they just decided to cross-license and call it good.

gonzo · on Dec 20, 2020

You seem to have not mentioned the lawsuit and clean room effort.

RuggedPineapple · on Dec 20, 2020

Which lawsuit? The mid-late 2000's one after all this history? Or the 1991 one that was about intel seeking ways to squeeze out AMD from it's license? The first one isn't really relevant to this and the second one could have been more direct but AMD having a license and then getting another later as part of the deal on x86-64 does allude to behind the scenes shenanigans.

puetzk · on Dec 20, 2020

There's actually a 3rd x86 license, that has changed hands quite a few times (Cyrix -> National Semiconductor -> Centaur(IDT) -> VIA -> Zhaoxin, I think, unless I missed a few transitions?)

There's also the https://opencores.org/projects/ao486 - the relevant patents on a 486-era design would have expired

userbinator · on Dec 20, 2020

The Pentium patents have expired too, and if we go by the usual 20-year expiry, then that would mean everything up to the Pentium III (1999) would be in the public domain. IANAL.

garmaine · on Dec 20, 2020

Patents have expired. That doesn't make it public domain.

jecel · on Dec 20, 2020

Isn't the point of patents that instead of ideas being kept secret (and often vanishing when their inventors die) they are published in exchange for a limited time monopoly on their use? In that case when a patent expires it does indeed become public domain.

garmaine · on Dec 20, 2020

Public domain means no IP applies. Patents are only one kind of IP. There’s also copyrights, trademarks, etc.

jecel · on Dec 21, 2020

Sure, Intel didn't have any serious patents on the 8080 but they did copyright the assembly language mnemonics. So the Z80 was 100% compatible at the binary level but had to use other names for the instructions. Same thing with trademarks.

What I was saying is that what is described in the patent text can be freely used after the patent expires.

Taniwha · on Dec 20, 2020

and at least 2-3 other attempts at cloning I know of including Linus's Transmeta and the one I worked on

strenholme · on Dec 20, 2020

ARM has been eating away at Intel for a while now, the same way Intel ate away at the mainframe and minicomputer market in the 1980s and the MIPS/Sparc/HPPA/Alpha workstation market in the 1990s. While the mainframes and minis ate the low-cost PCs for lunch in the 1980s, and the $20,000 workstations of the 1990s had far better performance than a 386 (or even 486 did), PCs were cheaper and more widely available.

It was the economies of scale and the standardization on x86_64 that made the PC the performance king in the first 2000s decade. Intel (and, of course, AMD) x86 did not have the best ISA but they, because of economies of scale, had the best fabs which let them outperform anything else.

While Intel was dominating with raw performance in the first 2000s decade, embedded chipsets slowly coalesced around the ARM ISA, a process which was accelerated by Apple choosing ARM for the iPhone (Nokia also used ARM in a lot of their phones).

Moore’s Law finally stopped working for Intel and they stopped being able to outfab everyone else in the mid-to-late 2010s; a 2012 x86 chip has about the same performance as, say, a 2017 x86 chip.

Intel saw the writing on the wall with people using non-Intel ISAs for phones, and tried to make an Atom chip which would work in a phone; it was a flop. Nobody wanted the x86 ISA unless they needed it in systems which ran legacy applications.

With the Raspberry Pi moving up from only suitable for specialized embedded applications to having near-desktop level performance, and with Apple finally making an ARM chip which is competitive (and in some cases superior) to Intel’s desktop chips, and with legacy x86 Windows applications being in many cases replaced with webpage and smart phone applications, it looks the industry as a whole is finally moving past x86 and its bloated instruction set.

This is a much needed breath of fresh air for the computer industry. I like the M1 because I like that we now have mainstream non-x86 desktop/laptop computers again.

I think RISC-V has a lot of potential, and I am interested in what comes of it in the 2020s, whether it blooms like the ARM did, or if it goes the way of the HPPA, Alpha, or Sparc.

sharpneli · on Dec 20, 2020

> Nobody wanted the x86 ISA unless they needed it in systems which ran legacy applications.

It wasn't fast/efficient enough. If it had been faster or with better power consumption it would've been fine. There was a massive push to get Android studio to automatically compile X86 binaries for you etc.

But why would you put an Atom in your phone if it means it's slower and worse battery life? That's the reason it flopped. ISA change is a hindrance for switching, but it can be overcome. Even if Intel had sold Atom chips with ARM ISA and identical performance to the X86 variant it would have still flopped due to the poor performance and efficiency.

willis936 · on Dec 20, 2020

>legacy x86 Windows applications being in many cases replaced with webpage and smart phone applications, it looks the industry as a whole is finally moving past x86 and its bloated instruction set

Similar predictions about lack of importance of legacy support made over the past 40 years have not borne out. Performant x86 emulation is an absolute must for a replacement ISA.

strenholme · on Dec 20, 2020

A lot has changed over the last 40 years:

• Smartphones

• iPads and Android tablets

• Chromebooks

• Game consoles

• The Phoenix-like return of the Mac (About half of the shops I have seen in the last decade were mixed or Mac shops)

Point being, it’s a different world than it was 20 years ago, when one needed to run on Windows to be a viable software product.

Anyway, the M1 has excellent x86 emulation; it’s 50% as fast (if not better) as native ARM code.

And, yes, I felt Windows for ARM was not viable just 10 years ago: https://www.samiam.org/blog/20101224.html But a lot has changed since then.

dkjaudyeqooe · on Dec 20, 2020

The broader point is that RISC-V provides the freedom and a practical ecosystem in which to innovate. Custom instructions may only be a small part of that.