Matrix Multiplication Inches Closer To Mythic Goal

1024core · on Dec 18, 2021

Just for reference: a Google TPU can multiply a 256x256 matrix in 1 clock cycle[1]. That seems astonishingly fast, considering the TPU operates at 700MHz:

The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 × 256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000 = 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix unit.

[1] https://cloud.google.com/blog/products/ai-machine-learning/a...

basementcat · on Dec 18, 2021

It appears that while the TPU is capable of ingesting the inputs of a few matrices each clock cycle, it likely takes several clock cycles for the matrix multiplication operation to percolate through the systolic array pipeline. That means while it may be able to sustain an average of one matrix multiplication per clock cycle, any particular matrix multiplication operation will require multiple cycles before the result is valid.

nynx · on Dec 18, 2021

This isn’t exactly correct. A TPU can probably multiply two matrices per clock cycle, but it most likely takes a few hundred cycles for a single operation to be performed. This is similar to how a modern cpu can execute many instructions per cycle, but most instructions take multiple clock cycles to be performed.

emacs28 · on Dec 19, 2021

Just to clarify, the TPU (version 1) does not multiply two 256x256 matrices per clock cycle as that takes 256^3 MAC operations to perform. It produces one row of the product matrix per clock cycle (each 256 element row takes 256^2 MAC operations to calculate).

userbinator · on Dec 18, 2021

...and in case anyone is wondering, no, this is still not a practical speedup. https://en.wikipedia.org/wiki/Galactic_algorithm#Matrix_mult...

actually_a_dog · on Dec 18, 2021

Fortunately, there's another approach based on group theory that's able to hit n^2.41 and shows some promise.

Here's the paper: https://arxiv.org/abs/math/0511460

Here's a set of lecture notes that are simpler to read and give an easy example of how group theoretic methods work: http://www.ccs.neu.edu/home/viola/classes/gems-08/lectures/l...

e-dant · on Dec 18, 2021

This thought has been churning around in my mind for some years now — we focus too much on processing speed and reductions in time complexity and not enough on increasing the size and efficiency of our cache and stack.

MM (especially MM on large type numbers like e.g. hashing algorithms) are very reliant on the cache because you can’t always fit that big of a number into a register. Side note, I was reading some Abseil code last night that did some funky bit twiddling on ARM: https://github.com/abseil/abseil-cpp/blob/master/absl/hash/i...

Off the top of my head, isn’t it about 200ns [edit, not ms] to query, bus, and read something from memory? Just a thought, perhaps the cache and memory is where we should focus our efforts.

kadoban · on Dec 18, 2021

This is kind of like "doctors are spending too much time curing cancer, they should worry about my bathroom, which is flooding!"

Theoretical Computer Scientists aren't going to increase the size of your cache, that's just not what they're into or what they know anything about.

evandwight · on Dec 18, 2021

> We nevertheless stress that such improvements are only of theoretical interest, since the huge constants involved in the complexity of fast matrix multiplication usually make these algorithms impractical.

It's only for mathematical interest

gameswithgo · on Dec 18, 2021

this article isn’t at all about doing it faster on actual hardware, and everyone has worried a lot about cache for decades now.

gnufx · on Dec 18, 2021

Not relevant to the article, but well-written single and double precision floating point large matrix GEMM on mainstream CPUs is basically computation-bound -- look at performance figures -- and is relatively unusual in that. (You probably need to worry about prefetching to get there, though.)

nynx · on Dec 18, 2021

200ns, not 200ms

a9h74j · on Dec 18, 2021

Fun article. The approach to n^2.

> The result, by Josh Alman, a postdoctoral researcher at Harvard University, and Virginia Vassilevska Williams of the Massachusetts Institute of Technology, shaves about one-hundred-thousandth off the exponent of the previous best mark. It’s typical of the recent painstaking gains in the field.

shmageggy · on Dec 18, 2021

Quanta has used this graphic in previous articles, and I find it very hard to grok. Time on the y-axis is just strange and confusing. Rotating 90° clockwise makes it much clearer.

icambron · on Dec 18, 2021

The dotted line just compounds this. Very strange choice

debbiedowner · on Dec 18, 2021

Which one of these algos does Intel/Nvidia/Google use in their AVX/Cuda/TPU implementations?

jl2718 · on Dec 18, 2021

Simple block partitioning, usually 4x4, with the full ordinary algorithm per block. This is the most numerically-stable and parallel.

https://developer.nvidia.com/blog/cutlass-linear-algebra-cud...

gnufx · on Dec 18, 2021

However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.

SuchAnonMuchWow · on Dec 18, 2021

and the most cache-friendly

saboot · on Dec 18, 2021

Strassen's method is the only one practical on modern hardware

gnufx · on Dec 18, 2021

If it's of interest, there's an account of fairly recent work in the "Strassen reloaded" paper https://dl.acm.org/doi/10.5555/3014904.3014983 or otherwise in https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf

jl2718 · on Dec 18, 2021

And only in fixed precision.

fdej · on Dec 18, 2021

What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).

SuchAnonMuchWow · on Dec 18, 2021

Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.

dnautics · on Dec 18, 2021

do we know if the rounding errors are a big deal for numerical methods that can tolerate inaccuracy (like gradient descent for machine learning)?

nestorD · on Dec 18, 2021

It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.

When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.

hamilyon2 · on Dec 18, 2021

Rounding errors, underflows and instability in floats are very well known problems, a big deal if you do anything but graphics.

klyrs · on Dec 18, 2021

You want to converge on a local minimum, don't you? We can't guarantee that with unstable algorithms.

jl2718 · on Dec 18, 2021

Speed is fine. The issue is numerical stability.

fdej · on Dec 18, 2021

Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.

Delk · on Dec 18, 2021

Previous discussion:

https://news.ycombinator.com/item?id=26555585

salamanderman · on Dec 18, 2021

The part I don't follow is that I thought multiplication used the same number of CPU cycles as addition, so I didn't get the the part where one multiplication replaced with many additions was obviously better. Could someone explain that part? I feel I must be fundamentally misunderstanding something.

teawrecks · on Dec 18, 2021

It has nothing to do with clock cycles, it's about how the number of additions and multiplications scale for large matrices.

reportingsjr · on Dec 18, 2021

If you look at operations per cycle, in this example for Intel Haswell processors, you'll find that an equivalent add and multiple are quite a bit different.

An add/subtract from a 32 bit register to register is 0.25 cycles per operation.

A multiply from 32 bit register to register is 2 cycles per operation. So a 8x speed difference, not counting for latency, etc.

See page 230 and 231 of this paper: https://www.agner.org/optimize/instruction_tables.pdf

gnufx · on Dec 18, 2021

For what it's worth, what counts for normal implementations is the number of multiply-add operations per cycle, typically two on current hardware with FMA.

davidgrenier · on Dec 19, 2021

I feel the other answers you got do not directly answer your question for the following reason.

In practice, I believe it is Strassen's algorithm that is used for very large matrix multiplication. Strassen's algorithm improves on both the number of multiplications of scalars as well as addition of scalars which both ends up bound by O(n^(lg 7)).

This is because Strassen's algorithm replaces a single large conventional matrix multiplication consisting of n^3 multiplications and n^3-n^2 additions with 7 recursive application of Strassen's multiplication on matrices of size n/2 along with 15 addition of matrices of size n/2.

The number of additions in a matrix addition is n^2 for a matrix of size n. Thus Strassen's requires 7 recursive calls as well as 15(n/2)^2 additions of scalars.

You win on both fronts if the matrices are sufficiently large and you default to conventional matrix multiplication on a recursive call below a certain threshold.

gameswithgo · on Dec 18, 2021

This isn’t about running stuff on hardware. The point is that for a large matrix the number of additions grows much more slowly, such that the represent an increasingly small percentage of operations. So they don’t matter much no matter the hardware at some sufficiently large matrix

Ductapemaster · on Dec 18, 2021

The multipliers in CPUs are a mass of logic gates and are challenging to make work in a single cycle. While I believe it is possible with integer multiplication to do so, doing so with other larger data types might take multiple cycles. Division is even more complex, although not part of the question here.

Additions really can be done in 1 cycle, so it's possible to optimize by using them instead of multiplications in operations like this.

Here's a Berkeley EECS lecture on multiplier circuits if you want more info: https://inst.eecs.berkeley.edu/~eecs151/sp18/files/Lecture21...

matbatt38 · on Dec 18, 2021

> multiplication used the same number of CPU cycles as addition

Only thanks to lookup table and parallelism. And that's not even true for old computers, where multiplications can take multiple cycles. Also I think complexity is calculated with a pen and paper approach in mind. Number of CPU cycles might vary a from actual complexity when the chips are optimised for certain tasks

saeta · on Dec 18, 2021

While that's true for fixed-precision integer arithmetic, this doesn't hold true for floating point multiplication.

Ar-Curunir · on Dec 18, 2021

From a complexity perspective, addition is linear time, whereas multiplication is superlinear (nlogn at best)

mensetmanusman · on Dec 18, 2021

It would be interesting to put this in perspective in regards to total energy savings.

Since so much of computation is matrix multiplication, how much computational power integrated over the next decade would we say that the exponent going from 3->2.3 has saved?

idiliv · on Dec 18, 2021

None. Afaik, all practical implementations of matrix multiplication execute the naive n^3 algorithm because sub-cubic algorithms have too large constant overheads and/or are numerically unstable.

asimpletune · on Dec 18, 2021

People definitely use the more sophisticated algorithms, it just depends on the size of the matrix. Usually exploring computer architecture is more important than the algo, but at a certain size MMM is definitely faster using state of the art algos.

mensetmanusman · on Dec 18, 2021

Fascinating! I wonder if the instability could be kept small enough to be accurate enough for graphic applications (for power savings)… What if the constants were hardware stored in the silicon? :)

MrYellowP · on Dec 18, 2021

I want to learn everything about this, so I can help solving it.

Where do I start?

davidgrenier · on Dec 18, 2021

This is where it all started:

https://en.wikipedia.org/wiki/Strassen_algorithm

MrYellowP · on Dec 19, 2021

Well, that article isn't helpful by itself, but there's tons of words I can google for better information and tutorials.

Thanks!