Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Matrix Multiplication Inches Closer To Mythic Goal (quantamagazine.org)
119 points by optimalsolver on Dec 18, 2021 | hide | past | favorite | 48 comments


Just for reference: a Google TPU can multiply a 256x256 matrix in 1 clock cycle[1]. That seems astonishingly fast, considering the TPU operates at 700MHz:

The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 × 256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000 = 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix unit.

[1] https://cloud.google.com/blog/products/ai-machine-learning/a...


It appears that while the TPU is capable of ingesting the inputs of a few matrices each clock cycle, it likely takes several clock cycles for the matrix multiplication operation to percolate through the systolic array pipeline. That means while it may be able to sustain an average of one matrix multiplication per clock cycle, any particular matrix multiplication operation will require multiple cycles before the result is valid.


This isn’t exactly correct. A TPU can probably multiply two matrices per clock cycle, but it most likely takes a few hundred cycles for a single operation to be performed. This is similar to how a modern cpu can execute many instructions per cycle, but most instructions take multiple clock cycles to be performed.


Just to clarify, the TPU (version 1) does not multiply two 256x256 matrices per clock cycle as that takes 256^3 MAC operations to perform. It produces one row of the product matrix per clock cycle (each 256 element row takes 256^2 MAC operations to calculate).


...and in case anyone is wondering, no, this is still not a practical speedup. https://en.wikipedia.org/wiki/Galactic_algorithm#Matrix_mult...


Fortunately, there's another approach based on group theory that's able to hit n^2.41 and shows some promise.

Here's the paper: https://arxiv.org/abs/math/0511460

Here's a set of lecture notes that are simpler to read and give an easy example of how group theoretic methods work: http://www.ccs.neu.edu/home/viola/classes/gems-08/lectures/l...


This thought has been churning around in my mind for some years now — we focus too much on processing speed and reductions in time complexity and not enough on increasing the size and efficiency of our cache and stack.

MM (especially MM on large type numbers like e.g. hashing algorithms) are very reliant on the cache because you can’t always fit that big of a number into a register. Side note, I was reading some Abseil code last night that did some funky bit twiddling on ARM: https://github.com/abseil/abseil-cpp/blob/master/absl/hash/i...

Off the top of my head, isn’t it about 200ns [edit, not ms] to query, bus, and read something from memory? Just a thought, perhaps the cache and memory is where we should focus our efforts.


This is kind of like "doctors are spending too much time curing cancer, they should worry about my bathroom, which is flooding!"

Theoretical Computer Scientists aren't going to increase the size of your cache, that's just not what they're into or what they know anything about.


> We nevertheless stress that such improvements are only of theoretical interest, since the huge constants involved in the complexity of fast matrix multiplication usually make these algorithms impractical.

It's only for mathematical interest


this article isn’t at all about doing it faster on actual hardware, and everyone has worried a lot about cache for decades now.


Not relevant to the article, but well-written single and double precision floating point large matrix GEMM on mainstream CPUs is basically computation-bound -- look at performance figures -- and is relatively unusual in that. (You probably need to worry about prefetching to get there, though.)


200ns, not 200ms


Fun article. The approach to n^2.

> The result, by Josh Alman, a postdoctoral researcher at Harvard University, and Virginia Vassilevska Williams of the Massachusetts Institute of Technology, shaves about one-hundred-thousandth off the exponent of the previous best mark. It’s typical of the recent painstaking gains in the field.


Quanta has used this graphic in previous articles, and I find it very hard to grok. Time on the y-axis is just strange and confusing. Rotating 90° clockwise makes it much clearer.


The dotted line just compounds this. Very strange choice


Which one of these algos does Intel/Nvidia/Google use in their AVX/Cuda/TPU implementations?


Simple block partitioning, usually 4x4, with the full ordinary algorithm per block. This is the most numerically-stable and parallel.

https://developer.nvidia.com/blog/cutlass-linear-algebra-cud...


However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.


and the most cache-friendly


Strassen's method is the only one practical on modern hardware


If it's of interest, there's an account of fairly recent work in the "Strassen reloaded" paper https://dl.acm.org/doi/10.5555/3014904.3014983 or otherwise in https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf


And only in fixed precision.


What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).


Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.


do we know if the rounding errors are a big deal for numerical methods that can tolerate inaccuracy (like gradient descent for machine learning)?


It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.

When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.


Rounding errors, underflows and instability in floats are very well known problems, a big deal if you do anything but graphics.


You want to converge on a local minimum, don't you? We can't guarantee that with unstable algorithms.


Speed is fine. The issue is numerical stability.


Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.



The part I don't follow is that I thought multiplication used the same number of CPU cycles as addition, so I didn't get the the part where one multiplication replaced with many additions was obviously better. Could someone explain that part? I feel I must be fundamentally misunderstanding something.


It has nothing to do with clock cycles, it's about how the number of additions and multiplications scale for large matrices.


If you look at operations per cycle, in this example for Intel Haswell processors, you'll find that an equivalent add and multiple are quite a bit different.

An add/subtract from a 32 bit register to register is 0.25 cycles per operation.

A multiply from 32 bit register to register is 2 cycles per operation. So a 8x speed difference, not counting for latency, etc.

See page 230 and 231 of this paper: https://www.agner.org/optimize/instruction_tables.pdf


For what it's worth, what counts for normal implementations is the number of multiply-add operations per cycle, typically two on current hardware with FMA.


I feel the other answers you got do not directly answer your question for the following reason.

In practice, I believe it is Strassen's algorithm that is used for very large matrix multiplication. Strassen's algorithm improves on both the number of multiplications of scalars as well as addition of scalars which both ends up bound by O(n^(lg 7)).

This is because Strassen's algorithm replaces a single large conventional matrix multiplication consisting of n^3 multiplications and n^3-n^2 additions with 7 recursive application of Strassen's multiplication on matrices of size n/2 along with 15 addition of matrices of size n/2.

The number of additions in a matrix addition is n^2 for a matrix of size n. Thus Strassen's requires 7 recursive calls as well as 15(n/2)^2 additions of scalars.

You win on both fronts if the matrices are sufficiently large and you default to conventional matrix multiplication on a recursive call below a certain threshold.


This isn’t about running stuff on hardware. The point is that for a large matrix the number of additions grows much more slowly, such that the represent an increasingly small percentage of operations. So they don’t matter much no matter the hardware at some sufficiently large matrix


The multipliers in CPUs are a mass of logic gates and are challenging to make work in a single cycle. While I believe it is possible with integer multiplication to do so, doing so with other larger data types might take multiple cycles. Division is even more complex, although not part of the question here.

Additions really can be done in 1 cycle, so it's possible to optimize by using them instead of multiplications in operations like this.

Here's a Berkeley EECS lecture on multiplier circuits if you want more info: https://inst.eecs.berkeley.edu/~eecs151/sp18/files/Lecture21...


> multiplication used the same number of CPU cycles as addition

Only thanks to lookup table and parallelism. And that's not even true for old computers, where multiplications can take multiple cycles. Also I think complexity is calculated with a pen and paper approach in mind. Number of CPU cycles might vary a from actual complexity when the chips are optimised for certain tasks


While that's true for fixed-precision integer arithmetic, this doesn't hold true for floating point multiplication.


From a complexity perspective, addition is linear time, whereas multiplication is superlinear (nlogn at best)


It would be interesting to put this in perspective in regards to total energy savings.

Since so much of computation is matrix multiplication, how much computational power integrated over the next decade would we say that the exponent going from 3->2.3 has saved?


None. Afaik, all practical implementations of matrix multiplication execute the naive n^3 algorithm because sub-cubic algorithms have too large constant overheads and/or are numerically unstable.


People definitely use the more sophisticated algorithms, it just depends on the size of the matrix. Usually exploring computer architecture is more important than the algo, but at a certain size MMM is definitely faster using state of the art algos.


Fascinating! I wonder if the instability could be kept small enough to be accurate enough for graphic applications (for power savings)… What if the constants were hardware stored in the silicon? :)


I want to learn everything about this, so I can help solving it.

Where do I start?



Well, that article isn't helpful by itself, but there's tons of words I can google for better information and tutorials.

Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: