Just for reference: a Google TPU can multiply a 256x256 matrix in 1 clock cycle[1]. That seems astonishingly fast, considering the TPU operates at 700MHz:
The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 × 256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000 = 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix unit.
It appears that while the TPU is capable of ingesting the inputs of a few matrices each clock cycle, it likely takes several clock cycles for the matrix multiplication operation to percolate through the systolic array pipeline. That means while it may be able to sustain an average of one matrix multiplication per clock cycle, any particular matrix multiplication operation will require multiple cycles before the result is valid.
This isn’t exactly correct. A TPU can probably multiply two matrices per clock cycle, but it most likely takes a few hundred cycles for a single operation to be performed. This is similar to how a modern cpu can execute many instructions per cycle, but most instructions take multiple clock cycles to be performed.
Just to clarify, the TPU (version 1) does not multiply two 256x256 matrices per clock cycle as that takes 256^3 MAC operations to perform. It produces one row of the product matrix per clock cycle (each 256 element row takes 256^2 MAC operations to calculate).
This thought has been churning around in my mind for some years now — we focus too much on processing speed and reductions in time complexity and not enough on increasing the size and efficiency of our cache and stack.
MM (especially MM on large type numbers like e.g. hashing algorithms) are very reliant on the cache because you can’t always fit that big of a number into a register. Side note, I was reading some Abseil code last night that did some funky bit twiddling on ARM: https://github.com/abseil/abseil-cpp/blob/master/absl/hash/i...
Off the top of my head, isn’t it about 200ns [edit, not ms] to query, bus, and read something from memory? Just a thought, perhaps the cache and memory is where we should focus our efforts.
> We nevertheless stress that such improvements are only of theoretical interest, since the huge constants involved in the complexity of fast matrix multiplication usually make these algorithms impractical.
Not relevant to the article, but well-written single and double precision floating point large matrix GEMM on mainstream CPUs is basically computation-bound -- look at performance figures -- and is relatively unusual in that. (You probably need to worry about prefetching to get there, though.)
> The result, by Josh Alman, a postdoctoral researcher at Harvard University, and Virginia Vassilevska Williams of the Massachusetts Institute of Technology, shaves about one-hundred-thousandth off the exponent of the previous best mark. It’s typical of the recent painstaking gains in the field.
Quanta has used this graphic in previous articles, and I find it very hard to grok. Time on the y-axis is just strange and confusing. Rotating 90° clockwise makes it much clearer.
However, on recent CPUs 4x4 is small for the innermost block size of the non-trivial hierarchy you need. You can see examples under https://github.com/flame/blis/tree/master/config with an a priori procedure for determining them in https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analyti... (but compare with what's actually used for SKX, in particular). OpenBLAS will normally be similar, though it may come out somewhat faster, but it's easier to see in BLIS.
What makes you say that? I've had good speedups with Strassen multiplication in variable precision (or with floating-point, in case you meant fixed-point arithmetic).
Strassen method is faster than native multiplication, but it is not stable: you will get much larger rounding errors when implementing it using floating point numbers, compared to native multipllication. And fine precision is required for a lot of algorithm implemented using matrix multiplication, such as matrix inversion or gradient descent, so this is often a problem.
It depends. I have seen some algorithms (the example that comes to mind was a clustering) become worse solely due to numerical error.
When that happens, if you are not equiped to measure the numerical error or at least trained to suspect it, you might think that it is just the algorithm that is not working.
Yes, indeed, and Strassen is a bad default algorithm because of this. There are specialized situations where the numerical stability isn't an issue though.
The part I don't follow is that I thought multiplication used the same number of CPU cycles as addition, so I didn't get the the part where one multiplication replaced with many additions was obviously better. Could someone explain that part? I feel I must be fundamentally misunderstanding something.
If you look at operations per cycle, in this example for Intel Haswell processors, you'll find that an equivalent add and multiple are quite a bit different.
An add/subtract from a 32 bit register to register is 0.25 cycles per operation.
A multiply from 32 bit register to register is 2 cycles per operation. So a 8x speed difference, not counting for latency, etc.
For what it's worth, what counts for normal implementations is the number of multiply-add operations per cycle, typically two on current hardware with FMA.
I feel the other answers you got do not directly answer your question for the following reason.
In practice, I believe it is Strassen's algorithm that is used for very large matrix multiplication. Strassen's algorithm improves on both the number of multiplications of scalars as well as addition of scalars which both ends up bound by O(n^(lg 7)).
This is because Strassen's algorithm replaces a single large conventional matrix multiplication consisting of n^3 multiplications and n^3-n^2 additions with 7 recursive application of Strassen's multiplication on matrices of size n/2 along with 15 addition of matrices of size n/2.
The number of additions in a matrix addition is n^2 for a matrix of size n. Thus Strassen's requires 7 recursive calls as well as 15(n/2)^2 additions of scalars.
You win on both fronts if the matrices are sufficiently large and you default to conventional matrix multiplication on a recursive call below a certain threshold.
This isn’t about running stuff on hardware. The point is that for a large matrix the number of additions grows much more slowly, such that the represent an increasingly small percentage of operations. So they don’t matter much no matter the hardware at some sufficiently large matrix
The multipliers in CPUs are a mass of logic gates and are challenging to make work in a single cycle. While I believe it is possible with integer multiplication to do so, doing so with other larger data types might take multiple cycles. Division is even more complex, although not part of the question here.
Additions really can be done in 1 cycle, so it's possible to optimize by using them instead of multiplications in operations like this.
> multiplication used the same number of CPU cycles as addition
Only thanks to lookup table and parallelism. And that's not even true for old computers, where multiplications can take multiple cycles.
Also I think complexity is calculated with a pen and paper approach in mind. Number of CPU cycles might vary a from actual complexity when the chips are optimised for certain tasks
It would be interesting to put this in perspective in regards to total energy savings.
Since so much of computation is matrix multiplication, how much computational power integrated over the next decade would we say that the exponent going from 3->2.3 has saved?
None. Afaik, all practical implementations of matrix multiplication execute the naive n^3 algorithm because sub-cubic algorithms have too large constant overheads and/or are numerically unstable.
People definitely use the more sophisticated algorithms, it just depends on the size of the matrix. Usually exploring computer architecture is more important than the algo, but at a certain size MMM is definitely faster using state of the art algos.
Fascinating! I wonder if the instability could be kept small enough to be accurate enough for graphic applications (for power savings)… What if the constants were hardware stored in the silicon? :)
The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 × 256 = total 65,536 ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit integers every cycle. Because a TPU runs at 700MHz, a TPU can compute 65,536 × 700,000,000 = 46 × 1012 multiply-and-add operations or 92 Teraops per second (92 × 1012) in the matrix unit.
[1] https://cloud.google.com/blog/products/ai-machine-learning/a...