Could you elaborate on what you mean by real vectors? It's not immediately obvio...

fentonc · on Dec 22, 2020

The Cray vector processors had a set of 8 64-element x 64-bit 'vector' (V) registers, as well as 8 64-bit 'scalar' (S) and 8 24-bit 'address' (A) registers - so it would sort of be similar to 4096-bit wide SIMD. When you did an operation like a vector add, you could do "V0 V1+V2", and it would automatically do 64 consecutive adds, and it would be done in 64 + a few cycles (since the hardware was still only doing 1 add per cycle). As someone else mentioned, it also supported "Vector chaining", so if your next instruction was "V2=V0*V3", it could take the result from the adder and pipe it into the multiplier so now your addition and multiplication are nearly fully overlapped (and you're cruising along at 160 MFLOPS in 1976!). I think it might have supported 3 chains, so you could very briefly peak at 240 MFLOPS, but you couldn't sustain it because of the startup latencies involved.

As a 'practical' example, I was able to write an N-body simulator of Jupiter and 63 of its moons (using the vector registers) orbiting one another in only 127 total instructions!

reitzensteinm · on Dec 22, 2020

Thank you! That's very interesting.

jasonwatkinspdx · on Dec 22, 2020

Another key feature of these architectures is they had a vector length register. This allowed you to write strip mined loops that would move through arbitrary size vectors in units of the hardware vector lane width, without knowing that width until runtime. This means unlike MMX/SSE, the same binary works on machines with different numbers of lanes.

This idea has been resurrected recently with RISC-V and ARM's scalable vector instructions. There the general idea is an instruction that assigns the minimum of an argument value and the hardware vector register length to a register, and sets the masking appropriately if the argument is smaller. This makes for a very straightforward strip mined loop without a branch to check for and handle the remainder in the last iteration.

pklausler · on Dec 22, 2020

A few things: Vector operations were controlled by a VL (vector length) register, so the length (simd "width") was dynamic. On the Cray-2/-3, you could set the length to zero, and turn vector operations into no-ops. So vectorization of a loop with an unknown length generated a "strip-mined" loop in which each iteration performed 1-64 (later 128) iterations, and there was no epilogue problem as with SIMD. The last proprietary vector ISA from Cray had a "compute VL" instruction that attempted to smooth out the lengths of the final iterations.

The Cray-1 line could "chain" the results of one vector operation into operand(s) of another without waiting for the first to complete. (On the Cray-1, the later operation had to issue at the exact "chain slot" cycle at which the first result element appeared, so scheduling was fun; on the X-MP and later, "flexible chaining" was possible). Scheduling vector code involved grouping operations into "chimes" that would run as parallel chained operations, and so long as you could pack more vector instructions into a chime without causing synchronization due to register use or blocking on a functional unit busy, you won. Getting a 3-chime loop down to 2 chimes was fun puzzle solving, and if the loop used (say) the floating adder twice, you knew you could stop optimizing.

The Cray-2 didn't chain, but the Cray-3 had "tailgating", which was kind of the opposite -- a new vector result could start writing to a vector register that was in use as an operand without having to wait for that operand use to complete.

It helps to think about these vector machines as being pipelined (which they were). A single chime sequence was basically flowing data from memory to functional units and back to memory without really needing to use the vector registers per se for anything unless an interrupt arrived in the middle of the sequence.

jabl · on Dec 22, 2020

Typically from an ISA perspective something like

- vector length register, to avoid loop epilogues.

- scatter-gather and strided memory ops

- support for predication / masking

- looser alignment restrictions, ideally as small as the element size rather than the entire vector width.

Look into Arm SVE and RISC-V V extension for modern incantations of this. Though the x86 world is slowly getting closer too.

CamperBob2 · on Dec 21, 2020

Support for vector chaining, I imagine.