LDM/STM are awful for modern cores, and was even microcoded on at least early ARM cores despite them being, you know, RISC.
As for why it's awful, it's the amount of loads and stores that can be in flight associated with a single instruction when an exception is taken, and how the instruction is restarted afterwards. ARM-M cores make this a little better by exposing "I've executed this many loads so far in the LDM" in architectural state so it can be restarted without going back and reissuing loads (so it can be used in MMIO ranges), but the instruction really only shines in simple, in order designs. They're almost as bad to implement on modern cores as the load indirect instructions you see on some older CISC chips.
Additionally, LDM/STM really shined in either cache-less or cache-poor designs where instruction fetch is competing for memory bandwidth with the transfer itself. That doesn't really apply to these modern cores with fairly harvard looking memory access patterns. Therefore, getting rid of these instructions isn't the biggest deal in the world.
So to answer your question, they absolutely could have done that, but chose to use the transition to AArch64 to remove pariahs like LDM/STM from around their neck because they're more trouble than they're worth from a hardware perspective on modern OoO cores. The LDP/STP instructions are the bone they throw you to improve instruction density versus memory bandwidth to/from the registers, but they don't really want each instruction being responsible for more than a single memory transfer for core internal bookkeeping reasons.
LDM/STM were also the source of some really wacky hardware bugs on some STM32 microcontrollers, such as:
> If an interrupt occurs during an CPU AHB burst read access to an end of SDRAM row, it may result in wrong data read from the next row if all the conditions below are met:
> • The SDRAM data bus is 16-bit or 8-bit wide. 32-bit SDRAM mode is not affected.
> • RBURST bit is reset in the FMC_SDCR1 register (read FIFO disabled).
> • An interrupt occurs while CPU is performing an AHB incrementing bursts read access of unspecified length (using LDM = Load Multiple instruction).
> • The address of the burst operation includes the end of an SDRAM row.
FWIW, disabling the read fifo like that would be a really goofy choice, but yeah, very good point. These are very special cased instructions that are different even than DMA transfers you might expect them to look like.
There's a reason for that. Another errata for the same part explains that:
> If an interrupt occurs during an CPU AHB burst read access to one SDRAM internal bank followed by a second read to another SDRAM internal bank, it may result in wrong data read if all the conditions below are met:
> • SDRAM read FIFO enabled. RBURST bit is set in the FMC_SDCR1 register
> • An interrupt occurs while CPU is performing an AHB incrementing bursts read access of unspecified length (using LDM = Load Multiple instruction) to one SDRAM internal bank and followed by another CPU read access to another SDRAM internal bank.
Could it also be the case that this is rendered partially obsolete by vector instructions? Obviously vector loads/stores don't cover all these cases, but I have to imagine they cover quite a few, and without all the bookkeeping (who knew loading one big thing would be so much easier to keep track of than loading a handful of tiny things).
No, because you still want fairly dense dumps of registers out to cache for function prologues. So blits from the integer register file still show up in your profile traces, hence LDP/STP.
As for why it's awful, it's the amount of loads and stores that can be in flight associated with a single instruction when an exception is taken, and how the instruction is restarted afterwards. ARM-M cores make this a little better by exposing "I've executed this many loads so far in the LDM" in architectural state so it can be restarted without going back and reissuing loads (so it can be used in MMIO ranges), but the instruction really only shines in simple, in order designs. They're almost as bad to implement on modern cores as the load indirect instructions you see on some older CISC chips.
Additionally, LDM/STM really shined in either cache-less or cache-poor designs where instruction fetch is competing for memory bandwidth with the transfer itself. That doesn't really apply to these modern cores with fairly harvard looking memory access patterns. Therefore, getting rid of these instructions isn't the biggest deal in the world.
So to answer your question, they absolutely could have done that, but chose to use the transition to AArch64 to remove pariahs like LDM/STM from around their neck because they're more trouble than they're worth from a hardware perspective on modern OoO cores. The LDP/STP instructions are the bone they throw you to improve instruction density versus memory bandwidth to/from the registers, but they don't really want each instruction being responsible for more than a single memory transfer for core internal bookkeeping reasons.