I'm concerned this is another patch on a very difficult problem. There are something like 16 different combinations of source alignment, destination alignment, partial starting word, partial ending word memory-move operations. What's needed is an efficient move that does the right thing at runtime, which is to fetch the largest bus-limited chunks and align as it goes.
This includes a pipeline to re-align from source to destination; partial-fill of the pipe line at the start and partial dump at the end; and page-sensitive fault and restart logic throughout.
Multiple versions of memcpy is suspicious to start with: is the compiler expected to know the alignment statically at code generation time? It might be from arbitrary pointers. Alignment is best determined at runtime. Each pass through the same memcpy code may have different aligment and so on.
Years ago I debugged the standard linux copy on a RISC machine. It has a dozen bugs related to this. I remember thinking at the time, this should all be resolved at runtime by microcode inside the processor. It's been years now, and we get this. Sigh. It's a step anyway.
This includes a pipeline to re-align from source to destination; partial-fill of the pipe line at the start and partial dump at the end; and page-sensitive fault and restart logic throughout.
Multiple versions of memcpy is suspicious to start with: is the compiler expected to know the alignment statically at code generation time? It might be from arbitrary pointers. Alignment is best determined at runtime. Each pass through the same memcpy code may have different aligment and so on.
Years ago I debugged the standard linux copy on a RISC machine. It has a dozen bugs related to this. I remember thinking at the time, this should all be resolved at runtime by microcode inside the processor. It's been years now, and we get this. Sigh. It's a step anyway.