How does ChaCha20 compare to the established AES standard? Is it stronger? weaker? faster? slower? easier to implement correctly? harder to implement correctly? better for some other reason? worse for some other reason?
* As an ARX design, doesn't need S-boxes, and so doesn't leave a cache footprint
* Has free key setup
AES is:
* A global standard
* Available in hardware on most platforms (extremely important)
* A conventional block cipher for which a bunch of modes (in particular: wide-block and AEAD) are already defined
But unlike Salsa, AES:
* Has relatively complicated key schedule (you have to expand its key input to a series of per-round keys, which imposes a cost when you switch keys)
* Relies on S-boxes for security and so must carefully avoid microarchitectural side channels
* Is much harder to implement
* Is not a native stream cipher, so requires an adapter (usually: GCM mode) to use safely.
AES is usually faster on modern systems because it's implemented directly in silicon. Salsa is usually the fastest pure-software option. Both are so fast that the speed difference is not particularly important, but most systems will prefer AES when hardware support is present.
Salsa is almost certainly the better choice for new designs just because of its simplicity. It's harder to screw up Salsa20 or its derivatives than it is to screw up AES (it is very easy to screw up AES), and its performance is more than satisfactory.
Even then, they actually used a tweaked version of ChaCha20 that uses a 96-bit nonce (just barely large enough to be suitable for randomly-generated nonces) and a 32-bit counter (limiting its use to 128GiB for a given nonce). Also, an extension XChaCha20 was recently published which performs an extra 20 rounds to initialize the cipher state, allowing for 192-bit nonces with no corresponding reduction in counter size.
Yes, but that's true of all sorts of things that aren't really global standards. Don't get me wrong: you should use Salsa ciphers. I'm just trying to provide the most honest possible accounting.
AES is hard to implement on a general purpose computer in a way that is both fast and doesn't leak through cache timing attacks.
The safe way to use AES is by using a hardware implementation, like modern x86 and some ARM CPUs.
The best software implementations use bitslicing and SSE, but are still slow. The best I saw is an Emilia Kasper and Peter Schwabe paper[1] from 2009 on bitsliced AES-GCM has 21.99 cycles/byte performance for constant-time implementation authenticated AES-GCM.
For comparison, Intel shows[2] 0.77 cycles/byte for same with a hardware implementation, albeit on a newer CPU.
Chacha is fast on modern general purpose CPUs without the need for a hardware implementation of chacha. One reason it's fast is that it was designed so that a normal compiler can generate machine code from regular-looking C code in such a way that it uses vector (wide) registers and uses independent operations to use as many operations in the CPU in parallel in the same clock cycle, without requiring an assembly wizard to do that. Intel can afford assembly wizards (i.e. Shay Gueron), other people can't.
Modern TLS stacks prefer AES when running on a CPU that has AES hardware and fallback to chacha otherwise. They of course fallback to either a slow or an insecure implementation of AES if the other side doesn't support chacha.
Basic Chacha C implementations do not get auto-vectorized down to ultra-efficient code. The most efficient implementations are intrinsic/assembler that process 4 (SSE2/AVX/NEON) or 8 (AVX2) Chacha20 blocks at once. This is due to layout of variables and operations being designed for efficient SIMD use and the blocks being independent of each other. (Shay's Chacha20 implementation is also not the fastest!)
Basic GNU C implementations don't get auto-vectorised full stop. But with a little bit of effort Chacha20 can be made to vectorise. The implementation in here is vectorised by GNU C:
If "ultra-efficient code" means what could be produced by a programmer highly skilled in some amd64 implementation (intel core2, amd bulldozer, ...) for that implementation then yes I doubt GNU C produces it. But the odds are GNU C's output runs faster than that's guru's code on other amd64 implementations.
That Salsa implementation is not being vectorized? Salsa also requires some values to be shuffled around to actually work in SSE registers, djb made a bit of a boo-boo when designing it. Chacha fixes that, so its SIMD implementations are a bit more straightforward.
If you don't have HW-acceleration, for example AES-NI instructions in x86-64, ChaCha will normally be quite a lot faster. Esp on 32-bit and 64-bit architectures.
Being a stream cipher you can also precompute the keystream. This reduces encrypt/decrypt to a simple XOR when handling the message - depending on message length of course. And yes, AES-CTR can also be used like this.
Chacha20 can do random access. See the end of my article, when I talk about counter mode. To get the part of the stream you want, you just generate the block you need (they're all the same, only the counter changes), then encrypt it. No need to generate all previous blocks.
Indeed, one reason for using AES in counter mode is this random access, which among other things enables parallel encryption. The same strategy works with Chacha20.
I'm probably stating the obvious here, but whatever your strategy for decrypting is you still must verify the ciphertext integrity, which unfortunately for you is calculated on the whole ciphertext. You may win some time by not reading the stuff before the block you're interested, but you will have to read the whole stuff anyway if you want to be safe.
I'm no expert of course so I don't even know if there's an AEAD that can bring you integrity on parts of the input; at least I know that minilock (https://github.com/kaepora/miniLock/blob/master/README.md#-m...) builds some kind of counter mode where each chunk is properly encrypted and has everything needed to check its integrity.
The most widespread way of using Salsa/ChaCha is in the "Chapoly" construction, which combines ChaCha20 with DJB's Poly1305 polynomial MAC; this is an authenticated construction. Pretty much every mainstream application of Salsa20 is in fact a Salsa/Poly1305 construction.
You can also just combine Salsa and HMAC.
It's true that you need to authenticate your data, but this is true for any cipher that you use.
It's a bad idea to implement your own cipher code, no matter what you're doing. If you're looking to include Salsa/ChaCha in an application, use Nacl, which refuses to give you unauthenticated ciphertext.
There's virtually no difference in utility, since pretty much the only thing we ever do with a block cipher is adapt it to encrypt streams --- this is true conceptually even when we're not literally turning the block cipher into a PRF with something like CTR mode.