They aren't really created equally though. epoll and kqueue really are just edge...

o11c · on Jan 11, 2024

A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.

Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".

Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.

AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.

(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)

manwe150 · on Jan 11, 2024

Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)

o11c · on Jan 11, 2024

Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).

Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?

manwe150 · on Jan 12, 2024

True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd

Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events

marssaxman · on Jan 11, 2024

> Completion based APIs are superior IMO

Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.

geertj · on Jan 11, 2024

> Completion based APIs are superior IMO

I probably agree with that, but curious to know what your reasons are.