Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They aren't really created equally though.

epoll and kqueue really are just edge-triggered select/poll.

However IOCP and the new io_uring are different beasts, they are completion based APIs vs readiness based.

To quickly explain the difference:

readiness based: tell me all sockets that are ready to be read from

completion based: do this, tell me when you are done

The "tell me when you are done" part is usually handled in the form a message on a queue (or ring buffer, hence the name io_uring, with the u being for userspace). Which also generally means really high scalability of submitting tons of tasks and also processing tons of completions.

Completion based APIs are superior IMO and it was always sad to me that Windows had one and Linux didn't so it's awesome Jens Axboe got his hands dirty to implement it. It beats the pants off of libaio, eventfd, epoll and piles of hacks.



A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.

Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".

Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.

AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.

(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)


Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)


Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).

Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?


True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd

Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events


> Completion based APIs are superior IMO

Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.


> Completion based APIs are superior IMO

I probably agree with that, but curious to know what your reasons are.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: