You appear to know some people and organizations who have solved the problem. What I am saying is that, other than GitHub, you apparently don't know the multitude of other organizations that haven't solved the problem yet, and you appear to falsely believe that they do not exist. (And yet you also claim to know people from Percona and Pythian; what do they tell you they do for a living?)
As I said, a lot of the problems organizations have are not publicly disclosed, and the people who know aren't authorized to talk about it. So I'm sorry, you're not going to get the examples you seek here. Companies like to keep their weaknesses close to their chest.
> i am NOT saying that github can hand wave this away with money and hiring. you misunderstand. what i am saying is company of their business size and scale should have already handled this long ago,
If not through money and hiring, how should they have accomplished it? You propose a goal but no actionable plan. That's worth diddly squat, both in engineering and in business.
i have personally been on teams that solved this problem and worked on this problem. across multiple companies. with larger DB size than github's.
if you think Percona and Pythian implement sharding solutions, you are deeply mistaken, this is not what they focus on at all. advice, sure. more on the perf and ops side. but not sharding implementation, and definitely not application side of things.
furthermore i am saying if other major companies were having this type of issue, everyone in the public would know about it! because the product/site/company would be down all the time. this isn't some thing you can hush hush. close to the chest? how do you keep daily outages close to the chest? completely absurd
there is literally no analog in the US tech world to a large 14 year old company having daily outages for a week, preceded by multiple outages per month consistently for the past year+.
i will stop replying now because it is clear we live in different realities or smth
Somehow you are very confident that this github issue can be avoided by horizontal sharding? I am not so sure about that. If my hypothesis is right (see my other comment in this HN discussion), then it can be rather trivial to bring down any MySQL instance or shard, such that the end result would be similar regardless of horizontal sharding.
I would also back otterley's position that there exists many other companies with larger non-sharded MySQL instances than github's. They may or may not have better reliability than github.
innodb global lock architecture has improved greatly since 5.1. ancient bug report from harrison is not indicative of problem here. what you think, facebook has unsolvable database cpu pileups for 12 years straight? harrison would not have been promoted to vice president if this was acceptable in his shop
"They probably outsource the lock queuing logic to their Memcached layer." you are grasping at straws. the wrong straws. go attend some facebook conf talks, meet their engineers, read some arch papers, and stop speculating this nonsense. largest db tier there does not even use memcached, hasn't for 9 years. read the TAO paper!
horizontal sharding works great for many many companies with databases several orders of magnitude larger than github, why would it not work for github?
you are also saying there may be large companies with lower reliability than github? ok name them. again, if this was the case, everyone would know! their availability at this point is like what, 1 nine?
> innodb global lock architecture has improved greatly since 5.1. ancient bug report from harrison is not indicative of problem here.
No, this performance cliff does still exist in the current mysql 8.0 branch. The Contention-Aware Transaction Scheduling added to 8.0 doesn't help the queuing problem with hot row at all. So facebook either doesn't suffer from the problem, e.g. all their writes are append-only so there is no hot row, or prevents the problem from happening, e.g. with API rate limiting built on top of something like memcached.
The improvement in innodb lock architecture mostly comes from splitting the locks into more granular levels. However, updates to the same hot row would require the same lock at the page-level. It is already as granular as it can be. So the tag of the bug report is correct, this is a problem of "lock queuing".
Not every organization that is either behind the 8-ball on their MySQL scalability projects or who is suffering incidents as a result is making headlines. That’s just a fact, and you’re just going to have to accept that you don’t know everything.
And frankly, given your attitude on this, I think you’re going to need to present some bona fides for anyone to take you seriously. Maybe you can even convince GitHub to hire you and solve their problems. I bet they’d be happy to listen to you berate them on how pathetic they are.
i was not even the commenter who said they were "pathetic", sheesh now you are attributing others' comments to me!
i have posted many correct technical details in various subthreads of this post. i don't care if you believe me, i do not live in your reality where constant downtime is totally fine and major companies are down all the time but no one knows about it
As I said, a lot of the problems organizations have are not publicly disclosed, and the people who know aren't authorized to talk about it. So I'm sorry, you're not going to get the examples you seek here. Companies like to keep their weaknesses close to their chest.
> i am NOT saying that github can hand wave this away with money and hiring. you misunderstand. what i am saying is company of their business size and scale should have already handled this long ago,
If not through money and hiring, how should they have accomplished it? You propose a goal but no actionable plan. That's worth diddly squat, both in engineering and in business.