- The team was not able to sufficiently diagnose the bad query program, and/or devise emergency mitigations, by 1400 UTC March 17 (24h after the first incident).
- They were not able to reduce the load on the db more quickly (eg; cranking up rate limiting, shutting down async or noncritical services and features, etc). At $prevjob we had the ability to do things like this at the push of a button, and generally would have within 90min of incident onset to help systems heal. Here they did eventually figure out they could throttle webhooks but it took several days of repeated incidents.
- They did not see connections creeping up towards their limits and have emergency mitigations prepared in case of spikes like this.
- The simple failover described on March 22nd (for example) took almost 3h to complete.
- They decided to enable profiling during the approximate time window of their load spikes, without sufficient resources.
- They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete(??).
- They did not have sufficient sampled trace/log data to form a hypothesis without this profiling.
- New services like Packages and Codespaces rely on the `mysql1` primary db.
Certainly I have much empathy for the responding teams – I have scars from responding to some very painful degradations myself – but as a Github customer this does leave me concerned.
Former tech lead on Packages here (no longer at GitHub).
> New services like Packages and Codespaces rely on the `mysql1` primary db.
A lot of this is due to that DB handling core data to GitHub (notably organization and repository data), which the integrated nature of the new feature offerings forces them to interact with (syncing permissions from repos for packages, publishing from actions, etc.). The link between repos and codespaces is even more unavoidable.
We were careful with Packages to keep as few service dependencies as possible in the critical path, especially for things like the anonymous read path, so serving project packages to users for open source projects or package managers such as homebrew are as insulated as they can be (and I suspect were unaffected by this incident).
But at the end of the day, there is some data that is central to most everything at the company, unfortunately.
Thanks, that makes sense. I would have imagined that most of what you describe could hit the `mysql1` replica (not primary) but certainly it's imaginable that this would not be possible or wise in all cases.
And of course it's worth applauding that I think most read traffic (whether to Packages or other services) worked just fine through these incidents, if I understand correctly.
this sounds backwards. i’ve worked at multiple places with the one big legacy mysql db. all of them had blanket rule no new features live in this db. whatever this complicates you would design around it
Live in the DB and use data from the DB are two separate things. Packages has its own dedicated MySQL cluster for all package related info. Auth also lives in a separate database. But it still needs to access org or repo information in many use cases.
I've not been part of designing systems at that scale (yet!) but isn't event sourcing kinda envisioned for just these kinds of problems in mind?
Like you would have all of that "core" data as Kafka topics and can safely interact with them without affecting the core services?
I know the answer is always "legacy" and "its been designed that way from the beginning" but I was wondering what do you think would have been the right way to design something like this to mitigate risk of current / future problems?
They still could have designed around it. At my current job, all of our services are in separate clusters, with none of them being largely interdependent on the other. It's possible our front-end could be bottlenecked, but as the front-end is read only, that is an easier problem to fix.
While we obviously have our own unique issues, this could never happen (in its entirety) at our company.
Separate clusters? That was not the issue here. It makes complete sense that a GitHub feature would have to interact with their core organization data/service. It’s crazy that you immediately assume that you’re smarter than the actual tech lead on the project.
To be fair, financial symbology is its own hellish rabbit hole. $prevjob is probably a symbol in someone's pet obscure symbology schema for some stock I've never heard of traded in a country I don't even know exists.
Most NASDAQ tickers are 4 upper case letters. Most NYSE tickers are 3 upper case letters. I've never seen a 7 letter lower-case ticker. Tickers aren't unique across exchanges, so you need a pair of <ticker, exchange> to avoid ambiguity, but then you have to deal with shares being fungible across primary and secondary exchanges. CUSIPs aren't memorable and only apply to North American securities. ISINs are even less memorable. CUSIPs and ISINs include Luhn check-digits, which help catch typos. RICs are pretty good, except that they're copyrighted by Reuters, so if you build your systems around them, you may be obligating yourself to buy their market data services. Conversions between different types of symbols are sometimes many-to-one. (For instance composite RICs like AAPL.OQ and their primary exchange RICs like AAPL.O would have the same CUSIP and ISIN.)
In most contexts, you can get away with treating tickers case-insensitively, except (last I checked) 1 pair of case-colliding tickers on the Toronto exchange and 3 colliding pairs of tickers on the Bangkok exchange.
If you're dealing with financial symbols, implement them as an interface with methods to perform conversions, fungibility checks, etc. Don't pass around financial symbols as strings if you can help it. At my previous job, we had our own internal hierarchical symbology that covered rates, credit, currencies, commodities, equities, and more. Unfortunately, some systems pass these around as underscore-delimited case-insensitive strings. Externally, some symbology conventions distinguish day count conventions case-sensitively with an m or M suffix, which then had to be translated to ^M and M, respectively (and confusingly, since ^ looks a bit like an up arrow but marks the lower-case version). They've luckily replaced most usages of the hierarchical symbols with objects that pretty-print to the older textual representation and have constructors that take the older textual representation.
(For more than a decade, I maintained/improved the functional reactive time series subset of a domain-specific financial programming language and integrated globally replicated distributed NoSQL DB for a large global financial firm. By default strings are treated case-insensitively. The case insensitivity helped lazy typists when the system was first used in New York and London, but is now illiquid technical debt. The ability to put spaces in variable names feels odd at first, but you quickly get used to it. The pain of case insensitivity by default never goes away, especially when someone from New York or London modifies symbology code. On the other hand, the lazily evaluated dataflow subset of the language with close database integration is really elegant, and it deals with collections much more elegantly and uniformly than Java/C++/Python/JavaScript.)
There should be a special badge you get to wear forever once you have been In Charge of Symbology at some point in your life..
In a previous job, after a few years I settled on (Type, Expiration, ISIN, Listing CCY, Listing MIC, Trading CCY, Trading MIC, Fudge) as The Tuple that could handle almost everything…. Once in a while I would still get a messed up case of non-uniqueness, that’s what the “Fudge” is for :)
I fell into this rabbit hole when trying to just write some hobby stock tracking stuff, it sucks. ISINs seemed decent but I couldn't find an open source of data about them. Lots of legacy fragmentation in this space and everyone is trying to take your money if you so much as look at them
Dont be. I just explain this a few days ago on HN why I cant handle Perl and PHP. My brain just cant register the $ symbol without going into Finance or Stock mode.
My guess is that part of the issue could be lock contention creating unexpected interactions between multiple simultaneous transactions. Issues like that can be really hard to pin down since there isn't a single transaction causing the problem; instead the load is the result of idiosyncratic interactions between multiple different transactions.
This reminds me of something I often forget about. The value of GitHub to Microsoft may be significant brand-wise, but it might not be a huge revenue source comparatively. Even if it were a huge revenue source, most of the revenue might still be shuffled off to the corporate behemoth to cover its balance sheet rather than reinvested in improving the product. Meanwhile the actual value of GitHub to organizations around the world is probably a couple order of magnitude higher than its actual revenue.
At some point it's actually cheaper for a coalition of international organizations to fund inventing a backwards-compatible GitHub replacement that is very resilient to failure, rather than wait for GitHub to get a measely enough budget increase from Daddy Micro$oft to shore up their legacy MySQL database.
GitHub is becoming a huge revenue source for them FYI. They are stealing a TON of business from competitors. Shoot, if Microsoft had a real competitor for Jira, they would probably steal even more business.
> if Microsoft had a real competitor for Jira, they would probably steal even more business.
But they have a real competitor for Jira. Check out the recently beefed up GitHub Projects, my previous (scrum-ish) team was running on it and we've liked it more than Jira. Much simpler, but just enough for a dev.
Didn't check out whether it's available on enterprise github though (i.e. local instances).
Our internal monitoring has seen more outages than they listed here. Theres been 4 full days where github actions have been mixed between completely broken and degraded status.
It's nice to finally get some comms, but this is incredibly late and incomplete.
GH Actions felt so fundamentally janky and under-documented when I last looked at it (October?) that this doesn't come as a surprise. I really can't foresee using it with requirements any stricter than "nice when it works."
Yup, same here. We had periods of more than 8 hours (during Asia daytime / US nighttime) where we couldn't push or pull, which were not mentioned on their status page at all.
I would not be surprised to learn that they have similar procedures to one of my past employers, where exec signoff was needed to even publish an update to the service status dashboard...
I mean, they have $MEGABUCKS, they could probably get 1/2 the team who maintains mariadb to come in and work for them if they wanted, and they still have a giant single db node doing writes and struggle to fail it over.
We're doomed >_<
You would think it wouldn't be THAT hard to shard something like GitHub effectively.
I mean, all user accounts/repos starting with the letter 'a' go to the 'a' cluster and so on seems not exactly science-fiction levels of technology.
it's profoundly strange that github has not properly sharded yet. essentially all large social networks are or have used sharded mysql successfully, this is not rocket science
livejournal, facebook, twitter, linkedin, tumblr, pinterest all use (or formerly used) sharded mysql and most of these are at larger db size than github
i will also repeat my comment from another recent thread: i just cannot understand how 20+ former github db and infra people recently left to join a db sharding company. this makes no sense whatsoever in light of github's lack of successful sharding. wtf is going on in the tech world these days
> i just cannot understand how 20+ former github db and infra people recently left to join a db sharding company. this makes no sense whatsoever in light of github's lack of successful sharding.
I believe you have the chain of causality backwards here. In fact, I think it suggests that talent that went to planet scale is perhaps not the issue.
i'm not suggesting that these departures are the cause of github's issue. rather, i'm saying i don't understand why such a large group from github was hired by planetscale if they did not have experience successfully sharding or successfully leveraging vitess
this is like if you were building a high-rise condo, would you hire the architects or management company from the building that collapsed in surfside florida? sure, they know what NOT to do next time, but that doesn't mean they do know what TO do
Hi, former GitHubber here and CEO of PlanetScale. I came to PlanetScale after seeing the incredible impact that Vitess had at GitHub. The team we have hired here successfully sharded part's of GitHub's very large platform. We continue to shard very large customers at PlanetScale.
Platform migrations take a very long time and it's very complicated especially with decade old codesbases. I will say the current team at GitHub are nothing but outstanding people and engineers with a difficult task of managing a very large deployment.
I understand what you’re getting at… but why is it you presume that the employees now at planetscale are the reason GH couldn’t shard out their databases?
Like, there’s another angle here: management, yeah?
Another way of reframing it is “maybe the folks hiring at planetscale know the inside baseball about GH infrastructure”. For example: https://www.linkedin.com/in/isamlambert
this is my exact point: they hired the former head of GH infrastructure - literally the person directly responsible for all this at github for years - and made him their ceo
github should have sharded years ago, every other large mysql user did so much earlier in their growth trajectory
This is just a gross simplification of the situation. You commenting from the outside without much context. GitHub is 14 year old rails app that is very complex doing large migrations of database platforms can very difficult and take a long time.
Blaming software/database complexity as an excuse why GitHub hasn't done so is extremely pathetic. Systems almost always gets more complex over time, not less.
There's a lot of people on HN, including myself, that have scaled MySQL beyond what we thought was possible.
The hard part isn't solving the scaling problem, but the uncomfortable conversations with leadership about how chunks of roadmap will be drastically delayed. Inevitably, leadership makes a decision to kick the can down the road, further exacerbating the problem.
My experience is that people who know how challenging big, complex changes can be know better than to trivialize their difficulty when discussing them.
so i would agree when talking about comments like "i could code twitter in a weekend! why so many engineers there!"
but this is different. github is a 14 year old company, with annual revenue in the hundreds of millions USD
I literally cannot think of any other comparable size and age mysql user who has not successfully sharded long ago and avoided outages of this magnitude. and in return we get hand-wavy excuses of "complexity!!!" from their former vice president of engineering who was previously also their first DBA.
criticizing these excuses is not trivializing the complexity, it's more of a "we all did this at our respective companies, who also had a lot of complexity! why can't you? we are github users, we rely on you and are very unhappy, we want to know how this happened" and just getting excuses as an answer.
> I literally cannot think of any other comparable size and age mysql user who has not successfully sharded long ago and avoided outages of this magnitude.
How about a little humility? You have no idea who else has similar problems out there. I'm sure Percona and other DB consulting companies would tell you otherwise, as would PlanetScale. (If this weren't true, they wouldn't be viable businesses.)
> github is a 14 year old company, with annual revenue in the hundreds of millions USD
You're making the classic mistake of thinking that having money means you can just wave your hand and hire whoever you want with whatever talent you need to solve your problems. The world just doesn't work that way.
As for the "hand-wavy" excuses, well, the people who know the issues don't need to disclose the dirty details in a public forum, nor are they often able to because of NDAs and other legal encumbrances. And it can be a career-limiting move to throw your colleagues under the bus.
You can choose to assume people are incompetent, or instead choose to assume that people are working as best they can under the constraints placed on them. I think it's better to assume the latter.
I do in fact know who else would have had similar problems, directly from personal experience and experience of coworkers and friends. large-scale mysql community is fairly insular, lots of talent rotation between the large scale companies and between percona and pythian, everyone knows everyone else, everyone talks war stories at the conference bar and elsewhere. but thanks for making assumptions once again!
i am NOT saying that github can hand wave this away with money and hiring. you misunderstand. what i am saying is company of their business size and scale should have already handled this long ago, because that's what literally every other large comparable company has done.
there are front-page-of-HN threads about multi-hour github outages every single day this week. show me an example of another similar-size company having an equivalent meltdown please. only real equivalent is twitter during fail-whale days when they were only a few years old. they solved it early on, as should have been done.
i am not saying the entirety or even majority of github is incompetent, but i am saying there is clearly something extremely wrong and extremely unusual that led to this, and citing "complexity!" is just a pile of BS.
You appear to know some people and organizations who have solved the problem. What I am saying is that, other than GitHub, you apparently don't know the multitude of other organizations that haven't solved the problem yet, and you appear to falsely believe that they do not exist. (And yet you also claim to know people from Percona and Pythian; what do they tell you they do for a living?)
As I said, a lot of the problems organizations have are not publicly disclosed, and the people who know aren't authorized to talk about it. So I'm sorry, you're not going to get the examples you seek here. Companies like to keep their weaknesses close to their chest.
> i am NOT saying that github can hand wave this away with money and hiring. you misunderstand. what i am saying is company of their business size and scale should have already handled this long ago,
If not through money and hiring, how should they have accomplished it? You propose a goal but no actionable plan. That's worth diddly squat, both in engineering and in business.
i have personally been on teams that solved this problem and worked on this problem. across multiple companies. with larger DB size than github's.
if you think Percona and Pythian implement sharding solutions, you are deeply mistaken, this is not what they focus on at all. advice, sure. more on the perf and ops side. but not sharding implementation, and definitely not application side of things.
furthermore i am saying if other major companies were having this type of issue, everyone in the public would know about it! because the product/site/company would be down all the time. this isn't some thing you can hush hush. close to the chest? how do you keep daily outages close to the chest? completely absurd
there is literally no analog in the US tech world to a large 14 year old company having daily outages for a week, preceded by multiple outages per month consistently for the past year+.
i will stop replying now because it is clear we live in different realities or smth
Somehow you are very confident that this github issue can be avoided by horizontal sharding? I am not so sure about that. If my hypothesis is right (see my other comment in this HN discussion), then it can be rather trivial to bring down any MySQL instance or shard, such that the end result would be similar regardless of horizontal sharding.
I would also back otterley's position that there exists many other companies with larger non-sharded MySQL instances than github's. They may or may not have better reliability than github.
innodb global lock architecture has improved greatly since 5.1. ancient bug report from harrison is not indicative of problem here. what you think, facebook has unsolvable database cpu pileups for 12 years straight? harrison would not have been promoted to vice president if this was acceptable in his shop
"They probably outsource the lock queuing logic to their Memcached layer." you are grasping at straws. the wrong straws. go attend some facebook conf talks, meet their engineers, read some arch papers, and stop speculating this nonsense. largest db tier there does not even use memcached, hasn't for 9 years. read the TAO paper!
horizontal sharding works great for many many companies with databases several orders of magnitude larger than github, why would it not work for github?
you are also saying there may be large companies with lower reliability than github? ok name them. again, if this was the case, everyone would know! their availability at this point is like what, 1 nine?
> innodb global lock architecture has improved greatly since 5.1. ancient bug report from harrison is not indicative of problem here.
No, this performance cliff does still exist in the current mysql 8.0 branch. The Contention-Aware Transaction Scheduling added to 8.0 doesn't help the queuing problem with hot row at all. So facebook either doesn't suffer from the problem, e.g. all their writes are append-only so there is no hot row, or prevents the problem from happening, e.g. with API rate limiting built on top of something like memcached.
The improvement in innodb lock architecture mostly comes from splitting the locks into more granular levels. However, updates to the same hot row would require the same lock at the page-level. It is already as granular as it can be. So the tag of the bug report is correct, this is a problem of "lock queuing".
Not every organization that is either behind the 8-ball on their MySQL scalability projects or who is suffering incidents as a result is making headlines. That’s just a fact, and you’re just going to have to accept that you don’t know everything.
And frankly, given your attitude on this, I think you’re going to need to present some bona fides for anyone to take you seriously. Maybe you can even convince GitHub to hire you and solve their problems. I bet they’d be happy to listen to you berate them on how pathetic they are.
i was not even the commenter who said they were "pathetic", sheesh now you are attributing others' comments to me!
i have posted many correct technical details in various subthreads of this post. i don't care if you believe me, i do not live in your reality where constant downtime is totally fine and major companies are down all the time but no one knows about it
This is a super, super annoying type of comment, and against HN guidelines. It adds absolutely nothing to the conversation. Don't go digging through people's comment history to throw mud.
i did not dig through his comment history. i was directly participating in that thread i quoted! and it was a recent thread. please check it again.
i agree though my comment could have been nicer. apologies.
as to adding nothing to the thread: disagree. he literally dumped on AWS with false claims. how do you know i don't work at AWS? why is it ok for him to say stuff like that?
Let these types live in their supreme know-it-all fantasy world, they're not worth a thought. Anyone with a little common sense will see through their shenannigans and proceed to ignore them the way you'd ignore an ant on the sidewalk as you go about your busy day.
Don't take it personally, their ignorance has nothing to do with you.
The issue isn't just a lack of "niceness" but also how this type of comment derails the conversation into an offtopic argument which should have been pursued only in the original context.
If you disagree with the particular assertions here (that your are grossly oversimplifing without context), you should address it directly. Saying "it's fine because you do it too" isn't so much a lack of "niceness", but rather how this type of comment unproductively derails the conversation into an offtopic argument while adding zero constructive discussion.
> offtopic argument which should have been pursued only in the original context
I would have gladly had that discussion in the original context! but he did not reply to me there, nor to any of the other 6 people who replied to that horrendously incorrect comment of his
i would have let it go, but yet here he calls me out saying my own comment is grossly oversimplified, but he offered little explanation of his perspective. if i am understanding his brief argument it is:
1) github is old.
2) sharding is hard.
ok so here is my response: it was a lot less old when he started working there in 2013. many other companies have sharded several years after launch. yes it is hard. i have directly worked on teams sharding larger DB's than github's, sharding legacy apps. and so forth. it can be done.
it is really hard to justify github to be down this many times from the same root cause. year after year
> I would have gladly had that discussion in the original context! but he did not reply to me there, nor to any of the other 6 people who replied to that horrendously incorrect comment of his
I think your response in that thread was sufficient. If you feel it was insufficient, that thread would be the correct place to continue the discussion or further clarify any misinformation where people may actually it.
People rarely admit when they are wrong and you don't have a right to (or even a reasonble expectation of) any such admission in this context. Harassing them about it in other threads is not a positive contribution to the discussion or the community.
he accused me of making a gross simplification. i responded, perhaps unkindly. but at no point have i "harassed" him.
github's outages have made my life difficult. planetscale's pricing scheme has also burned me. i am not just attacking this guy at random.
based on upvote points i can conclude many people do think i am contributing positively to this discussion. please refrain from tone-policing me. i don't see any technical comments from you in this discussion whatsoever so how exactly are your comments positively contributing?
i did so because it is directly relevant to this topic!
the person who for many years was directly responsible for github's databases, and later their overall infrastructure, recently made a horrendously inaccurate statement about the industry's most popular managed database offering
this shows either a severe lack of technical understanding, or a willingness to make deliberate technical misstatements
now github is down literally every single day due to database problems. it is very relevant! accountability is important!
Just because you have money doesn’t mean you can hire anyone you want to. Talent is scarce, especially in highly-specialized roles, and people have free will.
I think the complexity of GitHub’s data management requirements would surprise you. Better to withhold judgment until you possess all the facts.
My fear is that this seems like a cover excuse for moving off MySQL. The bug will be too hard and they’ll move off. They will choose SQLServer and take a lot of time to convert and then have even more outages.
100% agreed. a lift-and-shift migration to Galera and modern MariaDB wouldnt be hard, but knowing MS there are mid managers waiting in the wings to swoop in and drive this into the ground with azure/sqlserver, the former of which posted 8 outages in the past 90 days alone.
this is classic Microsoft. spend a ton of money for something very valuable -- in this case virtually all developer marketshare -- and then casually pedal it into the ground while you lie about the KPI's to C levels (IIS marketshare on netcraft as a function of parked websites at GoDaddy to dominate over Apache) and keep it on life support with other revenue streams (XBox) for the next 16 quarters until it becomes a repulsive enough carbuncle to shareholders that it gets the axe (Microsoft phone.) then in a year, limp into the barn with another product nobody else but you could afford to buy (minecraft) and slowly turn it into a KPI farm for Microsoft account metrics to drive some other failing product (Azure) and keep the C level happy while you alienate virtually every player with mechanics or requirements they hate.
It’s funny you mention Minecraft. My kids recently said “I hate Microsoft. All they did is ruin Mojang.”
They don’t know Microsoft for anything other than ruining Minecraft. They didn’t know Microsoft made the Xbox or even Windows.
They made this statement after Microsoft forced them to migrate their account they’ve had for 5 years to a Microsoft account. That broke their computer for a few days and reset their games. For no useful reason.
My son is eleven years old, a huge Minecraft fan, and is exactly like that. Now, he does know that MSFT makes Windows, but that's not an existential threat to him like ruining Minecraft is.
yes! i signed up for an account here just so i could say this. we got refurbished computers for my children just to play Minecraft. Minecraft didnt work in windows, so i set up a dual boot Ubuntu system (while biting my nails i might totally destroy the os), thatnwas working well for months. when they had to migrate their account, somehow it switched my daughter's local login to a Microsoft family login which screwed up her other apps (like adventure academy). im tearing my hair out trying to figure out how to switch it back to a local login. i already have two other apps for monitoring computer time (which already don't always work out as well as youd think, so i didnt want to really have to deal with a third one.) i guess ill eventually figure it out, but found this page looking for a solution. (and im not that techy right now, so it takes a good chunk of time.... sigh
galera is not a solution for scaling out writes, full stop
galera has lower max writes/sec than a traditional async single master because it's a cluster. the other members of the cluster need to ack the writes, and all members are doing all the writes, so adding machines does not increase your max writes
Looking for education here: in reading my understanding of MairaDB was that it's not as well suited to large databases and heavy read/write performance in those databases, as compared with Postgres. Would the primary argument be for MariaDB's sharding support first-class?
At GitHub’s scale, you don’t just “move off” a database. At best it would be a gradual project that would take years for the company to complete, and likely trigger additional incidents along the way.
Why would Microsoft not migrate to SQL Server? MySQL is owned by Oracle. Microsoft cannot be happy about using a product from Oracle. SQL Server is a pretty good product and the conversion will give them even more tools and expertise for their consulting wing to do it for other companies.
Microsoft could easily fork and maintain mysql if oracle did anything wonky. And they could use MySQL without paying any licensing fees due to its OSS license.
Does it matter? Github is a closed source service. If they run Microsoft SQL Server vs. MySQL it doesn't matter or change anything for its users. All that us as customers and users care about is the experience and performance. Whatever they want to do internally to achieve that doesn't matter at all IMHO.
It matters because it makes their product worse. I’m a customer so I’m impacted by using a crappier database that has higher costs and worse performance.
SQL Server is at least as good as any variant of MySQL and that’s only if you want to be very generous for MySQL.
If you meant that and I misunderstood, the other issue is that you don’t migrate to a different RDBMS. You rewrite half of the app and then spend a couple years fixing issues.
It’s not a horrible database, but it’s just expensive and pretty much never a better choice than Postgres or other OSS dbs.
There are certainly some very niche use cases where you need it, but there’s a reason why tech stacks don’t use sqlserver. I think there are better ways to handle durable transactions and redundancy than using sql enterprise with sql‘s replication.
Counterpoint - with GitHub now being part of a large database and cloud vendor (i.e. SQL Server and Azure), surely there are at least some merits to moving workloads to those technologies? Leverage internal knowledge, talent, etc.
Don't take my word for it. Spin up an Azure SQL database and try it yourself.
"Ping" it using a trivial query such as "SELECT 1" a thousand times in a row and draw a histogram. Or just eyeball the numbers. I did this recently and had response times over 12 milliseconds regularly. For comparison, a small and cheap IaaS VM running SQL Server can get down to the 150 microsecond range and stay there.
For this 100x degradation in performance you get the privilege of paying several times the cost of an IaaS VM + SQL license.
Azure SQL proxies connections. I mean sure, the documentation says that they can do a "redirect" instead of a proxy, but not if you use any of their "private" network connection options. You would think that is some sort of simple IP header change implemented by the switching gear in hardware, but you'd be wrong -- it tunnels through what is essentially a VPN -- with all of the predictable issues. These tunnel VMs don't have accelerated networking turned on, for example. So you can have "business critical" tier on one end, and huge VMs with accelerated networking on the other end, but the traffic in between is being routed through some 1 vCPU virtual router appliance processing packets in software.
Anyone with database admin rights can alter firewall rules via SQL commands. These are via SQL Server accounts and hence they're just a username and password, no MFA or anything. If you can find a SQL injection vulnerability you can punch a hole through the "firewall". Brilliant.
The firewall supports IPv4 only, and uses different CIDR syntax to everything else in Azure. It doesn't support Service tags, or logging, or monitoring, or anything really. It exists only to tick a checkbox.
Unlike most other Azure services, SQL doesn't integrate with Azure Active Directory or RBAC properly. So for example you can have one AAD group as the SQL Admin. No other rights, no list of principal IDs... just one admin group. All other delegated permissions must be done through SQL commands, blocking the use of ARM templates, Policy, custom roles, etc...
If you delete an Azure SQL Server instance, it deletes all backups associated with it. Sure, they recommend that you put a "delete lock" on the server, but then most administrative operations become impossible because you can't then delete any child objects. And even if you do create a delete lock, that can just be deleted. There is no way to say "backups cannot be deleted for at least 'x' days", even though the underlying storage accounts support this feature.
DISCLAIMER: Some of the above may have changed since I last checked, always read the documentation and/or verify with support if your data matters to you.
Somehow, I just assumed that because git is based on content-addressed storage DAG that GitHub internally heavily leveraged a distributed hash table and content-addressed storage DAG. With content-addressed storage, all of the stored data is immutable, so caching is significantly simplified, and a DHT can scale very well horizontally. Care still needs to be taken around the transactions that change DAG roots (heads of branches), but the rest is just making sure immutable blobs are sufficiently replicated in the DHT.
A DHT isn't a great fit for really rapidly changing data like metrics and such, but I figured that everything required to keep the basic git clone, commit, and web serving functionality is a pretty natural fit for a DHT.
I figured even code search had a relatively small amount of storage for indexing recent commits, with compaction of that data resulting in per-token skiplists stored into the DHT. That way, failure outside of the DHT serving path still allows code search for all commits older than the latest (incremental) compaction. Distributed refcounting or Bloom filters could be used for garbage collecting the immutable blobs in the DHT. The probability of a reference cycle in SHA-256 hashes of even quadrillions of immutable blobs is vanishingly small.
Permissioning/ACLs is something that you really want to get synchronous whenever possible, you don't want caching and pipeline delays if you can help it.
Some maintainer is acting in bad faith, someone else quickly locks them out of all the repositories that they have not yet thought to corrupt, then it looks really really bad if they find out that they still have permission to run the CI/CD scripts, maybe with malicious substitutions, even though you blocked them from being able to push commits up.
The other part of that is, it's not too hard to get right. See what GitHub was doing. People don't change ACLs that often, nowhere near as often as you read them. If they did, you could rate-limit them. And the rest of the problem is, you have one writer and many read copies in one place and you can enforce boundaries on how stale the data gets on the read copies.
Two hard problems, first is cache invalidation.
Of course, they have found out that they are big enough to shard, one would have hoped that they would have found that out in a gentler way, much sooner. But let's not pretend that their architecture makes no sense, it's a fine architecture, they just happened to outgrow it in a bad way.
I think it's wise to minimize the number of synchronous reads necessary for correct operation. Pervasive usage of immutable data and (expiring) cryptographic capability tokens makes this much easier. Typically, the use of immutable data results in slower writes, but simpler caching logic, which can often result in faster reads in distributed systems. Luckily, many workloads are very read-heavy.
If your CI/CD pipeline uses capability-based permissions, utilizing delegatable expiring cryptographic tokens (e.g. Google's Macaroons), then a stale read of the ACL (presenting a token incompatible with the latest ACL) would allow the CI/CD pipeline to start running. Completing all non-local I/O (i.e. observable side-effects) would require synchronous reads of a cryptographic hash of the latest repository ACL and comparison with an ACL hash in the token (with a slow path of re-verifying the token could still be issued in case of ACL changes). Presumably, this ACL hash would be a part of the repository's immutable root node, so in the common case of CI/CD pipelines storing outputs into a repository, this synchronous read is "free" in the sense that it's part of the transaction to change the reference to the repository's immutable root node.
Granted, you have to be careful and make sure all of the I/O that's not local to the CI/CD pipeline (i.e. all observable side-effects, e.g. releases, sending emails) goes through through an ACL check that involves a synchronized read of the repository root reference. The fast path is just checking the cryptographic hash of the ACL in the capability token against the hash in the repository's root node. The slow path (when the ACL has changed since the capability token was generated) involves re-checking the ACL and verifying that the user named in the token is still allowed to create the presented token.
The downside is a racy read of the ACL (in the form of presenting a cryptographic token based on a revoked permission) when starting the CI/CD pipeline results in wasteful resource usage. If this is a worry, an additional synchronous read can be added at pipeline start time to avoid this race. If the cost of this synchronous read is prohibitive, a cache of <user, revoked_permission, revocation_time> can be added to the CI/CD start logic.
The upsides are that you can cache ACL reads client-side (in the form of your cryptographic capability tokens), needing only synchronous ACL reads for non-local I/O (i.e. observable side-effects) and if an attacker starts a malicious CI/CD operation, locking out the attacker before the malicious I/O (releases, emails, etc.) completes results in the malicious non-local I/O (i.e. observable side-effects) getting discarded.
Linked in the article is this other one, "Partitioning GitHub’s relational databases to handle scale" (https://github.blog/2021-09-27-partitioning-githubs-relation...). That describes how there isn't just one "main primary" node; there are multiple clusters, of which `mysql1` is just one (the original one — since then, many others have been partitioned off).
from that article it sounds like they are mostly doing "functional partitioning" (moving tables off to other db primary/replica clusters) rather than true sharding (splitting up tables by ranges of data)
functional partitioning is a band-aid. you do it when your main cluster is exploding but you need to buy time. it ultimately is a very bad thing, because generally your whole site is depenedent on every single functional partition being up. it moves you from 1 single point of failure to N single points of failure!
I disagree, functional partitioning is not a band-aid, but an architectural changes that in the end can reap much more benefit than simple data sharding.
>> your whole site is dependent on every single functional partition being up. it moves you from 1 single point of failure to N single points of failure!
Not necessarily, it can also be that only some parts of your site are dead while others work perfectly fine.
too idealistic. invariably some team (or usually many teams) don't properly gate some critical-path logic, they depend on some functional partition always being online and then boom much larger blast radius than intended
then they fix it in post-mortem but pattern just repeats. i have seen it so many times! used to be much worse in the earlier days of the cloud when VMs would go poof more often
> In addition to vertical partitioning to move database tables, we also use horizontal partitioning (aka sharding). This allows us to split database tables across multiple clusters, enabling more sustainable growth.
It seems to me like one thing that would have made these failures way more manageable would be a solid adaptive load throttling framework. Presumably, the total load only exceeded their maximum capacity by a couple percent, so a good identity-aware throttling framework should be able to maintain full functionality for ~97% of users by shedding requests early in the flow, or probably better if some automated traffic is prioritized lower and can be dropped first. If I remember correctly, Tencent had a nice paper about how they do this, but I can't find a link now.
When a request hits the first server in the backend, it gets assigned a priority based on the service priority (i.e. wechat pay is prioritized over games) and a hash of the user's ID and the current date. When any service gets overloaded, it propagates a message to the other services that depends on it telling them to drop traffic below some threshold priority. That means the traffic gets dropped as early as possible, and the username hashing means people don't just keep refreshing until they can get through.
BGP itself didn’t cause this problem. An operational error/bug/whatever that manipulated network configurations was the cause of the outage. BGP is just the messenger: if you misconfigure it your network will break, but it’s not BGP’s fault.
A more accurate checklist item would be “is it the network?”
Build&Configure server, A few hour drive it down to the DC, rack the server. Get back home, try to access. No luck. Turned out I had forgotten I had forgotten to connect power and turn it on.
I know this could be taken as a compliment but I don’t feel it’s a fair assessment. Maintaining large scale SaaS services is a constant challenge, and the GitHub team work really hard to build a platform that serves a gigantic community of developers. Downtime happens to everyone.
I worked at GitHub up until last October. I’m keenly aware of the challenges of maintaining it, and how much hard work the teams puts in.
However I think it’s hard to ignore in light of recent issues that there has been a large amount of attrition in the last year or so, specifically to PlanetScale by some of the most senior/database focused engineers.
Edit: just noticed that you’re the CEO of PlanetScale. I don’t mean this as a slight against you or GitHub or the folks that moved. I genuinely do just think that the number of people who shifted over left GitHub with a fairly large experience and knowledge gap. I don’t think anyone is to blame, it’s just an outcome of organic evolution and changing of a technical organization.
I mean, you can rest assured that you will have problems with Vitess at some point. All software is buggy, and there is no guarantee one of those bugs won't be catastrophic. You should plan for a method to completely replace your entire production cluster in the event of failure. This can be quite difficult with horizontally-scaled databases. It's also not uncommon to hit a scaling limit for which there's no simple solution, or not be able to build out a replacement production cluster due to unplanned-for limitations.
One thing that has caused me much frustration and is probably putting Microsoft in tremendous legal risk is that GitHub Actions will often be incorrectly billed during these incidents.
Others have said the only recourse is to contact support. I finally pulled the trigger today.
As I write this, yet again, I cannot push to or pull from GitHub repositories since about an hour now, and yet again the GitHub status page is all green. It's starting to seem like it never gets updated outside of US business hours.
Yikes. "All write operations were unable to function during this outage, including git operations".
I know git does not need to consult any database but its own when committing. Auth is solved with tokens and url matching.
I get the need for the fancy database for the all the non-git stuff, but it is very concerning this stuff is in front of the meat and potatoes. Sounds like Microsoft is making another Skype/Teams disaster...
> Unfortunately, this caused a new load pattern that introduced connectivity issues on the new failed-over primary, and applications were once again unable to connect to mysql1 while we worked to reset these connections.
> We were able to identify the load pattern during this incident and subsequently implemented an index to fix the main performance problem.
Probably cascading failures - that's certainly an unhealthy pattern of loads. One part of the house of cards collapses putting previously unseen pressure on another part of the house of cards - that then promptly collapses - so forth and so on.
Technical debt - here's a real world example of what those words really mean!
They mentioned nothing about CPU nor disk I/O, with the immediate cause being "reaching maximum number of connections", i.e. the following error from vitess vttablet:
ResourceExhausted desc = transaction pool connection limit exceeded
* Oracle added NOWAIT and SKIP LOCKED to MySQL 8.0, and called it a day. This is a bad solution because it just changes from one extreme (unbounded queue) to another (no queue).
* At $prevjob we added block-level queue within InnoDB, based on an older experimental patch from AliSQL which they didn't actually bring into production. No app change required.
* That bug was reported by Facebook, but there is nothing relevant in their MySQL fork. They probably outsource the lock queuing logic to their Memcached layer.
The problem with this bug is that, without a correct analysis, the knee-jerk reactions are typically:
* adding more CPUs
* throwing in even faster disks
* bolting on a sharding layer like vitess
* blaming your DBaaS providers (well, they are to be blamed if they intend to stick with "vanilla Oracle MySQL")
Question to Github: I see you running pprof on vttablet, but have you run perf on mysqld?
I gathered from another comment here that this mysql1 database is still fronted by ProxySQL? In that case, the advice remains the same: don't bother with the proxy layer, focus on mysqld instead.
Oh dear. So other than those double outages, I should expect at the very least that GitHub should have at least one outage every month, not 3 or 5 in one month?
I don't think we are going to see GitHub be up for a full month without an incident anytime soon and I guess my entire comment chain [0] on the whole situation has aged for two straight years in a row, especially yesterday's one from [1]:
>> Until the next time GitHub goes down again (hopefully that won't be in another month's time).
*Goes down the very next day*
That says it all really. Lets reset the counter and try this again.
I’m surprised when these increasingly frequent outages at large internet companies happen I don’t see tons of suggestions that Russia (or China, or North Korea) is behind them.
Is all cyber warfare completely underground and invisible?
Cyberattacks look every different to degraded query performance.
You could in theory pull off an attack like that but it seems sort of dumb because it's not destructive/permanent so unless the timing it really critical it's probably a huge waste of a (likely expensive) exploit chain.
If you have deep enough access to degrade performance of the database you likely have deep enough access for exfiltration or other more nefarious activities.
I sometimes think in that direction as well... but in the case of degraded query performance it could be simple a DOS attack, where they found some combination of operations that brings the db down... and they exploit it because they don't expect it to do anything more than this.
If you want your enemy to fear the repercussions of attacking you, you need him to understand the strength you have to fight back. And in the case of cyberwarfare the only way to show strength is by hacking stuff... so I imagine there's some value in smaller not-so-destructive attacks.
MySQL scales out just fine. Anyone who thinks otherwise is simply not well versed enough in how to properly shard data and manage replication topologies.
I say this as someone that despises MySQL.
Same obviously applies for PostgreSQL but is actually slightly harder for a number of reasons with less out of the box solutions like Vitess available.
The hardest part is drafting a series of questions for the end-user to understand and answer before we get that “MAGIC-PRESTO-CHANGO” get those configuration files that just works.
I blame the program providers.
Some Debian maintainers are trying to do this simple querying of complex configurations (dpkg-reconfigure <`package-name>`). And I applaud their limited inroad efforts there because no else one has seem to bother.
I have made a bash script to configure for each Chronyd, named, sshd, dhcpd, dhclient, NetworkManager, systemd-networkd, /etc/resolv.conf, amongst many. They try and ask simple questions and glue appropriate settings then run their own syntax checkers (most are provided by the original stream).
Postfix, Shorewall, and Exim4 remain a nightmare to my evolving design.
CISecurity and other government hardening docs were applied as well and then some I took even further like Chrony had its file permissions/ownership even further and MitM block feature as well.
These are dangerous scripts where it can write files as root but as a user, you will instead get configuration files written out in appropriate directories under `build` subdirectory.
If these designs work across Redhat/Fedora/CentOS, Debian/Devuan, and ArchLinux well, I may forge even further.
- The team was not able to sufficiently diagnose the bad query program, and/or devise emergency mitigations, by 1400 UTC March 17 (24h after the first incident).
- They were not able to reduce the load on the db more quickly (eg; cranking up rate limiting, shutting down async or noncritical services and features, etc). At $prevjob we had the ability to do things like this at the push of a button, and generally would have within 90min of incident onset to help systems heal. Here they did eventually figure out they could throttle webhooks but it took several days of repeated incidents.
- They did not see connections creeping up towards their limits and have emergency mitigations prepared in case of spikes like this.
- The simple failover described on March 22nd (for example) took almost 3h to complete.
- They decided to enable profiling during the approximate time window of their load spikes, without sufficient resources.
- They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete(??).
- They did not have sufficient sampled trace/log data to form a hypothesis without this profiling.
- New services like Packages and Codespaces rely on the `mysql1` primary db.
Certainly I have much empathy for the responding teams – I have scars from responding to some very painful degradations myself – but as a Github customer this does leave me concerned.