I work platform, a quick project is one or two quarters, a decent sized one is multi year, will likely take several deployments to roll out and at least a quarter for testing and stabilization, but when we deliver it to the several thousands clusters it will serve, the chance of a serious incident is close to zero.
For this to work, we need to anticipate future requirements. We monitor usage patterns, make proyections, study the market our company serves and try to see trends and shifts. We care about what competitors are pushing and what complementary industries are doing. We are aware of the business, but not in the same way product does.
When our plans work (and they mostly do), we usually deliver at the right time when the company needs it. We saw a shift in market patterns that would affect load a year ago, so we started working on scaling certain part of the system. It took us a year to get there, but when it was needed, we had it ready.
Sometimes people ask for things that come out of the blue. We were unable to anticipate them and the work to get them done will take at least several quarters.
These scenarios are the hardest, when you need to push back, but it's not that we're not doing anything, we're reshuffling several quarters of planning to adjust to the shift in reality, to correct the blind spot, so the next time you ask for something, it's ready, because we started a year ago.
Can you give some
more detail - I am trying to work out what is deployed to thousands of clusters - is this like a base image containing a new rollout of say Postgres? Or internal tools ?
I am genuinely interested - it sounds like you really know your onions and I am trying to work out what scale you do onions on :-)
The main one is a real-time analytical engine (think OLAP, for billions of records, sub-second response time and dynamic queries, not pre-baked). The core products depend on it.
Each client has its own cluster due to data privacy and isolation (some infra is shared, analytics it's not), some are FedRAMP compliant which makes matters even harder.
Most clusters are small (4-5 small nodes, running on Kubernetes or Docker), some are huge (many terabytes of RAM across dozens of machines/containers).
For example, we saw a shift in growth patterns (both in the datasets and product talk about features and the future) a couple of years back, so we invested in distributing the dataset and the query engine. We started early because we knew it would take time to build. So we projected our "death date" and built the project backwards from that. By the time the storm came, the scalability was there. We had to rush the deployment a bit, but we were almost there (perf testing, forcing failures and recovery, updating SOPs, etc.)
At about the same time, we were somewhat blindsided in part by a 1000x to 10000x increase in data volume on one particular part of the product related to text analytics data (it came after an acquisition), which forced us to scramble and change some of the internal data structures to better compress this data, plus product changes to reduce the volume needed. In the end, the impact was under 10x, which we were able to handle with the distribution work I mentioned earlier.
In any case this "scramble" took a quarter, and the strategy we used to increase compression was something that we had thought about internally and discussed before, but we didn't have an incentive to implement it until the demand was there.
Sometimes it is like that, we play likely scenarios (usually looking for what will kill us next), think roughly about how to address them, assess the likelihood of that happening and when and decide if it's something we need to do now or later.
Most scrambling is to bring forward something we planned for later and adjust it to the new reality.
As a recovering platform engineer, nearly 2 years out of the game, i identify a lot here. My experience is that platform engineers are mostly people who never actually write code for the platform that they have built, other than toy services to test, which are usually written in a language which isn't the company default.
Communication is difficult because platform, like SWEs, are supporting people, but the support is direct, so platform often have a torrent of people asking for things. This is a problem when the SWE doesn't know what they want because they are a newbie out of university, or worse, they know what they want, but don't know how to work the insane platform mono repo that requires 9 items on a checklist to get a IAM permission added that the platform team member told them about in a support slack channel.
I think a lot of platform teams also make their own lives harder by making too many abstractions. If it is impossible for an SWE to add a new resource to the cloud without a platform engineer helping, then platform are always going to be disadvantaged.
Control is also a big issue.
Platform want all the control and believe giving control away is going to cause a disaster. In reality, SWEs want 5% control, just to do the thing they have to ask you as a platform engineer to do anyway. But maybe platform engineers like that hero complex.
An internal "platform" team is often just a mismanaged infrastructure team. Careerist managers and engineers advocate for a sweeping, complex, centralized system which competes with real-world products sold by outside companies. It starts with "my team will reduce costs" and then it snowballs into "the platform delivers value! Look at all the people working on the platform! We need to hire more people to build even more! We make money!"
A few years later, the platform's code is rotting, its interfaces are unusable, and all the original proponents have chosen to depart rather than to maintain the mess they created. The company is stuck with a legacy system, far inferior to the real-world product, which has been shaped by market forces and not the political dynamics of one, single company.
Polemic aside, I think internal engineering platforms are ok to build when the industry solution hasn't matured yet, isn't available, or is otherwise too expensive. Otherwise, I think the infrastructure manager should seek out the minimum abstraction (and team) that enables the simplest path towards buying from the outside.
Platform don’t want control they just don’t want to inherit infra that was setup by someone who didn’t know wtf they were doing. The common antipattern is fullstack would set something up then hand it over to platform when they leave or can’t keep it up and running
My experience as a product engineer and lack of root is that I spend 50% of my time asking people to do things that I could have done myself in 5 minutes.
I’ll confess that I sometimes break something that takes a day to fix, but I’m not sure if that’s worth it.
Especially considering I regularly wait weeks until platform can get around to fixing my specific issue.
I’d say you’re still in a somewhat decent position. I miss those times. The real worries start when you realize you’re spending the other half of your time fixing bugs that would have never come to existence if you wrote it the first time.
When the "platform" is a cobbled-together mangle of CloudFormation, Python Lambdas, and Terraform. They call the people who did that SREs.
And by "They" I mean the Sales and HR people who get assigned to run engineering orgs at 100-200mm private businesses all over the US. AKA "Enterprise."
You would be surprised how few devops/platform engineers cannot write code that would get them a junior position for the code that runs on the platform.
Where people still follow old SysAdmin vs Dev silos. If you have follow checklists, open tickets, follow up on slack you are not really working on a platform, thus not a platform engineer either.
The problem with product is that they care about individual features and not features in general.
This leads to an inability to see the overlap across product distinct features that all seem largely the same from a platform perspective.
The goal for platform is to take the overlap and push it down the stack, behind an API - hiding the subset of complexity that product does not need to be exposed to.
Platform gets to hide the dragons, and product gets an API that allows for features to be implemented quickly and cleanly (it also provides improved constraints from a testing perspective).
However, when there's a feature that cannot be implemented against the constraints and features currently imposed and available, things slow down - until platform figures out what new abstractions should be presented to allow for the new feature to be implemented.
It does not matter overly if platforms initial implementation is well implemented or hacked into place - fix it later, but try to do a good job on the abstractions (this will allow for product to cleanly implement the new feature and for the platform side technical debt to be handled later on, without major impact to product).
> This leads to an inability to see the overlap across product distinct features that all seem largely the same from a platform perspective.
Bingo!
The worst is when a product team circumvents platform or doesn't invest in the overlap, resulting in 2 complex things to maintain, slowing down the organization overall and creating future migration problems. Alternatively they create it, but then platform ends up paying the maintenance cost.
This is very on point. We have platform and product in entirely different reporting structures that meet just below Sundar; this leads to hysterically different cultures. My team is trying to straddle the product/platform divide right now and it’s a constant learning experience.
A recent example: someone asked me for my team’s planning schedule, specifically when we figure out what we will be doing for the next six months.
My only response is “we don’t do that here”; my team has to constantly juggle projects to load balance downtime on projects (most of our work is very sinusoidal in how much time it demands in a week), and we pick up new ones that are important enough to add to the load from time to time.
The platform folks set out in December last year with a mission for all of 2021; the design is flexible, but they know exactly what the goals for their year are.
When one team has projects that take an arbitrary amount of time to “get right”, and the other has more reliably predicted cycles that just depend on coding output, things get a little hairy. Mostly just takes a lot of empathy for the folks on the other side, though.
I do platform, I have a rough idea of what I need to accomplish in a year, in two, and reasonable definition of where we'll need to be in the next 5 to 10 years.
The plans move all the time, but we have them there. We usually plan rough deliverables one to two years in advance. Detailed plans cover usually one to two quarters.
Yeah; my product team will get wildly different priorities over the course of a month as execs spin plates, PMs and managers scramble around, and we nail down what our next round of experimentation will involve. On top of that, relying on A/B testing for all decision making means you’re liable to discover at any point that there’s a totally different piece of software that you should have built instead.
As someone who works on an infra or platform team at work, this is a nice thing to read. Wish more people at work had some empathy for the teams that support them. I really hate having to do customer service tasks at work. Surprisingly there's a lot entitled engineers to deal with, despite the fact I only support a few hundred. In a company we are all co workers, we work together. I'm not here to respond to demands from any one. I'll stop before this turns into a rant hah
Could it be that the engineers need changes in order to deliver on a product they’re working on?
Being on an infra or platform team is always going to involve receiving a lot of demands from engineers, since you become the bottleneck preventing them from adding/modifying the overall system.
Genuinely curious to hear how engineers act entitled.
There's a couple things that make someone come off as entitled. Like someone comes to me with a solution to a problem they don't understand but still demands I implement their solution. Someone wrote a whole doc about how we don't test enough and need to test more, like were unaware of it. I wrote the same thing my self before that, though were just understaffed so things fall to side. Why go in to the interaction with the assumption we don't know what were doing? When I get to back to work, I have someone opening a ticket telling me how to run the service. They're wrong on their assessment, so their solution is wrong. Ive told them yet they still insist and don't want to help with the debugging.
Or its the attitude that my time isn't worth as much as theirs or others. I'm on the receiving end of everyone's fire. Any team's fire can be twisted to suddenly be the most important thing for the company. Your problem gets put in a queue. And lets be honest, most people are full of shit with how important their problem is. Like this article said, I have my own projects, with cross team and company wide implications if it doesn't get done on time or done well.
A recent scenario, someone asked for help, I told them Ill get to it soon, just not now, apparently that wasn't fast enough (we have defined sla's for responses) Mangers got involved, fine, I dropped everything to figure out their issue, it was a cross team thing, I pulled in other teams to debug too. I was getting pressure for updates for a few days, then silence. I asked for a follow up, i got a response the plans changed, not important anymore. They couldn't be bothered to even tell me?
Some team started using our service in a way we didn't intend. (Revolving around persisting state and data, I don't want to give to much away about what i do, just trust I'm right here) So they complain, "whys the data gone?". The answer, because we never guaranteed that? And we have what we do guarantee published. So now this team is demanding special treatment, they can't fix it now, but they promise in the future to. Guess what happened? Were still special casing their ridiculous situation.
There's little things people do, generally revolving around unrealistic expectations. Asking for off hours help with test/staging/dev/"whatever phrase you use" environments. Asking for multiple updates a day. Meetings that can be emails.
> Being on an infra or platform team is always going to involve receiving a lot of demands from engineers, since you become the bottleneck preventing them from adding/modifying the overall system.
Of course this is the case. There's ways to interact with people though, like ordering at a restaurant "Give me the food already, I'm hungry! Is it ready?", that just pisses people off. Have empathy and assume the best intentions until you have a reason not to. There's teams that are pleasant to work with, and co workers that are nice and friendly, there's also teams and names I cringe when I see them show up.
That's my off the top of my head examples and part of the rant I tried to avoid.
Sounds like in your third scenario you possibly should have given that team an EOL date on support for their special use case from the beginning. If it was an open source platform deprecating support for a feature they use, then they'd make the time to implement a fix. But since you're part of the same org they're able to indefinitely delay when they have to do additional work, but at the cost of your team constantly having a slightly higher workload.
Yeah we need more empathy all around. Really we should all just admit no ones calling in life is to make their ceo and share holders more money and just take it easy at work and stop torturing ourselves and others for meaningless shit
I believe the differentiation between product and platform engineers are creating an unnecessary divide:
A platform engineer builds something for a customer, just that in this case the customer is a developer, usually within the same organisation.
If anything applying product thinking to the problem helps the platform engineer: They have a clearer persona for their customers. They can more easily speak to their customers (if internally). They want to deliver value to their customers and fix their problems.
Why create this artificial divide and silo thinking? If your customers are not happy with your work you're doing a bad job and your product sucks.
> A platform engineer builds something for a customer, just that in this case the customer is a developer, usually within the same organisation.
> Why create this artificial divide and silo thinking? If your customers are not happy with your work you're doing a bad job and your product sucks.
The problem is that most companies don't start out with a platform team. They grow and grow until there are a set of problems that are ignored because there is no direct owner until they can't be ignored any more. A team gets created devoted to these problems and ultimately becomes a platform team, because it's an easy way to allocate work. It's one thing to design a "product" from the ground up. It's entirely another to be trying to build a building that is simultaneously crumbling and growing around you at every moment.
Platform teams generally don't get the headcount allocation to get ahead of their work either, this results in having to make prioritization choices. When a platform engineer makes a prioritization choice, that means a product engineer is hearing "no." The problem then becomes the product engineer might do something to circumvent you or any other number of failure modes, when platform is circumvented or battled, this creates animosity. If a product team is told no, that likely means platform has to explain it, which is yet another time investment/context switch that is not building a platform "product".
I assure you most platform developers are also "product" developers and see the platform as a product.
The fundamental organizational difference between a platform developer and a product developer is that a platform developer is a "cost" and a product developer is seen as "income." Which ignores that platform teams are a product multiplier. A correct measurement of platform teams is aggregate product creation. When viewed this way it's pretty easy to see that a 5% boost across all teams might be a better investment than a 30% boost to 1 team. Unfortunately this is something that is very hard if not impossible to measure, so its a matter of leaderships gut feeling/the platform teams ability to market work that could be seen as fruitful.
If ALL of your customers are not happy, then you're doing a bad job. But if only a small subset of your customers are not happy, then that's just par for the course. You can't please all of your customers all of the time because they all have slightly different needs, so you have to make difficult decisions on how you want to optimize overall customer happiness.
I think it's more like a team that works on roads or power. Platform teams are usually monopolies and they usually work on critical infrastructure. They are almost a kind of utility. When a power company has failed, you see blackouts, when a city fails to plan for public transportation or load on roads, everyone gets stuck in traffic.
A lot of the time all of a platform teams customers are unhappy because there are very systemic problems with the platform (slow tests, hard/slow release, brittle services, slow feature production) that result in an overall bad experience. In addition, just like a freeway that's run out of capacity, adding capacity will often involve even worse traffic while the situation is being improved.
Now if a lot of people are experiencing black/brown outs and a lot of people are stuck in traffic, who is to blame? Is it the power companies? Is it the road construction companies? No. It's city hall. It's the people who make prioritization choices. More lanes on a highway is a major investment and building a rail system is a major strategic choice. Those choices are not made by construction companies, they just build. Those decisions are made by city hall.
If you are experiencing traffic it's because someone in an authority position looked at the opportunity cost of the tax dollars that might be spent on solving traffic and said, it's too high. Similarly, a platform that has too much traffic (dev experience sucks), probably indicates a CTO that said "we need features more than we need a stable platform."
An even more apt similarity is that infrastructure projects often fail because term limits/vote cycle mean that people can make short term decisions they benefit from with long term consequences they are not responsible for.
The model of platform as a product breaks down when you understand that the platform is a both a monopoly and a cost center. Platform is much closer to a government (trying to keep people as happy as they can with as low taxes as possible), than a business (able to charge as much as the market will bear/doesn't have to serve any particular customer or use case). The difference between pleasing everyone enough, and pleasing some specific people a lot is quite a large paradigm difference.
Platform is a product in the same way that Comcast is a product. Platform is a product in the same way that the Flint Michigan water company is a product.
Blaming a platform team is almost always wrong. The right people to blame is leadership.
> I can’t stop everything and clean up. As much as I would like to, I have a deadline.
In my experience, if this way of thinking has become institutionalized you have bad leadership. It is literally unsustainable to constantly develop without any aftercare or clean up. It doesn't matter if it is on the Product or Platform side.
The best solution I have found starts with leadership. Leadership understands the idea of building codes and does understand this concept as applied to software.
I have found that when quality standards are agreed by all teams and supported by leadership, a lot of the inter-team friction falls away.
This lines up with my experience as well. At worst, you should be able to work solely on getting your sprint tasks done, but also be making notes of any clean-up/aftercare that needs to be done and any other basic codebase maintenance that you notice as well. Then these get added as future sprint tasks.
In the best team I worked on all of our sprint tasks had soft deadlines unless explicitly stated. So it was perfectly justifiable to not fully complete a sprint task because you needed to do some unforeseen codebase maintenance in that part of the project before you could get to work. before I got there, the team had inherited a python2 project that hadn't been fully maintained in a while. Our pm always emphasized to "clean up your workplace before and after every task". Adhering to that, we were able to prepare for a migration to python3, add tests and documentation, and enforce pep8; all while implementing new features and setting up a continuous delivery system.
she leaves out a third category: ai/ml/ds, which also has huge cultural differences with traditional product and infra/platform teams. check out hilary mason's recent interview in the twiml podcast. she confirms something i've suspected for a very long time: agile is a bad fit for ml enhanced product/project development.
The biggest issue I've had with bridging the DS<>Product gap is that our (DS) work is not linear and often not incremental. Sometime we go backwards and sometime we have to build for a really long time only to discover that this way doesn't work and we'll have to do something else and that there is usually no stopgap. ML-based solutions often don't work at all until they work.
is this "friction" to just the same as the traditional orthogonal relationship between developers and operational teams?
One want to introduce new features into the system, the other has the job of maintaining stability so the service is actually available to it's users.
in my opinion a good metric on how "fast to move" is hard to define. The "Move fast and break things" that is popular in sillicon valley does not work when your services are used by critical systems. (Think, water supply. power supply, medical etc). But moving too slowly has the disadvantages of being (too late) to market.
It's a very fine balance to strike, if not impossible to reach.
I think its more like ordering food in a restaurant then dev and ops. Where the platform team is the restaurant and the workers are some of the engineers and the customers/eaters are the users of the platform. Kitchen needs to cook in a certain way, cant be doing a million substituions the waiter can only do so much, customers can't cut the line if its busy, should have made a reservation.
This perspective is wrong because it assumes that there are platform engineers and product engineers and that they need to negotiate with each other as equals in the absence of an arbiter that can set priorities and explain the long term consequences of unreleased product to platform or the consequences of unmaintainable code to product. The absence of the big picture creates agitation rather than alignment.
This blog post is an indicator of extremely weak leadership. Lack of a leader with both authority (from the org chart) and legitimacy (because people believe in their technical expertise) is exactly the situation that results in this kind of blog post being written.
Full disclosure: I am a platform engineer.
My take:
* Product engineers overestimate their ability and expertise
* Product engineers underestimate the damage they cause with several classes of errors, especially dependency errors
* Organizations generally prioritize product engineers concerns over platform engineers concerns, product = money, platform = cost
* Organizations under-invest in platform headcount, even when all product teams complain about the dev experience.
* Platform engineers fail to create pleasant frameworks
* Platform teams repeatedly fail to make the right choice the easiest/only choice
* Product wants more from platform, but won't give up their headcount
* Product is more than happy to spend platforms money (they should be oncall/do maintenance), but isn't happy when platform spends products money (we don't have time to refactor/migrate).
Here is a list of things product teams might do that would make platform engineers lives hell:
* add new dependencies, especially without consulting anyone
* create circular dependencies/inline imports/other dependency abominations
* use the global scope
* change how the server works in the wrong location
* fail to or completely ignore dependency injection
* implement a solution to a problem they haven't yet understood
* HOPE a short cut solution will work because they don't want to solve the problem the hard direct way
* try to solve a performance problem by adding complexity, rather than reducing complexity
* demand specific infrastructure changes in an area they don't maintain
* make significant changes for a service they are not on call for
* write crazy complicated tests that take forever and completely fail to unit test
* contract out features to other companies without consulting platform
* ask for new languages/paradigms to be supported (which also means maintained)
* not do napkin math to determine how much space their new data will take
* not do napkin math to determine how much a new feature will increase utilization
* not warn platform teams of new/impending load
* not consulting platform teams about complex new features BEFORE implementation
* abandoning services/code they don't want to maintain, but can still break
* failing to migrate or finish a migration
* failing to actively or passively monitor the capacity/health of their own features
* create product that someone else will be on the hook to maintain
* implement the same behavior another team needs in their own way because coordinating is too hard
* implement the same behavior another team is responsible for providing because they need it now
* creating a new system without first understanding what was bad or refactoring the old system
* failing to quarantine business logic, especially from infra logic.
All the product, platform, and infra engineers in these toxic environments sit down trying to diagnose the key problems about why the dev experience sucks. Then ask themselves "how are we failing so bad? Why does product do such insane things? Why does oncall suck so much? Why is the platform so hard to use? What is infra even doing? Who is solving these problems?" It's clear something is wrong, but nobody knows exactly what it is.
The problem is a lack of leadership. When people write posts about product and platform being at each other's throats, that is an abject failure of senior leadership at a company.
A leadership that fails to interface with or solicit feedback from line workers is a failed leadership. A leadership that fails to see the bigger picture is a failed leadership. A leadership that fails to see long term costs is a failed leadership. A leadership that promotes short term prolific product devs that squirt out product rather than foundational, do the hard thing because its necessary even if it takes more time, product devs is doomed.
If you go around and all the people in your technical leadership positions can't tell you what the real architectural problems are because they are too busy plopping out features, that means leadership has failed.
And guess what! All the senior engineers are avoiding these problems like the plague because sane solutions require a lot of really unpleasant, really unsexy work that is not guaranteed to succeed. Why would a senior engineer work on a hard problem when they can write a horribly complex feature, get a promotion, a 20% bonus, and quit the instant there are any repercussions for their bad decisions with a resume that talks about the amazing features they created? I'm not asking that facetiously. Is hard grunty refactoring work valued by the companies reward system/authority system? Do platform engineers get to write performance reviews for the senior engineers that cause them pain? Where/how does quality/architecture feedback even get considered systematically?
How many kids out of college even know what dependency injection or the global scope is? How many of them can mock a canonical easily understood teaching example on the white board of a global scope violation or dependency that was not injected? How many senior engineers can do that? How many onboardings, documentations, or classes exist that explain these concepts and the long term consequences of ignoring them? What do you do if your senior engineers are leading by example by taking those shortcuts if it gets their feature out faster?
What every product engineer needs to understand is that a platform is a commons, like the tragedy of the commons. A grassy meadow with a bunch of cattle ranchers might let the cattle eat their fill, but if all the cattle eat all the grass, the cattle will starve or the ranchers will fight. In the same way, if a bunch of devs are allowed to create as much complexity as they want, the server will starve (tests will take to long, build will always be broken, pages will always be sent) or engineers will start to fight.
Every individual dev is incentivized to create as much complexity as quickly as they can because they want to meet their goals. With no force regulating the complexity it is not a stable system and will definitely collapse. So the platform team comes in and says "no" because that's the only way to really limit complexity. Leaderships job is to regulate this relationship and ensure that everyone is aligned and properly incentivized, so that "the cattle ranchers can have as many cattle as the fields can sustain."
> 25 × 10^6 is, roughly, what separates us from orangutans: 12 million years to our common ancestor on the phylogenetic tree and then 12 million years back by another branch of the tree to the present day orangutans.
> But are there topologists among orangutans?
> Yes, there definitely are: many orangutans are good at ”proving” the triviality of elaborate knots, e.g. they fast master the art of untying boats from their mooring when they fancy taking rides downstream in a river, much to the annoyance of people making these knots with a different purpose in mind.
For this to work, we need to anticipate future requirements. We monitor usage patterns, make proyections, study the market our company serves and try to see trends and shifts. We care about what competitors are pushing and what complementary industries are doing. We are aware of the business, but not in the same way product does.
When our plans work (and they mostly do), we usually deliver at the right time when the company needs it. We saw a shift in market patterns that would affect load a year ago, so we started working on scaling certain part of the system. It took us a year to get there, but when it was needed, we had it ready.
Sometimes people ask for things that come out of the blue. We were unable to anticipate them and the work to get them done will take at least several quarters.
These scenarios are the hardest, when you need to push back, but it's not that we're not doing anything, we're reshuffling several quarters of planning to adjust to the shift in reality, to correct the blind spot, so the next time you ask for something, it's ready, because we started a year ago.