Had a postgres database which was using pgbouncer for connection pooling. The most senior developer (24yo or so) we had on the project was using Go to connect to the database to write some simple reports, but each report took hours to run, and often had to sleep for 30+ minutes. So, after a while, pgbouncer would kill their connection, and their report would die. No other application did this among the many that we had connect to that DB, so it was definitely strange.
Found out pretty early on in troubleshooting it that they had no mechanism to keep the connection alive, which makes total sense for why his app died. So, they put the library standard keepalive function in a loop if the report wasn't doing anything.. but that didn't fix it.. it made no friggin' sense. After bashing my head against that for a while, I finally threw my hands up and asked if they could just run a "SELECT 1" as a keepalive instead of whatever the Go library was doing. Got a bit of pushback, but just told him to do it and walked away. That ended up fixing the problem.
Turns out the Go library was trying to be clever in its keepalives (can't remember what it was doing exactly), in that it made some silly assumptions that there was nothing in the middle managing connections.
I like to think that dev learned a lot about trust in "magical" libraries after that.
Go sql/database uses its own connection pool. But still that shouldn't create any problems. I have seen the reverse where apps that assumed temporary tables stick around from statement to statement without an explict `txn` (which regular postgres connections don't need) clearly failed. But I have not seen the issue you talk about.
My wild guess would be that the Go code never closed the result/rows which caused either a pool exhaustion / left the connections hanging and eventually got a timeout. Consider this similar to how http client's need to close the response.Body or else connections can't be reused.
At a previous employer, I was forced to use a certain 3rd party ODBC library. Under certain circumstances, it would just do `exit(1)`, with nothing even to stderr. Very frustrating and annoying to debug/fix. Had I physically known the developer responsible for that behavior, I'd probably have faced murder charges and plead temporary insanity.
Any sufficiently advanced technology is indistinguishable from magic. If your source of knowledge is stack overflow or a youtube video posted on linkedin you already live in a world filled with magic.
Interesting take, since you're still stuck with "simple reports" that take hours to run. What about writing more efficient queries, adding indexes, normalizing the DB... ?
It amazes me the complexity of solutions these days, when 20 years ago almost everything ran in relational databases and query tuning was usually the solution to performance problems.
It no longer makes sense to tune your queries, or to wait weeks\months for the vendor to tune their queries, when you can just slap a few sticks in and call it a day.
Tuning queries still makes plenty of sense. Slow/expensive queries need extra infrastructure like a memory cache server or more app servers to paper over the inefficiency.
That's a bunch of added effort that might as well be spent on understanding your database. Even something simple like using materialized views can significantly increase performance of expensive queries.
If you're like a company I was at before, you'd pay $10k+ for DB consultants to tune some queries and your prod database, and when migrating your DB to new hardware forget to re-tune it and waste the extra 64GB and even SSDs installed. There should still be a bare minimum floor of competence for actually developing against and maintaining databases organizationally whether it's a DBA, better engineers, etc. Throwing hardware at a problem is fine when you're sure that you are actually throwing it in the first place which I have seen surprisingly few places do well.
But we aren't doing something simple like that, we are building monstrosities based around the theory of micro-services in the cloud. Kubernetes. It takes hours to get a development environment put together to try and reproduce / debug a problem. We are adding complexity instead of keeping things simple.
And query tuning isn't that difficult. Spend a few hours on this site and you will be better than 90% of devs out there: https://use-the-index-luke.com/
> we are building monstrosities based around the theory of micro-services in the cloud. Kubernetes
This isn't something I'm doing and none of the people I personally know are doing this (disclaimer: im not in SV or the "startup scene")
> And query tuning isn't that difficult. Spend a few hours on this site and you will be better than 90% of devs out there: https://use-the-index-luke.com/
It used to be that companies had DBAs and could call out their developers\vendors on their sloppy queries. They have been replaced by 64gb of ram
The technology inept organizationally can’t even figure out how to use that 64GB of RAM properly is my point and more RAM won’t fix a full blown table scan on every other query coming in because people don’t have a way to stop queries without limits. However, you can add modern SSDs with millions of IOPS to paper over complete ineptitude at using a free query analyzer in your database. But to myself, inability to bring oneself to use that is about as incompetent as not being able to use a debugger or profiler. I can certainly understand a level of inability when hiring only junior devs or having one’s hands tied when dealing with production systems locked behind compliance and bureaucracy, but all these things are problems that hardware nor more people will fix honestly.
And TBF, I am mostly familiar with the guts of technology laggards and they’re adopting K8S at frighteningly fast rates compared to anything else I’ve seen in almost 2 decades. Very common to see pretty smart people familiar with the latest tech yet lacking some serious holes in fundamentals familiar to those with experience.
I've heard about some database drivers/middleware being smart and optimizing/caching queries like "SELECT 1" to save the round trip, so now I do something like "SELECT now()" or similar to do health checks.
Let’s not just target Go with that sentiment, it applies almost universally, just in varying degree.
Counterpoint: how is anyone supposed to learn, if not from their mistakes? We might worry about the blast radius, but there’s no compression algorithm for experience.
Well, for one, you can learn from other people's mistakes, which is better than learning from your own, because then you don't have to feel the pain.
Go is a deserving target for this criticism because the language itself deliberately made a lot of the mistakes other language communities made and learned from, like weak typing[1] and naive garbage collection algorithms. Literally if you opened an undergraduate textbook on either topic you'd see much better ways to do things. But early adopters argued vehemently that Go was simple and didn't need those things.
It does seem like Go is learning from their mistakes here: they've introduced precise garbage collection and it seems like some form of generic or template types are inbound in the next few releases. Perhaps in a few years Go will be a language I am willing to work in. But it would have been nice if a new language which already had these problems worked out had become popular, instead Go, which has reached popularity through hype rather than technical merit[2].
Tracking the history of template/generic types has been somewhat humorous: you can almost see it in this article[3] where the author starts in with the title "Who needs generics!", goes on to describe some frankly horrible ways to get around the lack of generics (it's amazing how complex Go's simplicity can be!) and finally backpedals in an update ("I am the last one to balk at the idea of generics in Go."). I don't mean this to be picking on the author here though--I've seen this history played out on other blogs and in the comments of Hacker News as well.
[1] Before you flame me on this, ask yourself if you can articulate the difference between strong and static types, because if you can't, you don't have the prerequisite knowledge to have an opinion on this.
[2] It's worth noting that the decisions made in Go probably have merit within the context of Google. The problem is that most Go users aren't at Google, and have different problems than Google, so the tradeoffs made by Go are nonsensical for their use cases.
The question was and is rhetorical. People don’t inwardly digest the mistakes of others. And anthropomorphising a language? Most peculiar.
None of this makes NIH a less than widespread phenomenon.
The readers of this forum often do know their type theory. Gatekeeping otherwise won’t go over well, it just reads like an arrogant insult from someone utterly lacking in self-awareness and accustomed to presuming themselves the smartest person in the room with the only relevant opinion. Ironically, given the subtopic, much like Google often does.
The fact that you ask a question not expecting a direct answer is not proof that a direct answer does not exist.
> People don’t inwardly digest the mistakes of others.
In my spare time, I'm a rock climber, and mistakes in my rope systems can kill me. The same is true in mycology, firearms, airplane piloting, civil engineering. If you really feel that you can only learn from your own mistakes then I guess it's lucky for you that you've chosen to learn in a field where the stakes aren't life and death.
> The readers of this forum often do know their type theory.
That's true. The same is not true for the many Gophers who repeatedly claim that Go has a strong type system, which is who that comment was directed at.
No, but it’s a trap. If someone answers an obviously rhetorical question, they’re inadvertently demonstrating a predilection for engaging the construction, not the substance, of a statement, and almost certainly missing the ironic subtext.
I’d be happy to repeat my assertion though. People don’t inwardly digest the mistakes of others, which is why educators on safety-critical topics such as those mentioned must go to extraordinary lengths to extract and convey the salient teachings, translated into better practices, drills, equipment etc.
Reading the archives of the Dropzone fatalities database, for example, won’t make me a better skydiver.
Conversely, the best structured educational processes I’ve experienced are essentially offering the student the opportunity to make their own mistakes, but under circumstances that don’t have consequences (other than pedagogical or scholastic)
> No, but it’s a trap. If someone answers an obviously rhetorical question, they’re inadvertently demonstrating a predilection for engaging the construction, not the substance, of a statement, and almost certainly missing the ironic subtext.
That's a pretty self-aggrandizing analysis of the situation.
From my perspective, I got you to make the statement, "People don’t inwardly digest the mistakes of others", which sounds a lot more absurd when you actually say what you mean plainly instead of hiding it in rhetoric.
Maybe some people go through life that way, but that's a pretty poor life strategy and I personally make a pretty big effort to learn from other peoples' mistakes. Maybe I haven't been successful always, but I can point to lots of examples of where I have.
That's quite the signal of bad faith debate. I don't think it's my own aggrandizement in play here. Quite the reverse. c.f. remarks passim re. hubris. So there the conversation must end.
The guy who thinks he "trapped" me with a rhetorical question accuses me of arguing in bad faith? Okay...
All I did was get you to say clearly what you believe. If what you believe is so embarrassing that it's a sign of bad faith to get you to say it in clearly, maybe believe better things?
The problem is you don't even know where to learn about the mistakes of others until you make the mistake yourself and in making the mistake you get some clue was to what to search for that then uncovers the mistakes of others.
I've found that when I ask others what mistakes I should avoid, they tend to answer the question.
Another way to discover any mistakes is to ask people why they didn't do certain things which you think are good ideas. Often that reveals that they did do that, and it turned out poorly.
You lose a lot of allure of Go if "the world" is not implemented in it (mostly safety) — I think that is the core reason everything is reimplemented in Go.
Of course, the real reason is the same reason even Go exists — we always think we can do something better the next time :) But hindsight is not 20/20 when it comes to software development.
Go also commits another un-Unix-y sin, in my opinion, in that it doesn't respect kernel keepalive. In other words, you can't use sysctl to configure it. Every single Go app has its own behaviour, requiring a recompile to change.
As far as remember, Go didn't enable TCP keepalives until 2018.
Nodejs Sequelize is transparently doing this when ping connection.
That aside, I wonder why you need to keep the connection alive for > 30 mins, while usually sql con is short lived. Why can't you just close and reopen them, is it temporary table?
Not the GP, but I assume transactional consistency was important for the report, hence the need to keep the connection alive. That's a pretty common situation.
Reporting can typically be split into two transactions. One long-running readonly transaction with snapshot consistency obtaining the data for the report and a separate transaction which publishes the result.
Interesting I’ve run into similar issues when we put network load balancer from aws in front of a db. It has a fixed tcp connection timeout so similar if the query takes a long time it’ll disconnect. We fixed this at the socket level by ensuring our linux syscnf was set to keep connections alive at an interval below network load balancer. Was a tricky problem to figure out.
I've seen older developers that call themselves senior, but lack basic knowledge. I've seen younger developers, wise well beyond their years. Age simply isn't a big factor in how you judge a developer.
Just because you saw a few exception does not mean the rule does not hold in general. Or are you saying most people don't learn with time (a corollary of your theory that age is not a big factor)?
I would say that I have seen no discernible pattern, so I have learnt that it is imprudent to judge a developer by their age. People do learn over time, but some people learn more "cogently" than others, i.e. some get more out of one year's worth of learning than others get in 10.
That's misleading because it's too simplistic. A smart person could spend 10 years gaining real, legitimate experience and they could still be eclipsed by someone with little experience but much more talent.
I'd say it's mostly the other way around, experience can't substitute for high natural cognitive ability. Of course a person still needs experience, but people with high cognitive ability don't need nearly as much, and people with less cognitive ability will hit thresholds of capability much more quickly. A lot of people live in a bubble of people with similar ability so they don't grasp the true importance of ability. And fakers who learn nothing year over year but have "years of experience", extremely common in this industry, don't like being told there are 14 year olds way more capable than them at their own jobs.
I mean talent is no substitute for experience in the sense that experience teach you things about how the world works which you cant figure out just by thinking, however smart you are. But sure, cognitive ability is a multiplier, and if you don't learn from experience it is wasted.
A have met people much smarter than me who wrote bad code because they were working from a theoretical framework which just didn't correspond to realty.
But your point is only valid if those who do learn with time are either rare (so do not make a dent in the general trend) or they somehow also do not learn much after a certain age, both of which I find deeply problematic.
Programmers by the nature of the profession must be good at learning over time... and I find it difficult to believe someone who is good at learning when at 20, will be bad at learning when at 40...
> some get more out of one year's worth of learning than others get in 10.
Sure, but they probably continue learning the year after?? And the year after??? Or they too, stop learning (or are just too rare)?
I think this is correct. The curve flattens with age (in my opinion).
> you saw a few exception
I think the seniors you met (the knowledgeable ones) /were/ the exceptions. Most seniors I have seen have convinced themselves that they know all there is to know.
I am getting on a bit and its not that I think I know all there is to know, but I see so much new tech that seems like pointless unnecessary complexity. I reckon that more than 90% of apps could be done in a classical MVP, with a bit of JQuery and a well tuned database. Most new ideas are a rehash of something from decades ago. It gets rather tiring.
> Most seniors I have seen have convinced themselves that they know all there is to know.
Yeah, seems we've met different kinds of seniors... I don't know any senior who wouldn't understand that "the more you know, the more you understand just how much you still don't know".
Most people don’t magically grow wiser with time. They need to be in an environment where they can grow, otherwise they’ll emerge just as stupid as before.
If you start out with a cohort of individuals across the spectrum of ability, over time many of those of lesser ability will self-select out of the pool. It is my experience (oops, see what I did there) that people rarely spend a career doing what they're not very good at. The exceptions inevitably stand out.
So, combining the winnowing of the not-very-apt with the gaining of knowledge through experience, the end result is that you have a preponderance of wise old experienced contributors.
I you are one of the younger inexperienced ones who believe they know better it's likely you'll self-select at some point, secure in your Dunning-Kreuger knowledge, and move to some career in which your high level of competence is valued more.
Do not accuse people without evidence! I'm all against age discrimination.
..and when you see a company where the age cluster around a very small area you can tell age discrimination is happening. Which might very well be the case of the parent post.
It's even more complex as you cant even say dev X > dev Y for all tasks. So the age thing is even more silly. After so many years you start forgetting the stuff you did (in detail) in the first years anyway.
Because then it’s harder to deal with our own imposter syndromes if we can’t blame it on the youth and hold their heads in the toilet while giving them the professional-development equivalent of a wedgie.
This was discussed at length in last week’s “Grey Beard Weekly” newsletter.
It seems so, and the fact that people are accusing me of ageism as I pointed out that a company hires only people below 24 years of age is very telling...
You didn't point that out, you made a vague complaint about someones age, and the comment you replied to doesn't even give you a good basis to assume that, so it's no wonder nobody understood what you tried to hint at. Hint: in many companies, a project involves only a small part of the workforce.
Edit, because meh: I'm making no claims about go itself. No idea what makes you think that's what I'm saying, since I'm clearly talking about a library, and not even any stdlibs. "Magic" is just a term useful for describing systems that sweep much of their abstractions under the carpet in a way that probably has gotchas. Granted, the term itself is magical.
In terms of fixing the problem, I knew for a fact that the keepalives that I was seeing were nothing like what I've seen in the past, at many companies, across Oracle, postgres, and MySQL, all who've implemented "SELECT 1" for the sake keep alives, by devs who've been in the field for much longer than me. The suggestion was by no means blind, unless you consider implementing a widely used method for this exact purpose, "blind". Had I gone a different route in fixing the problem within the stable, existing system, it would have likely broken many of the other database connections by many teams. I'll pass on that, since frankly, even ignoring the risk of such a change, the dev should have done the investigation themselves.
Your post was unnecessarily aggressive and seems to come from me having struck a nerve somehow.
Genuinely hoping you're doing alright. Peace.
I presume they took it as an attack on Go. Truth is, it's an attack on the library developer who themselves may have found their keep-alive solution by stumbling blindly on it.
I have no affection towards go or any language. They're just tools. You sounded elitist calling something magic and pointing out someone's age as part of your point. And your passive aggressiveness to my response is proof of it. I "genuinely" hope you're doing alright too.
(The 80/20 rule applies below, some developers do care)
Developers... just don't care. They want to spin up an ORM, point it at a URI, and forget about it.
I've fought this for over a decade now as a DBA, SRE, DevOps, and architect. Most of the developers don't want to deal with anything infrastructure-wise; they want to spend all the time they can just focusing on the problem they're writing software to solve.
Observeability, reliability, scalability - these are all words that are translated into either "someone else's problem" or "unproductive busywork" in their minds.
Many interests are pulling developers' attention in several different areas all the time.
Database, security, accessibility, performance, infrastructure, tooling and productivity, business concerns, workflow processes (agile), language concerns, new things
All of these like to say "if only the developer could do $MY_AREA better, they'd be better developers and we'd have better software". Each of them wants to pile on more requirements for what devs ought to know.
Let's say we agree that devs should know all of these things. After you tally up every area's demand, you're probably looking at a 10 year timeline before someone's decent in all of these areas by spending approx. 1 year on each of these areas doing meaningful work (as opposed to contrived tutorials).
Even then, things change constantly, so you'd have to continually practice all of these things. Who knows if we'll add a new category next year? Then it becomes an 11 year timeline.
At some point, developer responsibility has to stop.
I have done exactly this and found it to be a frustrating and thoroughly thankless exercise. No stakeholder for development teams care about any of this, and your allocation of shared resources plummets as people realize they can dump problems on your team and instead focus their attention on the ones who do not care.
I don't agree. If you're designing data structures in a code base you shoulder some of the responsibility for the persistence characteristics of that data.
There's a lot of devs that think database design is the same as starting a new ORM class and generating a migration file.
> Database, security, accessibility, performance, infrastructure, tooling and productivity, business concerns, workflow processes (agile), language concerns, new things
Yes, these are all things that devs should strive to know as much about as possible. Software isn't easy. It takes a long time to become an expert. 10 years sounds about right.
> Who knows if we'll add a new category next year?
Skill domains do come and go, but I think the ones you've listed are solid staples of web development for the past decade and likely will still be for a decade more.
> I don't agree. If you're designing data structures in a code base you shoulder some of the responsibility for the persistence characteristics of that data.
There's usually not a relationship that goes the other way though, for example, developers don't tell DBAs to pick up code so they can write the models in our language in addition to the underlying SQL. This highlights a trend of increasing responsibilities pushed onto the developer.
>Yes, these are all things that devs should strive to know as much about as possible. Software isn't easy. It takes a long time to become an expert. 10 years sounds about right.
But in some cases it's specialists designating what an expert developer should know. It's giving away control in some respects. This turns into new job requirements and a higher barrier for entry. The growth will need to stop at some point.
> There's usually not a relationship that goes the other way though, for example, developers don't tell DBAs to pick up code so they can write the models in our language in addition to the underlying SQL. This highlights a trend of increasing responsibilities pushed onto the developer.
Yes, because as a developer you're the one who has the responsibility of implementing the business requirements of the app. There's no trend here; this is the way it's always been. The buck stops with development for a lot of things. The developer is in a unique position to respond to many incidents because they have an intimate understanding of how the business requirements wed with the technology in ways someone like a DBA does not.
DBA's have plenty of responsibilities of their own, such as handling a 3AM alarm that goes off when some part on the application starts hammering the DB from some poorly designed N+1 query problem in the codebase. Often times when the DBA tries to teach the developer it's because he's sick of getting those 3AM wake-up calls.
> But in some cases it's specialists designating what an expert developer should know. It's giving away control in some respects. This turns into new job requirements and a higher barrier for entry. The growth will need to stop at some point.
That's a strange mindset to have. Different jobs have different technologies, and as a developer you learn how to work with them. The specialists aren't designating anything, the job you're responsible for is.
>Often times when the DBA tries to teach the developer it's because he's sick of getting those 3AM wake-up calls.
And if he just learned to code, he could write all the queries himself, check in his own code, and never have 3AM calls. I see no reason why a DBA couldn't manage to write queries in source, especially with some hand-holding w.r.t. source control and check-in rituals. If they are already doing too much, just hire or train more of them. DBAs are equally as smart as devs. Loop DBAs into the business requirements.
>That's a strange mindset to have. Different jobs have different technologies, and as a developer you learn how to work with them. The specialists aren't designating anything, the job you're responsible for is.
This is very push/pull about who should own what and I personally think devs are often responsible for too much. In silo'd places, make no mistake that input is gathered about who owns what. In more collaborative environments, you see the other sides willing to step in and share the responsibilities; blurred lines, but they are still there. What we've adopted is an over-reliance that a dev will just learn many things and we hope he's good enough at all of them that we don't really need deep knowledge or need to delegate work.
Agree with this. There's the whole DevOps nowdays as well, which basically just shifts what used to be an entirely separate, full-time, role onto the developer. Adding DBA to that sounds like it would benefit noone, except perhaps business owners looking for short-term savings at the expense of productivity, ala open-office floor plans.
In ~50 lines of code I can erect a load balanced set of web servers with my app binary preloaded, secured in a private VPN subnet with an auto-scaling policy attached to it.
15 years ago this project would have meant:
- Racking new hardware somewhere on premise
- Configuring several routers and networking equipment
- Tweaking the different software responsible for the different app layers (load balancing, web servers, etc)
- Hoping we forecasted demand correctly and whatever hardware we racked can handle a traffic spike or growth surge
You needed a lot more expertise back then.
And really you can trade all of what I'm doing for a higher level of abstraction that requires even less knowledge of the underlying technology (netlify, heroku, amplify, etc) if you're willing to pay a little more.
In most places I have worked, knowing the database well enough to manage it and at very least write stored procedures has always been expected from developers writing database related code.
I have been doing this for a little more than 30 years now, so it isn't "nowadays" thing.
Sure, knowing your database well enough to "manage it" in terms of indexes, bloat, writing procedures, that's a developer responsibility.
What is "new", is developers having to manage the DB backups, k8s clusters, Docker images, Terraform, load balancers, CI/CD, proxies etc.
This used to be a role for a whole another role. Some places do have an "Ops" role, but most tasks are still expected off the developer, with Ops at best providing a helping hand, rather than taking over a task completely.
as someone with 6 yoe and who considers himself a decent developer, I am looking for ways which can help me to be so-called expert developer-- I don't mean to say it mockingly.
Genuinely interested in the advice which can help to gain more than I will on my own with my current trajectory.
You should be constantly learning everything about the whole stack so that you can actually build functional, reliable, manageable and maintainable systems.
I expect a competent developer to be able to build a modern multi-page web application, with a HTML/JS front end, relational database back end, appropriately configured certificates and DNS/CNAME/URL, build basic uptime and application monitoring and do a basic SQL ETL data retrieval process.
That seems like a reasonable bar, and while the specific tools have changed over the years, that stack is basically the same as it's been since the 90s.
Those are all areas I've seen "Developers should learn X" calls-to-action.
Sometimes it's outright loathing over developers that don't know specific topics. Other times it's wishful thinking, or maybe a nice-to-have.
If you, as a developer, read all of those posts, they probably all make fair points about the importance of knowing each of those things as they relate to development. You decide there is some merit to learning it.
So you decide to make a to-do list to go learn each of those topics, because you want to listen to the blog posts and be a good developer, and do some real work to prove you know it. That's what's going to take you a while.
You seem to be contradicting yourself without realizing it.
> You should be constantly learning everything
This is an in-progress action i.e. the developer is still learning.
> I expect a competent developer to be able to build
This is now considering a "learned" developer.
I am not sure you are making the point you think you are making. The point I think you are trying to make is your expectations of what an experience developer should know. But, you seem to be expressing it as what a new developer should be doing.
While the core discussion of the article might be in regards to what developers do and do not know, I can't help but notice that a developer knowing about something does not necessarily allow them to be productive in that area of their knowledge (especially relative to another co-worker with both the knowledge and the dedicated focus in that area).
Also very important to note: the comment you are addressing seems to be referring to knowledge that is local (particular customer necessity/problems, particular architecture choices for infrastructure, particular product design decisions, particular ways to answer the quirky CEO/CTO in a way that they understand, etc.). There is a lot of locale-based knowledge that a developer must learn at a company/job/project and can even change over time (temporal-locales).
Globally-applicable knowledge like frontend, backend, and general CS concepts are for sure a reasonable expectation of an experienced developer. But, there is a delicate balance a developer must take in the real working world that is subject to not attempting to master every aspect of the product/business (especially if it overlaps with someone else's job/focus) just because you have a high-level understanding of the global concepts. In other words, it is not necessarily efficient for a developer to know every aspect of every language and every database in the company unless that actually buys the company more customers and money.
I would expect any decent manager to understand this very basic principle. Everyone in the company trying to be a master at everyone else's job does not help the company make more money. Being reasonable about expectations in the moment is also a critical asset of working together to make money. :)
hes not contradicting himself, the fact is there are a large percentage of developers who just don't care. its not a matter of still learning. they just don't care to learn.
To be clear, I was not trying to say his points were void of validity. I was trying to append some clarity to properly differentiating between a developers general knowledge base, a developers temporary knowledge base, and the actual day-to-day doings of the developer. They are not equivalent sets of things even if they intersect.
As a personal anecdote, I know I have learned many things on a project that really helped improve the code, that today I am not able to recall and would have to go back and re-learn with a minimum of a refresher. This happens a lot too. :)
> I expect a competent developer to be able to build a modern multi-page web application, with a HTML/JS front end, relational database back end, appropriately configured certificates and DNS/CNAME/URL, build basic uptime and application monitoring and do a basic SQL ETL data retrieval process.
What does all that have to do with OP's assertion that expecting application developers to understand that nuances of DB's listed in the article is noble, but unrealistic? A "competent developer" could fulfill your requirements and still not understand the implications of time drift or how to scale horizontally or other deep topics. Applications developers are the hub to many spokes, but expecting them to have deep knowledge across all technologies is unrealistic just as it is to expect a DBA to have a deep understanding of how a certain application framework works.
The problem is that all abstractions are leaky in some way or another. If you're going to use an abstraction, you should at least learn enough about it to know where to put the buckets.
Not really. That is exactly the point the article is trying to make. Developers need to care about these things - _enough to know who to go get help from_. That is the minimum. Also 10 years is an exaggeration. 2 years working on a non-trivial backend should expose one to these problems.
From what I have seen, products built without caring about these will usually get rebuilt a year from the original release - either by the same company or by a competitor who killed them.
> 2 years working on a non-trivial backend should expose one to these problems.
You can be exposed to them, but without understanding them, and experiencing both good, bad, and really bad 'solutions' to them, and understand the impact (on the business, on the code, on security, on maintainability, etc)... you just can't really get all that in 2 years.
I know plenty of people who've been 'exposed' to certain type of tech problems, and the solutions they decide on are objectively really bad for any metric other than "stop this error from showing up on the screen right now".
I've been doing this for a bit over 25 years, professionally now, and... there's a lot I don't 'get' with current stuff. But I've seen and lived enough projects, in enough different situations, to have a good idea of impact of tech decisions, and to understand how to make tradeoff decisions.
I had someone call me up to 'fix' a problem in a system I'd given then 15 years earlier. It was still running, more or less the same, and... spelunking your own code 15 years later gives you a new perspective on the impact and value of decisions you make. Many of the things people get hung up on (code style, tabs/spaces/etc, particular naming conventions, etc) provide pretty much no value in digging in to old code that no one has touched or thought about in a decade. Correct comments, sample data, repeatable tests hold so much more value, but are harder to get people to commit to following through on.
I wish I could slap anyone who gives a hoot about tabs vs spaces. Fortunately modern languages like go are removing the version control problem that not caring about style and using auto-formatting IDEs produces.
I'm working in a couple of projects where there's a bunch of linter-checker things that prevent any PR merges (another... imo somewhat over-used tool) and... I split my time between Java, PHP, various SQL engines and various JS frameworks (react, extjs, vue, etc) and I'm constantly battling different mental models with various IDEs always showing different colored squiggles and highlights telling me all the ways I'm "wrong" about the code I've just typed.
Can't use double quotes!
Always to use double quotes!
"Prefer string/template interpolation" in JS
Always do string concatenation per another project's standards.
Lots of different frameworks, languages, projects and companies all force different types of ceremony on formatting. "Hey, just let the tooling tell you!" turns in to constant UI distractions telling you that you're 'wrong' degrades (my?) performance. And... the value of most of these formatting things is pretty low, long term. I know that's heresy to some folks, and everyone I talk to sort of agrees, then says "yeah, but I really think standard XYZ is a good thing", but... it's nearly all preference, just like tabs/spaces.
That sounds like bad tooling. Why not have every project use https://editorconfig.org/ and then have your IDE auto-format? It shouldn't be popping up and making you fix it, it should fix it for you.
without visually flagging it, just autocorrects as I type?
And... I can just quickly swap settings for different clients? Because one has eslint block any PRs that don't have "prefer template" rules followed but another client doesn't like that style, and don't want that style in their code because it conflicts with existing style.
I don't think that counts as a formatting issue. Yes, if your clients have hard rules about different coding styles at that level then it's not a technology problem (nor is it likely solvable with technology). I assumed we were talking about formatting issues like tabs-vs-spaces, in which case yes every single project could be different but auto-fixed.
thx. sorry, i sounded a bit snarky before and wasn't meaning to be. it's just easier for people to focus on visual issues vs operational/functionality. and switching between multiple projects/clients/standards illustrates to me how relatively unimportant some of these things are (but of course just imo).
I would encourage you should check out editorconfig more closely. The whole idea is that each project has a file that defines the simple formatting rules for various file types in the project/directory tree and your editor will automatically follow them on a per-project basis. It's surprisingly well supported across editors.
I strongly suspect a lot of the remainder of the tabs vs spaces reminder is actually about how it either constrains editor / IDE choices OR requires and investment of time to deal with whichever choice someone else made.
"Why do you care, my IDE just handles this" is pretty close to saying "use my IDE," on top of "use my convention."
That's why people are asking to push this problem down to the language level the way Go has done. Go defines both the "correct" format of the code as well as a standard way for any/every editor to enforce it (gofmt etc). That eliminates the double-headed subjectivity of both "use my IDE" AND "use my convention" down to just "use the standard convention defined by the language". And people love it because finally we can stop arguing about the stupid color of paint for the bikeshed and just bloody build the damn thing.
you end up in pointless arguments. I don't care about this sort of formatting very much (I have my own default style developed over years), but I do care when other people care about it, often to the exclusion of other factors.
"but we need these tools so that we don't argue about how to format code". Well... you could... just not argue about it in the first place.
Disagree. There's value in consistency, even for the shallow matter of formatting. There's a reason so many large-scale software-development companies care about coding standards.
You're right that it's not a matter of there being one true style. I agree with Kevlin Henney though that there are certainly wrong ways to format code - https://youtu.be/ZsHMHukIlJY?t=1027
I think it's not at all pointless to have consistent code style throughout codebase. I agree, though, it's pointless to argue (thus losing time) about "correct" style. It should be enforced on project or company level, no discussions between developers there.
> You can be exposed to them, but without understanding them, and experiencing both good, bad, and really bad 'solutions' to them, and understand the impact (on the business, on the code, on security, on maintainability, etc)... you just can't really get all that in 2 years.
I don't think this is about becoming expert. This is about learning at least very basics in multiple areas so that you are not completely clueless and know issues exist. When you redefine requirements to "2 years active learning worth of knowledge", you moved goalpost quite far.
> I've been doing this for a bit over 25 years, professionally now, and... there's a lot I don't 'get' with current stuff. But I've seen and lived enough projects, in enough different situations, to have a good idea of impact of tech decisions, and to understand how to make tradeoff decisions.
>Not really. That is exactly the point the article is trying to make. Developers need to care about these things - _enough to know who to go get help from_.
That isn't it. They aren't telling you to learn about database details for the sole purpose of "DBA handles databases. Go ask the DBA database questions".
It's more than that. They're offloading specific knowledge onto the dev and then making them accountable for it. It's a reaction to common questions and an attempt to answer them all at once by teaching devs answers to common questions. This is a noble goal, but it's problematic in aggregate from multiple perspectives.
The aggregate goal of all of these areas trying to teach developers their own specialties is to make developers the masters of low-hanging fruit.
> Database, security, accessibility, performance, infrastructure, tooling and productivity, business concerns, workflow processes (agile), language concerns, new things
Hammers, nails, tape measures, saws, levels, reading blueprints, adhesives... If you want to be a professional, you need to learn the tools of your trade. Being able to work directly with a SQL database is a foundational capability in the software development trade. ORMs are a mental crutch that are overused to the detriment of many systems.
Speed squares are a mental crutch that are overused to the detriment of many structures...
Look, I actually agree with you about ORMs, but come on, this is a pretty bad take on the problem.
The issue is rarely the tool itself, it's the changing requirements around how it's used.
You can have the best hammer swing in the world, and maybe you sling the best tape ever. But if the building codes are revamped in non-trivial ways every 6 months, you're still going to want someone spending dedicated time understanding that. If your job is to assemble the stairs, you focus on that instead of wasting time asking why this blueprint happens to place the stairs at a slightly different angle, or why some places demand kiln dried timber vs simple construction grade.
Your area of expertise is NOT the building code, it's integrating the stairs into the rest of the structure and actually having people walk up them.
>If you want to be a professional, you need to learn the tools of your trade.
The tools of the trade has constantly been expanding. It's not set in stone as you imply.
It's like me telling a carpenter that they now also must become an electrician, window installer, insulation installer, HVAC installer, steelworker, concrete worker, brick layer, and security system installer.
Your new job title is "Fullstack Building Developer" instead of "Carpenter". We can't afford to have 10 specialists that all do a great job, we can only afford 1 person doing a poor/mediocre job in 10 things.
> Perhaps there is a reason why DBA, SRE, Devops and Architects are separate roles?
It's mainly an artifact of the way we've broken up degree tracks, and the boundaries that each group is taught to stay inside, lest any particular group actually ends up culpable for a failure.
Make fun of the "full-stack rockstar ninja" all you want, but the reality is that it is possible to have a functional understanding of all of these areas. It's just a matter of having the confidence and doing the legwork.
To the extent that more of your people have a working knowledge of these domains, you'll not only have a better end product, but a much easier time getting stuff done.
There's room for hyper-specialized expert consultants in each field, of course, but the myth of the myth of the full-stack developer exists primarily for political convenience. Most of this stuff is not any harder than the rest of it, and can be learned by anyone willing to sit down and learn it.
The phrases "functional understanding" and "working knowledge" are gigantic sucking tarpits.
Before I got my current job, I felt confident in my knowledge of computer networking at the LAN level. I knew I wasn't going into the telecom world and I knew I didn't have the knowledge to debug BGP or ensure a CO was doing everything right, but LANs? Sure. No problem. I knew DHCP, Ethernet, TCP/IP, even stuff like PPP which is more niche now. Heck, I'd even passed a college course on the subject. OK... set up an ICMP server and make it useful. That's LAN, right? Certainly gonna be used on a LAN.
I'm not saying it was hard. I'm saying that I'd never touched ICMP before except for ping and didn't know what the more advanced stuff even was. Did I have a "working knowledge" of basic networking? Did I have a "functional understanding" of how to get a building full of computers to talk to each other?
It's always something. I thought I had a good, working knowledge of networking. Someone who'd actually done networking in a corporate setting would have disagreed, and pointed to a list of things I'd never touched because those things aren't useful in SoHo LANs and aren't theoretical enough for a classroom. Multiply that by a few dozen topic headings and watch people sink under the load.
Unless you got yourself into a situation where you were expected to set up a whole office with the same speed and expertise as a full-time network engineer based on some gross misrepresentation of your skillset, I don't see how this story is particularly relevant. Maybe it's a good cautionary tale about presuming that SoHo is the limit of networking?
Technical topics are indeed both very deep and very broad. The level of sophistication and depth is how you choose your specialization, but that doesn't mean you're never allowed to learn anything else. You should learn enough about each field to know the shores when you're standing there, to be able to communicate with the "natives"/specialists, and to know when you're getting out of the shallow end. This expectation should exist for everyone: DBA, application developer, devops, network, whatever. They should all know the territory and be able to work together cohesively to identify the best place to take something down deep.
Depending on the constraints of the project, leaving the shallows means either a) developing more proficiency directly and getting deeper yourself; or b) acknowledging that you need someone with more expertise in that area to take it from there while you go back to handle things in areas you know better.
The thing we must avoid is "well I'm not a network engineer so I don't look at Cisco configs, sorry". That should be replaced with "well I'm not a network engineer, so I have no idea what's happening here, but it's still interesting, can I sit behind you and learn?"
> Unless you got yourself into a situation where you were expected to set up a whole office with the same speed and expertise as a full-time network engineer based on some gross misrepresentation of your skillset, I don't see how this story is particularly relevant.
Taking a job as one of a business's "computer people" puts you in the path of a whole lot of interesting tasks, even if your main job is programming.
> Maybe it's a good cautionary tale about presuming that SoHo is the limit of networking?
It's certainly that, but I want to expand on this a bit: SoHo is the only stuff most people can play with. For example, I can make programs and package them in Docker containers and run them that way, but I don't know how I'd play with Kubernetes in a realistic fashion. There's whole genres of technology most people can't get realistic access to without some institutional support. It's an effective ceiling on some kinds of knowledge.
As far as learning how to learn, I agree with you. I think a lot of it comes down to vocabulary: Once you know the terms the experts use, you can bootstrap effectively and learn more terms and bootstrap even more effectively. Plus, words have a way of coming back to you at odd intervals, effectively dropping you hints when you see something you vaguely recognize.
Maybe we should all have Word Of The Day calendars.
> For example, I can make programs and package them in Docker containers and run them that way, but I don't know how I'd play with Kubernetes in a realistic fashion.
Alternately, you'd take advantage of $CLOUD_PROVIDER's initial signup credit and start cloud instance equivalents.
Nowadays most things have good virtual environments floating around (you can even download virtualized mainframes if you want). A little bit of time tinkering with such environments will take you surprisingly far -- especially in fields like network engineering, where even most professionals don't know how to experiment.
Well likewise, many of the DBAs I've worked with don't try and understand the software needs, and want the world to revolve around what is optimal for them.
In either case, someone needs to have moderate expertise in multiple areas of tech. If it's wrapped up in one person, the business needs backups to deal with the bus factor. If it's spread across multiple people, you now need to be selecting for people with moderate tech expertise AND the ability to communicate and work together with other people effectively. And you need someone to manage their process (in some manner - doesn't necessarily mean micromanaging).
So the 'full-stack rockstar', if they have good communication skills, can exist and be valuable, but still isn't the best long term solution for a business of any size.
Hell Yes. Break it down further: the front end demands that. You could split further into front end architecture, front end UI performance (DOM etc.), web performance (taking into account network etc.), tooling. Even getting NPM to play nice could become a full time expertise!
The "developer". Can do anything that involves smashing at a keyboard! It's like calling a politician a "talker". Oh well you could do ground control for the next moon launch then, that's just a "talker".
As a developer: I do care, but it's hard for me to focus on building software if I also need to think about DBA tasks, DevOps tasks, and so on. These things all take time, patience, and energy. I've just spent most of today running and re-running a CloudFormation template to create a SQL database. Most. Of. A. Day. It's partly because I'm not a DevOps expert and even if I wanted to be one, that would also take time, patience and energy.
> It's partly because I'm not a DevOps expert and even if I wanted to be one, that would also take time, patience and energy.
It's this exact divergence that creates the disconnect. If someone doesn't understand and doesn't have to care about the whole experience, they're going to focus on their side and stop when their side is good enough.
On the other hand, if that same someone is going to be regularly developing the application, making changes to the database, and triggering deployments, they will find a way to make the process flow. They'll make it adaptable enough that it's not a day-long pain any time they need to run a migration or spin up a test DB.
The whole toolkit can and should be available because every piece of the stack opens up new possibilities. You want, and at least at some level, can have, people who know this well enough to make good use of it. Nitpicking over "not my specialization" is the antithesis of a smooth engineering process.
> Nitpicking over "not my specialization" is the antithesis of a smooth engineering process.
Counter argument: jack of all trades, master of none.
I'm already a full stack developer. I work on a complex front end application, a GraphQL server, a .net core platform split into multiple microservices, and a MSSQL database. I know my way around these components fairly well now but it's taken a good couple of years to get to this point.
I could also invest a bunch of my time learning all the intricacies of cloudformation templates and how IAM roles work too, sure. But is it the best use of my time as a developer, when I'm much more productive writing code?
Stubborn people like me keep doing the right thing, but the fact is that there are kudos and recognition for implementing a new feature. There is nothing for keeping servers from crashing, and only a little for reducing server count if you wait for things to get bad first.
The problem with doing things right the first time is nobody appreciates how hard it was. And you will sometimes get questioned about your loyalty and your competence to do the job.
This kind of "top-to-bottom" architecture approach reminds me of Apple. They have this notion that you can't provide a good product if you control only the software or only the hardware. You need full control of both, designing in the synergies to produce a really top-notch outcome.
If you read the blogs by large shops like Google, CloudFlare, or Facebook, they do the same thing on the server side. They design the software for the failure modes of the hardware, and conversely the hardware is designed to be low cost in full knowledge that the software can tolerate high failure rates.
Why am I talking about hardware? Because in a typical n-Tier design, "The Database" is just one of the pancake layers between the metal and the Internet. Every layer matters, and every layer interacts with the others. Hearing developers call themselves "full stack" is hilarious to me. They're basically saying that they know 2 out of about 20 layers! Do they know about load balancing persistence? The security tradeoffs made by TLS 1.3 zero-RTT? Cache-control headers? Disk latencies? Automatic scale-out? Virtual machine affinity and anti-affinity rules? Backup and restore? Etc...
The original post about how various databases treat transaction isolation modes is a tiny, tiny fraction of what a typical developer ought to know about the "layers below the web server" but likely doesn't. For example, about 50-70% of bespoke software I've seen in the wild either doesn't use database indexes at all, or uses them ineffectively. About 80% of websites either do not use cache-control headers, or if they do, they'll often end up with front-end errors due to the caching layers violating the consistency of data coming in from the database. Sure, it's a maxim of the industry that cache invalidation is hard, but there are workarounds such as constructing URLs based on the content hash.
All of the fancy transaction isolation modes, to me, is wishful thinking. Get more developers to use indexes first, and then come back and teach them the esoteric stuff after that!
Application developers don’t care about data - they want the database to be a dumb storage unit because they only care about their application functionality.
If you can convince your application developers of the prime importance of data, they will start caring about their database.
You’d be surprised how much rank and file developers can care about observability and reliability. The key to unlocking this is making them responsible for how their code runs in production by adding them to the on call schedule.
I'd guess that problem might go away if you make their bonuses contingent on a sleep numbers (along with the SLAs) for whoever has to run the application or it's downstreams whether that's a dev or an ops type. Base the number on pagerduty calls or something like so they can't game the number.
That is a wonderful idea. Managers are responsible for the performance of their teams. NOBODY wants to sandbag sleep numbers, so it'd be a decent comp metric.
> Observeability, reliability, scalability - these are all words that are translated into either "someone else's problem" or "unproductive busywork" in their minds.
The root of the problem is the same reason why Google keeps creating completely brand new applications instead of just maintaining and improving their existing ones. Maintenance is not rewarded. Anything existing is not rewarded. Management only focuses on new customers brought in by new features or product asap. Management doesn't care about maintenance; they care about growth, so developers have no real interest in it. Words tend to be empty. If you want to see how maintenance is really valued, look at the company's promotion system.
So, they really aren't that easy to bolt on, if not considered from the beginning. Monoliths, for example, are a real PITA to make reliable and scalable.
Worse, once your company is successful, there will be an endless list of features to add to your product, meaning nobody has the time to "bolt those features on". How the product begins, is how the product often lives on well past it's expected lifespan.
I sometimes feel in the minority. I love databases. When I work in the Ruby on Rails ORM ActiveRecord I can actually visualize the SQL it is generating in my head and also do all sorts of tricks when needed.
That's the key. ORMs get a bad name but most of the time you just want to display a list of things, or one thing in more depth or maybe create a new thing.
ORMs unfortunately, have a habit of getting in the way when you want to do something they don't natively support. When they just ignore things the database provides people just end up reinventing the wheel. Rails' implementation of enums is a good example of this.
> Rails' implementation of enums is a good example of this.
The advantage of ActiveRecord enums (vs db-native) is that you can change the list of valid values without having to run a whole ALTER TYPE - and all the overhead that would entail in doing without downtime in a production db.
A nice "trick" is to declare your enum column on the database as a string, and your enum in the code as a hash with string values. So you can have self-explained data saved onto the database, and all the niceties Rails gives you from the enum.
> That's the key. ORMs get a bad name but most of the time you just want to display a list of things, or one thing in more depth or maybe create a new thing.
If you just need to display a list of things then there's nothing much simpler than:
var result = exec_query("select * from things")
foreach (var row in result) {
//output html or something here
}
The problem is we decided this was bad and had to add more layers, abstractions, translations and complications in between the database and the output, then we needed tools like ORMs to help with that.
In Ruby on Rails, ActiveRecord has a method called pluck that returns the result set as an array of strings or as an array of array of strings. I've totally done this in areas that need performance. In my experience, if you are concerned with just showing some data you should specify the columns instead of doing a *. ActiveRecord also supports that.
So, for the bulk case. ORM is safer than straight SQL and allows for more easily testable business logic. It does not block doing the things you suggest even if it is not the most common approach.
There's truth to this, but there are also reasons for this.
First, it may be fair to say that infrastructure is in an other domain of problem solving than structuring and implementing program logic. (Speed requirements and crunch time tend to lock a person in a single domain.) Second, those developers, who do understand the importance of infrastructure, tend quite naturally to care for those levels which are closer to them and which they actually may understand (like the programming language and how it actually implements things, standard libraries, dedicated libraries, the OS, any middleware). Finally, SQL via networks (and all the flavors this may come in) is hard, there's a reason for why there are specialists.
Speed of development, of course, is orthogonal to concerns about infrastructure. Which may be a major concern. Communication (a constant source of enterprise horror stories) and general awareness (on all levels) may help. These are not problems that can be solved on a single side of the implementation. And then, there's certainly some kind of "developer heroism", which doesn't especially help.
(I guess, I've successfully managed to have messed up with everyone? ;-) )
Probably because they don't need to think about those things.
I started my career as an Embedded Software Engineer, and memory allocation, and clock cycles hand to be managed. Our software ran on systems with limited memory, and had to fit in one 60hz cycle. We supported VAX system that used VAX Floating point, and had to be cognizant of both floating point conversions AND endian byte encoding.
These days, such concepts are basically just trivia answers for interview questions.
I had to think about those things because it was required. System software would crash, and debugging it on an expensive government flight sim 1000 miles away was impossible.
Perhaps for the developers you work with, they don't need to think about those things because they're not required to. After all they have access to a dedicated DBA/SRE/DevOps/Architect guru.
Not meaning to come off as trite, but I know my response will sound like I am.
But, what you described is exactly why senior devs/engineers make 2x, 3x, 4x a junior.
Seniors make more because they have that extra knowledge you're talking about that can take a decade or more to accumulate. The idiosyncrasies between different DBMS or even languages or even specific versions of languages. That knowledge doesn't come cheap. It literally takes upon years to develop.
This is partially why I think bootcamps are bullshit. I'm sorry, but there is no way to become a competent full-stack developer, especially if you're not targeting web/browser. Give me someone with 3 months of bootcamp C++, I'd be surprised if they can get a nontrivial program to link, let alone compile.
For those devs the solution is that DBAs provide stored procedures and predefined views and block them from accessing anything else. Or address it on the app level, have someone knowledgeable create repositories with all the queries, and mandate the rest of team to stick to using only these predefined methods. Either way, people who don't care about infrastructure should have their access to that infrastructure maximally limited and wrapped in safety nets. For both sides to sleep better.
>Observeability, reliability, scalability - these are all words that are translated into either "someone else's problem" or "unproductive busywork" in their minds.
hence why you work in a team.. everyone provide their domain expertise.. we don't need dba to be dictating FE devs on strategies for immutable state management in the UI.
add to that "do not want to declare/worry about types", "cant be bothered about RAM usage/cache coherency/etc", "concurrency" etc. etc. All of those are "premature optimizations".
And they are, until they aren't. And I really do mean that in both directions: Much of the time, the simplest, dumbest, most naive solution is 100% fine for realistic load for the forseeable future. And then some of the time it breaks terribly and you do need to spend effort optimizing it, whatever that entails.
To be clear I am not advocating optimizing everything down to the last bit here. Problem here is when THERE IS a need many would not even have a slightest clue where to begin as all those "low level" concepts were relegated into oblivion.
The question rarely asked is, "Can your team go from zero to sixty on this sort of thing when this happens?"
Technically optimizing as you go might be a 'waste' of time, until it isn't, because your team can handle a serious performance problem in days instead of months (or IME, never).
I never realized this before but many excellent developers struggle with SQL beyond simple SELECT statements. I have a colleague who is by all accounts a deeply technical person but one day he confessed to me that he didn't really grok SQL and that he'd rather work with a "real" procedural programming language to just store and retrieve data.
Part of it may be due to the fact SQL isn't really a programming language but a declarative DSL for manipulating sets and tables. Things like GROUP BYs and PARTITION BYs (window functions) that come naturally to mathematical types/functional programmers are less intuitive to procedural programmers.
I suspect this was what attracted developers to noSQL databases like Mongo in the first place -- it's more attuned to a programmatic mindset.
(this is not universally true of course -- many programmers have no issues with SQL at all.)
I firmly believe that every developer should spend 2-3 weeks early in their career working with nothing but SQL. It will pay huge dividends for the rest of it.
IMO a lot of the issue is that developers for many years using Java or PHP, were using SQL to handle everything. The application language was a pass through later between the client and the database.
Your goal was to accomplish as much as possible in a single query and then to simply return the results of that query to the interface. That meant formatting numbers or currency in your SQL. Optimizing inserts or updates to be handled in a single query. Grouping, counting, left/inner joins, having statements to filter on aggregate results. More than 1 or 2 queries for the primary area of the screen was both a rare and foreign experience.
And then ORMs started to slowly integrate themselves into the flow of various frameworks to automate the repetitive things around CRUD tasks. Then to address scaling & bloat problems we saw an uptick in REST APIs, microservices that further made those ORMs the norm...and then many developers started actively trying to stay within those API constraints to an almost religious degree which led to a nested payload becoming acceptable fueling the whole "NoSQL" situation, along with the idea that it was somehow better to repeat the same data thousands of times over.
A whole lot of people pushed back against this and eventually, it mostly ran its course. I've often seen resistance to SQL to be driven by fear of SQL more so than anything else. As soon as people get a basic comfort level with SQL, it become almost automatic.
> The application language was a pass through later between the client and the database.
This style of doing things resulted in spaghetti style unmanageable databases, filled with an unknowable number of triggers and procedures, all written in PL/SQL (which is much, much worse than either Java or PHP). The reason why ORMs started to become popular is that you can write your application without filling your DB with arcane and inscrutable logic
There's a middle way which is very powerful: SQL views (just SQL queries; no triggers or procedures)
Here's a powerful mindset trick: think of SQL views as an sort of a REST API, but whose access language is SQL and not HTTP, and that returns data in a table rather than JSON (hierarchical).
I once tried to build a REST API to a database, and someone told me I already had a battle-tested and highly performant API that outperformed REST at scale -- it's called SQL. A SQL view is a dynamic lens into the underlying tables, so even if the underlying tables/schemas were to change, your consumers don't care as long as they can access the SQL View.
SQL views are also composable: you can build SQL views on top of other SQL views, and any changes made in the base views are propagated throughout. Need to add/transform a field? Do it in the view. Need pull in auxiliary data? Bring it in through a JOIN in the view. I've built many systems by composing SQL views and they're very maintainable and very flexible. They're kind of like function compositions but on tabular data.
The rule of thumb is: always access a database through a view, never the underlying raw tables. In computer science, a great many maintainability issues are alleviated through a layer of abstraction/indirection, and SQL views provide exactly that.
This centralization of the core logic becomes especially powerful if the database is accessed from multiple consumers (webapps, analytics backends, Tableau, ML tools, etc.) The "API" remains consistent throughout.
Do you have any example code that shows how this works? I get what you’re saying intuitively but example code will help me bring it to table.
What about cross cutting concerns? I’ve found stored procedures to be a performant solution here. By version controlling them, and limiting to pure functions, I found them quite maintainable. Would you instead just define a new view, or extend an existing one, or refactor into a separate view that’s then joined into the existing views?
I haven’t delved as far as views, admittedly. One app featured a bit of complicated logic and eschewing the ORM in favour of raw SQL helped (instead of getting tangled up in Demeter chains). Despite new developers, who have used purely ORM for years, shitting their pants at the raw SQL, both of us who worked on it felt it was the right call. We feel much better about leveraging more of the database in new projects.
In fact, when we took our experience to a Django project, my colleague wrote a Manager method in such a way that an ORM favouring developer questioned because it looked too much like SQL. But it was the obvious implementation to us after using raw SQL. And, after benchmarking, the most performant.
1. Let me try with a simple example. Suppose you have a fact table A with fields (ItemID, Item, Amt) where Amt is in USD. Rule of thumb is: don't expose A to the consumer; instead write a SQL View V_A and expose that instead:
CREATE VIEW V_A AS SELECT ItemID, Item, Amt FROM A
Then suppose a European counterpart wants to use the same API but needs the amounts to be in Euros. You can write another view: (in practice the conversion 0.92 shouldn't be a static number, this is just for illustration)
CREATE VIEW V_A_EURO AS SELECT ItemID, Item, Amt * 0.92 AS AmtEUR FROM V_A
Expose this to the Europeans. You can keep stacking views on top of other views. Your U.S. consumers will always see the data through the lens of V_A and your European consumers will always see it through V_A_Euro.
Suppose the underlying table A now changes. There's been a merger and the company now stops reporting currencies in USD, and everything is now in British Pounds so your DBA adds a field AmtGBP and starts populating that field instead. Amt still contains historical data, but moving forward the data in Amt will be NULLs; AmtGBP is the new internal baseline currency. From a VIEW perspective, all you have to do is:
ALTER VIEW V_A AS SELECT ItemID, Item, ISNULL(Amt, AmtGBP * 1.23) AS Amt FROM A
Your V_A and V_A_EURO consumers (could be Tableau, Excel, other SQL views, etc.) will still happily receive data per usual, unaware of the internal changes (the British are coming!) that have occurred. Contract kept.
Table A <- View V_A <- View V_A_Euro
2. Cross cutting concerns come in many forms so not sure if I can address. Stored Procedures are definitely an acceptable abstraction -- they accept parameters and can return tabular results just like VIEWs. They do however work in a procedural manner (like subroutines) and can produce side effects, which is sometimes necessary to accomplish very specific tasks. VIEWs on the other hand are more similar to pure functions (unless random number generation is involved) with no side effects. Because views are dynamic, they flex with your data and VIEW definitions.
There's another step that could be added there, too: After the ALTER VIEW, V could be slowly incrementally updated over however long you need to back-populate AmtGBP, and the views will continue to just work the whole time. Once done, V_A can be simplified to remove the ISNULL and Amt, then Amt dropped from V. That way you don't get build-up of cruft over the years, and the experience isn't interrupted for the migration.
(Possibly a bad idea for currency conversion for various reasons, but just wanted to mention it since this type of migration may be just right for other data)
Is there anything you recommend for handling SQL definitions in version control, development and production envs?
For production, I created a command on the app that loads the stored procedures into the DB idempotently on each deployment/configuration. This won’t work if the app server scales but allowed us to store stored procs in VC.
For development, we ran the command on each page load as a sort of hacky “live reload”. It didn’t work well (which highlighted the issue with scalability in production) because Postgres, fairly, doesn’t like parallel redefinitions of the same stored proc.
I’m not sure how best to automate this. For production, seems like a case of running a command once per DB server.
And in development, using a fs watcher that loads changes in.
But I don’t know, this is new territory for us and I couldn’t find anything out there to manage it within the context of a web framework. Perhaps I’m searching for the wrong thing.
Web frameworks like Rails/Django use the idea of migrations to make changes to the database. The idea is that you have a set of migration scripts like:
migrations/1765_create_table_users.sql
migrations/2891_store_procedure_x.sql
migrations/5892_change_store_procedure_x.sql
(.sql/.rb/.py, it doesn't matter).
And you have a "migrations" table in your database that contains the numbers of the migrations that have been run:
select * from migrations;
version
----------------
1765
2891
Every time you deploy to production automatically check which scripts in your db/migrations folder don't exist in the migrations table and run them. (In the current example, you'd run the 5892_change_store_procedure_x.sql that hasn't been run yet).
How to do with functions/procedures?
You commit the function definitions in a functions folder to your version system like:
db/functions/report_x.sql
CREATE or REPLACE function report_x() returns ...
When you change this file, nothing happens, you need to create a migration to re-run this code once. In rails migrations would be:
class UpdateReportXFun < ActiveRecord::Migration[5.2]
def up
execute File.read(
Rails.root.join('db','functions','report_x.sql')
)
end
end
Yeah, I’m aware of that, thank you. I was wondering if there was a way with a faster feedback loop and allowed for bug fixes without creating a new migration.
You don't need to write the migration until you're done. It's possible to have a very tight feedback loop in any case.
I'm doing a lot of work in a Rails codebase where I edit views/functions/procedures all the time. My setup is quite usable.
My current setup: I edit those .sql files and run them with psql in my local while developing (without writing any migration yet).
I have some like this running on one screen to make sure the modified files are executed by psql immediately as I change them (you could use `guard` too):
and I edit the db/functions/*.sql files freely, adding things, changing behaviour of functions and they are updated on the fly. (I can run tests -or try things in the browser- to verify my changes work as I expect).
--
Once I finish and I know everything is great, I just add the migration. The migration is simply an indicator of which files I've modified and to specify the right order to run them (which is useful if they are dependencies), like:
# migration
def up
execute File.read(function1_sql_file)
execute File.read(function2_sql_file)
end
I could have an alias that automates generating that migration but it's just 4 lines...
[ I'm also using pgTAP to write tests for functions, it's quite nice :) ]
Oh wow, now I see what you mean. Thank you! That’s work great. I wasn’t aware of ‘entr’ either, that’s exactly what I had in mind!
I’ll have a look at pgTAP too. Naturally we want to test in CI, I can see this working really well. I did look at myTAP too, since we have a few MySQL instances.
Agreed, and a good example of this is implementing search. You can define a view on top of your searchable entities that includes the urls to the entities, as well as searchable metadata (entity descriptions or whatever). So when you add new searchable items, you just update the view to include them and the code to select from the view doesn't change.
> highly performant API that outperformed REST at scale -- it's called SQL
You are conflating many disparate things here.
SQL is a language (DSL) for accessing data. REST is a protocol and a data transport method (one could surmise a way to do REST without HTTP, but when reasonable people refer to REST they mean HTTP (over TCP (over IP (etc.)))).
Even REST is not an API. You can't do anything with a GET or a POST without other abstractions built on top of that. So I don't understand how anyone could make performance claims beyond something like "HTTP is slow" and "binary transport is faster", with regards to SQL vs. REST/HTTP.
SQL does not define how you receive your data. Databases have different methods of sending SQL and responding to SQL. Oracle, MSSQL, MySQL, etc.
> This centralization of the core logic becomes especially powerful if the database is accessed from multiple consumers
That's the entire point of an API. Any API. REST APIs, even.
Right, and this is normal except that people take it too far, I love to find 10(!) depth nested views hiding table valued functions and scalar functions everywhere - you cant reason about the rat's nest created.
If you want to make simple views that expose useful nouns I am down with it, but I have seen it taken too far too many times.
Postgres has something similar called pgREST too. I think I would only adopt these kinds of interfaces if the consumer insists on accessing the data through a REST interface.
If you are building something from scratch, or your consumers don't have a hard requirement for going through REST, I would go directly to the database view.
This is trying to avoid the reality that the data is in a database. A database is not just a flat storage area that can be poked by code from outside. It has a lot of relationships and integrity constraints that need to be maintained by the DB itself, and all this is expressed in SQL/stored procs. Trying to avoid this is basically removing the power of a DB and converting it into just an expensive and cumbersome storage area.
But now you're filling your application with arcane and inscrutable logic, with an extra layer of abstraction via the ORM to make it even less scrutable.
I think one should view a SQL DB like a microservice. Instead of REST endpoints (or gRPC or whatever), create stored procedures. These define a strong contract with your DB, the capabilities that it provides to your app(s). Now you know what the query and insert patterns are, and can tweak the table layout under the covers without screwing up your application code.
Of course you can abuse this into a spaghetti monolith, just like you can evolve a microservice into a spaghetti monolith, but you shouldn't. There's no technology that will prevent you from making poor architectural decisions, you just have to not go down those dark paths.
> Instead of REST endpoints (or gRPC or whatever), create
> stored procedures. These define a strong contract with your DB
Exactly. It took me years to grasp this, but when I did my code became much simpler.
A REST API (that returns JSON) has to contort a tabular data structure into a loosely-typed hierarchical data structure (JSON) which has to be read back, reconstructed and in many instances type-checked (e.g. DATETIMEs are not native to JSON, nuanced datatypes like DECIMAL(18,3) are lost).
Whereas a SQL interface returns data in its native tabular format with all the correct types.
I remember people I used to work with arguing against stored procedures for two main reasons.
1) Version Control - I guess a lot of the stored procedures were being put straight into the DB without recording a history of the changes. These days you could easily do this using DB migrations I guess.
2) Testing - is unit testing a thing for Stored Procedures? I guess again, you might be able to do this from code as well programatically adding a stored procedure, running a bunch of tests and removing it again.
I do wonder - what do people generally do in practice for overcoming these objections?
Does anyone have any other objections around using stored procedures?
1) I find this kinda funny. Why are you worried about this in SQL but not for other code? It's not like it's hard to chuck Python, JS, Ruby, ASP, etc code straight into prod, you just don't do that because it's stupid. Don't do it for SQL either. If you really want to, build out user permissions that only allow your CICD system to change them.
2) TBH we never built anything complex enough to need this, and I would tend to think that if you do need this you're probably overcomplicating your DB. But you could probably do something that creates a temp DB, populates example data, and then runs tests.
But going too far the other way is how you end up with performance 100-1000x worse than it should be. Which is an actual thing that happens quite often in the wild. Also often ends up heavily dependent on some ORM or framework, making rewrites or multi-client DB access dangerous and painful.
This is the exact fear of SQL that's being talked about. You can make a mess of anything but its not hard to have a well maintained relational DB with minimal cruft.
In my observations, ORM use has a perspicuous relationship to piles of arcane spaghetti code.
Not to mention, 50%+ of ORM managed DB schemas that I've observed don't have proper constraints, indexes, relationships, etc. Because the developers using the ORM think it's a magical tool that makes understanding SQL and relational databases optional.
There is an entire world between ORM and PL/SQL. Programmatically constructing SQL statements is also a thing. Just because someone writes SQL does not mean
SQL needs to be spread through out code or that we need to have lot of logic in PL/SQL. Of course, there will be cases where a store procedure is desired (any kind of validation that cannot be expressed as fkeys, canonicalization of some core data components etc). I worked on DBs writing SQL for several years with only minimal code as stored procedures.
Of course the database vendor and culture also plays a role. We were primarily a mysql/postgresql shop
How did you minimize roundtrips between server and DB, or did you find that they were not a big concern?
I'm working on a project with a Postgres database, and as it gets more complex I'm moving more stuff into stored procedures, pretty much wherever a single action requires multiple statements in series (e.g. check if this thing exists, check a value, get the id of some other thing, on success update another table).
Of course I would prefer to leave as much of it in the middle tier as possible because the ergonomics are better there but I don't want to sacrifice performance.
> I firmly believe that every developer should spend 2-3 weeks early in their career working with nothing but SQL. It will pay huge dividends for the rest of it.
That pretty much describes my career, except it was more than 2-3 weeks. I agree there is value in being familiar with the ins and outs of SQL, but...
> As soon as people get a basic comfort level with SQL, it become almost automatic.
I don't know about that so much. I find that as soon as I need to write something much more complex than a simple select statement I am all but guaranteed to get caught by SQLs many gotchas. Eventually you realize the mistake, and that experience provides the knowledge to know how to correct it, but a good language helps guide you away from being trapped by those mistakes in the first place.
SQL is not a good language. It was clever for its time, but we've learned a lot about language theory in the many decades since. It is a travesty that we haven't put more effort into designing a modern database query language. To use SQL in 2020 is like writing software in COBOL when you could be writing software in Rust. Where is the declarative query language equivalent of Rust, Haskell, etc.?
The NoSQL movement was supposed to be about improving on query languages, but sadly it soon turned into "NotRelational" instead, which killed off any momentum away from SQL that was built.
> I firmly believe that every developer should spend 2-3 weeks early in their career working with nothing but SQL. It will pay huge dividends for the rest of it.
I did this (actually for longer), but it always washes away.
One challenge for me is I don't see an incentive to learn complex SQL setups unless absolutely necessary. Since I tend to work at startups, development velocity is way more important than optimized codebases. I'd rather write the "dumb" solution that nearly every engineer can grok than the "elegant" solution that goes above people's heads.
That aggregate function, yes, I know I can do it 100% in SQL. However, it's much easier for me to selectively pull parts into memory and modify them with language tools I work with everyday. When the junior needs to modify that function, they can do it with tools they're most familiar with.
With the whole shelter-at-home thing, I had a chance to work on a simple app for my kids. One of the corollaries of "simple" was avoiding an ORM.
Implementing logic is single SQL queries can definitely be a bit challenging, but I thought it was also quite rewarding and liberating - Raw SQL is incredibly powerful!
> I firmly believe that every developer should spend 2-3 weeks early in their career working with nothing but SQL. It will pay huge dividends for the rest of it.
i did tons of sql couple of years ago on a reporting team. Now I do android dev fulltime and don't remember any SQL beyond basics, highly doubt it will all come back to me if i tried.
it doesn't matter, you'll google the details. many people aren't aware that sql isn't just a stupid way of filtering tables and can do things server-side that you'd spend hours implementing and testing on the application side with worse performance in the end most of the time.
It's declarative, the primitives seem entirely non-intuitive, it often takes a lot of fiddling to get what you want, the behind-the-scenes execution is mostly a black box, and while it's supposed to work the same on different implementations (of browsers/databases), there are tons of little gotcha quirks.
All in all, they're both entirely different skill sets from traditional programming, and also where experience counts for a ton more than just normal logical thinking.
> It's declarative, the primitives seem entirely non-intuitive, it often takes a lot of fiddling to get what you want, the behind-the-scenes execution is mostly a black box, and while it's supposed to work the same on different implementations (of browsers/databases), there are tons of little gotcha quirks.
My reaction to that is that it's similar in a different way: everyone needs to use it but many developers don't take it seriously, avoid learning how it works, and then complain that it's unintuitive (i.e. not something they had already known) and hard to use because they're basically just poking randomly until they get close enough to what they want.
It's true that not every implementation is the same but … where else is that not true? This is also true of operating systems, filesystems, every library implementing a standard format (“Why doesn't this PDF open in …”), etc. I would have trouble supporting the belief that databases are an outlier in this regard.
If I write a C program, especially with the appropriate compiler warnings enabled, or some filesystem code, there is a high chance that it will work across platforms with no further coding necessary. Anecdotally, the same cannot be said for CSS or SQL.
It doesn't matter much for SQL, because I always know what database I'm using, but for CSS, it's a massive pain.
That's a charming belief about C which is only true if you work on the same operating system, processor, and don't do anything complicated. If those are not true, as I've experienced many times over the years, you will learn otherwise about things like differences in floating point behaviour, memory allocation and access patterns, which filesystem and/or locking semantics, etc.
In other words, it's about as true as it is for CSS and SQL, where that kind of simple use is also stable.
Yes, I write fairly boring applications. But the point is that C is mostly cross-platform for those boring applications, whereas the the "cross-platformness" of CSS and SQL fall apart for even the simplest of tasks.
Again, I think you're viewing this through the lens of relative experience. I have seen tons of C code which required substantial fixes to work portably (bonus points when this was exploitable) and I have plenty of CSS and SQL which hasn't had to be touched in years.
When you look at SQL from a logical/set-based perspective, it is by no means unintuitive. Basically, all you do is join all the tables you need and then filter out everything you don't need and maybe do an aggregation here and there.
- When is a subquery actually a correlated subquery? Will this destroy your performance? Or is it a critical feature?
- Should you put constraints in the JOIN or in the WHERE? Will the distinction drastically affect performance?
- When do you use WHERE vs HAVING?
- Is the NULL from the join because no joined row was found, or because the joined row had a NULL value itself?
Etc. etc. The basic concepts are simple, but the implementation details quickly become very complex, particularly when you're ensuring high performance with indices, and making sure the query uses the indices.
And all of the questions I pose above have clear answers... but the answers certainly aren't obvious from SQL basic concepts.
And also, while JOIN seems like it ought to be intuitive, in real life it seems like it's like pointers in C -- some people get it pretty quickly, other people struggle forever.
Like everything else, write it the simple/elegant way then profile it and tweak if you have to.
Once you're at the point where you have to worry about these things, tuning the SQL is still probably much less complex than writing the query in your app language or figuring out how a NOSQL db can do these joins.
In reality it rarely works this way - there's plenty of systems which are falling apart due to "death of thousand cuts" type issues. You run a profiler and most of the queries are slow and there's no one obvious part to optimize - because developers over the years ignored basic optimizations and there are inefficiencies everywhere.
E.g., for a quick practice run, try optimizing Wordpress without making it a static page via caching - how many queries will you have to optimize and how much of a codebase rewrite will it be to make it significantly more performant?
the problem you describe is completely different and its more a measure of system health in its entirety of the DB. it can be easier to fix or harder depending on the precise cause. but that's a standard debugging skill which isn't strictly in the DBA skillset.
This is my experience especially with ORMs. It's really hard to optimize when your "top query time" list is just a list of your most frequently called methods, where there's no single obvious pathological case, but lots of little inefficiencies.
> The basic concepts are simple, but the implementation details quickly become very complex
True, but those kinds of questions come up in every language: Should I use an array or a dictionary? Should people have references to projects or should projects have references to people or both? Is money a float, an int, a decimal or should I write my own money class? Should I memoize the results? Is it thread-safe?
As you can see, this could go on forever, pretty much for any language.
They do. And language designers seek to smooth the edges and develop ways to encourage devs to write clear and intuitive code. When a language says "well, these are hairy questions that you should just figure out" we tend to criticize those languages unless they have clear reasons for that decision. It should be obvious that something is thread-compatible. "Well, if you model it in pi-calculus it is easy" is a crappy way of handling criticism.
"Its easy if you think about it mathematically" is not enough.
In which language is it obvious not to put money into a float? In which language is it obvious if I want a synchronizedSortedMap or an arrayList?
There are simply things you have to learn to use a technology. If you want to write Java, you need to know what a variable is, what a for loop is and what Inheritance/Interfaces are. Likewise, in the case of SQL, it means you need to know concepts like normalization, ACID and joins. Just poking around until the code works won't do it.
No language is perfect. But surely you'd agree that there is a spectrum here. In C++ you have to think "should this be a pointer or a reference" all the time. In most modern languages, you never do. And although there are clear methods to write C++ well, it is rightly criticized for being overly complex and unintuitive. For example, std::set is ordered. That's a basic mismatch with expectations. Sure, you can just check the docs. But how much code would be more efficient if they had just spent a bit of time making things easier to work with?
You need to spend time working with the language and understanding how the set based operations used in SQL work. Declarative languages means you can express things in multiple ways and get the same results. This is an incredibly important aspect of SQL
>the basic concepts are simple, but the implementation details quickly become very complex
This is no different than programming. Assuming that SQL doesn't have complexity because you can only SELECT, INSERT, UPDATE or DELETE is going to have you banging your head against the wall. Tackle the complexity in SQL like you'd tackle the complexity in your programming language of choice; read the docs, work through examples, and read how other people solve the problem. There's ton out there for SQL
>When do you use WHERE vs HAVING?
HAVINGs allow you to add a condition to an aggregate function. So SUM(myColumn) > 5 would be something you put in a HAVING clause. Honestly, this is pretty clear cut.
>Should you put constraints in the JOIN or in the WHERE? Will the distinction drastically affect performance?
The first thing to understand is a condition in a join versus a where might return a different result set, specifically on anything other than an INNER join. The impact to performance will depend on the rest of your query, your data, and your index coverage. For simple cases, there is likely no difference. For complex ones, there may be an impact
> Is the NULL from the join because no joined row was found, or because the joined row had a NULL value itself?
An inner join shouldn't produce a null. That's why it's an inner join, as the data needs to exist in both places. If you want to "test" whether a join found a row, look at the field you were joining to and see if it's a non-null value. Nulls won't join to Nulls unless you've change some settings in most RDBMS. If you're looking at other fields to determine the presence of a row from a join, make sure you're looking at a non-nullable field.
>When to use JOIN vs a subquery?
A better way to phrase this would be when to just join the table, vs writing a sub query and joining to that. When is a question of the complexity of the query and performance characteristics, and that can't be answered in the abstract. The most important thing is that in a large number of cases you can do both, and knowing how to express things in both ways is powerful.
>And all of the questions I pose above have clear answers
No they don't. Any time you're wondering about how different SQL impacts performance, there's absolutely a huge "it depends" angle on it, because how you've structured the tables, index coverage, and the volume of data, can have a significant impact. This is why DBAs still have jobs, because the database is an incredibly complex system. You seem to be complaining that SQL shouldn't be complex, yet are not willing to accept that it is more complex that you've assumed it to be. It's complex. You don't need to know everything if your just a dev, but don't just assume it's simple.
>while JOIN seems like it ought to be intuitive
I'd check the diagram here -> https://stackoverflow.com/questions/13997365/sql-joins-as-ve... Half those joins aren't needed as you can re-order a right join into a left join. For 95% of development Inner joins and left joins are all you need. The other 5% is an outer join and that's mainly needed in report writing, not app development.
The answer to the performance-related ones is simple. It always depends on the your query, your indexes, and your data. The same query can produce wildly different query plans if you have different data (or even if you have the same data, but the engine decided to sample different rows!)
Best thing you can do is to learn how to read EXPLAIN ANALYZE results.
Normally, subqueries return a single value whereas joins can result in n rows of output for 1 row being joined on, and you can access all the columns of those n rows. There are ways to make use of more than one value (e.g. (tuple) IN (subquery)) but if you want to SELECT more than one value, you need to join.
Depending on the database, it might be slower to do a correlated subquery than a join though (MySQL especially).
> - When is a subquery actually a correlated subquery? Will this destroy your performance? Or is it a critical feature?
A subquery is a correlated subquery when it references symbols from the outer query. That means it needs to be evaluated once per row, and can't be evaluated once at the start of query execution. It can destroy performance if it needs to be evaluated too often - if it's in your 'where' clause and is evaluated over too many rows, e.g. it's mixed in with a boolean expression that can't be short cut.
> - Should you put constraints in the JOIN or in the WHERE? Will the distinction drastically affect performance?
Conventionally, you should put equi-join constraints (equality expressions with foreign / primary keys in other tables) in the JOIN clause and other constraints in the WHERE clause. For inner joins, it doesn't make a difference where you put the predicate, semantically. There is a semantic difference for outer joins though (left join, right join, full outer join): failure to join results in a tuple worth of null values from one or both sides (left/right vs full), rather than eliminating the row.
Where the semantics aren't different, performance should not be affected. Of course the database engine might be stupid, but a fundamental requirement of a reasonable query planner is in determining (a) join order and (b) which indexes to use for the combination of join predicate and where predicate. No query planner worth its salt won't consider using the where clause along with the ON clause on the JOIN when fetching rows in the joined table.
> - When do you use WHERE vs HAVING?
WHERE is before GROUP BY and filters the rows that enter aggregation (if any), HAVING comes after GROUP BY and filters the aggregated rows. If you use a derived table (a nested query with a table alias), then you can use WHERE instead of HAVING for no semantic difference, but derived tables may execute differently (MySQL will generally materialize them, PostgreSQL will see through them).
> - Is the NULL from the join because no joined row was found, or because the joined row had a NULL value itself?
If the column is nullable, and you used an outer join, you can't tell. Normally you check for the primary key or some other non-nullable column to discover if a join failed (most often used in anti-join, when you want to find all rows that don't have corresponding rows in the join).
I had way less trouble in college with relational algebra, compared with SQL. SQL is by no means intuitive. Projection, which is what you do last, comes first in the select statement. Then there are all the join types.
Relational algebra, which is what SQL is ultimately based on, is much more elegant.
This is... one of the best comparisons I've seen and sums up the reason why I dislike SQL as well (although I know how to use it).
Sometimes it really feels like you're trying to give instructions to someone via chat which gets Google translated to chinese, japanese and russian on the way - it's this very lossy communication channel where you need to tweak the language "just so" to get maximum performance. I think it's no wonder that newer DB designes opted for more direct and tailored APIs.
> it often takes a lot of fiddling to get what you want
and yet, for any nontrivial operations, opting for a few simple K:V stores instead means one basically ends up implementing an unrolled SQL engine processing loop to do the application-specific select statements you would otherwise need.
there's a reason SQLite took over from dbm files ...
I’ve been casually trying to knock into the black box of MySQL’s internals with no success yet. I have queries that do GROUP BYs across multiple unindexed columns in frequently large tables, so it ends up with temp tables on disk and filesort. Can someone point to the source code for processing this kind of GROUP BY situation?
SQL feels to me like bash or regular expressions. You can do amazing things if you do it full time. But if you do it only a few times per month or year you quickly forget all the subtleties and it gets hard to understand even the stuff you wrote half a year ago. I guess in the end things have become so complex that as a dev you can’t be good at everything. I often wish there were dedicated database guys but if you have one usually that person is sysadmin and won’t help much with coding against the database.
This echos my experience. I've written some complicated SQL queries, but I do it rarely enough that I always have to re-learn a lot of it.
There have been a number of instances where I have had to look something up seemingly for the first time, found a good Stack Overflow answer, and then chuckled to myself because I had already upvoted that exact answer at some point in the past. SQL queries are a pretty common source of this, along with uncommon git and terminal commands.
Very true about git. I would add C++ to that lis. Done full time it’s super powerful. But later or with less experience you look at the code and think “WTF?” .
I tend to disagree, since I find SQL orders of magnitude more intuitive than regular expressions, having no problem picking it up once every couple of months without requiring to fallback on books or cheat sheets for implementing the majority of queries. Sure, every now and then we have to optimize a query or implement a complex data extraction pattern forcing us to study how to do it, but for me this is unusual.
The key aspect of SQL that makes it easy for me is readability. Once you know the basics you can understand what a query is doing just by looking at it. Regular expressions notation on the other hand are really though to read, and indeed I keep always forgetting its syntax.
Common table expressions were all I needed to write <a href="https://modern-sql.com/use-case/literate-sql">literate SQL</a> and make it perfectly readable when I come back. But I was doing it almost full time so I wasn't forgetting much either.
In a given week I might work with all of the following: SQL, C#, Python, JS (Kendo, Vue, React), XSL, bash, and more. I'm a quick learner and I pick things up fast, always have, I've got a deadline and I don't have the time or capacity to fully internalize the minutiae of all the technologies I have to work with. In other words I depend on the tools to show me the options at my disposal and construct syntactically correct expressions.
What I really dislike about SQL is common to most systems I dislike, where the tools aren't discoverable to me. In SQL it boils down to the fundamental syntactical requirement to put the SELECT clause before the FROM clause. So I have to build up my statement in this weird spiral pattern where I change something deeper in before I know what I can SELECT in the first place. The ability to give tables shorthand names with MS-SQL, i.e.
`left join [dbo].[sometable] st`
...is very helpful but I gather this isn't common to all dialects. Working with XML in SQL is a nightmare as the tool cannot tell you whether you can do obvious things like pass an sql:variable into nodes() until you actually query the server. (spoiler: you can't.)
I vastly prefer the functional programming approach to working with data e.g. C#'s lambda style linq. The tool shows me all the pieces I have to work with, and all I have to do is piece them together the right way.
Reading SQL is like reading German, where the last word in a long sentence determines the meaning of the entire sentence.
An SQL statement starts with "select ABC.XYZ", but you have no idea what it means, because only one screen later it is written that "ABC" is actually an alias for "T_ACCOUNT_BUSINESS_CREDITS" or something. The logical order would be "from ... where ... select ...".
Imagine a programming language designed like SQL. It would look approximately like this:
function foo(a) {
return s;
b = join(a);
s = concatenate("[", b, "]");
by the way, a is list of strings, b is string, s is string;
also, function "join" is imported from "lists", and "concatenate" is imported from "string" module;
actually, don't return the result, just tell me how many characters it would have;
}
I don't know why people think this. After much thought, it's because they read chronologically which isn't how SQL works. It doesn't 'execute' in the written sequence.
The entire statement is a logical one like a mathematical equation. You wouldn't think that the meaning of (a + b) * (c + d) implies that (a + b) has to be done first. Maybe in some computer languages it may like early C compilers.
Given that: "select ABC.XYZ" is perfectly well formed, it's like a forward reference declaration where ABC is the alias for a table-like thing and XYZ is the alias for the column-like thing. I mostly read my SQL inside-out, starting with uncorrelated subqueries.
Another pet peeve: using the keyword INNER or OUTER is just noise.
> In SQL it boils down to the fundamental syntactical requirement to put the SELECT clause before the FROM clause
No offence but if you are at the level of struggling with the syntax then you should not be using SQL until you have more experience. There have been plenty of well-founded critiques of the crappy syntax of SQL, but it's not ultimately that hard. There are worse things you'll have to cope with such as the implications of 3-valued logic that comes with nulls.
The giving-tables-alias-names feature has been standard probably since the very first standard was released. In some cases such as self-joins, it is necessary.
If I can give you a piece of advice, arrange things using the 'with' clause and learn to break things down as simple as possible at each step. Let the optimiser sort things out for you. If performance is poor, look at the query plan. Also understand you don't have to write everything in one huge statement. Spool intermediate results off to temp tables if that helps (not @tables but #tables - @tables have problems with them).
As for the XML, I guess that may be better handled outside the DB. IMO XML should never be made part of SQL. Good luck.
I would not throw someone at CTEs who is using SQL Server - just use temp tables for each component you would be CTE-ing, CTE's dont get any benefit from re-use except from a code perspective, whereas composing your sets into temp tables will often get you exactly what you want, individual sets that you can re-use throughout your code.
Very good point indeed. I'll add to that, that you can trivially examine temporary tables wheres CTEs don't have that transparency. Thanks!
Edit: I'm going to clarify this. It is a mistake to throw CTEs at a beginner but to be clear, CTEs have very important advantages over the using temp tables to hold intermediate results. It comes down to efficiency.
MSSQL only optimises within a single statement; it does not optimise across statements, so if you have a several queries comprising a single CTE, the optimiser has plenty to get its teeth into and may produce a much more efficient query plan.
Also a query plan of a complex CTE can (depending on what you're doing) end up being a straightforward pipeline where one result feeds into another into another and finally gets spat out at the end. That can be very efficient. If you use temp tables you spool intermediate results (which may be large), re-read, spool into another temp table etc. If you're working with a large data set that can use up a lot of memory, and if it's large enough that it has to spill to disk... oh dear.
So for beginners, yes, as you get more expertise, CTEs are the way to go (depending, of course, on various factors)
(Conversely, temp tables do have a definite cardinality, whereas queries in a CTE are estimates and can be badly out leading to very poor query plans. So temp tables can work to your advantage here).
May I propose, for your long expression of frustration, a simple fix. Don't use left join. Instead select x.x,x.y,y.x,y.z from x,y where x.x=y.z and ...;
This is the same operation but without the screwup syntax. The whole notion of left join style syntax is an abomination and - excepting for those who have internalized it to the detriment of normal discourse - represents needless cognitive load.
I like SQL when I'm not writing reporting queries. GROUP BYs bite me (With MySQL 8 I end up reaching for the ANY_VALUE() function), and I end up with more subqueries than I feel I should need.
When working with time-indexed data I feel I'm forcing the database to do something it doesn't want to. E.g. if I want to answer the query "How many sales are there per day this month?" and I want an entry for every day in the month, even when there were zero sales. Or another query asking "Which days didn't have sales?" and listing the days. I haven't found a way to do this in-database. I end up answering in code based on the data I fetch.
Definitely feel it's me not SQL, but not found the answer.
Thinking about the intersection of sets is important for grokking what you're trying to do in that example (and with SQL in general). For me, at least, SQL suddenly made a lot of sense when the idea that I was working with intersections and subsets of sets.
In your example you have (1) a set of all dates in a range, and (2) a set of sales totals for days that had sales. Set 1 could be a "numbers table" or generated with something like "generate_series()" in PostgreSQL. Set 2 is made by summarizing the data in a sales table by date using "GROUP BY" and "SUM".
Then you're just looking at JOIN'ing those sets and COAELSCE'ing the NULL returned from days when there are no sales into 0.
> Set 1 could be a "numbers table" or generated with something like "generate_series()"
I've done it this way when doing it per-minute or per-hour - i.e. for 60 minutes or 24 hours; fixed, constant ranges. I picked this example because (to me at least) it's more difficult :)
Calendar months have different lengths, so something would have to tell it which month to use (and account for leap years). But say you wanted the last 60, 90 or 120 days? I suppose you could first create a temporary table to be the "numbers table", custom made for the range you want to fetch - is there an alternative?
I have done this many times. My preferred solution was to create a recursive CTE beginning with the desired start date and ending with the desired end date. I often based these on MIN and MAX subqueries, but obviously you could also pick arbitrary dates.
You can use days, months, weeks, years, or whatever you wanted as the "increment" in the CTE using the DATEADD function. Then (for nulls) you simply LEFT JOIN your CTE with the desired date part (aggregated) of your table. This gives you your first answer, and a simply filter will get you your second answer.
It's also common for people to just create all these date tables beforehand as actual, materialized tables. In my opinion, this is less elegant (what happens in 2101?! Somebody had better remember to add to the table!) but it naturally works just as well and probably saves some perf.
In many databases you would create a function to fill the role of that "numbers table" and, effectively, map that function onto rows. That function can express whatever "richness" you need (i.e. civil calendar month, phase of the moon, etc... >smile<) that might not be easily expressed as a series.
There are date functions in most SQL packages which can create the limiting factors you're looking for, just as in procedural languages. Once you have the list of dates you're interested in, the problem becomes simple.
I learned SQL working in Access. There wasn't anything like generate_series(), and I remember having to do things like above by having a start and end date, cross joined with a table of nonnegative integers I made, and creating a field which added an integer to the start date, subject to being no greater than the end date.
The answer is likely that your storage schema is incorrect. You have things stored in OLTP (i.e. app database) but trying to read it as OLAP (i.e. reporting database). Once you reimagine the data in the OLAP style then these kinds of queries are simplistic.
EDIT: specifically for your example, in an OLAP style you would generate a Times table and then foreign key the sales table to it based on the date. Then you can easily query against that Times table as the filter/bucket for your queries.
Yes, agreed. It was designed for OLTP not OLAP, and I have to get my mindset into that.
For days as my example has, would the Time table be generated for say 1970-2050? Given months are of different lengths and there are leap years, I'm assuming this is needed, rather than storing a single year.
Yes, the time table is a pregenerated set of all possible moments in time for whatever resolution you care about. So, for example, if you cared about day resolution it might be
date_key | year_num | month_num | day_num | quarter_num | week_num | month_name | month_name_short
...
20200101 2020 1 1 1 1 january jan
20200102 2020 1 2 1 1 january jan
...
20200401 2020 4 1 2 14 april apr
...
You would FK on the `date_key` and add as many columns as you need to support querying against the dimension. I also like to add a proper `datetime` representation of the date so I can easily do a date range query.
I found Kimball's book to be interesting and helpful.
I think the Kimball approach gets bogged down in temporal evolution (Slowly Changing Dimension type 1? 2? 3? 4? 5?), but that's more about the underlying absence of meaningful bitemporalism than the dimensional schema approach per se.
For anything other than the absolute simplest case you should have two databases. One for OLTP (the application(s)) and one for OLAP (the reporting).
If your reporting queries are hard or complex it's because you did a bad job architecting the reporting tables. Reporting queries should almost always be the simplest form of query if you designed your warehouse properly.
Days with sales is easy with a simple SUM or COUNT and GROUP BY.
Days without sales is easy if you think about this dimensionally.
SELECT dim_days.day_of_month,
COUNT(fact_orders.order_id)
/*
Select from dim_days first because you want every day.
*/
FROM dim_days
/*
Outer join to the fact table to pull in the data you have and add that to your dates.
*/
LEFT JOIN fact_orders
ON dim_days.calendar_day = fact_orders.order_day
/*
Filter results by the desired range.
*/
WHERE dim_days.month = 4
AND dim_days.year = 2020
GROUP BY dim_days.day_of_month;
A more functional mindset can definitely help here. Think of your "sales per day" model in terms of starting with a sequence of days--startingDay up to startingDay + n--as the input to a function that maps to an aggregate of that day's activity.
Aggregate functions in SQL are IMHO quite awesome once you develop a comfort level to stop worrying about them per se. I wouldn't like to try to get Excel to tell me--or write the code to do manually--something like "show me the standard deviation in units sold by day of week over the last ten summers."
Some of the most fun I've ever had coding has been creating "complex" SQL queries. The syntax is something that can only be overcome with memorization, but it becomes second-nature fairly quickly.
Until I started thinking about SQL as manipulating sets it never "clicked". Once it did, though, a whole world of applications filled with mismatched procedural thinking mapped on to SQL was revealed to me, and my own work became much easier.
Despite our industry's habit of treating mathy whiteboard things as indicative of one's "programming talent", real application of type, set, and graph theory are not tested nearly as much as they end up being in practice, and I think that developers' reticence to get deep into SQL is symptomatic of the weaknesses there.
The first time I accidentally typed "select 8 from users..." and, instead of an error, it returned a column called "8" with a bunch of rows with the value "8", my mind was blown.
> I suspect this was what attracted developers to noSQL databases like Mongo in the first place -- it's more attuned to a programmatic mindset.
I suspect it was mostly because the tree-like structures people were trying to represent are an absolute pain to work with when you have to shove them into two dimensional rows and columns. I doubt SQL itself had much to do with it as ORMs were already all the rage when Mongo emerged. But ORMs only slightly improve on papering over the data structure impedance mismatch. Most of the pain points present with using SQL directly remain in ORMs when it comes to this problem.
With Mongo you can just throw the tree at it and it will happily store it and nicely give it back again. Which eventually leads to its own set of problems due to how it handles said tree internally, but that's why it has fallen out of favour and SQL databases are the new hotness again.
You'll notice the frontrunners in the NoSQL movement of the time weren't relational databases with a different query language. They all took different approaches to dealing with data itself. Sometimes, humorously, they even maintained SQL as the query language. NoSQL didn't come to mean "No SQL" at all, but rather "Not Relational".
As an aside, while it might go a bit outside the spirit of SQL and introduces its own challenges, I remain somewhat amazed that we haven't seen a standard way emerge to query tree-like structures to better reflect the needs of a fairly common use case. You can kind of get there in unconventional ways, like using json_agg in Postgres, but that all seems pretty hacky.
I couldn't agree more. I think people underestimate how much complexity got introduced thanks to the object-relational mismatch.
It would be really nice if there were databases that attempted to expose a data interface that is more tree-friendly. That's why I'm keeping an eye out on projects like EdgeDB and DGraph, and even Hasura.
I think you're onto something with the observation about what type of developers take to SQL. We have a spectrum in which people think more like how the processor operates to people who think more in classes of problems and their solutions.
We run into problems when someone with a proclivity to one side meets a problem best solved by thinking on the other side of the spectrum. An example is a young dev in my organization who was given requirements for a new app that required a data store. Not being comfortable with SQL or relational databases, he chose a document store. It wasn't a good fit. Very quickly requirements expanded and caused his code to balloon into a mess of nested loops with lots of if-checks. Performance has progressively decayed as well. A simple multi-table join with filtering would have knocked off 2/3 of the code.
TFA isn't even about SQL the language at all though, it's about the scalability and reliability characteristics of databases, especially in a distributed environment.
> I suspect this was what attracted developers to noSQL databases like Mongo in the first place -- it's more attuned to a programmatic mindset.
Well, it's more attuned to the dynamically typed mindset, sure. Programmers who understand the value of static type systems should understand the value of relational schemas.
I disagree. There's more to relationships that typing. I'm very pro-dynamic languages and still chafe at static typing but the wonder of the relational model fits nicely with my liking for declarative and functional approaches.
(EDIT - and as another data point I dislike SQL's syntax. The semantics are bearable but the syntax just makes my brain melt)
They're not identical concepts, but both relational schemas and statically typed programming languages provide assurances about basic structure. Mongo, and Python, offer no such assurances, and leave it to the developer to get it right.
MongoDB supports Schema Validation since MongoDB 3.2 (Dec. 2015) and JSON Schema since MongoDB 3.6 (Nov. 2017) so MongoDB can enforce types strictly if you want to at the collection level.
NoSQL doesn't mean NoSchema or NoTypes.
https://docs.mongodb.com/manual/core/schema-validation/#json...
So they bolted it on eventually then. The relational databases went the other way and bolted-on schemaless data [0] [1]. We can still compare the schema-based and schemaless approaches.
> both relational schemas and statically typed programming languages provide assurances about basic structure.
Yes I get that but I was trying to make the point that there's more to the relational model than just assurances. It offers a richly entwined semantic structure and an elegant route to reducing duplication and the risk on inconsistency that comes with it. None of this is in tension with dynamic typing.
(Also - dynamic languages aren't just for sloppy thinkers and cowboy coders. There's actual value that is lost in moving to static typing even if you think the trade-off is well worth it)
Part of the issue is that a complicated database can handle the same SQL query many different ways based on indexes and other configurations.
This kind of "magic" isn't always clear when programmers are mostly used to working with data structures and procedural code.
The other problem, IMO, is that programming languages are very poor at bridging the difference between the SQL domain and the language domain. We really need plugins for compilers because ORM libraries often are harder to learn than the database itself.
> The other problem, IMO, is that programming languages are very poor at bridging the difference between the SQL domain and the language domain. We really need plugins for compilers because ORM libraries often are harder to learn than the database itself.
Actually in past year I discovered that JetBrains IDEs (in my case PyCharm) have nice feature that seems like not everyone is aware. It is due to their integration with DataGrip. If you configure the IDE to connect to your database, it will start looking for SQL in your strings, it will then do code highlighting, autocomplete table names and fields. If you use refactoring it will understand SQL, if you refactor database it will produce migration statements. It appears to solve all the issues that ORM supposed to solve, and you still have full control since SQL didn't need to be abstracted from you.
>This kind of "magic" isn't always clear when programmers are mostly used to working with data structures and procedural code.
My problem is that it is like some sort of black magic to me. If I write a complex query I have no idea if what is spit back to me is actually what I want. The only way is seeding lots of records and then manually checking that each filter and calculation is doing what I want.
In code complex things can be broken down into more simple items. Then I can reason about and test those building blocks into something I understand and am confident that is working as intended.
Yes, SQL's biggest fault is that it's not very composable. Complex queries end up being long and repetitive, and the order of the parts of a query is totally unintuitive (it should go something like: FROM, GROUP BY, SELECT, ORDER BY rather than SELECT, FROM, GROUP BY, ORDER BY, which makes autocompletion hard).
One thing I never understood is that the SQL language, and its alternatives, share the same theoretical IR -- the relational algebra -- it shouldn't be that difficult to implement for postgres/mysql alternative relational languages like QUEL or Datalog. Or even just a simplified SQL with a sane, consistent syntax.
I know PG has a bunch of procedural-language alternatives, but afaik, no relational-language alternatives.
At the same time, I'd also expect it to not be that difficult to transpile from say mysql to postgres, yet there's very little in that space, at least in open source (there is however many ORMS that map to either mysql or postgres..)
Both of which, amongst the procedural-languages, you'd find a hundred transpilers (JS->C) and VM-languages (Clojure,Scala,etc on jvm) implemented even by bored/curious students. Which makes my suspicion that there's no technical blocker, just a cultural one.
That is, QUEL losing to SQL, such that you can't find any implementation, is an absurd concept -- it should be available on postgres (perhaps requiring some special starting keyword) -- but somehow it is not.
>programming languages are very poor at bridging the difference between the SQL domain and the language domain
Depends a lot on the language. I've lost countless hours to things like JOOQ trying to figure out how to get it to do what I want, or express the query in its quirky not-quite-right DSL, plus dealing with mappings, pojos, auto-generation, and so on.
However, on the other hand, in a dynamic language with just enough support to move the result of your queries into a map of key/value, I feel no friction at all. I'm using little more than a simple jdbc wrapper in Clojure and even after months on the project, I'm still just continuously stoked with seamless the whole thing is.
jOOQ can be used in a less-type-safe way. For example, `fetchMaps` [1] does more-or-less what you describe.
However, I have found it worthwhile to learn to use the more advanced features you mention. Extending type safety to queries is incredibly useful. Consider cases when developers are making code and schema changes concurrently that overlap.
jOOQ can effectively extend the type system of Java to the construction of ad-hoc queries by way of code generation. jOOQ generates Java classes that correspond to the database schema, which can be used in its query building DSL.
For example, if I have a `timestamp with time zone` column in PostgreSQL, that is represented in the generated code. I will be prevented from inserting a Java `String` into that column - I will have to provide an `OffsetDateTime`. (This is how it would be in a typical configuration - it is flexible enough to do pretty much anything).
Another example, let's say I reference some column in a query in one of my feature branches. Meanwhile, somebody has dropped that column on the master branch. When I rebase my feature branch onto master, my build will fail.
The thing that really worries me when I'm writing SQL is the possibility that the query planner will get frisky and choose some disastrously slow join order, but only once in a while. I've run into way too many hard-to-reproduce performance issues like this.
Having the database automatically figure out how to run a query is a great feature, but most of the time I'd happily just write explicit nested loops for the sake of predictability.
When I was in undergrad I was part of program that was heavily programming focused, but was actually part of the business school. Several of the classes I took were heavily SQL focused, with at least one class that had every single assignment requiring extensive designing database schemas and writing SQL. Now, years later, I still think that those classes were some of the most valuable to my career as a programmer.
What's interesting to me is that apparently the CS program at my university hardly did anything with SQL, and I notice too that most programmers I meet "in the wild" are lacking in SQL skills, as you mentioned. It's led to some interesting interview situations where I really struggle with any questions about algorithms (my college courses didn't cover algorithms at all) while the interviewer will tell me that I have the best SQL skills of anyone they interviewed.
Speaking with others that went to other universities, I've heard that it's similar elsewhere for the "business/programming" to include SQL classes but eschew algorithms, while CS programs will ignore SQL but focus heavily on algorithms. It seems to me like both programs could benefit from meeting in the middle a bit.
I have a very similar background. My degrees are in Business Computer Systems and we had SQL hammered into us. We learned how to program with Java and .NET, but all from a very high level. It wasn't until a couple years into my career after school that I started learning DS+A fundamentals on my own. It blew my mind. I agree that there should be more of a mid-point available.
This is partly why I love LINQ: it's a more C# flavoured way of expressing queries. And it works on objects as well as databases.
Most ORMs are bad for queries (pull over all the objects and look at their properties!), but LINQ will actually turn your code into SQL under the hood with some remarkable machinery.
>Most ORMs are bad for queries (pull over all the objects and look at their properties!)
i would change that to ORMs CAN be bad for queries. most ORMs are super configurable and can be changed only to pull specific stuff.
i would argue that most ORMs are a plus to productivity because for most of cases, you just need simple querying (select * from, simple updates, simple deletes).
also, if you really need the power of raw sql you an just use that -- and even then, for most advanced ORMs (django's, for example) exposes a LOT of really advanced sql stuff in python, which, for a lot of developers, is a lot more expressive.
Which ORMs are bad for queries? Most popular ORMs these days expose most of sql in a language-specific DSL (and some allow you to splice in bits of raw sql in a semi-structured way as an escape hatch). Sure, you still need to learn to "think in sql", to use these effectively, but the ORM has not been the problem for a long time in most languages.
ActiveRecord, Rails ORM was the poster child of ORMs for a while, and is completely abysmal for non-trivial left joins, non-trivial aggregations and some subqueries.
Of course using raw SQL is possible but might force you into converting other parts of the code into raw SQL. There is Arel, but it's just a more verbose SQL in quasi-AST form.
On the other hand, I never had much problems with LINQ or Hibernate.
Not doubting there are bad ORMs but none of the ones I've dived into is quite that bad (SQL Alchemy and Django's ORM).
I sometimes wonder if the debate about ORM is mainly driven by some traumatic horror witnessed at some point in every developer's career. If we judged programming languages by the same metric... Oh... Actually we kind often do that too!
I struggle with lack of experience with SQL by always being told that I should always use an ORM or I would regret it in the future when I would change database technology. I'm in the future now and spend a lot of time debbuging the ORM and the sql statements it produces, when I could split that work in half by not using the orm at all. Would also have a lot more experience with sql so there would probably be less bugs in the first place.
If you're writing a CRUD application, an ORM saves a lot of headaches.
If you're doing complex reporting queries, an ORM is strictly worse. And yes, I've seen developers, architects, and authors of ORMs that believed otherwise. They are wrong.
As an example, very, very few ORMs can make the distinction between
SELECT ...
FROM foo
LEFT JOIN bar
ON foo.id = bar.foo_id
AND bar.category_id = 5
LEFT JOIN baz
ON bar.id = baz.bar_id
...
and
SELECT ...
FROM foo
LEFT JOIN bar
ON foo.id = bar.foo_id
LEFT JOIN baz
ON bar.id = baz.bar_id
AND bar.category_id = 5
...
(And if you get into things like analytic queries, just forget about it.)
The extra "AND expr" in the join clause makes it more strict, which means the outer join can produce more nulls in the joined tuple. In a sequence like A left join B left join C, moving the extra join filter between the two joins is the difference between getting tuples like (a..., null..., null...) vs (a..., b..., null...).
I believe they may have intended to write this query with INNER JOINs instead of LEFT JOINs.
If so, the result of the query would be identical but the second one would likely have performance problems given the category_id filter is not being applied at the point where bar is joined. A clever database engine might find an optimisation but I wouldn't count on it.
This illustrates the subtle problem that ORMs can introduce - all logical tests pass but issues emerge over time. I found that typically every ORM based problem was resolved by rewriting it using SQL.
I used to use ORMs extensively, but my philosophy now is that you can make simple queries by using a query generator (i.e. connection.GetByID("MyTable", 100)) and complex queries have to be written by hand. I would not choose to use an ORM again.
However, I'm reasonably certain we'd see poor performance with that query on MySQL 5.7, but I don't think it's got as good an optimiser as Postgres for example.
In the first case, if for a given row bar.category_id <> 5 then there's no match and hence null will be returned for any value from bar. Thus there will be no match in baz either (assuming the id columns are non-null).
In the second case, the first join will always return a row from bar for a given foo_id if it exists, but the second join does not return values from baz for rows where bar.category_id <> 5.
Swapping out database platforms is pretty deep into YAGNI territory for most. Unless you know you are selling an on-prem software product to some customers who will demand MSSQL and others who will demand Oracle, or whatever, this "swap databases" justification for scrupulously using an ORM is not well grounded in reality.
From my experience it's just unfamiliarity with the SQL language. I've spent thousands of hours writing PHP, Java, Scala, Kotlin, Python and Go and doing fairly complex things with them since I started programming.
SQL maybe a few hundred hours? And off and on again rather than constantly so I don't always remember beyond the basics. The concepts make sense to me but the language always feels foreign.
I'm pretty sure I would be good at SQL if I devoted time to it, because I've worked in other declarative DSLs and generally outperformed other programmers using them, but basically every place I've ever worked at that used SQL extensively had already employed SQL experts that just handled the problems on that level so that it didn't seem worthwhile bothering with it.
In the minds of many developers, SQL is not a "real programming language". That's why they don't see the value of spending the time learning it. Many also believe that SQL is something that they should better avoid, using an ORM, and let it be handled by libraries written by "experts". All of this contributes to developers not having a firm grasp of SQL.
I don't think it's as elaborate as you're saying -- the clearest road into a productive job for many years has been "join a fast growing tech company who uses <cool_framework> to do the magic, so the tech team is always building more business logic". When you've been working with (eg:) ActiveRecord and Ruby on your database data, direct SQL just doesn't come up until you're trying to wrangle more efficiency in your queries or do more complicated joins.
It certainly makes sense for all developers to learn more SQL! But it just hasn't been part of the daily work requirement for many.
As for the attraction of noSQL databases, I think perhaps it's more of a new-hotness trend rather than a serious technical decision. NoSQL has its place, but like most of this kind of tooling, it's often applied incorrectly because of a lack of thinking through the business logic.
I agree that this is in part why nosql became popular. I'm quite experienced with systems programming but databases were never central to my interests.
So when I needed one for my personal stuff, I went with CouchDB, because it's just a very nice RESTful API around a data structure I understand well.
As a programmer who is admittedly attracted to Mongo, a large part of it is simply its ease of integration. I'd happily spend more time familiarizing myself with SQL if it were less of a pain to integrate into my projects.
The first time ever using Mongo I had a cloud cluster connected and working in about 10 minutes after signing up. Trying to integrate SQLite took me around 1.5 hours before it was functional. To this day I have yet to set up a real cloud SQL database because the one time I tried it with PostreSQL I just couldnt get it to work.
Is initial integration cost really the most important metric you look at when designing a system that will probably run for years and will have to be maintained and scaled?
However most of the time we'll only deal with simple queries such as select where and join. More complex example is just group by and simple aggregation such as max min count.
People struggling with sql because they usually don't know easier way to query that they resulting to use more complex query. For example `select where id in` and `case when` are both powerful query that many don't know or under utilized.
With `select where in` for example, we can do object mapping / join on application level instead db level.
SQL definitely solves a lot of performance issues for typical users who think they can do better
but throwing strings over the wire, trying to get as much as possible without crossing cardinalities and ignoring the n query problem makes SQL at scale disastrous
Devs always get one of these things wrong and not being able to use my programming language in the query makes some things hard to express
I find SQL INSERT statement not intuitive. I can understand why SQL requires me to declare the field names and then the values of a new row that I am inserting; but it would've been a huge time saver if SQL had a key-value dictionary-like syntax:
INSERT INTO "my_table"
"col1": value1,
"col2": value2,
...
Single-row inserts are super common in both application code and in interactive usage of SQL, so I think that it is worth it to have a syntax for them that reduces this common error. Especially when a table has many columns of the same type (like booleans). E.g.
When I wrote my (now unmaintained) statically typed SQL dialect for F#, which compiles to underlying SQLite/Postgres/MSSQL, I added a single-row insert with Field=Value because it's nice to have and took no more than 30 minutes to do. It's only a tweak to the parser after all -- you just parse it to the same AST used to represent the `INSERT ... VALUES` clause and all later stages of the compilation do not need to know about it.
The DB we use at work (SQLAnywhere) supports inserting from select, and using auto column name matching, so for a table with columns (name, en1, en2) you could have
INSERT INTO tbl WITH AUTO NAME
SELECT 'foo' AS name, 1 AS en1, 0 AS en2;
Not nearly as neat as the direct key=value syntax but...
It depends pretty much on the quality of one's degree.
Thankfully we had one semester course on databases, all the respective layers, got to implement B+Tree indexes in C, and a GUI application using Oracle's Forms their 90's Visual Basic variant, but using PL/SQL instead.
When I first encountered SQL, my immediate reaction was to start writing my own lightweight ORM (two days later I learned ORMs were a thing that already existed) because I took one look at that syntax and decided it was insane to work with directly.
I definitely don't think in SQL.
That said, I like working w SQL databases -- just please give me a battle tested ORM...
Developers do not want to deal with data in 2D tables all the time. The expressive power SQL offers for manipulating these 2D tables is pretty good, but it can't compare to the ease of working with data not confined to that shape, which is what devs are used to from "real" languages - whether imperative, pure functional, or whatever.
In my mind the best example of this is grouping.
Just like most languages now have "Map" and "Filter" functions in their standard libraries, many now also have some form of "GroupBy" function for working with lists and objects in memory. Invariably the meaning of this is that it takes:
1. A list of Xs
2. A key function from an X to a Y, where Ys can be tested for equality
And spits out:
A list of pairs (Y, Xs), each pair containing:
1. A key value Y
2. A list of Xs from the input list matching that key
That list-of-(keys and sublists) can then be fed into whatever .Map(), .Filter(), etc. you want to use next. It is perfectly good for simple aggregates like "give me the total salary we're paying each department", but it's also great for stuff like "give me the 3 highest-paid employees within each department". It just makes sense - map over your groups, sorting each sub-list by salary descending and taking the first 3. Nice clean pure functional stuff.
By contrast in SQL, the first problem is super easy - group by department, select sum(salary). The second problem trips a lot of people up because suddenly you aren't really using GROUP BY anymore, even though the thing you're doing is still a "grouping" task. I think the main way to do it is to use ROW_NUMBER() OVER (PARTITION BY DepartmentId ORDER BY Salary DESC) to get a ranking number for each employee then select only those with ranking <= 3. Why do I think that's worse?
1. If you were to describe to a person what you're trying to accomplish, the first step would be the same for problems 1 and 2 - get the employees for each department. But the queries in SQL are not alike and do not appear to share the same first step.
2. There needed to be dedicated extra language syntax to do this, instead of having one function (not even special syntax) that gives you the essence of grouping - which other functions then aggregate/massage further into whatever you need. This makes it harder to learn and harder to pick the right tool for a given query job.
3. Locality is almost fundamental to programming. I like writing the inner dept.OrderByDescending(e => e.Salary).Take(3) in the real programming language, because I can think locally to "this department's employees" and even down to "this employee's salary". In SQL, you're a mile above the data tying things together with push pins and string by ID, row number, etc. You rarely get to think locally.
4. In the same vein as above, structured data maps better to how I think. When I write data code, the metaphors in my head involve places and containers - drawers, shelves, moving things around. I am organizing the objects into boxes, going into each box, and picking my top 3. I am not marking each object with a number then going over them all as a big set and tossing out the ones with numbers greater than 3.
5. When I get my result from this query, I will probably want to work with it in the nested-list form, even if only to display it with merged cells for the the department (so I don't repeat every department name 3 times). Funny enough, if not using an ORM, I'll end up using my language's GroupBy to massage the flat rows back into this form.
Despite all this I want even less to do with NoSQL. I don't want to chuck blobs of JSON into the DB and end up with duplicate data, orphan data, ill-defined schemas, data that's only efficiently queryable one way, etc. I really do want to store my data in a relational form, and a normalized one where possible. I am happy with my tables. I just would love to query them in a LINQ-like language that is capable of representing and manipulating nested data structures -- and returning them to my application as such.
Learn about modelling. Database is more than just storing data. Drink less koolaid of NoSQL, any NoSQL. It is trading initial result with future development time. SQL has been battlefield tested. No amount of "convenience" is more convenient than learning the fundamentals.
I'm disappointed that there isn't more criticism of the SQL language. The whole NoSQL buzz got me excited, then turned out to actually mean NoRelational.
It is wild that we are still using a language that looks and feels like COBOL, and any criticism is met with drive-by disapproval (downvotes and no comments) or an argument about why relational databases are important.
SQL is a deeply flawed language by standards that are pretty much universal today - we on HN discuss them daily on posts about new programming languages. For example, one of the top level comments here:
> In SQL it boils down to the fundamental syntactical requirement to put the SELECT clause before the FROM clause. So I have to build up my statement in this weird spiral pattern where I change something deeper in before I know what I can SELECT in the first place.
If this were the case for a new language post on HN, the author would get run out of town for this reason alone. And yet, the person who wrote that comment feels the need to hedge by saying they are "an SQL hater in remediation".
Relational data does require a mode of thinking that many programmers are not practiced with. Wouldn't it help to have a language for working with it that isn't outright terrible?
SQL stood the test of time. SQL is widely adopted. Once you get the hang of it, it can be applied on a wide range of RDBMS.
I would argue that countless productivity has been lost to learning yet another query language for yet another NoSQL db. Mongo has its own query language. Cassandra has its own. Neo4j has its own. What not. Guess what, few engineers need these to solve their actual problem. Be it building an application, a library, a SaaS, a tool with wide database support. The yet another query language is an imaginary solution to an imaginary problem (most of the time). Your problem isn't big enough to use whatever NoSQL of choice.
When developers starting out, they get attracted to technologies with the most marketing money. They beat around the bush of learning things that might not matter anymore 5 years down the road.
Is english the best language? No, not even close. Should one learn it? Probably.
Is esperanto a better language? I don't know, maybe. People invent it for a reason (a problem to solve) after all. Should one learn it? Probably no.
> SQL stood the test of time. SQL is widely adopted.
I'm highly dubious of "it is widely used so there" arguments. I think in the case of SQL, the fact that ORMs and SQL generators are ubiquitous are evidence that a huge proportion of engineers would rather not write it, and part of why it is still widely used (most of the time, most people can avoid actually touching it).
> I would argue that countless productivity has been lost to learning yet another query language for yet another NoSQL db. Mongo has its own query language. Cassandra has its own. Neo4j has its own. What not. Guess what, few engineers need these to solve their actual problem.
This is exactly one of the points I addressed in my comment... none of those are even relational databases. And this is where the discussion ends up every time someone says SQL sucks.
> Is english the best language? No, not even close. Should one learn it? Probably.
The vast majority of the time, when you are using English, you have no control over the receiving end. The vast majority of time, when you are using SQL, you / your org grabbed an SQL RDBMS to use.
Furthermore, I don't see why a modern, reasonable query language couldn't transpile to SQL easily when necessary, making all path-dependency / adoption arguments void.
> Wouldn't it help to have a language for working with it that isn't outright terrible?
There is a language like that, and it's downright elegant. It's called LINQ, and it's built into the C# language. It inverts FROM and SELECT, which lets it reliably support autocomplete. LINQ was built using concepts from category theory/relational theory and is widely used in C# to make code tighter and simpler (even though some eschew it because simple for loops are faster). It also transpiles to SQL via Linq-to-SQL. It looks like this:
var results = from c in collection
where c.Fld1 < 10
select new {c.Fld1, c.Fld2};
It's the closest thing we have to SQL that Typescript is to Javascript.
That said, most people who access databases in C# still either write SQL or use an ORM like EF or Dapper.
Databases speak SQL natively for better or for worse. And the SQL language, despite its flaws, is actually mathematically very deep because it's been developed and extended over so many years. See [2] for some serious SQL-fu. To replicate all this functionality into a new language that transpiles into performant SQL is a huge undertaking.
SQL is one of those incumbents that is extremely hard to displace, partly because it's so ubiquitous and that in practice, not everyone finds outright terrible.
True, it's not good, but it's not terrible. I write it day in and day out. The analogy I can think of is the Erlang language -- it doesn't have a pretty syntax but it solves the concurrency problem really well. SQL for better or for worse solves the complex analytic query problem really well.
In regards to 'select' before 'from', there could be a similar argument to be made for declaring imports at the top of a file. When you write a program, you may not know what libraries you need. Variable declaration at the beginning of a function is also a common pattern.
The SQL language stood the test of time where as COBOL did not. I think it says something about how well it was designed.
I strongly suspect that developers have such a hard time with it is because of ORM. ORM trains you to think of the database as objects. Another poster mentioned that treating databases like sets made SQL click. I think that is a good way of thinking about SQL.
To be fair, SQL does have some archaic things in there that make it annoying to work with. However, once it clicks, it's fits naturally with how a relational db works.
>Variable declaration at the beginning of a function is also a common pattern
What language requires you to declare variables at the beginning? C even stopped doing that. People choose to do that, but ime that's after the writing phase, to make it more readable
>When you write a program, you may not know what libraries you need
Sort of -- the difference is that when I'm writing code in my IDE, I know all database objects available to me. It's in the schema. A library I import once (to the project itself), and the IDE can always assist from then-on. But SQL is designed such that despite the library (schema) being imported, the IDE can't actually assist, unless I write the code out of order (eg start with SELECT * FROM table and then start working)
>However, once it clicks, it's fits naturally with how a relational db works
I think you've misunderstood the complaint -- the relational language is a very strong concept, which has stood the test of time, and is difficult to complain about -- it does its job well, and fits naturally with how a relational db works.
SQL the language however is:
A hodgepodge of random keywords tossed about in a totally inconsistent fashion (eg postgres overlay: OVERLAY('Txxxxas' PLACING 'hom' FROM 2 FOR 4) -- postgres at least generally offers a consistent comma-separated syntax for every functions
has weird and technically unnecessary limitations (like SELECT being evaluated after the WHERE clause, so you can't use the aliases defined in select clause)
putting SELECT before the FROM (disabling IDE auto-complete support)
the stuffing of a 3-value logic system into a 2-value logic interface, so boolean operations break silently and produce nonsense in the face of a database with NULL values, because no mapping of NOT (TRUE, FALSE, NULL) AND NULL makes sense. [0]
It generally composes very poorly, leading to redundant, long and convoluted queries
I agree. The SQL language is in desperate need of an "upgrade" to a proper functional language. It's missing so many basic features that it's just painful.
For example, why do I have to repeat expressions in the SELECT, GROUP BY and ORDER BY clauses!?
E.g:
SELECT
LEFT(Foo,4) as Prefix,
COUNT(1) as N
FROM MyTable
GROUP BY LEFT(Foo,4)
ORDER BY LEFT(Foo,4)
This gets really obnoxious for complex expressions. I mean sure, I can break things out into functions, or use the WITH clause, but both of those often end up being more verbose, not less. This is solved in functional languages such as Haskell with the "where" or "let" clauses (not to be confused with the SQL WHERE filter).
I wish I could do something like:
SELECT
Prefix,
COUNT(1) as N
FROM MyTable
GROUP BY Prefix
ORDER BY Prefix
LET Prefix = LEFT(Foo,4)
Similarly, it's crazy to me that most SQL dialects don't allow relational functions to be passed around as proper data types. This kind of thing comes up a lot when implementing row-level security.
I'd like to be able to write "view functions" that can take any relation as an input, as long as it contains some column, and then add on some joins or filters based on that column. Imagine you have a bunch of tables that link to the "Customers" table via a "CustomerID" column or whatever. It would be great to be able to natively write a query that says "AddCustomerName" or "FilterDisabledCustomers". These would take an existing relation (table/view/function) as an input, and return a new relation as the output with an extra column or filtered rows.
This kind of modular style made up of small chunks of code that can be elegantly composed is just not possible or very messy in most SQL platforms. It's typical of most modern functional languages, and I think the "next gen" query language that replaces SQL will look more like Haskell and less like COBOL.
So true. I worked at a database-centric place for a while, so I read up on database stuff to fit in.
Learning about first, second, and third normal forms was very enlightening. If you learn these thoroughly enough that they are second nature, it really helps you see modeling mistakes that you might be making. It just becomes a lot easier to think clearly about how to lay out data. Just like there are code smells, if you learn normalization, you will immediately detect data structure smells.
Third normal form is not always the one and only right answer for modeling data, but it's a pretty good starting point from which you can refine and adjust if needed. Just because you understand normalization doesn't mean you have to do it all the time, of course. But if your data isn't normalized, it should be for a specific reason (performance), not by accident or because you don't know how to keep it organized. It is one of those things where knowing the rules gives you the freedom to know when it's right to break them.
It's even useful when thinking about organizing data in RAM. The situation is a little bit different because in RAM you follow pointers instead of doing joins. But there are still cases where it helps. For example, you might have one big struct, and you realize it should be two different structs because you are filling the same (redundant) data into multiple instances. And you know how to fix it.
Another very useful concept from modeling is choosing keys. One of the lessons of databases is that if you don't use synthetic keys, you're going to have a bad time. When you choose real-world data (like first name and last name as a key for a table of people), you're going to have problems like non-uniqueness.
Relational databases aren't the last word on how to organize and store data, but there are just a lot of good ideas that pick up if you learn about them.
I'll add that they should never be trusted to not jump around either! I imagine everyone makes this mistake at least once in their life.
There is a very high chance that the database will skip a few numbers from time to time. You will then have someone from an accounting department asking where Record #XX is.
This is why I'm skeptical of the suggestion to prefer a natural primary key like a username. It works fine... until they day the business asks for changeable usernames because BigClient is now LargeClient and can't stand anything to still have their old brand identity.
Even in the natural key's playground (in my mind) of data warehousing, you still have to manage changing dimensions, and the inevitable reality that your fact table was actually another dimension table all long.
Quite. Not only does it protect you from architecture pain later, but it also allows you to create business-keys that humans have an easier time reading, comparing, and typing.
The implicit requirement of a global always-online single master counter is great until it isn't. Perhaps one day it becomes a performance bottleneck, or you have to support systems that create provisional items offline, or some law or client-contract stipulates everything for them must live in a daughter-system hosted inside/not-inside a particular country...
If it was a business requirement that you have perfectly sequential invoice numbers with no gaps, do it at the application level, not at the storage level.
Let the database do what it's great at doing: efficiently store and retrieve data.
A database is also exceptionally good at doing transactional stuff like atomically incrementing something. I'd even say: This is something that belongs in the database and not in some brittle application logic.
Generally, I agree. But in many RDBMSs, auto-increment features explicitly do not guarantee gap-less sequential ordering. If anything throws, it's easy to end up with discarded numbers, in which case you need to catch all errors, inspect the state of the sequence and/or the entity you're populating, and re-seed the sequence. At this point, I'd argue you've already left the pristine gardens of set theory and wandered into the thorny brambles of app dev.
I don't think it is not possible to enforce perfectly sequential numbers at the application level.
This is the same problem as multiple threads trying to increment the same global variable. Unless there is mutual exclusion while the variable is being read/incremented there will be race conditions.
I think the only way to enforce mutual exclusion for applications would be at the database layer (or any other layer where there is 1 resource that is shared between each application peer).
I think by "application level" they are saying "using transactions at the application level". If you require a strict sequence of numbers then you BEGIN, READ, INSERT, COMMIT, that is really what you want so just do it explicitly. If the COMMIT fails because of a duplicate key then you start over.
Not the OP but perhaps if only accounting cared about it you could assign these IDs separately as some asynchronous single threaded process. This would only work if insert rates were low enough or the assignment of the gapless sequence only needed to be eventually concistent over some window of time (i.e. before next payroll we must have assigned gapless IDs to all the new user accounts)
2. integer for the actual number you're generating, let's call it 'IDValue'
Seed the new table with a single row with IDValue set to one less than your minimum value (say 0), then use the following process to generate a new number:
1. Insert a new row into the table, with a known invalid value (e.g. -1) for IDValue (note this must not be the same as the IDValue from your initial row)
2. Get the primary key of the newly inserted row
3. Get all the rows from the table (in primary key order) with primary key < the new id. This will consist of one or more rows of prior valid values (or just the initial seed), followed by one or more rows that are either valid or invalid values (other clients may be running through this process concurrently and finishing at different times) - something like this:
Key / IDValue
61 / 1119
62 / 1120
64 / 1121
65 / -1
67 / 1123
70 / -1
71 / -1
(your row is the next one after this)
4. Your new IDValue == the last valid IDValue in that set of rows + the number of rows between that and your new row + 1 - update your row with this new value - in the above example, 1123 + 2 + 1 - i.e. 1126
5. Delete the first unbroken sequence of valid rows except for the latest one, to keep the table small but leave at least one valid IDValue (IDValues 1119 and 1120 in the above example) - just something like DELETE FROM table WHERE Id < 64 (in this case) should be safe.
The database takes care of atomically creating rows which is the tricky bit, and then you can generate your own number at your leisure regardless of gaps in the Key numbering sequence.
Usually, you don't. What kind of messed up requirement is that? Do you want the sky to be green at sunrise tomorrow too?
On the exceptional case where the user is on the clear and there is an unavoidable reason to create this beast, you create a separate numbering application, that uses database transactions that are independent from the ones of the main application and only cares about numbering. It will still have holes, but few enough that you can manually inspect every so often and explain why they happened.
If you look at the documentation for something like "create sequence", databases often mention that the database will reserve blocks of numbers so that they can hand out values from memory (and also without coordinating with peer databases). There's normally a way to turn that caching off.
But you may still need to check what the behavior is during a transaction, as you could have two requests trying to add a bunch of rows at the same time.
Auto-increments are fine for primary keys, until they aren't. I think the list of items from the linked article are things that may cause delayed problems.
Since a long time ago. This is old news. iFrame sandboxing is now a two-way street, depending on configuration. Not only can an iFrame be prevented from accessing the parent frame, a parent frame can be prevented from accessing the child frame. It is still vulnerable to clickjacking, but to reduce the impact of that, using one-tap sign up only allows the most basic Google permissions. https://news.ycombinator.com/item?id=17044518https://developers.google.com/identity/one-tap/web
"The fastest way to access to a row in a database is by its primary key. If you have better ways to identify records, sequential IDs may make the most significant column in tables a meaningless value. Please pick a globally unique natural primary key (e.g. a username) where possible."
I can agree with everything in the article except this one.
>Has anyone had a problem due to surrogate keys?
There's one problem with surrogate keys: they are not convenient to users (too long and not meaningful).
There are two problems with natural primary keys, and you are guaranteed to hit one of them at some point.
1. It turns out your key isn't actually unique. To resolve the collision you have to replace natural key with a surrogate for one of conflicting entities. Which is not always possible without risk of another collision.
2. It turns out your key isn't persistent. You have to change it for some entities, but you can't because of so many FKs.
The article mentions auto-incrementing keys, not surrogates. Not the same thing at all. Not all incrementing keys are surrogates and not all surrogates are incrementing keys.
Also, your problem number 1 is a problem whether the natural key in question is the "primary" one or not. Certainly if you choose the wrong natural key then you'll have to fix that - that's why you should take care to make a wise choice of natural key regardless of whether you are also using a surrogate.
> You are lucky if 99.999% of the time network is not a problem.
This reminds me of one time I was having networking issues (around a CRUD GUI). Oddly, I identified the issue the (business) day before my users did. I was working on a rather large change (OS updates, 32->64 bit) and noticed that multicast updates broke. The timing was funny. I was working towards a minimal reproducible example already, and I had a small test app that showed that multicast was broke on my PC. So, I ran down and tried the app on their computers, and it worked... between the two of them, but not the company et large.
That's when I vaguely remembered an email for Network Ops about a router change on my user's floor over a weekend. I went by the Ops team (they didn't like email for some reason) and told them what I was seeing.
The short of it was, yeah, the router change screwed my 2 users on the floor (they got put on the printer VLAN by accident, which didn't receive multicast). Separately, the issue I saw was a bug in how we built a 3rd party lib on Windows that provided 64-bit multicast support.
THE thing I wished more developers knew about databases: they are badass when it comes to data manipulation.
Stop sucking all data and manipulate it in your language of choice. Tell you DBA what you want done and let her do it for you.
Really, DB's may look like they are the special needs kid in the chain, but they're magnificent powerhouses when it comes to 90% of what you are trying to do to data.
The most important and overlooked characteristic is that most RDBMS use b-tree for their table-space, meaning all the operations (including search / lookup) are O(log n). For online (OLTP) applications, this means you will have to shard sooner or later (assuming your audience is growing)
I agree with your first statement, but could you explain how b-trees logically lead to sharding (at some growth point)? What storage structure doesn't lead to sharding eventually?
One thing I've noticed that in the medium term of a software service (2-5 years) is that your software should have the ability to do double-writes to and flip reads between two different datastores.
That will afford migration with as close to transparent migration with as reduced a downtime or no downtime.
One more book on the subject, possibly even closer related to the field of databases (should cover many of the items mentioned in the post): https://www.databass.dev
An odd thing that happened to me yesterday with the PostgreSQL ODBC driver (psqlODBC) v9.3.400. It wouldn't let me insert a string longer than 255 characters long into a character varying field into a local v9.4 database on windows using a recordset update. I didn't have a problem pasting it in via pgadmin. Altered the field to text and the problem went away.
I've a suspicion that there is a limit on text in the tens of thousands of characters length too though despite both those fields being essentially the same thing and limitless.
I learned SQL working on a "database as a product". The database was filled with medical ontologies. It was a perfect environment for learning. We were always looking for obscure things in that database, and rarely changed data, just selects for days. In the end I once wrote a query spanning about 100 lines that used common table expressions and found it quite maintainable.
Where can an SQL beginner find such a database to experiment with? I was fortunate because my job provided me this database to experiment with, but what about those who are not so fortunate?
There seems to be a wealth of information here. I wonder, though, if the goal would be better served breaking this up into a series. Or perhaps applying the 80/20 rule to the list to come up with truly “a few” items that will have the highest leverage. Otherwise, I think it will provoke a lot of discussion and analysis from people who are experts in databases, but it might remain impenetrable for the core audience: the majority of developers who ignore this stuff (according to the author.)
I could be wrong but it seems to me that she is confusing not storing state with Idempotent calls to the db. Everything you store in the db is some form of state.
Many years ago I've read some chapters from Itzik Ben-Gan's "Inside Microsoft SQL Server 2008 T-SQL Querying." It's an excellent book to anyone who wants to know how things work under the hood. While the title is in fact about SQL Server there so many things explained out there that apply to almost any SQL engine that is a worth the purchase.
These sort of articles are very effective, if you nail at least one thing in your checklist the reader does not know or is confident about, it often encourages them to continue reading further. I know I did.
In my experience, if you think of a database as a 3rd party API that you do not control sitting somewhere unknown on the internet, your expectations become more reasonable.
Had a postgres database which was using pgbouncer for connection pooling. The most senior developer (24yo or so) we had on the project was using Go to connect to the database to write some simple reports, but each report took hours to run, and often had to sleep for 30+ minutes. So, after a while, pgbouncer would kill their connection, and their report would die. No other application did this among the many that we had connect to that DB, so it was definitely strange.
Found out pretty early on in troubleshooting it that they had no mechanism to keep the connection alive, which makes total sense for why his app died. So, they put the library standard keepalive function in a loop if the report wasn't doing anything.. but that didn't fix it.. it made no friggin' sense. After bashing my head against that for a while, I finally threw my hands up and asked if they could just run a "SELECT 1" as a keepalive instead of whatever the Go library was doing. Got a bit of pushback, but just told him to do it and walked away. That ended up fixing the problem.
Turns out the Go library was trying to be clever in its keepalives (can't remember what it was doing exactly), in that it made some silly assumptions that there was nothing in the middle managing connections.
I like to think that dev learned a lot about trust in "magical" libraries after that.