I think you're asking the wrong question. The question should be: How did MongoDB become so successful?
IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API. MongoDB was the natural choice because they excelled at being accessible to devs who were already familiar with Javascript and JSON.
Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?
I think you'll enjoy the series then, I spent several months investigating and made the same point about JSON and the Javascript-like CLI (plus great Node support, plus savvy marketing). For example:
> 10gen's key contributions to databases — and to our industry — was their laser focus on four critical things: onboarding, usability, libraries and support. For startup teams, these were important factors in choosing MongoDB — and a key reason for its powerful word of mouth.
> I think you're asking the wrong question. The question should be: How did MongoDB become so successful?
Marketing, marketing, and more marketing. Mongo was written by a couple of adtech guys.
> Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?
I remember being underwhelmed by two things at the one MongoConf I went to earlier this decade:
1.) My immediate boss was an unfathomable creep who was there mostly to pick up women
2.) Mongo was focused on how to work around the problems (e.g. aggregate framework) rather than how to solve them.
I can't recall ever seeing a Postgres rep, but I can recall having worked out a PostGIS bug with a fantastically tight feedback loop. The Postgres documentation and community are nothing short of amazing.
Meanwhile with Mongo I watched as jawdropping bugs languished. IDGAF what the reps say, anyone with even a few years experience should've been able to see through the bullshit that Mongo/10gen was/is selling.
> IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API.
The thing I dislike about this type of comment – although I now notice yours doesn't explicitly say this – is the implication that devs don't like SQL because they're lazy or stupid. Well, sometimes that is probably true! But there are some tasks where you need to build the query dynamically at run time, and for those tasks MongoDB's usual query API, or especially its aggregation pipeline API, are genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties. For anyone not familiar, just look at how close to being a native Python API pymongo is:
Of course you could write an SQL query that does this particular job and is probably clearer. But if you need to compose a bunch of operations arbitrarily at runtime then using dicts and lists like this is clearly better.
Of course pipelines like this will typically be slow as hell because arbitrary queries, by their nature, cannot take advantage of indices. But sometimes that's OK. We do this in one of our products and it works great.
With JSONB and replication enhancements, Postgres is close to wiping out all of MongoDB's advantages. I would love to see a more native-like API like Mongo's aggregation pipeline, even if it's just a wrapper for composing SQL strings. I think that would finish off the job.
Elixir's primary database wrapper, Ecto [0], lets you dynamically build queries at runtime, and also isn't an ORM. Here's two examples directly from the docs:
# Query all rows in the "users" table, filtering for users whose age is > 18, and selecting their name
"users"
|> where([u], u.age > 18)
|> select([u], u.name)
# Build a dynamic query fragment based on some parameters
dynamic = false
dynamic =
if params["is_public"] do
dynamic([p], p.is_public or ^dynamic)
else
dynamic
end
dynamic =
if params["allow_reviewers"] do
dynamic([p, a], a.reviewer == true or ^dynamic)
else
dynamic
end
from "posts", where: ^dynamic
Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.
My main point, though, is that you don't need to reach for NoSQL if all you need is a way to compose queries without string interpolation.
As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline unless it can do analogous things to Postgres's JSONB fields. For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?
Also, one of the benefits of Mongo's API is that it has excellent native implementations in numerous languages (we already use C++ and Python), so a suggestion to switch language entirely is not really equivalent.
> As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline
Huh? The aggregation framework is a solution to a mongo-only problem. Most other databases are performant, but Mongo suffers wildly from coarse locking and slow performance putting things into and retrieving things from the javascript VM.
> For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?
This sounds suspiciously like a SQL view.
Edit: But if you actually need an array in a cell, Postgres has an array type that's also a first-class citizen with plenty of tooling around it.
The "this" was referring to dynamically building queries (the GP comment by me) in Ecto (the parent comment by QuinnWilton). What you've said is a non-sequitur in the context of this little discussion. My whole original point is that raw SQL isn't right in all situations, and you appear to be arguing that I just use SQL instead.
I can't speak to every ORM or database interface in existence but ActiveRecord will happily handle Postgres arrays and let you use the built-in array functions just handily without having to write queries by hand. Ecto is less elegant, but you can still finangle some arrays with it.
As far as views are concerned, I don't know what to tell you. Sure, you'll probably have to craft the view itself by hand. The result is that you can then use most abstractions of your choosing on top of it though.
There's also the possibility of using automation to create, update, and manage views. That lets your app be 'dynamic' with regards to new data and new datatypes, but also preserves the performance, debugging, segregation, and maintenance benefits of the underlying DB.
> Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.
Ahh Elixir. My favorite language that really just tries so hard to shoot itself in the foot. I'm currently in the protracted process of trying to upgrade a Phoenix app to the current versions. Currently I'm at the rewrite it in Rust and try out Rocket + Diesel stage.
Diesel is... interesting and makes me long for Ecto (which is often used as an ORM although the model bits got split off into a different project).
Love the downvotes instead of comments. I've walked away from Elixir as the best practice deployment methodology (Distillery) is non-op on FreeBSD[1] and has been for a few months while the Distillery author is mum. All of this despite the vast love that the Elixir community seems to heap on FreeBSD.
Erlang and Elixir have plenty of promise but there simply is no good story for production deployments. Distillery and edeliver approximate capistrano, and that sounds great when it works (although I'd just as soon skip edeliver). But when it doesn't I'd much rather dig into the mess of ruby that is Capistrano than the mess of shell scripts, erlang, and god knows what else goes into a Distillery release.
Elixir is a really interesting language, but Phoenix seems to still be pretty wet behind the ears and very much in flux. Ecto too to a much smaller extent.
1: Some of the distillery scripts can communicate with epmd, some just give up.
Well... you can also use a modern ORM. I think "stitching ... text strings" is definitively not the way to go when interfacing a SQL database. My go-to ORM is Sequel[1]. I think their API is one of the best I've seen: you can choose to use models, but you can also work directly with "datasets" (tables or views, or queries) and compose them as you like. It's really powerful and simple.
> genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties.
You're using the Pymongo library as an example. Someone can just as easily use SQLAlchemy and not have to worry about those things.
I'm confused by the implication that someone doing things like the above would be writing in SQL. SQL is a little like assembly language in a game: You may need to drop down to it for some key highly-optimized areas, but you rarely need to directly use it for most tasks. While it's true that you should understand how it works so you don't generate queries that suck performance-wise, the same goes for Mongo's intricacies too.
Every language I know of has great ORMs which do this for whatever SQL flavors people tend to use on that platform. I write things like this all the time, and it gets turned into SQL for Postgres:
When using an ORM correctly (and indeed, the less I'm using any of my own bits of SQL the more this is true) I am also protected against injection attacks.
I'm not saying NoSQL has no value, but I believe it to be the wrong tool for data that lends itself to an RDBMS. If you have a bunch of documents who have deeply nested or inconsistent structures and where it makes no sense that you'd want to query by something other than the primary key, sure, it's a no-brainer to use a NoSQL system. For a CMS, which has been implemented thousands of times in RDBMSs, it is madness though. I cringe at realizing that apaprently there are developers out there who have avoided learning SQL entirely in their career out of fear, and as a result have to use Mongo for every application because that's the only thing they know how to do. I'm sure they're out there, but I wouldn't hire one.
The jsonb_array_elements function is roughly similar to Mongo’s $unwind pipeline op. It explodes a JSON array into a set of rows. From there it’s pretty simple aggregates to achieve what you’re looking for.
I was evaluating Mongo a couple months back to solve roughly the same problems. Eventually discovered Postgres already had what I was looking for.
More the point Postgres has an actual array data type (and has for a while). You don't need to shove everything into a JSON/JSONB blob unless you absolutely cannot have any sort of schema.
Not only arrays, you can, with some limitations, create proper types with field names, if your ORM supports that you should use that over JSONB if it fits.
It was supposed to be clear from the context that this meant:
> Does building queries programmatically with SQLAlchemy do that?
Maybe I'm misreading your comment, but you seem to just be talking about writing queries directly in SQL.
If not, could you give an example/link of how to programmically build a query in SQLAlchemy that dynamically makes use of jsonb_array_elements? It would be hugely useful if I could do that.
I was speaking of SQL, but if you can write it in SQL you can usually map it to SQLAlchemy. If worse comes to worse, you can use text() to drop down to raw SQL for just a portion of the query.
SQLAlchemy’s Postgres JSONB type allows subscription, so you can do Model.col[‘arrayfield’]. You can also manually invoke the operator with Model.col.op(‘->’)(‘arrayfield’).
To add to the pile of responses: in Scala, Slick is great library that lets you compose sql queries and fragments of queries quite effectively. (http://slick.lightbend.com/)
At my company we built a UI on top of Slick that lets users of our web app define complex triggers based on dynamic fields and conditions which are translated to type-safe SQL queries.
From my POV the rise of 'NoSQL' some years back was tied into a number of things:
- Misunderstanding by most developers of the relational model (I heard a lot of blathering about 'tabular data', which is missing the point entirely).
- The awkwardness and mismatchiness of object-relational mappers -- and the insistence of most web frameworks on object-oriented modeling.
- The fact that Amazon & Google etc. make/made heavy use of distributed key-value stores with relatively unstructured data in order to scale -- and everyone seemed to think they needed to scale at that level. (Worth pointing out that since then Google & Amazon have been able to roll out data stores that scale but use something closer to the relational model). This despite the fact that many of the hip NoSQL solutions didn't even have a reasonable distribution story.
- Simple trending. NoSQL was cool. Mongo had a 'cool' sheen by nature of the demographic that was working there, the marketing of the company itself.
I remember going to a Mongo meet-up in NYC back in 2010 or so, because some people in the company I was at at the time (ad-tech) were interested in it. We walked away skeptical and convinced it was more cargo-cult than solution.
I'm _very_ glad the pendulum is swinging back and that Postgres (which I've pretty much always been an advocate of in my 15-20 year career) is now seeing something of a surge of use.
I remember a hyperbolic readme or other such txt file for Postgres in the far-away long-ago time when everyone was on Slashdot. The author had written one of the most enthusiastic lovenotes to software I'd ever read, and that includes Stephenson's "In The Beginning Was The Commandline." It was a Thomas Wolfe level of ejaculatory keenness. I'd love to read it again if anyone else knows where I can find the file. So, even if there aren't actual Postgres reps, there are most assuredly evangelists.
Saying "I don't know SQL so I will just use JSON" really misses the point though. SQL is easy. Data is hard. NoSQL products offer to get rid of SQL which includes an implication that SQL itself was the challenge in the first place. The problem then is that you have lost one of the best tools for working with data.
I dunno that SQL is exactly easy, though. It's one thing to say "select statements are essentially identical to Python list comprehensions", but in practice I still have to look up the Venn diagram chart every time I need to join anything, and performance optimization is still a dark art. I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.
You could solve that by altogether dropping the Venn diagram metaphor when reasoning about joins. This is the number one problem I see with junior devs who have a hard time grokking SQL. If you think about a join as a cartesian product with a filter, where the type of join defines the type of filter, the reasoning is extremely easy.
The hard parts of "SQL" are the hard parts of data. Joins aren't easier in Mongo. The performance optimizations you reference are tuning of a relational database, not SQL itself.
If you want to work with databases a domain specific language like SQL really provides a lot of value in solving these hard data problems.
The idea is, in relational databases, that the vast majority of the time you shouldn't have to do it. Because you're writing your queries in a higher level (nay, functional) language, the query planner can understand a lot more about what you're trying to do and actually choose algorithms and implementations that are appropriate for the shape and size of your data. And in 6 months time when your tables are ten times the size, it is able to automatically make new decisions.
More explicit forms of expressing queries have no hope of being able to do this and any performance optimization you do is appropriate only for now and this current dataset.
> I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.
Mongo and Javascript don't solve that either. In fact you get additional problems by virtue of not being able to do a variety of joins. For extra points, you're going to need to go well beyond javascript with mongo if you want performance. 10gen invented this whole "aggregation framework" to sidestep the performance penalty that javascript brings to the table.
On the other side, the postgresql documentation is second to none. SQL isn't necessarily easy but the postgres documentation gives you an excellent starting point.
> You make it sound like learning SQL is like learning Assembler
It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."
You seem like the kind of person ready and willing to learn the right tool for the job. From my experience a few years ago on an accredit computing course that covered database admin and programming, this attitude is not representative of most of the software engineering students //unless// there's a specific assignment that requires particular knowledge.
Cs get degrees. And for plenty of developers out there, knowing one language (not even particularly well) gets jobs.
> It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."
And that's a big fat mistake. There are so many ways to shoot yourself in the foot with mongo such that simply knowing the language mongo uses for most of its queries while not actually knowing the particulars of how mongo uses that language… well that's just a road to a world of hurt.
For example, when I first inherited a mongo deployment I noticed the queries were painfully slow. Ah hah says me, let's index some shit. Guess what? Creating an index on a running system with that version of mongo = segfault.
After a bunch of hair pulling I got mongo up and running and got the data indexed. But the map reduce job was STILL running so slowly that we couldn't ingest data from a few tens of sensors in real time. So I made sure to set up queues locally on the sensors to buy myself some time.
Even in my little test environment with nothing else hitting the mongo server, mongod was still completely unable to run its map reduce nonsense in a performant manner. Mongo wisdom was: shard it! wait for our magical aggregation framework! Here's the thing: working at a dinky startup we can't afford to throw hardware at it especially that early in the game. Sharding the damn thing would also bring in mongo's inflexible and somewhat magical and unreliable sharding doohickey.
So I thought back to previous experience with time series data. BTDT with MySQL, you're just trading one awful lock (javascript vm) for another (auto increment). So I set up a test rig with postgres. Bam. I was able to ingest the data around 18x faster.
And that's the thing. Mongo appeals to people who are comfortable with javascript and resistant to learning domain specific knowledge. All that appealing javascript goodness comes with a gigantic cost. If you're blindly following the path of least resistance you're in for a bad time.
P.S. plv8 is a thing, and you can script postgres in javascript if you really wanted to.
I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline. So while "sure I'll just learn SQL" is great for a personal or school project, when you've got to get something done next week, it's better to take maximal advantage of the tools/skills/workflow that you already have.
IOW, it's not just laziness, it's a kind of professional conservatism. which is partly what gets older engineers stuck in a particular mindset, but it's also a very effective learned skill. The opposite is being a magpie developer, which results in things like MongoDB taking off :)
> I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline.
You have to do the exact same things with Mongo+JS (e.g. learning when to avoid the JS bits like the plague).
learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language,
SQL is a skill that rewards investment in it 1000x over, in terms of longevity. It has spanned people’s entire careers! What’s the shelf life of the latest JS framework, 18 months at most...
Yes, I know that, and that's why I know and use SQL instead of MongoDB. But that's a very similar reason to why I've resisted learning Rust, and Ruby, and React, and Docker, and Scala, and many more. I know I could learn the utter basics in a weekend, but I also know that those basics are utterly useless in a real-world context, and I would prefer to spend the weekend hacking on my open-source project in Python or C, which I've already invested the years into. And that's how engineers age into irrelevance..
Well, that and SQL has a somewhat undeserved reputation for being easy to learn, but also easy to screw up. Like you write a simple looking query and it turns out to have O^2 complexity and your system ends up bogged down in the database forever.
In practice people who fall into complexity traps are usually asking a lot more of their database engine than any beginner. It's usually not that hard to figure out the approximate cost of a particular query.
> Like you write a simple looking query and it turns out to have O^2 complexity
Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.
> Not only that, their marketing/outreach efforts were also aimed at younger developers.
I do remember a lot of MongoDB t-shirts, cups and pens around every office I was in around 2011-2013. When I would ask they would tell me that a MongoDB developer flew halfway across the world to give them all a workshop on it.
> The question should be: How did MongoDB become so successful?
Ability to store Algebraic Data Types and values with lists without a hassle of creating a ton of tables and JOINs. Postgres added JSON support since, plus there are now things like TimescaleDB, which didn't exist previously.
ORMs have existed for decades so developers can use a SQL database just fine without knowing the language. So it's definitely not this.
It's more likely because Mongo is (a) is extremely fast, (b) the easiest database to manage and (c) has a flexible schema which aligns better with dynamic languages which are more popular amongst younger developers.
Postgres is faster at json than mongo. Also the pipeline query strategy of mongo is terrible to deal with. A schema should not be flexible. Now I have to write a bunch of code to handle things that should have been enforced by the database. Postgres is incredibly easy to manage with actual default security. I know the mongo tutorial says to not run the default configuration, then why is it the default configuration. It's so easy to manage anyone can take it over for ransom.
At "large financial news company" we had a "designed for the CV" tag that applied to stupid architectural decisions (of which there were many)
One of the biggest and most expensive was using Cassandra to store membership details. Something like 4 years of work, by a team of 40, wasted by stupid decisions.
They included:
o Using Cassandra to store 6 million rows of highly structured, mostly readonly data
o hosting it on real tin, expensive tin, in multiple continents (looking at >million quid in costs)
o writing the stack in java, getting bored, re-writing it as a micro service, before actually finishing the original system
o getting bored of writing micro services in java, switching to scala, which only 2/15 devs knew.
o writing mission critical services in elixir, of which only 1 dev knew.
o refusing to use other teams tools
o refusing to use the company wiki, opting for thier own confluence instance, which barred access to anyone else, including the teams they supported
I think the worst place I ever worked was like that. Was going back quite a number of years now but it was a startup fired up by a one of the lesser MBAs to utilise a legal loophole to slice cash off the public via a web app and a metric ton of marketing to desperate people.
Step one was hire everyone the guy had worked with at his previous company. They were all winforms / excel / SQL / sharepoint / office developers from big finance and had no idea where to go really. None of them had even touched asp.net.
Cue "what's popular". Well that was Ruby on Rails back then on top of MySQL and Linux. 4 people with zero experience pulled this stack in and basically wrote winforms on top of it. Page hits were 5-8 seconds each. Infrastructure was owned by SSH worms and they hadn't even noticed.
I think I lasted two days there before I said "I'm done".
I once saw a CEO who wrote his own Web Framework and forced the entire company to use it.
At the time, under the influence of React, the idea was to "build web application sorely based on Functional Programming". Since after years of trying no one could figure out what that meant, the company ditched the CEO and ended up wasting a couple years of work.
It's not that you don't write anything in a new thing. It's that you start with small, less critical projects. Get your feet wet, give people a chance to get a feel for the pros and cons, that sort of thing.
Jumping right into writing "mission critical services" in the brand new language that few people at the company know well is asking for trouble.
there were no problems of scale, speed or latency. It was migrating from one terrible system to something that should be smaller, simpler, cheaper and easier to run.
The API is/was supposed to do precisely four things:
o provide an authentication flow for customers
0 provide an authentication flow for enterprises to allow SSO
o handle payment info
o count the number of articles read for billing/freemium/premium
That is all. Its a solved problem. Don't innovate, do.
Spend the time instead innovating on the bits that are useful to the business and provide an edge: CRM, pattern recognition and data analytics.
This actually resonates with me, maybe not in the way Ellison intended as I’m not familiar with the context he said it in. A bit off the main topic but the more I revert to just using emacs for some task I previously used a .app bundle or web page for, the more I question how much we the computer industry has just been spinning its wheels for the last 30+ years. I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion, let alone the endless debates in the form of actual implementations about the best way to build one. Possibly the best argument ever made died with Mac OS 9.
I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.
Discoverability and self documentation is generally the advantage of GUI systems. Well built systems can be understood in a few moments without needing to consult a manual. That's almost impossible in a pure CLI environment.
Of course it's entirely possible to screw this up, modern phone-centric design standards are really bad about it for example, but in general you need to consult the manual (or Google) far less often.
I’m not convinced these are inherent properties of GUIness. We have had GUIs in use for long enough that many elements even across slightly different GUIs are familiar enough, this had allowed conventions and then conventional wisdom to develop.
Early GUI developers also put effort into making their GUIs at least somewhat intuitive, but they also bundled thick books of documentation on how to use their systems.
To a degree, they are self-documenting, but not because of their GUIness so much as their design choices; menus helped a lot with this. Menus actually are a good subject to touch upon, they are definitely one of the better conventions we developed, and they were based upon and analogous to a restaurant menu. However their very nature as a list of commands you can issue a GUI does also typically but not always, limit the application. If the only options available to you are what is on the menu and there is no other interface, then an application is much more limited. This isn’t even true of most restaurants which will often allow you to issue an order for something not on the menu if they have the ingredients, equipment and expertise to make it.
But as a convention, it is not limited solely to GUIs, you can incorporate menus into any interactive interface.
I think the real innovation wasn’t GUIs, it was interactive software. The innovation beyond that is scriptable software.
What is important isn’t GUIness, but the developer’s intent. If you develop software with the intent to be self-documenting and discoverable, you will end up with an interface that is both of these things provided you did a competent job of it. A GUI might help, relying on platform conventions might help, and using common cultural conventions might help, but these aren’t the necessary ingredients for those qualities.
Emacs has the quality of self-documentation, but it is emphatically not a GUI even though it is interactive.
I would say well built systems, provided you intend to do anything productive and even slightly complex, should have a manual included, or else it isn’t a well built system.
It isn’t all bad. My laptop is 6, almost 7 years old at this point and it has received a few upgrades in that time. This change in habits does lower the minimum system requirements for its eventual replacement from “runs Mac OS X” to “runs emacs”, but I might be able to stretch its life out a bit longer now.
IMO there is still a place for schema-less document databases. It's just that Postgres's JSON columns mean you can get the best of both worlds, which makes Mongo look weak by comparison.
I would rather say, "Postgres's JSONB provides a hybrid compromise that may meet the needs of many users." JSONB feels like it's closer to creating more complex data types that are less primitive than those that SQL currently allows.
The real driver of the NoSQL movement, I believe, was that everybody wanted to be the next big social network or content aggregation site. Everybody wanted to be the next Facebook, Instagram, Twitter, etc. and that's what people were trying to build. Ginormous sites like these are are one of the applications that strongly favors availability/eventual consistency over guaranteed consistency, whereas most other applications are quite the opposite.
Nobody really cares if your Instagram post shows up 10 minutes later in New York than it does in LA, and certainly not if the comments appear in similarly inconsistent order. It's one step above best-effort delivery. However, your bank, hospital, etc. often care quite a bit that their systems always represent reality as of right now and not as of half an hour ago because there's a network problem in Wichita.
The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."
> The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."
Even that's too simplistic. For most RDBMSes, the answer depends on how you have it configured, and usually isn't "Error". If you're using a serializable transaction isolation level, it usually means, "you might have to wait an extra few milliseconds for your answer, but we'll make sure we get you a good one." Other isolation levels allow varying levels of dirty reads and race conditions, but typically won't flat out fail the query. This is probably the situation most people are working under, since, in the name of performance, very few RDBMSes' default configurations give you full ACID guarantees.
To the "DB in NY knows something different from DB in LA" example, there are RDBMSes such as the nicer versions of MSSQL that allow you to have a geographically distributed database with eventual consistency among nodes. They're admittedly quite expensive, but, given some of the stories I've heard about trying to use many NoSQL offerings at that kind of scale, I wouldn't be surprised if they're still cheaper if you're looking at TCO instead of sticker price.
Many ATMs will still give you money when they're offline, and things become eventually consistent by comparing the ledger.
Shops also generally want to take orders and payments irregardless of the network availability, so whilst they might generally act as CP systems, they'll be AP in the event of network downtime, but will likely lose access to fraud checks, so may put limits of purchases, etc.
They're probably all CP locally and AP (or switchable from CP) from an entire system perspective.
I like the JSONB support. Not for storing data but to query it with shitty drivers.
Some array magic and jsonb_agg and suddenly you can get easy to json decode results instead of having to play with rows in your app. Yes you can also do it with xml_agg but these days people tend to consider anything xml as evil (they're wrong).
I would say that development of JSON fields in Postgres and MySQL was accelerated by the adoption of Mongo.
Speed to market/first version using JSON stores is attractive, especially when you're still prototyping your product and won't have an idea of exact data structures until there's been some real world usage.
Standard rules about performance optimization apply. Denormalization can improve performance, but it can just as easily harm it.
For code, I think by now we all understand that you should always start with clean, well-factored code, and then optimize only as much as is necessary, which is usually not at all, and always under profiler guidance. It's the same with DBs: You start with a clean, well-normalized schema, and then de-normalize only as much as is necessary, which is usually not at all, and always under profiler guidance.
Also, keep in mind that improvements in compiler technology over time mean that the performance tricks of old can be useless or even actively harmful nowadays. This is true of SQL every bit as much as C.
Denormalization as way of dealing with performance issues is like guilottin to cure headache. And more frequently than not the headache is still there even after that.
It is true that for some narrow class of analytical workloads 20-25 years ago (behold BW of 199x) the denormalized case performance was better compare with straight non-optimized running of the same queries over normalized schema. Since then, the exponential availability of RAM and huge increase in streaming speed of HDD with stagnating IOPS (the main mistake in analytical workloads on HDD in the last 10-15 years - using nested loop join with indexed lookup into the large facts tables :) have made denormalization obsolete and harmful. If anything, the emergence of SSD and the huge RAMs moved things even further toward and beyond normalization, by making the "super-normalization", i.e. columnar tables, a viable everyday thing.
The idea that joins are slow is a holdover from the bad old days when everyone used MySQL and MySQL sucked at joins. On a more robust DBMS, a normalized schema will often yield better performance than the denormalized one. Less time will be lost to disk I/O thanks to the more compact format, the working set will be smaller so the DBMS can make more efficient use of its cache, and the optimizer will have more degrees of freedom to work with when trying to work out the most efficient way to execute the query.
(edited to add: If you're having performance problems with a normalized schema, the first place to look isn't denormalization, it's the indexing strategy. And also making sure the queries aren't doing anything to prevent proper index usage. Use the Index, Luke! is a great, DB-agnostic resource for understanding how to go about this stuff: https://use-the-index-luke.com )
It's not, it's a strategy to improve the performance of a particular query or access pattern, and is usually the last resort after things like proper indexing, aggregations and materialized views.
JOINS are fast and it all comes down to how much data you're moving. If it's a large table joining to a small set of values, then the joined data is quickly retrieved and applied to the bigger table, with great performance.
If the join is between two large tables where every combination is unique then that's the unique case where the joined table is just adding another hop to get the full set of data for each row and is a perfect candidate for denormalization, although in that case it probably should've been a single table to begin with. Of course there's a spectrum between these two scenarios but it takes a lot before denormalization makes sense on any modern RDBMS.
2. You denormalize in a structured way (eg dimensional modelling), rather than any old how.
3. You test the change.
Database query planners work better when they can take logically-safe shortcuts in their work. In large part that comes down to a properly-constructed schema.
Denormalizing makes it harder for the query planner. It also means you will probably lose out on future query planner enhancements.
Generally speaking, if someone wants to denormalize, I want to know the actual business value created and that the business risk is properly understood.
What really happened is people wanted something new but did not want to change the ways they DESIGN their application and processes. There is place for NoSQL but if you are going to use it as if it was SQL then you would be better served by an SQL database.
Also, I think MongoDB tried to be everything and failed to be good at anything. It offers neither stellar performance nor scalability and I guess for most projects there is not much advantage over regular SQL database. Certainly nothing to fight over when there is much more technology choices to make.
Just a few thoughts. A lot of little companies like picking hyped tools because they think it will differentiate them and server them well later, but most companies never get to that stage and don't really need what is being offered. Seems much safer to take the tried and true standard solution that has worked for decades and just make your UX outstanding rather than try to put together a "dream team" of new tech...
Another thought. Most companies have absolutely no idea how to select the technology. A lot of people that are in position to decide don't have all that much hands on experience with the new technology and so they will decide based on other factors like whether his manager would like or would like not see new technology, what other companies are doing, etc.
Another example is "Agile". Everybody is doing it yet I still wait to see a single company that understands what the term means. My current boss is big promoter of "Agile" which in his language is synonym to "Scrum". Yet when asked he has never heard of Pheonix Project, The Goal, theory of constrains or basically any theory at all. So what the people are doing is fighting fires almost 100% of the time with not much project work done for the effort and absolutely no improvements. Yet, because everybody complies to do daily standups and Jira updates we are 100% agile.
Nice writeup. I love this quote: "The first few times an engineer sees this kind of hype, they often think it's a structural shift. For engineers later in our career, we’ll often dismiss structural shifts as misplaced hype after getting burned too many times"
I found it a little funny that NoSQL started becoming popular during at least some of the same years that static typing starting becoming popular (again).
I don't see it that way -- I feel like NoSQL's rise was more or less coincident with the adoption of Rails, Django, and Node over Java. The surge in interest in new static languages has mirrored the resurgence of Postgres, right around version 9.4 (and JSONB).
I think they're both responses to the same challenges: distributed web enabled applications.
On the server front that means ever increasing complexity with decoupled microservices and latency issues that play nicely with the classic approaches to those domains (static typing, functional programming).
On the data front sites like HN, Reddit, or Facebook need scability more than consistency, and have oodles of 'uninteresting' data that jives nicely with a schemaless document store.
I've noticed that, too. There's an interesting ping-pong effect where a fair number people have flipped from strong typing at the database layer & dynamic typing at the view layer to the reverse — it seems like someone could write an interesting group psychology paper about how that cycle has repeated over the years.
People realized a lot of the claimed not only SQL offerings were actually no SQL at all. It turns out it is nice to have those extra features of NoSQL as well as a more traditional RMDS instead of just the NoSQL parts.
Also, the RDBMS world is full of some of the oldest scaling experts in the yellow pages. It's interesting how surprised people were that many of the "traditional" RDBMS were able to catch up on some of the scaling support that were the biggest advertised advantages of NoSQL.
It's interesting because I feel like the NoSQL world spent a lot of time reinventing the RDBMS from the "opposite direction". For all its faults, SQL is a fascinating language because it mostly ignores low level details of how the database scales, how it operates under the hood. It's not that SQL is intrinsically hard to scale (certainly at the relational algebra roots it shouldn't be hard, in theory), but it certainly leaves a lot of work to anyone building a database engine to figure out what/when/why/how to scale. I feel like a lot of RDBMS' query analyzers/planners resemble things like HBase a lot more than folks realize.
It's great that NoSQL realized that sometimes those "low level details" in an SQL engine are useful in their own ways, and have increased the spectrum of performance versus power/flexibility trade-off options. But it shouldn't be that big of a surprise that SQL databases remain competitive in that trade-off space, given under the hood they've had to think about a lot of that stuff over many decades.
https://www.mongodb.com/customers/guardian
https://www.mongodb.com/presentations/mongodb-guardian
https://www.slideshare.net/tackers/why-we-chose-mongodb-for-...
And reupping my previous, three-part series on MongoDB:
On MongoDB
NoSQL databases were the future. MongoDB was the database for "modern" web engineers and used by countless startups. What happened?
https://www.nemil.com/mongo/index.html