> to include the ability to do deduplication But... Won't it be an ugly kludge? ...

aidanhs · on Feb 12, 2016

> and it didn't exist with Docker (because it uses layers)

There's a big spectrum of 'solving the deduplication problem' and Docker is towards the 'when all you have is a sledgehammer' end.

Saying "you can manually arrange Dockerfiles so that they cooperate and share layers" is not solving things in an architectual sense! You essentially have to be using a single custom tool (Dockerfiles) to create your images and then you need to apply thinking power to consider how best to arrange your images (e.g. having a 'base' package.json to install a bunch of things common to many apps, then additional an additional package.json per-app).

It's getting better with 1.10 (layer content is no longer linked to parent layers, so ADDing the same file in different places should reuse the same layer) but it's still pretty imperfect. I created https://github.com/aidanhs/dayer to demonstrate the ability to extract common files from a set of images and use it as a base layer, which is another improvement. Even better would be a pool of data split with a rolling checksum, like the one bup creates - short of domain-specific improvements (e.g. LTO when compiling C), I think this is probably the best architectural thing you can do.

drdaeman · on Feb 12, 2016

Oh, sorry. Yes. I think, I'm with you on this. I didn't meant to say about the quality of this approach. I just meant that Docker has its means to not duplicate by having shared base layers - so we can say the problem [mostly] didn't exist in systems Subuser had started from - but it's completely another matter whenever the approach it takes is good or not.

rljy · on Feb 12, 2016

No, algorithmic deduplicaton is not a kludge! It is the oposite of a kludge. It is a beautiful way of letting the computer do hard work for you! Rather than trying to deduplicate things by hand (aka, traditional dependency management) you let the computer do it for you :)

drdaeman · on Feb 12, 2016

Oh, no, I guess I wasn't clear on what I mean. Sorry.

Data compression (which deduplication essentially is a subset of) absolutely isn't a kludge. I meant, introducing the duplication (by design) and then inventing some workaround to get back to square one - that's the suspicious part. It adds complexity, when there could be none.

rljy · on Feb 12, 2016

Square one, being, in your opinion cramming all of the dependencies together in one place? There is a big advantage to immutable dependencies though... Your code NEVER breaks. If the dependencies are immutable, and the architecture stays the same, then your code will run. Of course, you can still use apt-get or another package manager to build your images, so the whole updating of dependencies thing is no worse than where you started. It's just, you have the option of not changing things, and not breaking things as well.

drdaeman · on Feb 12, 2016

I must be misunderstanding something.

Isn't it Subuser that crams all the dependencies together, in one place (image)? So there was that proposal so images are deduplicated, in a sense shared files are automatically detected and stored only once, saving disk space. But then, isn't it that Subuser is completely unaware of any metadata a particular file may have, so it can't really tell the difference between libz and libm, or know that all those binary-different libpng16.so have the 100% compatible ABI and are interchangeable (but not all libpng12.so do)?

From what I understood, Subuser is a package manager (plus permission manager) that doesn't know a thing about what it's packaging - only the large-scale image.

Package managers have all libraries separate, that's the whole point why package managers were invented in the first place. If a package management uses some database, dependencies may lay together in the filesystem, but the're completely separate in package manager's database. If package management system uses filesystem as its database, then packages are separate in that regard, too. There's immutability as well, sometimes enforced, sometimes along the lines of storing `dpkg -l | grep '^ii' | awk '{ print $2 "=" $3 }'` output to save the exact state.