Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A similar issue has come up for codes that I write. Among other things, I write low level mathematical optimization codes that need fast linear algebra to run effectively. While there's a lot of emphasis on BLAS/LAPACK, those libraries work on dense linear algebra. In the sparse world, there are fewer good options. For things like sparse QR and Choleski, the two fastest codes that I know about are out of SuiteSparse and Intel MKL. I've not tried it, but the SuiteSparse routines will probably work fine on ARM chips, but they're dual licensed GPL/commercial and the commercial license is incredibly expensive. MKL has faster routines and is completely free, but it won't work on ARM. Note, it works fantastically well on AMD chips. Anyway, it's not that I can't make my codes work on the new Apple chips, but I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions. That's a lot to stomach.


Apple's own Accelerate Framework offers both BLAS/LAPACK and a set of sparse solvers that include Cholesky and QR.

https://developer.apple.com/documentation/accelerate/sparse_...

Accelerate is highly performant on Apple hardware (the current Intel arch). I expect Apple to ensure same for their M-series CPUs, potentially even taking advantage of the tensor and GPGPU capabilities available in the SoC.


Huh, this actually may end up solving many of my issues, so thanks for finding that! Outside of their documentation being terrible, they do claim the correct algorithms, so it's something to at least investigate.

By the way, if anyone at Apple reads this, thanks for the library, but, you know, calling conventions, algorithm, and options would really help on pages like this:

https://developer.apple.com/documentation/accelerate/sparsef...


That's the documentation page for an enumeration value, not a factorization routine (hence there are no calling conventions, etc, to document; it's just a constant).

Start here: https://developer.apple.com/documentation/accelerate/solving... and also watch the WWDC session from 2017 https://developer.apple.com/videos/play/wwdc2017/711/ (the section on sparse begins around 21:00).

There is also _extensive_ documentation in the Accelerate headers, maintained by the Accelerate team rather than a documentation team, which should always be considered ground truth. Start with Accelerate/vecLib/Sparse/Solve.h (for a normal Xcode install, that's in the file system here):

    /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Headers/Sparse/Solve.h


Numpy and SciPy reject use of Accelerate due to faulty implementations of some routines. https://github.com/scipy/scipy/wiki/Dropping-support-for-Acc... We have never received any feedback from Apple about these bugs.


I noticed that SciPy has dropped support. I believe it wasn't only related to bugs, but also an very dated LAPACK implementation (circa 2009). I can't tell from Apple's developer docs whether this has changed.

My sense is that Apple's focus is less on scientific computing and more so on enabling developers to build computation-heavy multimedia applications.


Accelerate is also available (and highly performant) on ARM as well. I was not able to beat it with anything on ARM, including hand-coded assembly, at least for sgemm and simple dot products, which are bread and butter of deep learning. It actually baffles me that Microsoft is not offering linear algebra and DSP acceleration in Windows out of the box. This creates friction, and most devs don't give a shit, so Windows users end up with worse perf on essentially the same hardware.


ARM themselves made a half-hearted attempt at addressing this with their Ne10 project (https://github.com/projectNe10/Ne10), but as far as I could see from the outside they never committed any real resources to it, and it now seems to be abandoned (no public commits for three years).


There's also https://github.com/ARM-software/ComputeLibrary, but Accelerate easily blows the doors off it, on the same hardware.


It worked well on PowerPC too and helped with the Intel transition.


> I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions.

Your complaint is kind of strange. You're blaming "GPL restrictions" but the cost is for a commercial license.


Well, if the FOSS license used was e.g. MIT he wouldn't have to buy a commercial license, that's the parent's point. With GPL, he does, because else his clients have to make their own code/project conformant...


Yes, that's correct. I write open source software as well and I don't begrudge anyone for licensing under GPL. And, I'm perfectly willing to obtain a commercial license, but I'm going to pass that cost on to my customers. In this particular case, though, the question for them is whether they want Apple silicon bad enough to pay an additional $50-100k in software licensing costs to keep their code private or to just buy an Intel or AMD chip. I know where I'd spend my money.


How do these types of licenses deal with software updates in general? Presumably, at some point they'll need to buy a new license anyway, and the issue will be moot, right?

And Rosetta will probably be around for a while...


> How do these types of licenses deal with software updates in general? Presumably, at some point they'll need to buy a new license anyway, and the issue will be moot, right?

It sounds like Intel produces an implementation of this thing that works on Intel and makes it available for free, whereas ARM don't (although another comment suggests Apple actually do), so you have to buy an expensive third-party implementation instead. That's not a difference that'll go away in the short term, and you can see why a processor company might legitimately choose one or the other approach.


Apple released the first Intel Macs to consumers in 2006, and in 2011 removed Rosetta from Mac OS X, so I guess it depends on what you mean by a while.


You were pretty specific that it was entirely the fault of the GPL:

> I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions.


What point are you trying to make here? The poster has been very clear on the mechanics, which are quite understandable, but I don't understand what you are trying to say. Is it just that you think it does not put the GPL in a positive enough light? I don't mean to put words in your mouth but that's my current best guess


Apple forced them into a situation that gives them fewer options. That isn't a statement about how good or bad each option is. It's a statement about the consequences that Apple's choices have for developers.

If I'm a travel agent and an affordable hotel near a travel destination closes down, I might have to book my clients in a nicer but more expensive hotel. Their trip will be a bit more expensive. Or maybe they'll travel to a different city. It doesn't mean I dislike the nicer hotel.


It seems clear enough from context that the "GPL restrictions" are that if they used the GPL-licensed compiler, the commercial clients might run into legal issues with their use of it, necessitating that they purchase the commercial license. It's not uncommon for businesses to have a prohibition against using GPL software in not only their shipping products but anywhere in their toolchain. (You can argue that's a counterproductive prohibition, but "your legal department just needs to change their mind on this" may not be an argument a vendor can effectively make.)


I would not make an argument even if I thought a client would accept it. If they are incompetent they will decide to use the GPL code with sloppy oversight, violate the terms of the GPL, then they will hold a little grudge against you for the advice that got them in trouble. Sloppy companies have no internal accountability, so it's your fault.

I use GPL code all the time at home and I would license many things GPL, but there's no reason to push GPL software at corporations. They should have limited options and spend money, possibly expanding MIT code, possibly just raising the price of engineers by keeping engineers occupied.


No, he was pretty clear that it was due to needing to use that solver due to it being the only one that works on ARM right now. The dual licensing was only relevant in that the client would have to pay for the commercial license (due to the GPL restrictions).

> MKL has faster routines and is completely free, but it won't work on ARM


That's still pretty silly. If the thing wasn't open source at all, you would still have to buy a license.

If your complaint is boo hoo, some people charge for software...well consider me unsympathetic.


Oh, so it's terrible to pay for software? How awful! Especially ironic because I'm sure the parent isn't working for free.


We all pay for software, but it's the amount that really shapes decisions. Most organizations have a dollar limit where we can just charge a purchase card and when they have to seek approval. In this particular case, the software costs are higher than what can likely go onto a p-card, so now it becomes a real pain to acquire. In fact, the software is so expensive that it's cost would like eclipse the cost of the computer itself. So basically, we're looking at a decision where the client can use a more performant library and save $100k as long as they stay off of Apple silicon.

That's really the point I'm trying to make and not to criticize anyone for using a GPL license. Moving to these new chips, in many cases, will be a much larger cost to an organization than just the cost of the computer.


>Oh, so it's terrible to pay for software?

Compared to not paying for it? Yes.

>Especially ironic because I'm sure the parent isn't working for free.

So? Who said that when you get paid yourself it stops being awful to have to pay for things?


I imagine the conversation with the clients will go like this:

- Here is a quote for 100k for adding SuiteSparse to the code.

- 100k‽ But I have found on the internet that SuiteSparse is free! Justify your quote.

At that point, they will have to explain to the client what GPL is and why they cannot use the free version.


> optimization codes

I'm curious do people in numerical specialties say "codes" (instead of "code")? I don't often hear it that way but I'm not in that specialty.


Really common usage in science/numerical computing.

I was trying to identify when, in normal usage, you'd say "numerical codes" rather than "numerical software" or just "numerical code". It seems a bit slippery!

Some contexts where it's prevalent: supercomputing, Fortran, national labs, large or multifaceted software. I also associate it with manager-speak ("our team has ported 77% of the simulation codes to HPSS").


Yes, this is a Fortran-ism which persists unto the present day.


Yes. e.g., "I work on multiphysics codes."

software => codes


Have you tried PETSc? It does sparse (and dense) LU and Cholesky, plus a wide variety of Krylov methods with preconditioners.

It can be compiled to use MKL, MUMPS, or SuiteSparse if available, but also has its own implementations. So you could easily use it as a wrapper to give you freedom to write code that you could compile on many targets with varying degree of library support.


I like PETSc, but how do its internal algorithms compare on shared memory architectures? I'd be curious if anyone has updated benchmarks between the libraries. I suppose I ought to run some in my copious amount of free time.

Sadly, the factorization I personally need the most is a sparse QR factorization and PETSc doesn't really support that according to their documentation [1]. Or, really, if anyone knows a good rank-revealing factorization of A A'. I don't really need Q in the QR factorization, but I do need the rank-revealing feature.

[1] https://www.mcs.anl.gov/petsc/documentation/linearsolvertabl...


PETSc developer here. You're correct that we don't have a sparse QR. I'm curious about the shapes in your problem and how you use the rank-revealed factors.

If you're a heavy user of SuiteSparse and upset about the license, you might want to check out Catamari (https://gitlab.com/hodge_star/catamari), which is MPLv2 and on-par to faster than CHOLMOD (especially in multithreaded performance).

As for PETSc's preference for processes over threads, we've found it to be every bit as fast as threads while offering more reliable placement/affinity and less opportunity for confusing user errors. OpenMP fork-join/barriers incur a similar latency cost to messaging, but accidental sharing is a concern and OpenMP applications are rarely written to minimize synchronization overhead as effectively as is common with MPI. PETSc can share memory between processes internally (e.g, MPI_Win_allocate_shared) to bypass the MPI stack within a node.


I'll have a look at Catamari and thanks for the link. Maybe you'll have a better idea, but essentially I need a generalized inverse of AA' where A has more columns than rows (short and fat.) Often, A becomes underdetermined enough where AA' no longer has full-rank, but I need a generalized inverse nonetheless. If A' was full rank, then the R in the QR factorization of A' is upper triangular. If A' is not full rank, but we can permute the columns, so that the R in the QR factorization of A' has the form [RR S] where RR is upper triangular and S is rectangular, we can still find the generalized inverse. As far as I know, the permutation that ensures this form requires a rank-revealing QR factorization.

For dense matrices, I believe GEQP3 in LAPACK pivots so that the diagonal elements of R are decreasing, so we can just threshold and figure out when to cut things off. For sparse, the only code I've tried that's done this properly is SPQR with its rank-revealing features.

In truth, there may be a better way to do this, so I might as well ask: Is there a good way to find the generalized inverse of AA' where A is rank-deficient as well as short and fat?

As far as where they come from, it's related to finding minimum norm solutions to Ax=b even when A is rank-deficient. In my case, I know the solution exists for a given b, even though the solution may not exist in general.


If you have one (or a small number of) right-hand sides, I would try to make LSQR work. It can find a minimum norm solution even if A is rank-deficient, and you can use preconditioning.

Also, if your problem is a good fit for a method like this, it could be impetus to add it to PETSc. https://epubs.siam.org/doi/pdf/10.1137/120866580


Unfortunately, in my case, the generalized inverse of AA' is the preconditioner for the system, which is why I need the factorization of A'. Essentially, I take this factorization and then run it through my own iterative method. When I run tests in MATLAB, SPQR scales fine for matrices of at least a few hundred thousand rows and columns. For larger, it would be nice to essentially have an incomplete Q-less QR factorization, which I don't think exists, but should be an extension of the incomplete Choleski work.

But, yes, LSQR or more fitting LSMR solves a similar problem, but they're the iterative solver and I need the preconditioner, which I'm using the factorization for.


I've made the point that GCC and free linear algebra is infinitely faster on platforms of interest (geometric mean of x86_64, aarch64, ppc64le) while still having similar performance on x86_64. I thought MKL used suitesparse, or is that just matlab?


As far as I know, MKL has its own implementation. As some evidence of this, here's an article comparing their sparse QR factorization to SPQR, which is part of SuiteSparse [1]. As far as MATLAB, I believe it uses both. I've a MATLAB license and it definitely contains a copy of MKL along with the other libraries. At the same time, their sparse QR factorization definitely uses SPQR, which is part of SuiteSparse. In fact, there are some undocumented options to tune that algorithm directly from MATLAB such as spparms('spqrtol', tol). As a minor aside, this is actually one of the benefits of a MATLAB license since they have purchased the requisite commercial licenses for SuiteSparse codes, it makes it easier to deal with some commercial clients who need this capability at a lower price than a direct license itself. This, of course, means using MATLAB and not calling the library directly. It's one of the challenges to using, for example, Julia, which I believe does not bundle with the commercial license, but instead relies on GPL.

https://software.intel.com/content/www/us/en/develop/article...


Just a note in support of Matlab's sparse capabilities. For the last couple of years, I used Matlab successfully on large, sparse multiplication and factorization problems. A friend who was using R simply could not approach the scale I was able to work at, and I assume it's due to weak sparse support.

I was multiplying and inverting sparse triangular matrices of size 650K x 650K with Matlab, on a laptop. Just amazing.


I'm surprised there doesn't seem to be anything in CRAN using SuiteSparse. It could presumably run at petascale, similarly to the dense support, if someone did similar work.


I doubtless mis-remembered about MKL, thanks.

I'm baffled why there would be a problem with commercial users running a free software program like Julia or GNU Octave+SuiteSparse; that's Freedom 0. (And commercial /= proprietary, of course.)


Most of the time, you're absolutely right especially with how Octave or Julia code is normally distributed. The code is delivered to the client and the client runs the code on their system. No GPL violations have occurred.

That said, I believe it gets trickier once we start compiling the code. Say I want to develop a piece of software for my client and I don't want them to have the source, Octave doesn't really have a way to do this, but MATLAB does and since MATLAB has purchased all of the requisite licenses, we're good to go. Julia makes me more uncomfortable. We can make binaries with PackageCompiler.jl, but if we do, we should be subject to the provisions in the GPL. That's no different than any other piece of software, but Julia, Octave, and MATLAB all use these libraries and most people don't know that something like the chol command hooks into SuiteSparse in the backend.


Yeah, the Julia devs are quite interested in removing our last few GPL dependencies and replacing them with something in pure julia. It'll take time though.


SuiteSparse switched from GPL to LGPL about a year ago if that makes a difference (for the couple of components I was looking at anyway).


Very cool and thanks for the heads up. I just went and checked and here's where it's at:

  SLIP_LU: GPL or LPGL
  AMD: BSD3
  BTF: LGPL
  CAMD: BSD3
  CCOLAMD: BSD3
  CHOLMOD Check: LGPL
  CHOLMOD Cholesky: LGPL
  CHOLMOD Core: LGPL
  CHOLMOD Demo: GPL
  CHOLMOD Include: Various (mostly LGPL)
  CHOLMOD MATLAB: GPL
  CHOLMOD MatrixOps: GPL
  CHOLMOD Modify: GPL
  CHOLMOD Partition: LGPL
  CHOLMOD Supernodal: GPL
  CHOLMOD Tcov: GPL
  CHOLMOD Valgrind: GPL
  CHOLMOD COLAMD: BSD3
  CPsarse: LGPL
  CXSparse LGPL
  GPUQREngine: GPL
  KLU: LGPL
  LDL: LGPL
  MATLAB_Tools: BSD3
  SuiteSparseCollection: GPL
  SSMULT: GPL
  RBio: GPL
  SPQR: GPL
  SuiteSparse_GPURuntime: GPL
  UMFPACK: GPL
  CSparse/ssget: BSD3
  CXSparse/ssget: BSD3
  GraphBLAS: Apache2
  Mongoose: GPL
There's probably a bunch of mistakes in there, but that's what I found scraping things moderately quickly. Selfishly, I'd love SPQR to be LGPL, but everyone is free to choose a license as they see fit.


Would their workflow allow just keeping a server on hand to do the number crunching, and still getting to use Apple Silicon on a relatively thin client?


>MKL has faster routines and is completely free, but it won't work on ARM.

It will probably be ported though, if there's a demand...


Maybe, but note that this is the Intel MKL. A library developed and maintained by Intel. It is not a secret that Intel does this to support their ecosystem and have been caught intentionally crippling support for AMD processors in the past [1]. Intel has recently been adding better support for AMD processors [2], but many suspect that is intended to help x86 as a whole better compete with ARM. If it does get ported, it is highly unlikely to have competitive performance.

[1] https://news.ycombinator.com/item?id=24307596

[2] https://news.ycombinator.com/item?id=24332825


Thanks for the links. If anyone is wondering about some of the hoops that need to be jumped through to make it work, here's another guide [1].

One question in case you or anyone else knows: What's the story behind AMD's apparent lack of math library development? Years ago, AMD and ACML as their high-performance BLAS competitor to MKL. Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine. That said, Intel has done steady, consistent work on MKL and added a huge amount of really important functionality such as its sparse libraries. When it works, AMD has also benefited from this work as well, but I've also been surprised that they haven't made similar investments.

Also, in case anyone is wondering, ARM's competing library is called the Arm Performance Libraries. Not sure how well it works and it's only available under a commercial license. I just went to check and pricing is not immediately available. All that said, it looks to be dense BLAS/LAPACK along with FFT and no sparse.

[1] https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...

[2] https://developer.amd.com/amd-aocl/


Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine.

It's ok. I did some experiments with transformer networks using libtorch. The numbers on a Ryzen 3700X were (sentences per second, 4 threads):

OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119

On a Xeon Gold 6138:

OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128

OpenBLAS was faster than AMD BLIS. But MKL beats everyone else by a wide margin because it has a special batched GEMM operation. Not only do they have very optimized kernels, they actively participate in the various ecosystems (such as PyTorch) and provide specialized implementations.

AMD is doing well with hardware, but it's surprising how much they drop the ball with ROCm and the CPU software ecosystem. (Of course, they are doing great work with open sourcing GPU drivers, AMDVLK, etc.)


If you care about small matrices on x86_64, you should look at libxsmm, which is the reason MKL now does well in that regime. (Those numbers aren't representative of large BLAS.)


A free version of the Arm Performance Libraries is available at:

https://developer.arm.com/tools-and-software/server-and-hpc/...


> What's the story behind AMD's apparent lack of math library development?

I don't see a story. AMD supports a proper libm for gcc and llvm, has its own libm, BLAD, LAPACK, ... at https://developer.amd.com/amd-aocl/

Just their rdrand intrinsic is broken on most ryzens if you didn't patch it. Fedora firmware doesn't patch it for you.


You just run MKL from the oneapi distribution, and it gives decent performance on EPYC2, but basically only for double precision, and I don't remember if that includes complex.

ACML was never competitive in my comparisons with Goto/OpenBLAS on a variety of opterons. It's been discarded, and AMD now use a somewhat enhanced version of BLIS.

BLIS is similar to, sometimes better than, ARMPL on aarch64, like thunderx2.


In what world will Intel port MKL - Intel intellectual property - to ARM? The whole purpose of Intel's software tools is as an enabler and differentiator for their architecture and specifically their parts.


I don't know about this proprietary technology specifically, but Intel is a huge company with some FOSS friendliness. USB 4 is based on Thunderbolt 3, so I guess they licensed that one.


In a world where Intel already had licensed ARM and built it in the past:

https://newsroom.intel.com/editorials/accelerating-foundry-i...


That linked article from 2016 is about Intel's Custom Foundry program, which I'm fairly sure is for building chips under contract to other companies. It promotes that they have "access to ARM Artisan IP," but doesn't specifically mention an ARM version of MKL that I see. The list of compatible hardware Intel's page on MKL itself lists compatible processors and ARM is conspicuously absent:

https://software.intel.com/content/www/us/en/develop/tools/m...

And, this question on Intel's own forums from 2016 at least suggests that there wasn't an MKL version for ARM in the time frame of the article you're linking to, either:

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Libr...

So, from what I can tell, while Intel is an ARM licensee and made ARM CPUs in the past, they haven't made their own ARM CPUs for years and there's no sign they ever made MKL for any ARM platform. Never say never, but I think the OP is basically right -- there's not a lot of incentive for Intel to produce one.


Intel had sold most of the relevant ARM IP and product lines to Marvell in 2006.


MKL is heavily optimized for Intel microarchs and purposely crippled on AMD (I believe dgemm is fast, sgemm slow). I don't think MKL benefits from optimizing it for Apple Silicon, especially considering Apple ditched Intel's hardware.


No it won't. Mkl is an Intel toolkit, so they will surely not support Apple's move to dump Intel processors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: