> our faster function could take advantage of no more than 8 cores; beyond that it started slowing down. Perhaps it started hitting some bottleneck other than computation, like memory bandwidth.
@itamarst Yes, this in interesting, you should profile it and get to the bottom of the issue! It seems like in my experience that being limited by hyperthreading or instruction-level parallism is relatively rare, and much more often it’s cache or memory access patterns or implicit synchronization or contention for a hardware resource. There’s a good chance you’ll learn something useful by figuring it out. Maybe it’s memory bus contention, maybe it’s cache, maybe numba compiled in something you aren’t expecting.
Worth nothing that using 20 on the fast test isn’t that much slower than using 8. A good first guess/proxy for number of threads to use is the number of cores, and that pays off in this case compared to using too few cores.
Out of curiosity, do you know if your images are stored row-major or column major? I see the outer loop over shape[0] and inner loop over shape[1]. Is the compiled code stepping in memory by 1 pixel at time, or by a whole column? If your stride is a column, you may be thrashing the cache.
I’d also be curious to hear how the speed of this compiled code compares to a numpy or PIL image threshold operation, if you happen to know.
NumPy default is that you iterate over the earlier dimensions first.
The slow code is likely at least partially slow due to branch misprediction (this is specific to my CPU, not true on CPUs with AVX-512), see https://pythonspeed.com/articles/speeding-up-numba/ where I use `perf stat` to get branch misprediction numbers on similar code.
With SIMD disabled there's also a clear difference in IPC, I believe.
The bigger picture though is that the goal of this article is not to demonstrate speeding up code, it's to ask about level of parallelism given unchanging code. Obviously all things being equal you'll do better if you can make your code faster, but code does get deployed, and when it's deployed you need to choose parallelism levels, regardless of how good the code is.
> when it's deployed you need to choose parallelism levels, regardless of how good the code is.
Yes, absolutely, exactly. That’s why it can be really helpful to pinpoint that cause of slowdown, right? It might not matter at deployment time if you have an automated shmoo that calculates the optimal thread load, but knowing the actual causes might be critical to the process, and/or really help if you don’t do it automatically. (For one, it’s possible the conditions for optimal thread load could change over the course of a run.)
Is there a good method to split out memory and CPU bottlenecks on a CPU? most folks simply optimize until they hit 100% and assume all CPU usage is created equal.
Seems like there needs to be a tool similar to htop with utiization stats for different instructions to indicate how much of the CPU is idle/busy.
I think most people would measure instructions per cycle as an estimate of how busy the CPU execution units actually are. Lots of performance tools do measure it.
IPC is not a good performance metric - especially for an instruction set like x86, but particularly for SIMD ISAs. Two seemingly similar instructions sequences may get decoded into a different number of uops, which in turn occupy different execution ports. The most extreme example of this being scalar code without stalls. It tends to have high IPC, even though it makes no use of FP/SIMD ports.
Keeping the execution pipeline busy is not a useful goal in isolation. Really, IPC is better left as a sanity check.
Consider appropriate hardware counters. (Note that multiplexing to measure more then three(?) at once can be misleading.) At least for serial code on x86, there's "top-down performance analysis". Some tools (I know Maqao) suggest how to address bottlenecks.
Personally, I dislike configuration parameters that let future admins of the system change parameters like concurrency of certain processes, etc.
A lot of the time even I can't tell what is going to be the optimum setting.
So for a number of years, rather than expose a configuration parameter, I am building in a performance test that establishes the values of these parameters.
This is a little bit of search and a hill climb through the multidimensional space of all relevant parameters. It is not perfect, but in my experience it is almost always better than I can do manually and always better than an operator without intimate understanding of the system.
The results are cached in a database and can be re-executed if change to configuration is found (just take a number of parameters from your environment, version of the application, etc. and hash the value and check if you already have parameters in the database).
This works well for expensive long-running things, but works less well for more short-lived programs, and even for expensive long-running things there may be more considerations than "what's the fastest value?" I intentionally set the default jobs of "make" to "4" rather than "8" because I don't want to utilize my full system when compiling stuff, so I can still use it for other things at a reasonable speed. There's lots of reasons you might want to do this: on a web platform you don't really want to use up all resources by the one expensive thing, but what's "reasonable" and "desired" often differs – often there is no objectively best "optimum setting".
You do not have to target best absolute performance. You can choose your own fitness functions to balance absolute performance and other aspects of the system like resource usage, latency, or any other aspects of the application. Think about what logic the operator would use to establish the parameter and capture it as code or as fitness function.
As usual, there is no single universal solution to optimising performance (and in fact it can be trivially proven to not exist in general case). But you do not have to find perfect solution, you just have to be able to find as good or better solution than a human operator would.
When you have a large system you can be working with thousands or tens of thousands of parameters. I usually see that even if the application is migrated to different hardware, the configuration parameters like buffer sizes, concurrency, connection pools, etc. are not really revisited. The effect is that usually the application will work suboptimally for the environment it is in.
My solution is an attempt to keep this part of application configuration sufficiently close enough to optimal to be usable and do it automatically without operator involvement (and associated cost and possibility for defects).
Yeah I’m in agreement with you. It can be hard for performance things (vs tuning parameters for numerical algorithm performance) because such tests can be inherently difficult to reproduce and writing the hill climbing can be tricky. It’s a goal to aspire to for sure. Have you found any tooling to do the hill climb search for you in a multidimensional space?
Make also has an option to only spawn new jobs when the load average is below a given value, which can be used to get make to reduce the number of jobs running in parallel when there are plenty of other processes using CPU time.
Put it in a container where you can explicitly limit CPU / memory usage?
I run all builds in containers/vms anyway as I am too tired to keep my personal machine able to build hundreds of different things. It is better to have a dedicated development environment frozen for each of the projects I work on.
I just want to run make and not slow down Firefox or maybe some other things I'm doing. This simple alias I've had for over 20 years works, on all (Unix-y) systems. Why make everything a hypercomplex story?
I am not surprised. I like to say that it is hard to have an original thought. Even if you think you are original, it is more likely somebody else has already done it in the past. Most of the computing novelties I am seeing are really old ideas, revisited and potentially improved upon.
Thanks for the link, I will take a look when I have a moment:)
Eventually the pendulum swung back for BLAS libraries, the currently popular strategy is hand-tuned assembly kernel in libraries where the most important kernels have been identified.
But BLAS is a very weird case, there are only a couple kernels that you care about (and the most important, GEMM, has been studied like crazy). And BLAS is used under the hood by pretty much the whole HPC community, so there’s incentive for tons of expert human time investment.
You can look at ilbflame/BLIS for the modern strategy.
ATLAS is still neat, perhaps one can even derive more generally applicable lessons from that library, not sure…
I thought the point of ATLAS wasn't the idea that automagical tuning was going to necessarily be best. Rather that it was expensive to [edit hand tune] for all the hardware variations that were around and this was a practical approach to get to pretty damn good even if nobody had spent the money to produce a hand optimized library for your hardware (or, they had, but you didn't want to pay what they were asking).
e.g. I remember vaguely people using it on AMD cpus because they couldn't use icc/mkl effectively...
> Rather that it was expensive to automate for all the hardware variations that were around
Is the “automate” that I’ve italicized supposed to be hand-tune or something along those lines? If so, I think I agree. I think this was the justification for it, and it made sense at the time, and it might make sense for future architectures or exotic ones now.
But eventually Kazushige Gotō wrote gotoBLAS, which really emphasized the importance of the matrix-multiplication kernel and showed how much performance could come from getting it right. Since it is only a handful of routines, it is possible to hand-tune them. BLIS went on to formalize this in a nice open-source library, so any vendor can drop their kernels in.
What is special about a particular CPU from this point of view is the SIMD extension and the cache hierarchy/memory system. SIMD extensions, somebody will write a matrix-multiply kernel in assembly or intrinsics before optimizing compilers get really good at using that extension (if they ever get to it!). Cache hierarchy is more abstractable I guess, but there are lots of rectangles and other nice shapes that dedicated tuners can think about, i.e, (PDF warning)
I’m sure one could search across all the possible sizes, but it is easier to just have a human identify the right ones.
Maybe someone can train an ML model to identify good sizes. Machine learning uses lots of linear algebra, so I guess we’ll have a nice feedback loop, maybe that will kick off the singularity, haha.
Also worth observing BLAS is an even weirder case as odds are the CPU architecture designed to ensure they could hit a high % of peak FMA throughput for that kernel, so there's a lot of SW/HW co-design going on there too.
FFTW [1] does something like this, not just for parallelism but also for selecting optimal algorithms for a given data set and architecture. It uses "wisdom" to store known-good combinations, falling back on quick heuristics and brief benchmarking when needed. If you're going to do a lot of something you can have it update the wisdom from full benchmarks.
As a devops type person, I've been advocating for basic performance testing to be part of microservice unit tests. It'd be really wonderful to get rough numbers for peak and idle cpu/memory usage along with other test output.
But now that I think about it, I could write a small library to automate most of it while delegating some tasks to the application (providing functionality to be tested, parameters and possible ranges to be searched, a way to save/restore parameters to/from the database, provide fitness function, etc.) It is essentially a small benchmarking framework with some added functionality. I think it could be quite useful to some people.
Use hwloc[1] to determine the hardware topology of the system, and pin processes/threads appropriately. For a heterogeneous system, you presumably want to avoid low-performance cores. OpenMP has a standard way of assigning threads to cores. Otherwise use hwloc, or perhaps likwid[2] explicitly.
Most things turn out to be memory-limited, and getting the algorithm right may win more than simple parallelism; GEMM is a case in point. As ever, profile first, and worry about thread numbers after you've pinned appropriately and looked for simple things like initialization interacting with memory policy.
The main HPC performance tools support Python and native code; Threadspotter was good for threading/cache analysis, but seems moribund. Note that SMT on POWER, for instance, is different from x86.
(This is general HPC performance engineering; I haven't considered the code in point.)
> But there is something unexpected too: the optimal number of threads is different for each function.
Nothing unexpected there. Amdahl's Law in its glory.
A fast running function finishes fast, and coordinating the job's execution over many cores requires doing work every time something finishes. If you split your job into chunks in the microsecond time range, then you'll be handling lots and lots of tiny minutia.
You want to set up your tasks such that each thread has a good amount of stuff to chew through before it needs to communicate with anything.
That's not it. I updated the article with an experiment of processing 5 items at a time. The fast function doing 5 images at a time is slower than the slow function doing 1 image at a time (24*5 > 90).
If your theory was correct, we would expect the optimal number of threads for the fast function processing 5 images at a time to be similar to that of the slow function processing 1 image at a time.
In fact, the optimal threads in this case (5 images at a time) was 20 for slow function, 10 for fast function, so essentially the same as the original setup.
Based on the graph the fast function runtime is really short. You might be just seeing effects of efficiency vs performance cores. Lower thread count makes most of the stuff run on performance cores and task end times align more nicely. With larger number of threads tasks running on performance cores complete first and you are left waiting for efficiency tasks running on efficiency cores to complete, or context switches have to be done and tasks are moved between cores which causes overhead.
You could try what happens if you have 10 times more images when running fast function.
Also you have just 8 physical performance cores and 4 physical efficiency cores. Performance cores have hyper threading so they act as 2 logical cores but that doesn't mean that they can actually execute 2 threads at maximum performance. If processing tasks use the same parts of the processor core, then processor cannot run both threads at the same time and IPC will suffer. Slow task maybe uses more varied parts of the core which allows better IPC with hyper threading. So that also may reduce optimal thread count.
Is it a caching thing? The slow version seems less cache efficient, so if it is waiting due to cache misses, that could create an opportunity for something else to get scheduled in.
I doubt it the slow version uses division instead of bit shifting. My guess would be the fast version saturated like i/o or some non cpu portion of the processor and the division one was bottle necked by the division logic in the processor.
Recalibrate how you feel about division and multiplication. It turns out, integer division on new processors is a 1 cycle process (and has been for a while now). Most of the multicycle instructions now-a-days are things like SIMD and encryption.
Which processor has 1-cycle latency for integer division? Even Apple Silicon, which has the mostly highly optimized implementation I am aware of appears to have 2-cycle latency. Recent x86 are much worse, though greatly improved. Integer division is much faster than it used to be but not single cycle.
Also, most of those SIMD and encryption instructions can retire one per cycle on modern cores, but that isn't the same as latency.
2-cycle latency for division seems extremely unlikely. From the instruction tables it seems that firestorm has 7-9 cycles latency SDIV (which is excellent). It also has an impressive 2-cycles reciprocal throughput.
Quite possible. I am not confident Apple Silicon has 2-cycle div latency, that seems improbably fast to me, but I had heard some reasonably well-sourced rumors to that effect. I have not measured it myself.
Even at somewhat higher latency it is still fast enough to not be worth optimizing around in most cases, which is great.
But it's iterating through the result vectors twice, so that's basically guaranteed to miss. Moving the threshold check into the loop above would at least eliminate that factor.
Maybe division vs bit shifting does play a factor, but it's hard to compare that while the cache behavior is so different.
>Nothing unexpected there. Amdahl's Law in its glory.
That's not Amdahl's Law. Amdahl's Law describes the expected total speedup from a speedup to a segment of the program. It is usually used in the context of adding parallelism but is more general than that. It does not assume parallelism has any overhead.
Amdahl's Law predicts performance will approach a limit with added cores, but still predicts that each core will improve performance.
>A fast running function finishes fast, and coordinating the job's execution over many cores requires doing work every time something finishes. If you split your job into chunks in the microsecond time range, then you'll be handling lots and lots of tiny minutia.
Not necessarily. Modern systems have pools with independent per-core queues and clever ways of distributing work to them with minimal overhead. Work-stealing pools don't have any overhead from synchronization except when a core runs out of work, so each core just runs a `for` loop most of the time. I have seen a simple (pseudocode) `list.parallelMap(foo)` give a linear speedup on a modern JVM even on a large host with a `foo` that takes <100ns.
Trying to manually batch work risks one thread running faster than the others and then idling when it runs out of work. This is common on modern hardware with heterogeneous cores and dynamic frequency scaling.
If the tasks require their own coördination or the runtime isn't smart, then manual batching may be needed. I wouldn't assume Python does the right thing.
> That's not Amdahl's Law. Amdahl's Law describes the expected total speedup from a speedup to a segment of the program. It is usually used in the context of adding parallelism but is more general than that. It does not assume parallelism has any overhead.
Yeah, it's a fiction. An useful one, but still a sort of spherical cow in a vacuum scenario. It doesn't exactly describe real hardware.
> Amdahl's Law predicts performance will approach a limit with added cores, but still predicts that each core will improve performance.
Because it's a fiction. You can have slowdowns with bigger tasks on real hardware, like if you run out of L1 cache.
> Not necessarily. Modern systems have pools with independent per-core queues and clever ways of distributing work to them with minimal overhead.
True, but that's Java and this is Python. I may be wrong, but since the GIL removal is very new still, I don't expect Python to have nearly the same level of efficiency in this regard.
> Amdahl's Law predicts performance will approach a limit with added cores, but still predicts that each core will improve performance.
And I would agree: on my systems, efficiency cores are used to handle other tasks and keep the power core cold and ready to use when needed (dynamic frequency scaling)
> Trying to manually batch work risks one thread running faster than the others and then idling when it runs out of work. This is common on modern hardware with heterogeneous cores and dynamic frequency scaling.
Yes, I think the issue here is more that python can't introspect the cgroup artificial limitations placed by docker on the CPU it's using, causing the kink past 9.
If the goal is performance, I think it would be better to assign the cores manually + use cpu pinning + declare to the containers what resources it really has.
On a NOHZ kernel, you can do manual assignments like nohz_full=1-3,5-7 rcu_nocbs=0-3,5-7 irqaffinity=4
This puts the IRQ burden on one core but you can do 2 (with =4,5) etc
Starting from Python 3.13 there should be a new method os.process_cpu_count() that aims to get the actually available number of cores our process can run on.
I'm still waiting for multithreaded c++ compiler. The project I work on has some nasty templating, and there are a few compilation units that take a long long time.
Wouldn't refactoring to reduce the templating be helpful? Templates are an horror if it's out of control. I have a small project that uses templates and there is a bug with one of the templates in it that I haven't been able to resolve yet. I do look at it occasionally see if I get anywhere with it. Frankly I found Rust a lot easier to work with.
The problem of "how many CPUs should I use?" is really only answerable empirically in the general case.
The problem of "how many CPUs are available?" is a little more tractable. Currently when running podman, the cpus allocated seems to be available in the /sys fs. I wonder if it's the same under k8s?
Per cgroups v2 documentation [0], `cpu.max` exists for all but the root cgroup. If it doesn't exist, that means the kernel at least is putting no restrictions compared to what the affinity says (use `nproc` to call `sched_getaffinity` from the command line). I'm not sure what hypervisor-based throttling exposes.
***
That said, there are several rules of thumb to guess the optimal number of CPUs, even before you measure - and to figure out how to improve the situation (since profiling is even harder for threaded code):
* if you are contending for a shared resource, even relatively rarely [1] (that is, assuming you're already using per-thread resources when reasonable and only sharing big chunks), you're likely to hit a limit for the maximum number of CPUs regardless. Numbers I've seen are often 20 or lower. Using a "nested" approach for resource sharing can improve this, but this may require pinning threads (which in turn threatens starvation - REMEMBER, YOU ARE NOT THE ONLY PROGRAM IN THE WORLD!).
* if you are truly CPU-bound, microoptimizing to avoid (mostly: random-access) memory stalls (usually only possible for purely numerical code), logical CPUs are useless so you should limit yourself to the physical CPU count. Usually you should have an idea if this is the case.
* if your program is actively using an amount of memory similar to a shared cache size, using fewer CPUs is often better. This is not restricted to the exact counts of logical and physical CPUs, except for caches that are physical-CPU-specific (L1 almost always is, and L2 usually is these days for P-cores but not E-cores. L3 - and, for that matter, total RAM size to avoid swapping - is the one where the optimal number of CPUs varies arbitrarily in practice)
* if your code does even a small number of unpredicted loads (typical object-oriented code does far more than that!) logical CPUs are easily a win. Most code belongs here so you can assume even if you're not familiar with it, unless one of the hard-to-know points applies.
PythonSpeed is one of my favorite websites. However, this article leaves me with more questions than answers. (I do appreciate the benchmarking code.) For example, at one point, the author mentions that hyper-threading can be disabled in BIOS. Should I disable it? Based on the author's own description, it sounds like hyper-threading is pretty useful.
Like much of the advice "it depends" and "measure it".
Basically hyper-threading shares a whole load of resources between threads. If your code ends up contending on those resources you probably run slower. If your code is blocked due to lack of instruction-level parallelism, it probably runs faster. Generally an expert-optimized numerical code (e.g. machine learning) is going to fall into the first camp as it's pretty easy to kill the ILP bottlenecks in that sort of code.
When I wrote some parallel code in C, enabling hyper-threading resulted only 30% increase in speed, which not too little, but you would expect more for doubling the thread count. I don't remember what was the bottleneck, but it was not I/O.
On an other occasion a consultant for a HPC consultant advised us to disable some physical cores for the optimal performance of a geological simulator application, because the core/cache ratio is higher that way.
I understand this is a Python-centric source, but without having done my homework I'd have thought Python wouldn't be a particularly great language for dealing with these low level concerns. Wouldn't it be much easier in C? In java it's as simple as Runtime.getRuntime().availableProcessors()
> you can spend a little bit of time at the beginning to empirically measure the optimal number of threads, perhaps with some heuristics to compensate for noise.
I'd like to point out that there is a lot of stuff published on ArXiv; this appears to be a very active research subject. Don't start from scratch :)
Remember kids, classical x86 hyperthreading goes freeganning for additional under-scheduled execution units of each physical core to create another virtual core that's not going to be as fast as doubling the count of physical cores.
In my programs if a physical core performs 1000 operations/sec, with an added HT core it performs 1070 ops/ second, not 2000 as I naively expected. I understand you should specifically program so that the main core performs integer arithmetics, and the HT core - floating point arithmetics, to get closer to 2000.
One of the hard things about writing parallelized programs is that you may want different algorithms for different environments. The other day I joked about CUDA being harder than rocket science[0] and this is part of it.
The thing with high parallelism is that you can't just think about your program in terms of clock and total memory. You have to consider all parts of the system, and this is likely the large reason most programs don't have high parallelism despite computers being many cores for decades. OpenMP is going to have different settings than OpenMPI. How you share the memory is an essential part of the algorithm you're going to write.
There's also issues about real cores and hyperthreads/logical. I find that in most computation I never want to use the number of logical cores but only rely on physical cores. You can also get a double descent like behavior[1]. The author here seems to be running into an even newer problem that is the difference between performance cores and efficiency cores. I'm not actually sure that this is surprising and makes the headline feel clickbaity.
For more, this is actually part of why CUDA is making a big boom lately. One of my first forays into CUDA was trying to write some kernels to speed up GEANT4 simulations, which are not too dissimilar from path tracing but they are more computationally heavy. So extremely parallelization but the operations per cycle wasn't high enough back then so it was still better to use CPU (I'm sure there were other bottlenecks as well but did find someone else confirming my results). This goes for lots of scientific compute too, like a lot of mesh solvers and FEA (finite element analysis: how you simulate physical parts). But not IPC has gone way up (AND cores increase!! AND bandwith! AND memory!) so now we start seeing these things start to leverage graphics cards a lot more.
In HPC and it is worth noting that the big bottleneck actually isn't compute, it's throughput. More data can be created than can be processed. Many teams are switching to "in situ" visualization/analysis methods where you offload your data to another machine which does that processing. You also simulate at FAR higher resolution than you visualize or even analyze. So if you're interested in tackling problems that are becoming more and more important, this is one of them and involves both hardware and software.
[1] Say you have 16 physical cores with hyperthreading so 32 logical cores. Your program speeds up as number of processes -> 16. But when you go to 17 cores you have a big jump (loss in speedup) and follow a shallower curve as number of processes -> 32.
Edit: If you're not on an HPC machine, I actually might suggest setting your max parallelism to (number_of_physical_cores - 1) because speed difference often is trivial (lots of asterisks here) and you have a core available to kill the process. More noob you are, the stronger this recommendation.
Is there some way that I can GPL the output people get from use of my program? For example, if my program is used to develop hardware designs, can I require that these designs must be free?
In general this is legally impossible; copyright law does not give you any say in the use of the output people make from their data using your program. If the user uses your program to enter or convert her own data, the copyright on the output belongs to her, not you. More generally, when a program translates its input into some other form, the copyright status of the output inherits that of the input it was generated from.
So the only way you have a say in the use of the output is if substantial parts of the output are copied (more or less) from text in your program. For instance, part of the output of Bison (see above) would be covered by the GNU GPL, if we had not made an exception in this specific case.
You could artificially make a program copy certain text into its output even if there is no technical reason to do so. But if that copied text serves no practical purpose, the user could simply delete that text from the output and use only the rest. Then he would not have to obey the conditions on redistribution of the copied text.
In what cases is the output of a GPL program covered by the GPL too?
The output of a program is not, in general, covered by the copyright on the code of the program. So the license of the code of the program does not apply to the output, whether you pipe it into a file, make a screenshot, screencast, or video.
The exception would be when the program displays a full screen of text and/or art that comes from the program. Then the copyright on that text and/or art covers the output. Programs that output audio, such as video games, would also fit into this exception.
If the art/music is under the GPL, then the GPL applies when you copy it no matter how you copy it. However, fair use may still apply.
Keep in mind that some programs, particularly video games, can have artwork/audio that is licensed separately from the underlying GPLed game. In such cases, the license on the artwork/audio would dictate the terms under which video/streaming may occur.
If it were just a standalone tool it wouldn't affect the licencing of the project, but it's not a tool, it's a software library, so projects using the library need to follow the terms of the GPL.
The tricky part here is that it's a library that produces graphs which themselves aren't covered by the terms of the GPL. The intent of the tool seems to be that you use it to produce graphs during development for benchmarking or during research or whatever where the tool is internal but the graphs aren't necessarily internal, in which case the licence of the library is mostly irrelevant as the GPL is mostly concerned with distributing software to other users of said software, not the output of the software, but if you later choose to want to release that tool it really aught to follow the terms of the GPL and be under a licence compatible with the GPL, so the note about it being a GPL licenced library is completely valid. You should use a more permissive LGPL/MIT/BSD/whatever licenced library like benchit if you don't want to have to think about this sort of stuff.
> if you later choose to want to release that tool it really aught to follow the terms of the GPL and be under a licence compatible with the GPL
This may be in the spirit of Free-as-in-freedom software, but I don't think it's grounded in the reality of copyright and the GPL. With the usual disclaimer that IANAL: copyright attaches to the work (the code), and is retained by the author. The author grants you certain rights to use, modify, and redistribute the code subject to conditions (the GPL). Those constraints only encumber your works that are derivative of the GPL-licensed code i.e. either direct modifications of original source or linking original/derived object code.
A product that directly and necessarily imported perfplot would probably itself need to be GPL licensed to be distributed. A product developed with merely the assistance of something like perfplot would not need to be licensed in any particular way, any more than software written in Emacs needs to be GPL licensed.
As far as I understand, I don't think even importing is an issue, unless you are distributing the module too as a combined artifact (e.g. docker image). See e.g. https://github.com/pyFFTW/pyFFTW/issues/229
There's a lot of GPL FUD though from commercial interests though.
Wait! You can’t benchmark on a CPU with unpredictable speed and mix of slow and fast cores. All kinds of effects come into play here, and the code itself is not the most prominent among them.
To measure which _code_ is better, you should use real SMP machine with fixed clock speed and turned off HT. On the machine from TFA you are just fighting with Intel’s thermal smarts and OS scheduling shenanigans. (edit: you can use the same machine, but configure it in BIOS. I, myself, use i9-12900k fixed to 5ghz@8P-cores as "Intel testing machine")
Except then you're not testing on the hardware configurations people will actually run the software on.
People do run software with HyperThreading on and on E cores and with Turbo boost and throttling. Having code that behaves better in these dynamic, heterogeneous environments is a net benefit to the user.
It is easier to deal with all the issues separately. First your algo (in single thread), then threading, then memory accesses and cache effects. After you sort everything in your control, you try to deal with the real world (like counting only “real” cores, ignoring/or enjoying hyperthreads, OS scheduling, but most of these are quite unpredictable, and if you get your code run faster in 12700k, the same setting will be slower on, say, 5800X or on server machines, while basic stuff speeds up the program on any hardware).
Those issues are not separable. The performance of memory and caches can determine which algorithm is best. This is the basis of the entire field of cache-aware algorithms, which routinely beat the pants off of theoretically superior algorithms.
In my experience (which is in video transcoding and in research astrophysics - both domains where it matters!), if you really need to squeeze out performance, you have to design with the target platform available for profiling and benchmarking from the beginning.
Edit to add: I agree wholeheartedly with your top-level comment! I just am perhaps more extreme than you; I don’t think “laptop benchmarks” can ever be twisted into being useful.
You usually don't design an algorithm for a single CPU. Most software has to run on different tiers and generations from Intel, AMD, a whole herd of different ARM processors.
Even the very hottest code paths will mostly do CPU feature detection but not "if cpu == i7-12700K { ... }"
I get what you are saying, but in some domains, you really do design for a single SKU, beyond even just the CPU. Supercomputers and computing centers like SLAC have a very constrained set of SKUs that your software will ever run on.
I know this is not how most cloud or consumer software works, but that stuff usually isnt as performance-sensitive.
There's plenty of CPU-cycle-munching work in the pro segment that does care about performance and can't assume fixed hardware. CI, compression, CGI, ...
«Most software» - lots of software is run on cloud compute / backend servers and can be built or tuned specifically for the box/vm/container you’re running it on. I would expect AWS/Azure/Google to do that for their PaaS offerings, for cost savings.
youre missing the point. you need to remove as many variables as possible when benchmarking because benchmarks are always relative to each other.
in absolute terms testing is required on a standard config, but benchmarks should mostly be conducted with a minimum of external variants if you want to draw meaningful conclusions
I barely even try to run micro benchmarks on my work laptop, and I’m the guy who introduced them. It’s just so random.
Pretty much I’m looking for a slowdown that’s more than 25% and anything less could just be noise. When you’re reaching for 5% improvements, that means you have to let the CI system tell you, and production response times are the final arbiter of truth.
Software should be tested on a system like the one that will run it. All modern machines use dynamic frequency scaling. It should rarely be disabled in production. Many modern machines have heterogeneous cores. Your testing procedure might be reproducible but it's unrealistic.
lol that’s a Wordpress bug, I don’t remember how I hit it, but I’ve done it before too. Might have to do with not having the datetime set correctly on the server.
@itamarst Yes, this in interesting, you should profile it and get to the bottom of the issue! It seems like in my experience that being limited by hyperthreading or instruction-level parallism is relatively rare, and much more often it’s cache or memory access patterns or implicit synchronization or contention for a hardware resource. There’s a good chance you’ll learn something useful by figuring it out. Maybe it’s memory bus contention, maybe it’s cache, maybe numba compiled in something you aren’t expecting.
Worth nothing that using 20 on the fast test isn’t that much slower than using 8. A good first guess/proxy for number of threads to use is the number of cores, and that pays off in this case compared to using too few cores.
Out of curiosity, do you know if your images are stored row-major or column major? I see the outer loop over shape[0] and inner loop over shape[1]. Is the compiled code stepping in memory by 1 pixel at time, or by a whole column? If your stride is a column, you may be thrashing the cache.
I’d also be curious to hear how the speed of this compiled code compares to a numpy or PIL image threshold operation, if you happen to know.