There is no mention in the article whether the software suite was vetted for support of more than 64 threads on Win32. The API has a peculiar weakness that limits thread scheduling to a single processor group by default and a group can have no more than 64 hardware threads. To get above this limit, the application must explicitly adjust the processor affinity of its threads to include the additional hardware threads. MS was not in a hurry to adjust the C++ STL and their OpenMP runtime after the basic processor group API appeared in Vista. I am not sure if they managed to do it by now. Some of the benchmark results look to me like the missing scaling from 64 to 128 hardware threads on Windows might be caused by this.
It's not just the API, it's the scheduler in NT itself that won't move threads from one process group of up to 64 hardware threads (on a 64bit system) to another and so it has to be manually managed by the application if you want to scale out farther than that on NT.
Given that it's a fundamental limitation of the NT scheduler (not present in Linux), it seems like it'd be on the table for "yeah, windows makes this way harder, and a lot of applications won't scale the same way on windows as their Linux versions will", rather than "oh, that just doesn't count because they aren't using it right".
EDIT: As an aside, this kind of thing is exactly why Linux doesn't provide binary compatibility on the driver level. It's easy to paint yourself into a corner by making decisions that were perfectly sane 20 years ago. Now NT has fundamental limitations, hitting even harder in kernel space where nearly every driver out there has some macros compiled in that touched these structures. It's bad enough at the syscall layer, but it's even worse when you can't change things with code that's directly modifying internal structures.
This is exactly why Linux won't provide a driver ABI, and why it's a good thing.
I forgot that the API doesn't allow thread affinity across processor group boundaries. It's been a couple of years since I last touched all of this. Revisiting it, it becomes clear that this limitation actually prevents transparent support for >64 hardware threads in the C++ STL or pthreads on Windows.
> the issue does not exist in Windows Enterprise or Server.
It does. There's maximum of 64 threads in a processor group. This is because the affinity mask in tons of internal data structures inside NT are pointer width (so 64 bits on a 64bit platform).
Server and Enterprise's scheduler adjustments are just around making better decisions balancing which processor group a new process is assigned to at creation time.
So it is, thanks for the link. Does anyone know how to access this[1] page that is referenced towards the end of that, it comes up as "Access Denied" with no hint as to what access is needed but it's referenced all over the place in these documentation pages.
How is it a "benchmark" if it doesn't fully utilize the operating system's APIs to maximize performance ?
I am not defending the utter shittiness of the Win32 Processor group API design, which is the usual overly complicated Win32 API that only makes sense to an NT kernel developer. I can see why no application developer bothered to incorporate that in their software. It's basically asking you to do thread scheduling yourself for >64 cores.
But I'm always annoyed to see people whinging about Windows performance when they're using "cross-platform" applications (that are using the lowest common denominator APIs) to supposedly measure perf.
For an application that knows the OS environment it's running in, the "scheduler" or the "kernel" is very unlikely to be an actual impediment most of the time. I use quotes because those words are usually used to mean "my app doesn't run as I expect and it can't be my fault".
This is true of all operating systems.
An old MSDN doc [1] actually says "The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application.". There is no newer doc that I could find, so perhaps the design hasn't changed in 8 or 10.
"XYZ should be more than adequate for the typical application" is like a Microsoft basic design principle or something :-)
The 64 processor per processor group limitation makes sense to me. A lot of >64 processor configs on the market are NUMA systems where it's probably best to pin threads to the NUMA node where their data will be hosted.
They also used a version of Windows aimed at desktop/graphical uses and not Server. I wouldn't be surprised if Desktop Windows didn't like having apps use a lot of CPU (making the system feel unresponsive) but server didn't mind.
This reveals one weakness of the Windows development model: if something isn’t a feature that is driven with a PM behind it, it won’t happen. On the other hand, if some obscure internal thing isn’t optimal yet, you can bet some obsessed hacker is going to tackle it one day. How many schedulers has Linux had already?
I’m not sure about the Windows ORG, but office always made time for dev led architectural improvements. Back in the old days it was called Milestone Zero.
I’d be absolutely shocked if Windows didn’t have a similar process.
Ok, its a trade-off. The year of the Linux desktop is surely coming soon.
I must say I did not expect the scaling performance difference to be so large though.
Do the server variants do better? If yes, one would at least wonder why the variants you'd expect on high-end workstations didn't get the same options.
The charts in the article literally show the enterprise version scales well where the "Pro" version doesn't. I'd suggest that this is largely because 64 threads is the point at which you should be moving from "Pro" to "Enterprise" Windows. Frankly, I don't think this is unreasonable, you can go to several computer vendors websites and what they consider high end can be as few as 6 cores.
It looks like this difference might just be a fluke, normal sample variation. Tom's hardware could not replicate and AMD has weighed in affirming there's no difference with Enterprise versions if same build numbers are used:
I don't know, they didn't benchmark it and I don't have the config to try it myself.
My gut feeling is that it's possible the thread scheduling was tweaked so that the UI is more responsive for the Workstation variant, since there's generally a user sitting in front of it who might be doing something else while some video renders or code compile. Server, on the other hand, cares less for that and if a process really wants to max out CPU (like a build or a rendering job) it should be allowed to do that since there likely won't be any sudden UI users on the machine.
From my experience for a long time the Windows NT (from Win2K onwards) kernel and scheduler were actually better than Linux in several ways. That always amazed me, because Linux was a better server OS in many other ways.
Now at 64 cores and above it is clear that the Linux developers have spent a lot of time making the Linux kernel better. May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?
> May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?
"At product introduction, the system supported up to 64 processors running Linux as a single system image and shipped with a Linux distribution called SGI Advanced Linux Environment, which was compatible with Red Hat Advanced Server. By August 2003, many SGI Altix customers were running Linux on 128- and even 256-processor SGI Altix systems. SGI officially announced 256-processor support within a single system image of Linux on March 10, 2004 using a 2.4-based Linux kernel. [...]"
And as I recall, back then Linux suffered an issue similar to what Windows is facing now: a single 64-bit integer was no longer enough for a processor bitmask when you have more than 64 processors. IIRC, a lot of code had to be refactored to allow for a bigger bitmask, and an abstraction layer was put in place. Nowadays, the limit is 8192 processors according to arch/x86/Kconfig (see also: https://access.redhat.com/articles/rhel-limits).
It wouldn’t surprise me if Microsoft spends more time and effort on tuning and providing OS services for the particular commercial applications that are typically used on high core-count Windows server boxes, several of the most prominent of which are also Microsoft products.
I suspect SQL Server has no trouble scaling to 64 cores and beyond.
I suspect it's just simple math. In the same way that Windows deployments vastly outnumber mainframes, (large scale server) Linux deployments vastly outnumber Windows. FAANG (and want-to-be FAANG) companies have been hammering away at Linux for decades with workloads that have grown to scales Windows has never seen. For example, how many billion user web services are running on Windows? What percentage of supercomputers today run Windows? etc.
Well the Windows Kernel is known to be well engineered. The userland including user interface is a different story. Here the backward comparability and the huge feature sets kick in including all its disadvantages (and advantages)
My guess is Linux is still mainly used in server environment so utilizing more threads as possible is necessity so more contribution is going into this ,
Honestly a little surprised it's as close as it is. I have consistently hated having to deploy anything that requires lots of cores on a windows machine.
I have been keeping on eye on DragonflyBSD for years now, it does some very interesting things, so this:
> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance on the Threadripper 3990X
I'd like to see comparisons of compilation time. I wish there was a standard for benchmarking CPUs by compilation time. I know quite often a compilation of the Firefox source code is used, as well as the Linux kernel, I just wish it was more prevalent in these reviews.
Since windows was mostly running on x86 and the memory controllers were in the northbridge back then even multi-socket systems wouldn't have been affected by NUMA. Moving them on-die only happened later.
There were NUMA x86 rigs long before the memory controller moved to the CPU. IBM xSeries and serverworks chipsets from around 2000 had NUMA topologies.
Windows server 2003 had NUMA support. I am not sure that Server 2000 exposed any NUMA capability, but there were a lot of things cut from that project because it was running late. My guess is NUMA was one of the things that got pushed (in terms of release) to 2003.
They used to have a "Datacenter" SKU of the server where you'd find most of these kinds of features. This was only available with OEM hardware IIRC.
These are all embarrassingly parallel multiplication workloads. Would be nice for a change if anyone would run something like MySQL or a gRPC server or something like that, you know one where it actually makes a difference how threads get scheduled when they go to sleep and wake up and when packets arrive and so forth.
looking at the results make me wonder if MS is keeping separate branches of win 10 internally or some CPU hogging services are disabled on Win 10 enterprise version.
Windows 10 Pro crippled the scheduler. Windows 10 Enterprise uses the same uncrippled scheduler as Windows Server. "CPU-hogging services" don't consume 32 full cores.
but this proofs my theory that MS is keeping internally different repositories for win 10 also we know that some tracking services are disabled for Win 10 enterprise which leads to logical conclusion that tracking services could potentially limit OS I/O ops.
In the general case, to maximize performance on any platform requires you to use platform-specific code.
There are some decent "cross platform" platforms, such as Java or C#, which have a better degree of performance compatibility. But if you're working at the system level (aka: PThreads / epoll with Linux, or Windows Threads / Critical Sections / Completion Ports), you need to use the OS-specific code to truly reach best performance.
Java, especially with high-performance JVMs like Azul, can be surprisingly efficient. But to achieve the best performance on the Java Azul Zing runtime, means to use Java Azul-specific libraries! Once again, tying yourself down to a platform.
As it turns out, performance is the hardest thing to port. You can somewhat easily port functionality to any system and kludge things together (with effort, your C# code can port over to .Net Mono and run on Linux). But to actually get performance guarantees with primitives is almost always platform specific testing.
Case in point: you may make certain assumptions about the Linux scheduler, only for the Linux scheduler to change from O(n) to O(1) to Completely Fair, and today the System Admin can change scheduler details to better tune the needs of your application. These things have an effect on performance that make it difficult to port between systems... or even between the SAME system running slightly different configurations (ex: misconfigure Huge Pages on one box)
I don't know what to say about the article, aside from Windows vs Linux comparisons almost always have a degree of inaccuracy. In this case, Phoronix are clearly Linux experts and I fully trust their Linux data.
Its very difficult to find someone who knows how to optimally compare Windows vs Linux, because most people only really learn one platform. I've taken it upon myself to become a "jack" in both platforms (having neither the expertise of a Windows expert, nor a Linux expert), so I'm better positioned than most to see and understand cross-platform issues.
But very few people bother to learn both systems. (And frankly, most people don't have to learn the other system, so why bother learning? You really can make a solid career on one OS without ever thinking about the other one...)
To the point where the benchmarks in the post get a bit misleading, as clear Linux will outperform Ubuntu (or Fedora or whatever a user is more likely to install) by a quote big margin.