Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Windows vs. Linux Scaling Performance 16 to 128 Threads with Threadripper 3990X (phoronix.com)
108 points by jjuhl on Feb 15, 2020 | hide | past | favorite | 52 comments


There is no mention in the article whether the software suite was vetted for support of more than 64 threads on Win32. The API has a peculiar weakness that limits thread scheduling to a single processor group by default and a group can have no more than 64 hardware threads. To get above this limit, the application must explicitly adjust the processor affinity of its threads to include the additional hardware threads. MS was not in a hurry to adjust the C++ STL and their OpenMP runtime after the basic processor group API appeared in Vista. I am not sure if they managed to do it by now. Some of the benchmark results look to me like the missing scaling from 64 to 128 hardware threads on Windows might be caused by this.


It's not just the API, it's the scheduler in NT itself that won't move threads from one process group of up to 64 hardware threads (on a 64bit system) to another and so it has to be manually managed by the application if you want to scale out farther than that on NT.

Given that it's a fundamental limitation of the NT scheduler (not present in Linux), it seems like it'd be on the table for "yeah, windows makes this way harder, and a lot of applications won't scale the same way on windows as their Linux versions will", rather than "oh, that just doesn't count because they aren't using it right".

EDIT: As an aside, this kind of thing is exactly why Linux doesn't provide binary compatibility on the driver level. It's easy to paint yourself into a corner by making decisions that were perfectly sane 20 years ago. Now NT has fundamental limitations, hitting even harder in kernel space where nearly every driver out there has some macros compiled in that touched these structures. It's bad enough at the syscall layer, but it's even worse when you can't change things with code that's directly modifying internal structures.

This is exactly why Linux won't provide a driver ABI, and why it's a good thing.


I forgot that the API doesn't allow thread affinity across processor group boundaries. It's been a couple of years since I last touched all of this. Revisiting it, it becomes clear that this limitation actually prevents transparent support for >64 hardware threads in the C++ STL or pthreads on Windows.


I don't think it's a fundamental binary API limitation type thing as the issue does not exist in Windows Enterprise or Server. This was covered the last time this was posted https://www.anandtech.com/show/15483/amd-threadripper-3990x-...


> the issue does not exist in Windows Enterprise or Server.

It does. There's maximum of 64 threads in a processor group. This is because the affinity mask in tons of internal data structures inside NT are pointer width (so 64 bits on a 64bit platform).

Server and Enterprise's scheduler adjustments are just around making better decisions balancing which processor group a new process is assigned to at creation time.

You can read more about processor groups and the manual work by user space needed to manage them on all flavors of windows that support them here: https://docs.microsoft.com/en-us/windows/win32/procthread/pr...


So it is, thanks for the link. Does anyone know how to access this[1] page that is referenced towards the end of that, it comes up as "Access Denied" with no hint as to what access is needed but it's referenced all over the place in these documentation pages.

[1] https://www.microsoft.com/whdc/system/Sysinternals/MoreThan6...


It's available on archive.org


link in my comment below.


There was another phoronix article recently that mentioned this, so they must know this exists.

Either way, if it doesn't work in the benchmark, it doesn't really matter who's at fault (as long as the benchmark isn't completely synthetic)?


How is it a "benchmark" if it doesn't fully utilize the operating system's APIs to maximize performance ?

I am not defending the utter shittiness of the Win32 Processor group API design, which is the usual overly complicated Win32 API that only makes sense to an NT kernel developer. I can see why no application developer bothered to incorporate that in their software. It's basically asking you to do thread scheduling yourself for >64 cores.

But I'm always annoyed to see people whinging about Windows performance when they're using "cross-platform" applications (that are using the lowest common denominator APIs) to supposedly measure perf.

For an application that knows the OS environment it's running in, the "scheduler" or the "kernel" is very unlikely to be an actual impediment most of the time. I use quotes because those words are usually used to mean "my app doesn't run as I expect and it can't be my fault".

This is true of all operating systems.

An old MSDN doc [1] actually says "The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application.". There is no newer doc that I could find, so perhaps the design hasn't changed in 8 or 10.

"XYZ should be more than adequate for the typical application" is like a Microsoft basic design principle or something :-)

[1]https://docs.microsoft.com/en-us/previous-versions/windows/h...


> How is it a "benchmark" if it doesn't fully utilize the operating system's APIs to maximize performance ?

Why should it do that? Isn't it better to reflect how applications will realistically use the APIs?


It depends on what you want to measure. You want to measure the maximum pure power the OS can possibly give you? Then sure it's not a fair benchmark.

If you want to measure "real world" performance it is a fair benchmark.


The 64 processor per processor group limitation makes sense to me. A lot of >64 processor configs on the market are NUMA systems where it's probably best to pin threads to the NUMA node where their data will be hosted.

They also used a version of Windows aimed at desktop/graphical uses and not Server. I wouldn't be surprised if Desktop Windows didn't like having apps use a lot of CPU (making the system feel unresponsive) but server didn't mind.


This reveals one weakness of the Windows development model: if something isn’t a feature that is driven with a PM behind it, it won’t happen. On the other hand, if some obscure internal thing isn’t optimal yet, you can bet some obsessed hacker is going to tackle it one day. How many schedulers has Linux had already?


I’m not sure about the Windows ORG, but office always made time for dev led architectural improvements. Back in the old days it was called Milestone Zero.

I’d be absolutely shocked if Windows didn’t have a similar process.


I don't necessarily consider that a weakness


Ok, its a trade-off. The year of the Linux desktop is surely coming soon. I must say I did not expect the scaling performance difference to be so large though.


Performance for more than 64 cores on the retail non-server OS seems like feature not a lot of users are asking for...


Do the server variants do better? If yes, one would at least wonder why the variants you'd expect on high-end workstations didn't get the same options.


The charts in the article literally show the enterprise version scales well where the "Pro" version doesn't. I'd suggest that this is largely because 64 threads is the point at which you should be moving from "Pro" to "Enterprise" Windows. Frankly, I don't think this is unreasonable, you can go to several computer vendors websites and what they consider high end can be as few as 6 cores.


It looks like this difference might just be a fluke, normal sample variation. Tom's hardware could not replicate and AMD has weighed in affirming there's no difference with Enterprise versions if same build numbers are used:

  https://www.tomshardware.com/news/amd-threadripper-3990x-performance-windows-10-enterprise


I don't know, they didn't benchmark it and I don't have the config to try it myself.

My gut feeling is that it's possible the thread scheduling was tweaked so that the UI is more responsive for the Workstation variant, since there's generally a user sitting in front of it who might be doing something else while some video renders or code compile. Server, on the other hand, cares less for that and if a process really wants to max out CPU (like a build or a rendering job) it should be allowed to do that since there likely won't be any sudden UI users on the machine.


Three, four if you count Con Kolivas's.


From my experience for a long time the Windows NT (from Win2K onwards) kernel and scheduler were actually better than Linux in several ways. That always amazed me, because Linux was a better server OS in many other ways.

Now at 64 cores and above it is clear that the Linux developers have spent a lot of time making the Linux kernel better. May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?


> May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?

Not servers, but single system image supercomputers. Quoting Wikipedia (https://en.wikipedia.org/wiki/Altix):

"At product introduction, the system supported up to 64 processors running Linux as a single system image and shipped with a Linux distribution called SGI Advanced Linux Environment, which was compatible with Red Hat Advanced Server. By August 2003, many SGI Altix customers were running Linux on 128- and even 256-processor SGI Altix systems. SGI officially announced 256-processor support within a single system image of Linux on March 10, 2004 using a 2.4-based Linux kernel. [...]"

And as I recall, back then Linux suffered an issue similar to what Windows is facing now: a single 64-bit integer was no longer enough for a processor bitmask when you have more than 64 processors. IIRC, a lot of code had to be refactored to allow for a bigger bitmask, and an abstraction layer was put in place. Nowadays, the limit is 8192 processors according to arch/x86/Kconfig (see also: https://access.redhat.com/articles/rhel-limits).


It wouldn’t surprise me if Microsoft spends more time and effort on tuning and providing OS services for the particular commercial applications that are typically used on high core-count Windows server boxes, several of the most prominent of which are also Microsoft products.

I suspect SQL Server has no trouble scaling to 64 cores and beyond.


That is true. I have seen some terrible queries written on enterprise hardware that returned at the drop of a hat without being cached ahead of time.


I suspect it's just simple math. In the same way that Windows deployments vastly outnumber mainframes, (large scale server) Linux deployments vastly outnumber Windows. FAANG (and want-to-be FAANG) companies have been hammering away at Linux for decades with workloads that have grown to scales Windows has never seen. For example, how many billion user web services are running on Windows? What percentage of supercomputers today run Windows? etc.


Well the Windows Kernel is known to be well engineered. The userland including user interface is a different story. Here the backward comparability and the huge feature sets kick in including all its disadvantages (and advantages)


I have often wondered if Microsoft will one day Open Source the Window Kernel.


My guess is Linux is still mainly used in server environment so utilizing more threads as possible is necessity so more contribution is going into this ,


Honestly a little surprised it's as close as it is. I have consistently hated having to deploy anything that requires lots of cores on a windows machine.

I have been keeping on eye on DragonflyBSD for years now, it does some very interesting things, so this:

> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance on the Threadripper 3990X

has me excited.


I'd like to see comparisons of compilation time. I wish there was a standard for benchmarking CPUs by compilation time. I know quite often a compilation of the Firefox source code is used, as well as the Linux kernel, I just wish it was more prevalent in these reviews.


The linux kernel has been run on "big iron" for a long time now, it would be surprising if it weren't better prepared for scaling to 128+ cores.

linux/Documentation/vm/numa.rst states it was started in 1999, was windows going anywhere near NUMA architectures back then?


Since windows was mostly running on x86 and the memory controllers were in the northbridge back then even multi-socket systems wouldn't have been affected by NUMA. Moving them on-die only happened later.


There were NUMA x86 rigs long before the memory controller moved to the CPU. IBM xSeries and serverworks chipsets from around 2000 had NUMA topologies.


Windows server 2003 had NUMA support. I am not sure that Server 2000 exposed any NUMA capability, but there were a lot of things cut from that project because it was running late. My guess is NUMA was one of the things that got pushed (in terms of release) to 2003.

They used to have a "Datacenter" SKU of the server where you'd find most of these kinds of features. This was only available with OEM hardware IIRC.


These are all embarrassingly parallel multiplication workloads. Would be nice for a change if anyone would run something like MySQL or a gRPC server or something like that, you know one where it actually makes a difference how threads get scheduled when they go to sleep and wake up and when packets arrive and so forth.


With no clear explanation of wildly varying results between different benchmarks, I wonder if the the analysis is flawed.

Were those programs built with the same toolchain? Could it be, that some library the lagging ones use is causing the problem?


looking at the results make me wonder if MS is keeping separate branches of win 10 internally or some CPU hogging services are disabled on Win 10 enterprise version.


Windows 10 Pro crippled the scheduler. Windows 10 Enterprise uses the same uncrippled scheduler as Windows Server. "CPU-hogging services" don't consume 32 full cores.


but this proofs my theory that MS is keeping internally different repositories for win 10 also we know that some tracking services are disabled for Win 10 enterprise which leads to logical conclusion that tracking services could potentially limit OS I/O ops.


Lol no, it's just licensing policies and nothing more.

By the way, going Enterprise -> Pro or Pro -> Enterprise doesn't need a reboot.


Hard to believe that when you saw reboot prompts after plugging usb drives


Any insight into whether Pro for Workstations is better here?


So, the get max pref from the Windows kernel, the software should use the completion port API, and not regular threads/locks.

https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-co...

However, any software that does that will likely NOT be cross platform.

In addition, if you want to benchmark the kernel, you should run against ram disk and not SSD.


In the general case, to maximize performance on any platform requires you to use platform-specific code.

There are some decent "cross platform" platforms, such as Java or C#, which have a better degree of performance compatibility. But if you're working at the system level (aka: PThreads / epoll with Linux, or Windows Threads / Critical Sections / Completion Ports), you need to use the OS-specific code to truly reach best performance.

Java, especially with high-performance JVMs like Azul, can be surprisingly efficient. But to achieve the best performance on the Java Azul Zing runtime, means to use Java Azul-specific libraries! Once again, tying yourself down to a platform.

As it turns out, performance is the hardest thing to port. You can somewhat easily port functionality to any system and kludge things together (with effort, your C# code can port over to .Net Mono and run on Linux). But to actually get performance guarantees with primitives is almost always platform specific testing.

Case in point: you may make certain assumptions about the Linux scheduler, only for the Linux scheduler to change from O(n) to O(1) to Completely Fair, and today the System Admin can change scheduler details to better tune the needs of your application. These things have an effect on performance that make it difficult to port between systems... or even between the SAME system running slightly different configurations (ex: misconfigure Huge Pages on one box)


Right. So in this case, what does the article compare?


Oh yeah, I'm agreeing with you for sure.

I don't know what to say about the article, aside from Windows vs Linux comparisons almost always have a degree of inaccuracy. In this case, Phoronix are clearly Linux experts and I fully trust their Linux data.

Its very difficult to find someone who knows how to optimally compare Windows vs Linux, because most people only really learn one platform. I've taken it upon myself to become a "jack" in both platforms (having neither the expertise of a Windows expert, nor a Linux expert), so I'm better positioned than most to see and understand cross-platform issues.

But very few people bother to learn both systems. (And frankly, most people don't have to learn the other system, so why bother learning? You really can make a solid career on one OS without ever thinking about the other one...)


They're using `Clear Linux 32280` which is a distro produced by Intel.

Presumably built using the Intel compiler which specifically penalizes using AMD CPUs.

Would explain the advantage at low core counts that windows has.


Interestingly the opposite is true. AMD performs surprisingly well on Intel Clear Linux https://www.forbes.com/sites/jasonevangelho/2020/02/12/surpr...


To the point where the benchmarks in the post get a bit misleading, as clear Linux will outperform Ubuntu (or Fedora or whatever a user is more likely to install) by a quote big margin.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: