pythomancer's comments

pythomancer · 2025-11-05T21:23:57 1762377837

(author here) check_brackets_bt is actually exactly the algorithm that Raph mentions

raphlinus · 2025-11-05T21:52:13 1762379533

Right. This is the binary tree version of the algorithm, and is nice and concise, very readable. What would take it to the next level for me is the version in the stack monoid paper, which chunks things up into workgroups. I haven't done benchmarks against the Pareas version (unfortunately it's not that easy), but I would expect the workgroup optimized version to be quite a bit faster.

Munksgaard · 2025-11-06T06:59:30 1762412370

To be clear, you can express workgroup parallelism in Futhark, or rather, if the compiler sees that you've programmed your problem in such a way that it can take advantage of workgroup parallelism, it will.

But you're right, it would be interesting to see how the different approaches stack up to each other. The Pareas project linked above also includes an implementation using radix sort.

convolvatron · 2025-11-05T21:57:31 1762379851

I've been playing with one using scans. too bad that's not really on the map for architectural reasons, it opens up a lot of uses.

kragen · 2025-11-06T11:51:52 1762429912

Yeah, monoid prefix sum is a surprisingly powerful tool for parallel and incremental algorithm design!

Munksgaard · 2025-11-05T21:39:15 1762378755

Thanks for clarifying! It would indeed be interesting to see a comparison between similar implementations in other languages, both in terms of readability and performance. I feel like the readability can hardly get much better than what you wrote, but I don't know!

pythomancer · on Dec 6, 2024

In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM (for double precision). Both cuBLAS and rocBLAS support ZGEMM, and the latter is open source. rocBLAS is also pretty complicated though, and perhaps not such a good learning resource. This is a more readable library which implements CGEMM, or at least a similar operation: https://git.astron.nl/RD/recruit/ccglib.

The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.