I use C++ Eigen extensively. I wonder if it feasible to write a C++ programs that hands off blocks of linear algebra code to Fortran subroutines. Eigen does a wonderful job of late binding and lazy evaluation so that something like the following is fairly efficient:
which computes the squared average of column by column dot products between matrices L and P. I assume Fortran, since matrices are native types, does this kind of thing even better.
Aside: Fortran 77 was the first programming language I learned in 1985 and I hated programming for awhile after that. We called it the "F" language. Now that I am a bit older and do a lot of linear algebra coding... I wonder if I would like it now.
I think Eigen's template strategy is probably hard to beat from the perspective of combining operations.
How well Fortran does is probably mostly up to the compiler implementation. In some of my benchmarks, gfortran sometimes seems to fuse the operations and end up much slower than if it has performed them separately in succession, but I wouldn't be surprised if compiling with `-fexternal-blas` (and linking MKL) would've solved that.
I have `A * B`, `A * B'`, `A' * B` and `A' * B'` small-single-threaded-matmul benchmarks here: https://chriselrod.github.io/LoopVectorization.jl/latest/exa...
I compared triple nested loops with Clang, icc, ifort, gfortran, Julia, and LoopVectorization.jl with matmul routines from ifort, gfortran, OpenBLAS, MKL, and Eigen.
While gfortran's builtin hit over 40 GFLOPS with `A * B` and `A' * B'`, it failed to get half that if only one argument was transposed. I'm supposed awkward fusing at the start of this post because if it had done one after the other, it should have still hit >40 GFLOPS when only 1 matrix was transposed.
These benchmarks are very interesting! From my experience gfortran’s MATMUL is very good for small matrices, but OpenBLAS gets better from ~10x10 elements. Not quite sure that’s the same as your benchmark; its colour code is a bit confusing. Would you mind making the data available?
Certainly, gfortran’s default implementation works well as a quick and easy solution for small vectors and matrices outside of hot paths. Also, I remember discussions about improving matmul(transpose(A),B) somewhat recently. I don’t remember for which version it was, but it’s the sort of things that is improved regularly.
I would love an option to align arrays in gfortran. It’s very important to take advantage of automatic vectorisation but I haven’t found a reliable way to do it.
My color coding is bad (and there's an open issue about it), but I haven't come up with a great solution. One idea I should add was to make everything relative to LoopVectorization, but that wouldn't help for those interested in other comparisons like Eigen vs gfortran.
I'm on gfortran 10.1.1, so I'd only be missing out on very recent improvements (i.e., to trunk).
Note that these benchmarks were dynamically sized. If you write a fortran program like
real(kind=8), dimension(10,10) :: A, B, C
! fill A and B somehow
C = matmul(A, B)
the compiler will take advantage of the known dimensions and inline the call, making it much faster.
> I would love an option to align arrays in gfortran. It’s very important to take advantage of automatic vectorisation but I haven’t found a reliable way to do it.
I wish more compilers could also use masks to vectorize without padding. If multiplying 7x7 matrices with AVX512, the obvious solution is to just mask the loads/stores of columns, without the need for padding.
Of course, padding would also ensure all your loads/stores are aligned, which can be nice.
I think what the plots need is an option to show or hide each curve. It’s a rather large increase in complexity compared to just generating images, though.
Thanks for the benchmarks. Do you see some intrinsic reason why a Fortran compiler couldn't do these optimizations? I would think it would know all the information so it would be the ideal place to optimize it.
There is no reason it could not. Those optimizations just have to be implemented.
Flang (the one merged into LLVM) is using MLIR, which has all the required code-gen abilities. That just leaves the cost modeling / deciding which optimizations/transformations to apply.
For BLAS in particular, this paper can give you an idea of some of MLIR's capabilities: https://arxiv.org/pdf/2003.00532.pdf
(But maybe you already know them better than I do.)
LoopVectorization can't do many of these yet, so its performance will fall off a cliff shortly after the largest size on the plots (and at much smaller sizes for CPUs with a smaller L2 cache). I had to add code to perform packing/tiling in my actual matmul code on top of what it did.
So that MLIR can generate that sort of code already looks promising.
Still, the work of telling it what to do isn't easy.
I'm not involved in any of those projects, so everything I say here is pure speculation. But I imagine pragmas and the like would be important whenever the compiler doesn't know the sizes at compile time.
Otherwise, you probably don't want it to generate massive amounts of code through multiple extra blocking loops, massive unrolling in a main kernel, and multiple clean up kernels for every random loop nest.
If you have a look at the different fortran implementations of BLAS gemm (matrix multiplication), you'll see that the transposed matrix cases are treated specifically. In fact, IIRC, the gemm function has flags to indicate for each matrix if it is transposed.
That's what I did when calling OpenBLAS and MKL, but I confess I don't know the internal details of a non-inlined `matmul` call in gfortran when you don't use `-fexternal-blas`.
Just writing three loops and letting the compiler optimize it was much faster for `A * B'`, so it must be a pretty naive implementation getting called.
You can define EIGEN_USE_BLAS and then Eigen will delegate matrix operations to BLAS routines (such as gemm) instead of its own engine [0]. Then, you could link your program to a BLAS implementation written in Fortran.
However, when I last researched this for my previous work, OpenBLAS was the fastest open source implementation and is written in a mix of C and assembly, not Fortran. The repository contains quite a lot of Fortran, but it is mostly for the LAPACK implementation, which is not part of BLAS.
In our testing we got a nice speedup of our signal processing pipeline when enabling EIGEN_USE_BLAS, but this was on ARM64, so your experience may differ.
You can do this with Julia. Julia has an insanely good interop with essentially any language you can name, including cpp and Fortran. You will go through julia but hey, for linear algebra it's the shit.
If you use LAPACK there’s a chance that some Fortran code will be run at some point. For example OpenBLAS does it behind the scenes of you give it a matrix to diagonalise.
I have been using armadillo now vs Eigen for a couple of years, which can be compiled against OpenBLAS and LAPACK, where matrix ops are handled by fortran routines.
I've never used Eigen, could you explain specifically what lazy evaluation implies for this expression? Is it to do with the diagonal() call constraining the amount of work to be done?
The last time optimisation came up, I tried the author's problem - multiplying two 4096x4096 matrices - in numpy, FORTRAN and Rust, on a standard laptop. Supposedly it took 9 hours in naive Python (I didn't try). Results were:
What surprised me is that python and fortran solutions were written in minutes. The Rust solution took hours and multiple forum posts for help. There's no obvious way, and every single way I tried had "gotchas" and utterly astonishing behaviour in it, particularly with simple, statically-allocated arrays.
Fortran could do with proper namespaces, and properly dropping some of its older conventions, though. Reading it feels like archaeology.
I used to essentially be a professional Fortran programmer. It's been 7-8 years since I've done that, so this was a nice trip down memory lane. Anyway.
I suspect that you could get the Fortran and numpy results to converge by turning on the compiler option to use the local BLAS library for MATMUL, which will normally use an OpenMP-threaded solution.
Upfront note: these days "Fortran" is preferred to "FORTRAN."
> Fortran could do with proper namespaces, and properly dropping some of its older conventions, though. Reading it feels like archaeology.
The newer Fortran standards support pretty decent namespaces (USE module x, ONLY : y_ and a limited OO with classes and class methods. The problem, at least when I was doing Fortran work everyday, was that the compilers did not fully support the newer Fortran standards. We found the Intel compiler to be the best of the bunch, but we also found a non-trivial number of compiler bugs and standard options that weren't supported. The Intel Math Kernel Library offers pretty good linear algebra performance, too.
Re: older conventions, we had a lot of modern Fortran code that was easy to read and avoided all of the old Fortran 77 nonsense that really confusing to read if you weren't from that era (e.g., the infamous arithmetic GOTO).
However, it was nice being able to still incorporate "just works" code from the 80s that did complicated calculational things that nobody really wanted to touch too much. You know, the kind of subroutine that starts with the comment, "This code _seems to_ ..." I suspect we're not the only organization that wanted to keep that code around but be able to add modern Fortran around and on top of it.
So to be clear, numpy gives you a highly optimised matrix multiplication algorithm, and your FORTRAN solution was just something you threw together yourself, right? Little surprise that numpy roundly outperforms it, especially if it's using a different algorithm, such as Strassen rather than naive matrix multiplication. [0]
> every single way I tried had "gotchas" and utterly astonishing behaviour in it
As someone who doesn't know much about Rust (or FORTRAN for that matter), what kinds of gotchas? I'd expect numeric code like this to be quite straightforward, as it presumably doesn't lean heavily on memory-management cleverness in the language, or anything like that.
Oh no, simpler than that even. The FORTRAN solution was also using somebody else's algorithm (matmul), which is intrinsic, and, I believe, not multithreaded. I'm not employed to develop matrix multiplication algorithms, but it will benefit me to know which tools work best.
Re Rust gotchas: the standard array container breaks with lengths > 32 - alright, it doesn't "break", but half of the traits, including really basic stuff - more basic than "add" - aren't implemented. So if you write something at small scale, then scale it up slightly - gotcha! And you can't easily reimplement them, either. Basically, the easiest way I've found to do it that way is to fork rustc.
That leaves using dynamic allocation, which I wanted to avoid, or external libraries (nalgebra), which aren't necessarily mature yet. And since type parameters are sometimes namespace separated, and sometimes not, and type specification is sometimes necessary, and sometimes not - examples work, but any tiny modification of them does not, because the example relied on type inference, and what looks like a type, and is declared as a type, is actually a generic.
It also, incidentally, compiled to ninety-three megabytes. To multiply two 4096x4096 f64's.
The main reason for numpy’s speed is that it uses highly optimized code written in another language (for matrix multiplications, quite possibly FORTRAN).
Yes, of course numpy's heavy-lifting code isn't written in Python.
My point was that in comparing numpy to yodelshady's FORTRAN code, we're probably just comparing two different FORTRAN implementations, one much better than the other (perhaps using a different algorithm, perhaps just better optimised).
It does seem surprising (and I was all ready to slate rust until I remembered to build for release). If you care that much - naive FORTRAN iteration on the same machine, same compiler, took 1200 s.
(I am, obviously, taking inspiration from "There's plenty of room at the top": https://news.ycombinator.com/item?id=23442123. Naive FORTRAN's slow even compared to C there, but it's not the same machine anymore so not apples to apples. Also, as a small point - I initialised all matrices with random numbers. Below half a second and that might start to be a significant time demand.)
I commented on this thread, because it definitely seems to be in the realm where the compiler matters. So there is opportunity.
LFortran author here. As kergonath said below, both "new" Flang that you linked above, and LFortran started at about the same time. I know the Flang developers well and I am hoping we'll be able to collaborate in the future more. It looks like MLIR would be one option: they are going to use it for Flang and LFortran can have a backend that could also use it.
LFortran's main advantage is that it is interactive, so the parser and semantics has to be a little different. And also we designed it so that it will be quite approachable for people to write tools on top.
Interesting to see an interactive Fortran as I started with Fortran in my first Job - on my first day I was told there is a book in the company library go and learn Fortran
For the really big Fortran systems I worked on in the 1980's (Map Reduce) the edit compile / link process could take several minutes for a single module - we did have a build system written in JCL to help automate this as well.
Not exactly. Development of both started at about the same time, but it was a coincidence. Flang is a standard compiler, whilst LFortran is more geared towards interactive use à la Jupyter, or source-to-source compilation.
LFortran author here. There is a Python prototype version that you can see in the notebook and it has very limited support for arrays. I stopped spending more time on it once I got further enough to validate that the idea works. And I am now spending all my time on the production LFortran implementation in C++. I am very close to make it usable for some very simple things, and then we'll go from there.
Not sure about arrays (I am not involved in the project, I just thought it was cool). It’s still in a very preliminary stage, though, so there certainly are rough edges.
LFortran author here. We will get there. I spent a lot of time in the last year bootstrapping our efforts around https://fortran-lang.org/, which was very successful. I have shifted my focus on LFortran again now.
> Not much of a Fortran yet without arrays, complex numbers, or strings.
A fortran without strings would still be a fortran, for many people. In fact, it could arguably be "a better fortran". Without complex numbers or arrays, not at all.
Not a low-level one but the CLR natively supports multidimensional arrays although I don't think they are exposed to C#.
I've written some F# code and for some trivial cases I was benchmarking there was some interesting performance differences between a native multi-dimensional array and an array-of-arrays version.
It wasn't important enough to look into so I don't know if the difference is due to quirks of the CLRs implementation or due to memory access patterns.
Is that more of a jagged / ragged array, or array of arrays? Sometimes these are called multidimensional arrays, but these are not the same thing. Dimensions and lengths of the dimensions should be a property of the multidimensional object.
in C. Is that jagged? I wouldn’t have thought so, since AFAIK you could only store 6 elements in that and no fewer. Interestingly you cannot take the value of &&a[0][0], assign it to an
int **b;
pointer b, and then access it via b[0][0] because it’s stored as a contiguous block of integers rather than an array of pointers that point to arrays as you might expect.
Yes, that's a jagged/ragged/array-of-array, and most languages support them. There isn't a requirement for the arrays to be the same length. There are many arguments about the pros and cons of each, and how different they are the closer to assembly you get. Golang had a proposition to include multidimensional arrays [1], there were lots of arguments. I think the different perspectives depend on what you want to use it for and how much numerical processing you do.
According to this thread, it is not a true jagged array because each row has the same number of columns and each column the same number of rows. You can initialize it as a jagged array but it will still allocate it like a matrix.
What is different to fortran's multi-dimensional arrays? The interface can be almost identical and the memory layout in C++ can be whatever you want it to be.
And my first computer lecture many decades ago is really about how he hated fortran and why FORTRAN is still not dead (Mainly negative view due to haveGOTO that time). And it is still not dead.
Lisps were at the genesis of the original concept, it was called Lisp Machine, and the experience can still be replicated when using commercial Common Lisps like Allegro.
Fortran has been continuously updated, supports OOP, modules, and even generics.
Sadly, AFAIK, BLAS hasn't been updated to use generics. It would be so nice to have support for integer vector/matrix operations in there (not that it would require generics to have that, but it could be easier to implement integer support with it, though I suspect that for efficiency reason it might not be used in the end).
Aside: Fortran 77 was the first programming language I learned in 1985 and I hated programming for awhile after that. We called it the "F" language. Now that I am a bit older and do a lot of linear algebra coding... I wonder if I would like it now.