LFortran: Modern interactive LLVM-based Fortran compiler

waynecochran · on July 2, 2020

I use C++ Eigen extensively. I wonder if it feasible to write a C++ programs that hands off blocks of linear algebra code to Fortran subroutines. Eigen does a wonderful job of late binding and lazy evaluation so that something like the following is fairly efficient:

     (L.transpose() * P).diagonal().array().square().mean();

which computes the squared average of column by column dot products between matrices L and P. I assume Fortran, since matrices are native types, does this kind of thing even better.

Aside: Fortran 77 was the first programming language I learned in 1985 and I hated programming for awhile after that. We called it the "F" language. Now that I am a bit older and do a lot of linear algebra coding... I wonder if I would like it now.

celrod · on July 2, 2020

I think Eigen's template strategy is probably hard to beat from the perspective of combining operations. How well Fortran does is probably mostly up to the compiler implementation. In some of my benchmarks, gfortran sometimes seems to fuse the operations and end up much slower than if it has performed them separately in succession, but I wouldn't be surprised if compiling with `-fexternal-blas` (and linking MKL) would've solved that.

If you want to try external BLAS/LAPACK with Eigen, I'd look at: https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html

I have `A * B`, `A * B'`, `A' * B` and `A' * B'` small-single-threaded-matmul benchmarks here: https://chriselrod.github.io/LoopVectorization.jl/latest/exa... I compared triple nested loops with Clang, icc, ifort, gfortran, Julia, and LoopVectorization.jl with matmul routines from ifort, gfortran, OpenBLAS, MKL, and Eigen.

While gfortran's builtin hit over 40 GFLOPS with `A * B` and `A' * B'`, it failed to get half that if only one argument was transposed. I'm supposed awkward fusing at the start of this post because if it had done one after the other, it should have still hit >40 GFLOPS when only 1 matrix was transposed.

kergonath · on July 2, 2020

These benchmarks are very interesting! From my experience gfortran’s MATMUL is very good for small matrices, but OpenBLAS gets better from ~10x10 elements. Not quite sure that’s the same as your benchmark; its colour code is a bit confusing. Would you mind making the data available?

Certainly, gfortran’s default implementation works well as a quick and easy solution for small vectors and matrices outside of hot paths. Also, I remember discussions about improving matmul(transpose(A),B) somewhat recently. I don’t remember for which version it was, but it’s the sort of things that is improved regularly.

I would love an option to align arrays in gfortran. It’s very important to take advantage of automatic vectorisation but I haven’t found a reliable way to do it.

celrod · on July 2, 2020

My color coding is bad (and there's an open issue about it), but I haven't come up with a great solution. One idea I should add was to make everything relative to LoopVectorization, but that wouldn't help for those interested in other comparisons like Eigen vs gfortran.

I'm on gfortran 10.1.1, so I'd only be missing out on very recent improvements (i.e., to trunk).

Note that these benchmarks were dynamically sized. If you write a fortran program like

  real(kind=8), dimension(10,10) :: A, B, C
  ! fill A and B somehow
  C = matmul(A, B)

the compiler will take advantage of the known dimensions and inline the call, making it much faster.

> I would love an option to align arrays in gfortran. It’s very important to take advantage of automatic vectorisation but I haven’t found a reliable way to do it.

I wish more compilers could also use masks to vectorize without padding. If multiplying 7x7 matrices with AVX512, the obvious solution is to just mask the loads/stores of columns, without the need for padding. Of course, padding would also ensure all your loads/stores are aligned, which can be nice.

kergonath · on July 2, 2020

I think what the plots need is an option to show or hide each curve. It’s a rather large increase in complexity compared to just generating images, though.

certik · on July 2, 2020

Thanks for the benchmarks. Do you see some intrinsic reason why a Fortran compiler couldn't do these optimizations? I would think it would know all the information so it would be the ideal place to optimize it.

celrod · on July 3, 2020

There is no reason it could not. Those optimizations just have to be implemented. Flang (the one merged into LLVM) is using MLIR, which has all the required code-gen abilities. That just leaves the cost modeling / deciding which optimizations/transformations to apply.

certik · on July 3, 2020

Yes, once MLIR is more mature, we plan to have a backend for LFortran to also use MLIR.

celrod · on July 3, 2020

For BLAS in particular, this paper can give you an idea of some of MLIR's capabilities: https://arxiv.org/pdf/2003.00532.pdf (But maybe you already know them better than I do.)

LoopVectorization can't do many of these yet, so its performance will fall off a cliff shortly after the largest size on the plots (and at much smaller sizes for CPUs with a smaller L2 cache). I had to add code to perform packing/tiling in my actual matmul code on top of what it did. So that MLIR can generate that sort of code already looks promising.

Still, the work of telling it what to do isn't easy.

I'm not involved in any of those projects, so everything I say here is pure speculation. But I imagine pragmas and the like would be important whenever the compiler doesn't know the sizes at compile time. Otherwise, you probably don't want it to generate massive amounts of code through multiple extra blocking loops, massive unrolling in a main kernel, and multiple clean up kernels for every random loop nest.

waltpad · on July 3, 2020

If you have a look at the different fortran implementations of BLAS gemm (matrix multiplication), you'll see that the transposed matrix cases are treated specifically. In fact, IIRC, the gemm function has flags to indicate for each matrix if it is transposed.

celrod · on July 3, 2020

That's what I did when calling OpenBLAS and MKL, but I confess I don't know the internal details of a non-inlined `matmul` call in gfortran when you don't use `-fexternal-blas`.

Just writing three loops and letting the compiler optimize it was much faster for `A * B'`, so it must be a pretty naive implementation getting called.

antoinealb · on July 3, 2020

You can define EIGEN_USE_BLAS and then Eigen will delegate matrix operations to BLAS routines (such as gemm) instead of its own engine [0]. Then, you could link your program to a BLAS implementation written in Fortran.

However, when I last researched this for my previous work, OpenBLAS was the fastest open source implementation and is written in a mix of C and assembly, not Fortran. The repository contains quite a lot of Fortran, but it is mostly for the LAPACK implementation, which is not part of BLAS.

In our testing we got a nice speedup of our signal processing pipeline when enabling EIGEN_USE_BLAS, but this was on ARM64, so your experience may differ.

[0] https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html

adamnemecek · on July 2, 2020

You can do this with Julia. Julia has an insanely good interop with essentially any language you can name, including cpp and Fortran. You will go through julia but hey, for linear algebra it's the shit.

waynecochran · on July 2, 2020

It would be interesting to compare using Eigen's BCDSVD solv er vs handing it off to julia's library to compute the SVD and handing it back.

kergonath · on July 2, 2020

If you use LAPACK there’s a chance that some Fortran code will be run at some point. For example OpenBLAS does it behind the scenes of you give it a matrix to diagonalise.

cinquemb · on July 3, 2020

I have been using armadillo now vs Eigen for a couple of years, which can be compiled against OpenBLAS and LAPACK, where matrix ops are handled by fortran routines.

http://arma.sourceforge.net/download.html

mhh__ · on July 2, 2020

I've never used Eigen, could you explain specifically what lazy evaluation implies for this expression? Is it to do with the diagonal() call constraining the amount of work to be done?

waynecochran · on July 2, 2020

Yeah, you got it. The diagonal() matrix is not actually created since you really don't need all the elements in the final computation.

yodelshady · on July 3, 2020

Modern fortran is surpisingly pleasant to write.

The last time optimisation came up, I tried the author's problem - multiplying two 4096x4096 matrices - in numpy, FORTRAN and Rust, on a standard laptop. Supposedly it took 9 hours in naive Python (I didn't try). Results were:

Python3 (numpy 1.18.4) 1.3s FORTRAN (gfortran 9.3.0) 6.0s Rust (1.43.1, debug) >60s Rust (1.43.1, release) 4.0s

What surprised me is that python and fortran solutions were written in minutes. The Rust solution took hours and multiple forum posts for help. There's no obvious way, and every single way I tried had "gotchas" and utterly astonishing behaviour in it, particularly with simple, statically-allocated arrays.

Fortran could do with proper namespaces, and properly dropping some of its older conventions, though. Reading it feels like archaeology.

_n_b_ · on July 3, 2020

I used to essentially be a professional Fortran programmer. It's been 7-8 years since I've done that, so this was a nice trip down memory lane. Anyway.

> Python3 (numpy 1.18.4) 1.3s FORTRAN (gfortran 9.3.0) 6.0s Rust (1.43.1, debug) >60s Rust (1.43.1, release) 4.0s

I suspect that you could get the Fortran and numpy results to converge by turning on the compiler option to use the local BLAS library for MATMUL, which will normally use an OpenMP-threaded solution.

Upfront note: these days "Fortran" is preferred to "FORTRAN."

> Fortran could do with proper namespaces, and properly dropping some of its older conventions, though. Reading it feels like archaeology.

The newer Fortran standards support pretty decent namespaces (USE module x, ONLY : y_ and a limited OO with classes and class methods. The problem, at least when I was doing Fortran work everyday, was that the compilers did not fully support the newer Fortran standards. We found the Intel compiler to be the best of the bunch, but we also found a non-trivial number of compiler bugs and standard options that weren't supported. The Intel Math Kernel Library offers pretty good linear algebra performance, too.

Re: older conventions, we had a lot of modern Fortran code that was easy to read and avoided all of the old Fortran 77 nonsense that really confusing to read if you weren't from that era (e.g., the infamous arithmetic GOTO).

However, it was nice being able to still incorporate "just works" code from the 80s that did complicated calculational things that nobody really wanted to touch too much. You know, the kind of subroutine that starts with the comment, "This code _seems to_ ..." I suspect we're not the only organization that wanted to keep that code around but be able to add modern Fortran around and on top of it.

MaxBarraclough · on July 3, 2020

So to be clear, numpy gives you a highly optimised matrix multiplication algorithm, and your FORTRAN solution was just something you threw together yourself, right? Little surprise that numpy roundly outperforms it, especially if it's using a different algorithm, such as Strassen rather than naive matrix multiplication. [0]

> every single way I tried had "gotchas" and utterly astonishing behaviour in it

As someone who doesn't know much about Rust (or FORTRAN for that matter), what kinds of gotchas? I'd expect numeric code like this to be quite straightforward, as it presumably doesn't lean heavily on memory-management cleverness in the language, or anything like that.

[0] https://en.wikipedia.org/wiki/Strassen_algorithm

edit For clarity, I made a table of your data. Seems surprising that Rust outperformed FORTRAN.

    ┌──────────┬─────────────────┬──────┐
    │ Language │     Detail      │ Time │
    ├──────────┼─────────────────┼──────┤
    │ Python3  │ numpy 1.18.4    │ 1.3s │
    │ FORTRAN  │ gfortran 9.3.0  │ 6.0s │
    │ Rust     │ 1.43.1, debug   │ >60s │
    │ Rust     │ 1.43.1, release │ 4.0s │
    └──────────┴─────────────────┴──────┘

yodelshady · on July 3, 2020

Oh no, simpler than that even. The FORTRAN solution was also using somebody else's algorithm (matmul), which is intrinsic, and, I believe, not multithreaded. I'm not employed to develop matrix multiplication algorithms, but it will benefit me to know which tools work best.

Re Rust gotchas: the standard array container breaks with lengths > 32 - alright, it doesn't "break", but half of the traits, including really basic stuff - more basic than "add" - aren't implemented. So if you write something at small scale, then scale it up slightly - gotcha! And you can't easily reimplement them, either. Basically, the easiest way I've found to do it that way is to fork rustc.

That leaves using dynamic allocation, which I wanted to avoid, or external libraries (nalgebra), which aren't necessarily mature yet. And since type parameters are sometimes namespace separated, and sometimes not, and type specification is sometimes necessary, and sometimes not - examples work, but any tiny modification of them does not, because the example relied on type inference, and what looks like a type, and is declared as a type, is actually a generic.

It also, incidentally, compiled to ninety-three megabytes. To multiply two 4096x4096 f64's.

Someone · on July 3, 2020

The main reason for numpy’s speed is that it uses highly optimized code written in another language (for matrix multiplications, quite possibly FORTRAN).

MaxBarraclough · on July 3, 2020

Yes, of course numpy's heavy-lifting code isn't written in Python.

My point was that in comparing numpy to yodelshady's FORTRAN code, we're probably just comparing two different FORTRAN implementations, one much better than the other (perhaps using a different algorithm, perhaps just better optimised).

yodelshady · on July 3, 2020

It does seem surprising (and I was all ready to slate rust until I remembered to build for release). If you care that much - naive FORTRAN iteration on the same machine, same compiler, took 1200 s.

(I am, obviously, taking inspiration from "There's plenty of room at the top": https://news.ycombinator.com/item?id=23442123. Naive FORTRAN's slow even compared to C there, but it's not the same machine anymore so not apples to apples. Also, as a small point - I initialised all matrices with random numbers. Below half a second and that might start to be a significant time demand.)

I commented on this thread, because it definitely seems to be in the realm where the compiler matters. So there is opportunity.

pjmlp · on July 3, 2020

Actually the table is more like

    ┌────────────────┬─────────────────┬──────┐
    │ Language       │     Detail      │ Time │
    ├────────────────┼─────────────────┼──────┤
    │ Python3 and C  │ numpy 1.18.4    │ 1.3s │
    │ FORTRAN        │ gfortran 9.3.0  │ 6.0s │
    │ Rust           │ 1.43.1, debug   │ >60s │
    │ Rust           │ 1.43.1, release │ 4.0s │
    └────────────────┴─────────────────┴──────┘

MaxBarraclough · on July 3, 2020

I could have put 'host language', but I figured it was clear from context.

I'd assumed numpy used FORTRAN, but you're right, its heavy-lifting is done in C.

C1sc0cat · on July 3, 2020

These days you don't use the older Fortran'isams Arithmetic IF's for example and I suspect no one uses BLOCK DATA any more.

But Backwards compatibly is important here as there is a lot of legacy code still in use - yes I know arithmetic if has finally been removed.

pantalaimon · on July 2, 2020

Is this related to flang?

https://github.com/llvm/llvm-project/tree/master/flang/

certik · on July 2, 2020

LFortran author here. As kergonath said below, both "new" Flang that you linked above, and LFortran started at about the same time. I know the Flang developers well and I am hoping we'll be able to collaborate in the future more. It looks like MLIR would be one option: they are going to use it for Flang and LFortran can have a backend that could also use it.

LFortran's main advantage is that it is interactive, so the parser and semantics has to be a little different. And also we designed it so that it will be quite approachable for people to write tools on top.

C1sc0cat · on July 3, 2020

Interesting to see an interactive Fortran as I started with Fortran in my first Job - on my first day I was told there is a book in the company library go and learn Fortran

For the really big Fortran systems I worked on in the 1980's (Map Reduce) the edit compile / link process could take several minutes for a single module - we did have a build system written in JCL to help automate this as well.

kergonath · on July 2, 2020

Not exactly. Development of both started at about the same time, but it was a coincidence. Flang is a standard compiler, whilst LFortran is more geared towards interactive use à la Jupyter, or source-to-source compilation.

Here’s a demo of the Jupyter notebook-like interface: https://mybinder.org/v2/gl/lfortran%2Fweb%2Flfortran-binder/...

mootzville · on July 2, 2020

Nice, I want to play with this, but I see on the Development page there is no/limited support for arrays?

certik · on July 2, 2020

LFortran author here. There is a Python prototype version that you can see in the notebook and it has very limited support for arrays. I stopped spending more time on it once I got further enough to validate that the idea works. And I am now spending all my time on the production LFortran implementation in C++. I am very close to make it usable for some very simple things, and then we'll go from there.

kergonath · on July 2, 2020

Not sure about arrays (I am not involved in the project, I just thought it was cool). It’s still in a very preliminary stage, though, so there certainly are rough edges.

dang · on July 2, 2020

If curious see also

2019 https://news.ycombinator.com/item?id=19795262

https://news.ycombinator.com/item?id=19788526

certik · on July 2, 2020

LFortran author here. I missed one of these discussions.

kergonath · on July 2, 2020

These are very interesting discussions. I don’t know how I managed to miss them when I submitted...

dang · on July 2, 2020

The cutoff point for reposts is about a year, so it's good that you didn't see them and then not post.

https://news.ycombinator.com/newsfaq.html

baggy_trough · on July 2, 2020

Not much of a Fortran yet without arrays, complex numbers, or strings.

certik · on July 2, 2020

LFortran author here. We will get there. I spent a lot of time in the last year bootstrapping our efforts around https://fortran-lang.org/, which was very successful. I have shifted my focus on LFortran again now.

enriquto · on July 3, 2020

> Not much of a Fortran yet without arrays, complex numbers, or strings.

A fortran without strings would still be a fortran, for many people. In fact, it could arguably be "a better fortran". Without complex numbers or arrays, not at all.

dm319 · on July 2, 2020

fortran - one of only a handful of languages that natively supports multidimensional arrays, maybe the only low level one?

BitterAmethyst · on July 3, 2020

Not a low-level one but the CLR natively supports multidimensional arrays although I don't think they are exposed to C#.

I've written some F# code and for some trivial cases I was benchmarking there was some interesting performance differences between a native multi-dimensional array and an array-of-arrays version.

It wasn't important enough to look into so I don't know if the difference is due to quirks of the CLRs implementation or due to memory access patterns.

cjhanks · on July 3, 2020

Julia, R, and MATLAB. But yeah, it's not common. I end up reinventing this wheel about once a year.

Someone · on July 3, 2020

Plankalkül (https://en.wikipedia.org/wiki/Plankalkül), Superplan (https://en.wikipedia.org/wiki/Superplan; never implemented) Pascal, Modula, Oberon, COBOL, ALGOL 60, dBase, Visual Basic (not all ‘low-level’)

(and probably many others)

ironmagma · on July 2, 2020

C kind of does, right? A multidimensional array in C isn't just an array of pointers.

dm319 · on July 3, 2020

Is that more of a jagged / ragged array, or array of arrays? Sometimes these are called multidimensional arrays, but these are not the same thing. Dimensions and lengths of the dimensions should be a property of the multidimensional object.

[1] https://stackoverflow.com/questions/597720/what-are-the-diff...

ironmagma · on July 3, 2020

I was talking about

  int a[2][3];

in C. Is that jagged? I wouldn’t have thought so, since AFAIK you could only store 6 elements in that and no fewer. Interestingly you cannot take the value of &&a[0][0], assign it to an

  int **b;

pointer b, and then access it via b[0][0] because it’s stored as a contiguous block of integers rather than an array of pointers that point to arrays as you might expect.

dm319 · on July 3, 2020

Yes, that's a jagged/ragged/array-of-array, and most languages support them. There isn't a requirement for the arrays to be the same length. There are many arguments about the pros and cons of each, and how different they are the closer to assembly you get. Golang had a proposition to include multidimensional arrays [1], there were lots of arguments. I think the different perspectives depend on what you want to use it for and how much numerical processing you do.

[1] https://github.com/golang/go/issues/6282

ironmagma · on July 3, 2020

According to this thread, it is not a true jagged array because each row has the same number of columns and each column the same number of rows. You can initialize it as a jagged array but it will still allocate it like a matrix.

https://stackoverflow.com/questions/1083658/do-jagged-arrays...

> the array will still have the shape [2][3]. If you want a true jagged array, you will have to create it dynamically.

dm319 · on July 3, 2020

Ah ok, I was discriminating between a true multidimensional array and array-of-array (which can be ragged).

CyberDildonics · on July 3, 2020

That's ridiculous, you can make a C++ class and overload operator() like Eigen to easily make a multi-dimensional array.

baggy_trough · on July 3, 2020

"natively"

CyberDildonics · on July 3, 2020

What is different to fortran's multi-dimensional arrays? The interface can be almost identical and the memory layout in C++ can be whatever you want it to be.

enriquto · on July 3, 2020

By that reasoning, all turing-complete languages support multi-dimensional arrays.

CyberDildonics · on July 3, 2020

I don't think every turing complete language lets you create an interface and memory layout that would be identical to fortran.

sangfroid_bio · on July 2, 2020

Looking forward to differentiable programming in Fortran.

certik · on July 2, 2020

That has been requested quite often: https://gitlab.com/lfortran/lfortran/-/issues/97

dnautics · on July 2, 2020

I believe that's called "Julia".

ChrisRackauckas · on July 2, 2020

There's a library for Julia coming out soon that can differentiate Fortran code. So, while tongue and cheek, it's not that far off!

rarepostinlurkr · on July 3, 2020

Support for arm?

dng88 · on July 2, 2020

Wish lisp have these like jupyter integration.

And my first computer lecture many decades ago is really about how he hated fortran and why FORTRAN is still not dead (Mainly negative view due to haveGOTO that time). And it is still not dead.

mmaul · on July 2, 2020

Common Lisp has had Jupyter integration for a while.

First cl-jupyter (https://github.com/fredokun/cl-jupyter) which is now in maintance mode and now common-lisp-jupyter (https://github.com/yitzchak/common-lisp-jupyter).

Here is a notebook I did awhile back with cl-jupyter: https://github.com/mmaul/clml.tutorials/blob/master/CLML-Win...

pjmlp · on July 3, 2020

Lisps were at the genesis of the original concept, it was called Lisp Machine, and the experience can still be replicated when using commercial Common Lisps like Allegro.

Fortran has been continuously updated, supports OOP, modules, and even generics.

waltpad · on July 3, 2020

Sadly, AFAIK, BLAS hasn't been updated to use generics. It would be so nice to have support for integer vector/matrix operations in there (not that it would require generics to have that, but it could be easier to implement integer support with it, though I suspect that for efficiency reason it might not be used in the end).

fiddlerwoaroof · on July 2, 2020

There’s this:

https://github.com/yitzchak/common-lisp-jupyter

kergonath · on July 2, 2020

To be fair, it does not look much like it was many decades ago.