I'm curious about past examples of failures, can you provide some? Thanks!

bayindirh · on Jan 22, 2021

The thing with Owncloud/Nextcloud is the complexity of the system. When many users are working with the files at the same time, there's a lot of messaging going on in the backbone. If people are working in shared folders, that traffic goes visibly up.

Messaging between systems is hard. When things are async, it's harder. When there's async concurrency, it becomes spicy.

I'm managing a Nextcloud instance at the office. Even with this relatively mature state, it can bork itself pretty decently. Also, it has separation of responsibilities and a system admin is not an admin the traditional sense so, a lot of things are hidden between logs and other layers.

Complicating this further by adding microservices is a brave idea at least. We're going to see what's going to happen.

I hope they succeed but it's not a piece of cake.

ThePhysicist · on Jan 22, 2021

There are many examples where complete rewrites have caused problems and killed or hampered companies, Netscape and Firefox come to mind.

Regarding microservices I don't know any OS projects that have failed because of that (because there aren't many OS products that use a microservices architecture), but personally I witnessed many corporate software projects burn and fail when trying to rewrite an "ugly" but functional monolith as a collection of microservices.

The thing is that the energy which you invest into rewriting your entire software in a new stack could also be invested into improving your existing stack. LAMP is not dead and actually evolving quite well, so there are definitely ways to improve the stability and speed of a LAMP application without rewriting it in Golang (even Facebook still uses PHP, albeit with some modifications). Having written extensive applications in both scripting languages and Golang I can say that the latter has its advantages but is also slower to develop in and in many respects not as agile as a scripting language like PHP or Python.

Doing a rewrite of a popular project with thousands of deploys in the wild will also force you to split your attention between keeping the old system running and building the new one. Since most of your users will want to migrate from one system to the other without losing their existing data and configuration you'll have to ensure that there is a clear migration path that works, which can be quite hard. Also, all external systems that are interacting with existing installations (e.g. via APIs) will also need to be supported by the new system.

In the end it might be possible to make it, but in my humble experience I think a complete rewrite is almost never a good idea and the energy you put into it is often better invested in making gradual improvements to your existing codebase. But that's just my 2c of course.

rimliu · on Jan 22, 2021

I'd say in your examples there were problems at companies which killed or hampered them and also caused rewrites.

ThePhysicist · on Jan 22, 2021

Probably! From my experience I think the main problem is that developers often don't like to work with "legacy" code, where legacy means anything they didn't write themselves. Also, most developers tend to drastically underestimate the cost of a full rewrite as they don't know all the intricacies of the old system. From the outside most systems seem easy and straightforward, so people think they should be easy to rewrite. During the rewrite people then discover all the little edge cases that the legacy system handled and that they didn't think about.

There are legitimate reasons to rewrite a system in a new language, it's not a a decision that should be taken lightly though. From a business perspective it's also very expensive, as having a team of 5-10 developers rewrite a system that already exists and works can cost millions of dollars, so you should have a really good reason and clear ROI objective when deciding to do this.

butonic · on Jan 26, 2021

Hm, let me give a short history of owncloud reva and ocis.

Many years ago CERN chose ownCloud over other solutions because of the state based sync. They could use that to let researchers sync petabytes of data residin on EOS, their custom built storage solution. That ownCloud supports custom storage implementations made this a lot easier.

The did suffer some database bottlenecks and decided to extend EOS with features that the sync clients needs: tree modification time propagation (so that the etag of the root changes when anything in the tree changes) and size tree accounting (so you can see how many bytes are hidden in a folder including all children).

The basically maintained a fork that was half PHP, half c++. Together, we added APIs and interfaces to the codebase to make the file cache implementation exchangeable.

They went ahead and implemented a golang service that could serve the api requests, while the web ui was rendered in PHP. It should be possible to dig into the details by looking at the public CernBox repo. Code archeology ;)

Anyway, all that was before the nextloud fork.

To be honest, CERN has tried to convince ownCloud to switch to a different architecture for years. With some long held opinions leaving the company we were free to reassess our options.

ownCloud has long had problems with long running operations being killed by eg. php-fpm timeouts. So I was evaluating and comparing different PHP frameworks like reactphp, swoole and amphp. I wanted to be able to offload workflows that are triggered after an upload has finished to a proper background job. And I really wanted to stay in PHP land because of all the already written code and existing apps.

But I noticed that they all had one thing in common. The all reimplemented a redis and a mysql library which made me wonder why?The existing drivers would block network IO, killing every concurrency gains you could gain by using a reactive framework or the go like coroutines of swoole.

It finally dawned on me that PHP may not be the best language to implement a service that has to deal with file IO. A systems language is more fitting.

Go or rust? Two years ago that was way easier to answer. Furthermore, CERN had a working server side API of the ownCloud webdav and ocs endpoints written in golang.

We sat down and discussed how o file sync and share solution would have to look like at the protocol level. What services are necessary and how could we make the existing code more modular to support other storage backend a than EOS? How can we get rid of the centralised database?

The result were changes to the CS3 api, making reva a reference implementation and using ocis to tie it together with user management openid connect and thinking about migration strategies. There is still the possibility to wrap PHP in a sandbox like service using roadrunner.

I would not have dared to start from scratch. But with an existing codebase that was used in production the decision became easier. The story is not over, yet.

Yes, we are leaving things behind. But we can embrace new things as well. And I am happy to be able to work with an awesome team to see this through. Every helping hand is welcome.

fundatus · on Jan 22, 2021

Netscape for example: https://www.joelonsoftware.com/2000/04/06/things-you-should-...

rimliu · on Jan 22, 2021

It is telling when a single example is used for 20+ years.

mjw1007 · on Jan 22, 2021

And in hindsight it doesn't seem a terribly good example.

Gecko was a technical success.

On the non-technical side, it isn't at all clear that Netscape-the-company came out worse than it would have done if it had tried to stick with the buggy Netscape 4 rendering engine.

ThePhysicist · on Jan 22, 2021

I'd argue that their attempted rewrite of Firefox in Rust also caused them to actually fall behind in terms of features and speed, which is reflected in the diminishing market share of Firefox. Technologically Rust is an awesome language and a boon to secure software development, I'm just not sure it was such a great idea for Mozilla to put so much energy into this, which could've gone into improving their main product instead.

ikt · on Jan 22, 2021

I'd argue not because I switched from Firefox to Chrome at v1

https://ubuntuforums.org/showthread.php?t=1398220&page=11&p=...

Chrome was absolutely faster in web rendering than Firefox and more stable and the UX was cleaner, it made Firefox with its slow buggy performance and awful theme's look like myspace to Chromes facebook.

I also remember being very frustrated with Firefox on Ubuntu ~10.04 at the time and when Chrome came along it was exactly what I'd been waiting for.

Quantum/Servo/Rust/UI refreshes seems like this was Firefox catching up to Chrome v1 and honestly in 2020 it feels like Firefox has finally caught up significantly.

It's still not as stable as Chrome but it's getting there, webrender is a massive leap forward in performance:

https://testdrive-archive.azurewebsites.net/Performance/Chal...

That test is now showing at 6.65 seconds down from 45 seconds without it.

superjan · on Jan 22, 2021

I know 3 examples from personal experience. One that I worked on myself, that was scrapped before completion. Two other two were finished but they have severely impacted the business. But they are not so useful to share because you would not know about these products. My assumption is that we all know these kind of projects, but the Netscape story is a great reference because it explains why it is a bad idea.

dosethree · on Jan 22, 2021

Incremental rewrites are just always better. The issue with full rewrites is that you have to rewrite everything - which I suppose sounds good to some engineers, but I dread being forced to spent weeks on parts of the product that are not an issue. Usually you do the rewrite to make things faster, more maintainable, and easier to add features. The issue is, the larger the rewrite, the worse you make all those things in the short-term.