I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.
The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.
The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).
My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).
My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.
The README in the linked page contains more information.
In theory yes, but at the moment my C compiler is very tied to the x86 platform (there is no separated code generation module, emit calls are directly in the code). Since the compiler is very simple, though, it should not be difficult factor out the handful of machine code gadgets that need to be implemented and add support for other back ends, I think. Or you might also other consider other ways, like using the toolchain in OP's repository.
If you want to have some brainstorming on #bootstrappable at freenode, there are often interesting discussions on these things.
Wow, great to meet both of you. I've been chasing bootstrapping as well. My approach has been to make programming in raw machine code more ergonomic with some simple text-based tools (that should in principle be easy to write in machine code)
> guarantee that you can unrar a file in 20 years with minimal work
Does this take into account a potential change in the predominant architecture? I.e., we move from x86 to fooarch? Presumably there's more work than "implement the VM in fooarch instructions"? You'd have to write a fooarch assembler as well, right? As well as fooarch C compiler backends?
As the code is compiled to a well-specified, small VM it should only be a matter of writing the VM in whatever language is already available for that particular platform/architecture.
You can also choose to write the VM in raw assembly. While this isn't ideal, the VM itself is mainly just straightforward register operations and should map trivially to any hardware that has bit operations and hardware 32-bit multiply/divide.
If it comes down to it, you can implement the VM itself on bare metal, but you'll need to do some work implementing things like a filesystem (not terribly hard to get a basic, non-scalable one up-and-running).
I suppose there's an assumption that the platform provides 32-bit integers, but I _think_ that's a safe assumption.
Bootstrapping isn't an issue of convenience, it is an issue of trust. You can't trust the compiler doing the cross-compilation. You literally have to start from the smallest chunks of assembly code and build your way up to a fully featured compiler through several stages, each of which is more complex and can in turn compile more complex code.
That it's an issue of trust is the reason that's not true. You know certain groups are highly-unlikely to work together on a backdoor that works same for all of them. There's also folks like Wirth's group at ETH that's unlikely to be backdooring things at all. So, the easiest route is to write your bootstrap phase is several languages that use tools from very different people and countries. You can trust it once they produce the same output. Use that output to bootstrap the rest. Also, do it on different hardware and OS's if concerned about that level. Make sure the CPU's were done at different fabs.
Aside from trust, the other reason people are doing this is for fun challenge of building things from ground up. They also are learning about interpreters, compilers, assembly, etc. The author of this work talked like he is doing it the way he is mainly for the challenge.
I think this is a very cool concept, but it doesn't seem to protect you from your environment. When you write the program in your editor, how do you know it's not inserting rogue code before it writes them to disk? When you run an assembly program stored on disk, how do you know the OS or even the hardware isn't patching it before it runs it?
Yes, but if you want to rebuild the trust chain you cannot restart from an untrusted compiler. The idea is to bootstrap a trusted compiler from nothing (i.e., from little enough so that you can check it directly), and then use the trusted compiler for everything else.
When you compile regularly you don't, mainly because you're more interested in ensuring your code isn't broken and you have no intention to ship the binaries to anyone.
The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.
The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).
EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap