A concrete case where this comes up is multi-omics research. A single study routinely combines ~20k gene expression values, 100k–1M SNPs, thousands of proteins and metabolites, plus clinical metadata — all per patient.
Today, this data is almost never stored in relational tables. It lives in files and in-memory matrices, and a large part of the work is repeatedly rebuilding wide matrices just to explore subsets of features or cohorts.
In that context, a “wide table” isn’t about transactions or joins — it’s about having a persistent, queryable representation of a matrix that already exists conceptually. Integration becomes “load patients”, and exploration becomes SELECT statements.
I’m not claiming this fits every workload, but based on how much time is currently spent on data reshaping in multi-omics, I’m confident there is a real need for this kind of model.
Interesting. Are you willing to try out some 'experimental' software?
As I indicated in my previous post, I have a unique kind of data management system that I have built over the years as a hobby project.
It was originally designed to be a replacement for conventional file systems. It is an object store where you could store millions or billions of files in a single container and attach metadata tags to each one. Searches for data could be based on these tags. I had to design a whole new kind of metadata manager to handle these tags.
Since thousands or millions of different kinds of tags could be defined, each with thousands or millions of unique values within them; the whole system started to look like a very wide, sparse relational table.
I found that I could use the individual 'columnar stores' that I built, to also build conventional database tables. I was actually surprised at how well it worked when I started benchmarking it against popular database engines.
I would test my code by downloading and importing various public datasets and then doing analytics against that data. My system does both analytic and transactional operations pretty well.
Most of the datasets only had a few dozen columns and many had millions of rows; but I didn't find any with over a thousand columns.
As I said before, I had previously only tested it out to 10,000 columns. But since reading your original question, I started to play with large numbers of columns.
After tweaking the code, I got it to create tables with up to a million columns and add some random test data to them. A 'SELECT *' query against such a table can take a long time, but doing some queries where only a few dozen of the columns were returned, worked very fast.
How many patients were represented in your dataset? I assume that most rows did not have a value in every column.
ClickHouse and Scuba are extremely good at what they’re designed for: fast OLAP over relatively narrow schemas (dozens to hundreds of columns) with heavy aggregation.
The issue I kept running into was extreme width: tens or hundreds of thousands of columns per row, where metadata handling, query planning, and even column enumeration start to dominate.
In those cases, I found that pushing width this far forces very different tradeoffs (e.g. giving up joins and transactions, distributing columns instead of rows, and making SELECT projection part of the contract).
If you’ve seen ClickHouse or Scuba used successfully at that kind of width, I’d genuinely be interested in the details.
Scuba could handle 100,000 columns, probably more. But yes, the model is that you have one table and you can only do self-joins and it’s more or less append only and you were only accessing maybe dozens of columns in a single query.
From what I understand, Exasol is a very fast analytical database for traditional data warehouses.
My engine doesn't replace a data warehouse; it solves a type of table that data warehouses simply can't handle: tables with hundreds of thousands or millions of columns with an access model that guarantees interactive response times even in these extreme cases.
In a few words: table data is stored on hundreds of MariaDB servers. Each table is user designed hash key columns(1->32) to manage automatic partitioning.
Wide tables are split in chunks. 1 chunk = the hash key + columns = one MariaDB server. The data dictionary is stored on mirrored dedicated MariaDB servers.
The engine in itself uses a massive fork policy.
In my lab, the k1000 table is stored on 500 chunks. I used a small trick : where I say 1 MariaDB server you can use one database in a MariaDB server. So I have only 20 VmWare Linux servers with 25 database each containing 25 databases.
A concrete case where this comes up is multi-omics research. A single study routinely combines ~20k gene expression values, 100k–1M SNPs, thousands of proteins and metabolites, plus clinical metadata — all per patient.
Today, this data is almost never stored in relational tables. It lives in files and in-memory matrices, and a large part of the work is repeatedly rebuilding wide matrices just to explore subsets of features or cohorts.
In that context, a “wide table” isn’t about transactions or joins — it’s about having a persistent, queryable representation of a matrix that already exists conceptually. Integration becomes “load patients”, and exploration becomes SELECT statements.
I’m not claiming this fits every workload, but based on how much time is currently spent on data reshaping in multi-omics, I’m confident there is a real need for this kind of model.
reply