More

Upitor · 2025-08-15T06:12:14 1755238334

Feels a little weird, given that there is an easy closed formula for the solutions.

Upitor · 2025-02-15T13:55:39 1739627739

Where do you see the upcoming support for multidim data? Link?

bitschubser_ · 2025-02-15T15:01:56 1739631716

> This is just the beginning of modernizing geospatial data storage. We’re already looking ahead to other types of geospatial data such as raster, point cloud, spatial indexes…

its not far from raster to full multidimensional

Upitor · on Dec 20, 2024

In my experience, NWP data is big. Like, really big! Data over HTTP calls seems to limit the use case a bit; have you considered making it possible to mount storage directly (fsspec), and use fx the zarr format? In this way, querying with xarray would be much more flexible

ElPeque · on Dec 20, 2024

The point of GribStream is that you don't need to download the grided data (cause as you say it is huge! and it grows many Gb every single hour for years).

This is an API that will do streaming extraction and only have you download what you actually need.

When you make the http request to the API, the API will be processing up to terabytes of data only to respond to you with maybe a few Kb of csv.

westurner · on Dec 20, 2024

Why is weather data stored in netcdf instead of tensors or sparse tensors?

Also, SQLite supports virtual tables that can be backed by Content Range requests; https://www.sqlite.org/vtab.html

sqlite-wasm-http, sql.js-httpvfs; HTTP VFS: https://www.npmjs.com/package/sqlite-wasm-http

sqlite-parquet-vtable: https://github.com/cldellow/sqlite-parquet-vtable

Could there be a sqlite-netcdf-vtable or a sqlite-gribs-vtable, or is the dimensionality too much for SQLite?

From https://news.ycombinator.com/item?id=31824578 :

> It looks like e.g. sqlite-parquet-vtable implements shadow tables to memoize row group filters. How does JOIN performance vary amongst sqlite virtual table implementations?

https://news.ycombinator.com/item?id=42264274

SpatialLite does geo vector search with SQLite.

datasette can JOIN across multiple SQLite databases.

Perhaps datasette and datasette-lite could support xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested

"The NetCDF header": https://climateestimate.net/content/netcdfs-and-basic-coding... :

> The header can also be used to verify the order of dimensions that a variable is saved in (which you will have to know to use, unless you’re using a tool like xarray that lets you refer to dimensions by name) - for a 3-dimensional variable, `lon,lat,time` is common, but some files will have the `time` variable first.

"Loading a subset of a NetCDF file": https://climateestimate.net/content/netcdfs-and-basic-coding...

From https://news.ycombinator.com/item?id=42260094 :

> xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i...

  %FETCH <url> <filename>

tomnicholas1 · on Dec 20, 2024

> Why is weather data stored in netcdf instead of tensors or sparse tensors?

NetCDF is a "tensor", at least in the sense of being a self-describing multi-dimensional array format. The bigger problem is that it's not a Cloud-Optimized format, which is why Zarr has become popular.

> Also, SQLite supports virtual tables that can be backed by Content Range requests

The multi-dimensional equivalent of this is "virtual Zarr". I made this library to create virtual Zarr stores pointing at archival data (e.g. netCDF and GRIB)

https://github.com/zarr-developers/VirtualiZarr

> xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested

Pretty sure you can do this today already using Xarray and fsspec.

ElPeque · on Dec 20, 2024

In theory it could be done. It is sort of analogous to what GribStream is doing already.

The grib2 files are the storage. They are sorted by time in the path and so that is used like a primary index. And then grib2 is just a binary format to decode to extract what you want.

I originally was going to write this as a plugin for Clickhouse but in the end I made it a Golang API cause then I'm less constrained to other things. Like, for example, I'd like to create and endpoint to live encode the gribfiles into MP4 so the data can be served as Video. And then with any video player you would be able to playback, jump to times, etc.

I might still write a clickhouse integration though because it would be amazing to join and combine with other datasets on the fly.

tomnicholas1 · on Dec 20, 2024

> It is sort of analogous to what GribStream is doing already.

The difference is presumably that you are doing some large rechunking operation on your server to hide from the user the fact that the data is actually in multiple files?

Cool project btw, would love to hear a little more about how it works underneath :)

ElPeque · on Dec 20, 2024

Yeah, exactly.

I basically scrape all the grib index files to know all the offsets into all variables for all time. I store that in clickhouse.

When the API gets a request for a time range, set of coordinates and a set of weather parameters, first I pre-compute the mapping of (lat,lon) into the 1 dimensional index in the gridded data. That is a constant across the whole dataset. Then I query the clickhouse table to find out all the files+offset that need to be processed and all of them are queued into a multi-processing pool. And then processing each parameter implies parsing a grib file. I wrote a grib2 parser from scratch in golang so as to extract the data in a streaming fashion. As in... I don't extract the whole grid only to lookup the coordinates in it. I already pre-computed the index, so I can just decode every value in the grid in order and when I hit and index that I'm looking for, I copy it to a fixed size buffer with the extracted data. When You have all the pre-computed indexes then you don't even need to finish downloading the file, I just drop the connection immediately.

It is pretty cool. It is running in very humble hardware so I'm hoping I'll get some traction so I can throw more money into it. It should scale pretty linearly.

I've tested doing multi-year requests and the golang program never goes over 80Mb of memory usage. The CPUs get pegged so that is the limiting factor.

Grib2 complex packing (what the NBM dataset uses) implies lots of bit-packing. So there is a ton more to optimize using SIMD instructions. I've been toying with it a bit but I don't want to mission creep into that yet (fascinating though!).

I'm tempted to port this https://github.com/fast-pack/simdcomp to native go ASM.

tomnicholas1 · on Dec 20, 2024

That's pretty cool! Quite specific to this file format/workload, but this is an important enough problem that people might well be interested in a tailored solution like this :)

ElPeque · on Dec 21, 2024

Thank you!

Upitor · on April 5, 2022

They are still not syncing directly to the AC grid. They are not going 50 rounds per second

YZF · on April 5, 2022

They wouldn't necessarily need to do 50 rounds per second to generate 50 Hz. There could be any number of poles on the generator. At least in theory... Not saying you're wrong about not being synced but just that they wouldn't need to do 50 rps...

sandworm101 · on April 5, 2022

Gearboxes. Windmills of the same design will turn at the same rate in lockstep with each other. A gearbox converts that rate into what is necessary for the generator. The generator turns in sync with the grid.

Upitor · on Dec 30, 2021

In my opinion this proposal seems only to consider simple cases, but there are many not-so-simple relationsship types:

Consider a ‘sales’ table which includes columns [time] and [sold_by_employee_id], and a periodized ‘employee’ table which includes columns [employee_id], [valid_from] and [valid_to] columns. There is a perfectly valid relationsship between the two tables, but you cant join them using only equal-statements (you need a between-statement as well)

JoelJacobson · on Dec 30, 2021

Nice example! The join you describe would remain as a JOIN ON.

This is per design. Quote from the proposal:

"The idea is to improve the SQL language, specifically the join syntax, for the special but common case when joining on foreign key columns." ... "If the common simple joins (when joining on foreign key columns) would be written in a different syntax, the remaining joins would visually stand out and we could focus on making sure we understand them when reading a large SQL query."

So, the special non-equal based join condition you describe, would become more visible, and stand out, allowing readers to pay more attention to it.

The hypothesis is most joins are made on foreign key columns, so if we can improve such cases, a lot can be won.

DevKoala · on Dec 30, 2021

But how could you accurately tell if some queries join on the foreign key, but were written by someone without knowledge of the new specification?

JoelJacobson · on Dec 31, 2021

Not sure I understand, can please elaborate? Perhaps with an example?

Upitor · on Dec 30, 2021

I see. I should have read the article more carefully :-)

Upitor · on Dec 30, 2021

Also, consider a ‘sales’ table with multiple references to a ‘calendar’ table: [shipped_date], [order_date], [received_date]

JoelJacobson · on Dec 30, 2021

Good example too, but this one can with benefit be written using the JOIN FOREIGN syntax, you just need to give the foreign keys suitable names such as e.g. "shipped_date", "order_date", "received_date". Or, to remind you of which is the referenced table, perhaps you want to include it in the names, and the names would be "shipped_date_calendar", "order_date_calendar", "received_date_calendar", but personally I would prefer just "shipped_date" or perhaps even just "shipped".

CRConrad · on Jan 1, 2022

Are those timestamps, or actual dates, i.e. meaning a whole day encompassing (usually) 24 hours?

In many (most?) data warehousing projects I've seen, you make a "fake date" integer the primary key column of your Dates dimension. This integer consists of 10000 × YEAR_PART + 100 × MONTH_PART + 1 × DAY_PART of the date in question, so yesterday's New Year's Eve woul get 10000 × 2021 + 100 × 12 + 1 × 31 = 20211231. The date dimension itself has many more columns (often booleans, IS_WEEKEND, IS_HOLIDAY, etc; also the date parts themselves in both numeric and character form (12, 'December'), day of week (5 [or 6, depending on convention], 'Friday'), etc) that are used for BI and reporting.

But since this generated ISO-8601-date-as-integer column is the primary key of the Dates table, it is also the value of the Date foreign key in all tables that reference Dates. That makes it incredibly handy in queries -- both during development and for ad-hoc reports -- of those tables without joining to the Dates dimension at all: grouping, sorting, limiting to a more manageable date interval in the WHERE clause... And it tells the reader exactly what the actual date in question is. (Well, at least readers who are used to ISO-8601-format dates.) As jerryp would have said, recommended.

Upitor · on Nov 23, 2021

The author makes a case against using “valid_from”/“valid_to” columns on your table. But in my experience, that is the only viable solution. Fx joining an employee-table and a department-table needs to take into account when an employee was in which department. How would do that with some git-like version control for data? That might be useful for some machine learning training dataset, but in an operational system? No thanks

blackp1nk · on Nov 23, 2021

Hi! Author here. The git model lets you track multiple tables in the same repo. You can query (and join) both tables starting with the same commit_id and is actually a nice solution to the problem you suggest :)

Upitor · on Nov 23, 2021

If he wants to do ML research, why doesn’t he want to do a PhD program? I dont think you can get both: if you do research, there is less focus on making operational software. If you make operational software, there is less focus on doing research. Find out what you really want to do, and go with that.

purplerabbit · on Nov 23, 2021

It’s mostly the prospect of being a poor grad student for 6 years that’s disheartening

Upitor · on Nov 12, 2021

Isomorphic as vector spaces, meaning that their additive structure is the same. But the complex numbers are usually not used as a vector space, but rather as an algebraic field, i.e. considering both their additive and multiplicative structure.

iamcreasy · on Nov 13, 2021

Thank you. But I am guessing I need to know a bit of abstract algebra to understand what you are saying which I don't have.

Upitor · on Oct 21, 2021

Would either of these database systems be proper for a case where you have a mix of large measurement data and small reference/master data that you need to join, filter, etc. ?Example:

SELECT r.country, m.time, SUM(m.measurement) FROM measurement_table AS m INNER JOIN refence_table AS r ON m.device_id = r.device_id

ryanbooz · on Oct 21, 2021

In it's current form/state, ClickHouse is not optimized for typical JOIN-type queries, a point we make in the post. You would have to re-write your statement to get better performance. The other main point is that all data is "immutable", so if your reference data needs to be updated, it would still need to go through some kind of asynchronous transform process to ensure you're getting the correct values at query time.

TimescaleDB is PostgreSQL, so it can easily handle this kind of join aggregate like you would expect. If "m.measurement" was compressed, historical queries with a time predicate would likely be faster than uncompressed state.

Upitor · on Oct 17, 2021

These are fine intros, but then you have to actually dive into these topics. Sometimes there is no real interesting way to explain these topics. Fx, iirc the construction of the real numbers is rather tedious. But I agree, that more effort could be done to motivate many of these topics (at least that was my experience studying math)

graycat · on Oct 18, 2021

"Sometimes"? Yup!

"Tedious"? Yup! Can use Dedekind cuts or maybe something called the normal completion, and especially the first is darned tedious!

There is a good math writer G. F. Simmons, of Introduction to Topology and Modern Analysis, who stated that the two pillars of analysis were linearity and continuity -- nice remark. He also stated that really to understand, have to chew on all the arguments, etc. or some such.

Then I decided to study the proofs really carefully, so to "chew", and in hopes of finding techniques I could use elsewhere. When I mentioned that study technique, objective to my department Chair his remark was "There is no time." -- he also had a point.

Commonly there is an intuitive explanation of what is going on and some views that can provide motivation to study the stuff at all.

There are a lot of books and papers. As a student, I saw a lot of the books, got copies of some of them, put them on my TODO reading list, etc. Eventually, after falling far enough behind on the list, I wondered just where all those books were coming from? It dawned on me, profs need to publish so they do. They are also supposed to have grad students and do. Then the grad students take the advanced course by their major prof and end up with a big pile of notes. Then the grad student, as an assistant prof, wants to publish so cleans up the pile of notes and contacts the usual publishers to publish a book. The top university libraries are essentially required to buy the books, so they get published and bought. And, then, often, there the books sit, gathering dust. I won't say that writing those books was a total waste, and I won't say that students should spend more time reading those books. Or, the books are there on the shelves. They are not really difficult to find. The books have work that was done. Maybe the work is useful now; maybe someday it will be useful; whatever, the work is done, the results found, and there in case they do become useful.

In the meanwhile, back to the mainline of math education, research, applications, usually there can be some helpful intuitive explanations and motivating example applications!

Apparently some authors just give up and assume that their books will mostly just gather dust. But once I wrote Paul Halmos, likely my favorite author, and got back a nice letter from him with "It warms the heart of an author actually to be read, and clearly understood, by ordinary humans." -- at the time I had no academic affiliation and was just reading his book on my own. So, Halmos was surprised that an ordinary human would be reading and understanding his book.

Ah, in what I wrote, I left out that also in linear algebra in n dimensional space, the Pythagorean theorem still holds, that is, an n dimensional version holds!