> This is just the beginning of modernizing geospatial data storage. We’re already looking ahead to other types of geospatial data such as raster, point cloud, spatial indexes…
In my experience, NWP data is big. Like, really big! Data over HTTP calls seems to limit the use case a bit; have you considered making it possible to mount storage directly (fsspec), and use fx the zarr format? In this way, querying with xarray would be much more flexible
The point of GribStream is that you don't need to download the grided data (cause as you say it is huge! and it grows many Gb every single hour for years).
This is an API that will do streaming extraction and only have you download what you actually need.
When you make the http request to the API, the API will be processing up to terabytes of data only to respond to you with maybe a few Kb of csv.
> It looks like e.g. sqlite-parquet-vtable implements shadow tables to memoize row group filters. How does JOIN performance vary amongst sqlite virtual table implementations?
datasette can JOIN across multiple SQLite databases.
Perhaps datasette and datasette-lite could support xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested
> The header can also be used to verify the order of dimensions that a variable is saved in (which you will have to know to use, unless you’re using a tool like xarray that lets you refer to dimensions by name) - for a 3-dimensional variable, `lon,lat,time` is common, but some files will have the `time` variable first.
> Why is weather data stored in netcdf instead of tensors or sparse tensors?
NetCDF is a "tensor", at least in the sense of being a self-describing multi-dimensional array format. The bigger problem is that it's not a Cloud-Optimized format, which is why Zarr has become popular.
> Also, SQLite supports virtual tables that can be backed by Content Range requests
The multi-dimensional equivalent of this is "virtual Zarr". I made this library to create virtual Zarr stores pointing at archival data (e.g. netCDF and GRIB)
> xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested
Pretty sure you can do this today already using Xarray and fsspec.
In theory it could be done. It is sort of analogous to what GribStream is doing already.
The grib2 files are the storage. They are sorted by time in the path and so that is used like a primary index. And then grib2 is just a binary format to decode to extract what you want.
I originally was going to write this as a plugin for Clickhouse but in the end I made it a Golang API cause then I'm less constrained to other things. Like, for example, I'd like to create and endpoint to live encode the gribfiles into MP4 so the data can be served as Video. And then with any video player you would be able to playback, jump to times, etc.
I might still write a clickhouse integration though because it would be amazing to join and combine with other datasets on the fly.
> It is sort of analogous to what GribStream is doing already.
The difference is presumably that you are doing some large rechunking operation on your server to hide from the user the fact that the data is actually in multiple files?
Cool project btw, would love to hear a little more about how it works underneath :)
I basically scrape all the grib index files to know all the offsets into all variables for all time. I store that in clickhouse.
When the API gets a request for a time range, set of coordinates and a set of weather parameters, first I pre-compute the mapping of (lat,lon) into the 1 dimensional index in the gridded data. That is a constant across the whole dataset. Then I query the clickhouse table to find out all the files+offset that need to be processed and all of them are queued into a multi-processing pool. And then processing each parameter implies parsing a grib file. I wrote a grib2 parser from scratch in golang so as to extract the data in a streaming fashion. As in... I don't extract the whole grid only to lookup the coordinates in it. I already pre-computed the index, so I can just decode every value in the grid in order and when I hit and index that I'm looking for, I copy it to a fixed size buffer with the extracted data. When You have all the pre-computed indexes then you don't even need to finish downloading the file, I just drop the connection immediately.
It is pretty cool. It is running in very humble hardware so I'm hoping I'll get some traction so I can throw more money into it. It should scale pretty linearly.
I've tested doing multi-year requests and the golang program never goes over 80Mb of memory usage. The CPUs get pegged so that is the limiting factor.
Grib2 complex packing (what the NBM dataset uses) implies lots of bit-packing. So there is a ton more to optimize using SIMD instructions. I've been toying with it a bit but I don't want to mission creep into that yet (fascinating though!).
That's pretty cool! Quite specific to this file format/workload, but this is an important enough problem that people might well be interested in a tailored solution like this :)
They wouldn't necessarily need to do 50 rounds per second to generate 50 Hz. There could be any number of poles on the generator. At least in theory... Not saying you're wrong about not being synced but just that they wouldn't need to do 50 rps...
Gearboxes. Windmills of the same design will turn at the same rate in lockstep with each other. A gearbox converts that rate into what is necessary for the generator. The generator turns in sync with the grid.
In my opinion this proposal seems only to consider simple cases, but there are many not-so-simple relationsship types:
Consider a ‘sales’ table which includes columns [time] and [sold_by_employee_id], and a periodized ‘employee’ table which includes columns [employee_id], [valid_from] and [valid_to] columns. There is a perfectly valid relationsship between the two tables, but you cant join them using only equal-statements (you need a between-statement as well)
Nice example! The join you describe would remain as a JOIN ON.
This is per design. Quote from the proposal:
"The idea is to improve the SQL language, specifically the join syntax, for the special but common case when joining on foreign key columns."
...
"If the common simple joins (when joining on foreign key columns) would be written in a different syntax, the remaining joins would visually stand out and we could focus on making sure we understand them when reading a large SQL query."
So, the special non-equal based join condition you describe, would become more visible, and stand out, allowing readers to pay more attention to it.
The hypothesis is most joins are made on foreign key columns, so if we can improve such cases, a lot can be won.
Good example too, but this one can with benefit be written using the JOIN FOREIGN syntax, you just need to give the foreign keys suitable names such as e.g. "shipped_date", "order_date", "received_date". Or, to remind you of which is the referenced table, perhaps you want to include it in the names, and the names would be "shipped_date_calendar", "order_date_calendar", "received_date_calendar", but personally I would prefer just "shipped_date" or perhaps even just "shipped".
Are those timestamps, or actual dates, i.e. meaning a whole day encompassing (usually) 24 hours?
In many (most?) data warehousing projects I've seen, you make a "fake date" integer the primary key column of your Dates dimension. This integer consists of 10000 × YEAR_PART + 100 × MONTH_PART + 1 × DAY_PART of the date in question, so yesterday's New Year's Eve woul get 10000 × 2021 + 100 × 12 + 1 × 31 = 20211231. The date dimension itself has many more columns (often booleans, IS_WEEKEND, IS_HOLIDAY, etc; also the date parts themselves in both numeric and character form (12, 'December'), day of week (5 [or 6, depending on convention], 'Friday'), etc) that are used for BI and reporting.
But since this generated ISO-8601-date-as-integer column is the primary key of the Dates table, it is also the value of the Date foreign key in all tables that reference Dates. That makes it incredibly handy in queries -- both during development and for ad-hoc reports -- of those tables without joining to the Dates dimension at all: grouping, sorting, limiting to a more manageable date interval in the WHERE clause... And it tells the reader exactly what the actual date in question is. (Well, at least readers who are used to ISO-8601-format dates.) As jerryp would have said, recommended.
The author makes a case against using “valid_from”/“valid_to” columns on your table. But in my experience, that is the only viable solution. Fx joining an employee-table and a department-table needs to take into account when an employee was in which department. How would do that with some git-like version control for data? That might be useful for some machine learning training dataset, but in an operational system? No thanks
Hi! Author here. The git model lets you track multiple tables in the same repo. You can query (and join) both tables starting with the same commit_id and is actually a nice solution to the problem you suggest :)
If he wants to do ML research, why doesn’t he want to do a PhD program? I dont think you can get both: if you do research, there is less focus on making operational software. If you make operational software, there is less focus on doing research. Find out what you really want to do, and go with that.
Isomorphic as vector spaces, meaning that their additive structure is the same. But the complex numbers are usually not used as a vector space, but rather as an algebraic field, i.e. considering both their additive and multiplicative structure.
Would either of these database systems be proper for a case where you have a mix of large measurement data and small reference/master data that you need to join, filter, etc. ?Example:
SELECT r.country, m.time, SUM(m.measurement)
FROM measurement_table AS m
INNER JOIN refence_table AS r
ON m.device_id = r.device_id
In it's current form/state, ClickHouse is not optimized for typical JOIN-type queries, a point we make in the post. You would have to re-write your statement to get better performance. The other main point is that all data is "immutable", so if your reference data needs to be updated, it would still need to go through some kind of asynchronous transform process to ensure you're getting the correct values at query time.
TimescaleDB is PostgreSQL, so it can easily handle this kind of join aggregate like you would expect. If "m.measurement" was compressed, historical queries with a time predicate would likely be faster than uncompressed state.
These are fine intros, but then you have to actually dive into these topics. Sometimes there is no real interesting way to explain these topics. Fx, iirc the construction of the real numbers is rather tedious. But I agree, that more effort could be done to motivate many of these topics (at least that was my experience studying math)
"Tedious"? Yup! Can use Dedekind cuts or maybe something called the normal completion, and especially the first is darned tedious!
There is a good math writer G. F. Simmons, of
Introduction to Topology and Modern Analysis, who stated that the two pillars of analysis were linearity and continuity -- nice remark. He also stated that really to understand, have to chew on all the arguments, etc. or some such.
Then I decided to study the proofs really carefully, so to "chew", and in hopes of finding techniques I could use elsewhere. When I mentioned that study technique, objective to my department Chair his remark was "There is no time." -- he also had a point.
Commonly there is an intuitive explanation of what is going on and some views that can provide motivation to study the stuff at all.
There are a lot of books and papers. As a student, I saw a lot of the books, got copies of some of them, put them on my TODO reading list, etc. Eventually, after falling far enough behind on the list, I wondered just where all those books were coming from? It dawned on me, profs need to publish so they do. They are also supposed to have grad students and do. Then the grad students take the advanced course by their major prof and end up with a big pile of notes. Then the grad student, as an assistant prof, wants to publish so cleans up the pile of notes and contacts the usual publishers to publish a book. The top university libraries are essentially required to buy the books, so they get published and bought. And, then, often, there the books sit, gathering dust. I won't say that writing those books was a total waste, and I won't say that students should spend more time reading those books. Or, the books are there on the shelves. They are not really difficult to find. The books have work that was done. Maybe the work is useful now; maybe someday it will be useful; whatever, the work is done, the results found, and there in case they do become useful.
In the meanwhile, back to the mainline of math education, research, applications, usually there can be some helpful intuitive explanations and motivating example applications!
Apparently some authors just give up and assume that their books will mostly just gather dust. But once I wrote Paul Halmos, likely my favorite author, and got back a nice letter from him with "It warms the heart of an author actually to be read, and clearly understood, by ordinary humans." -- at the time I had no academic affiliation and was just reading his book on my own. So, Halmos was surprised that an ordinary human would be reading and understanding his book.
Ah, in what I wrote, I left out that also in linear algebra in n dimensional space, the Pythagorean theorem still holds, that is, an n dimensional version holds!