Hacker Newsnew | past | comments | ask | show | jobs | submit | rochoa's favoriteslogin

Reminds me of this:

"""On the first day of class, Jerry Uelsmann, a professor at the University of Florida, divided his film photography students into two groups.

Everyone on the left side of the classroom, he explained, would be in the “quantity” group. They would be graded solely on the amount of work they produced. On the final day of class, he would tally the number of photos submitted by each student. One hundred photos would rate an A, ninety photos a B, eighty photos a C, and so on.

Meanwhile, everyone on the right side of the room would be in the “quality” group. They would be graded only on the excellence of their work. They would only need to produce one photo during the semester, but to get an A, it had to be a nearly perfect image.

At the end of the term, he was surprised to find that all the best photos were produced by the quantity group. During the semester, these students were busy taking photos, experimenting with composition and lighting, testing out various methods in the darkroom, and learning from their mistakes. In the process of creating hundreds of photos, they honed their skills. Meanwhile, the quality group sat around speculating about perfection. In the end, they had little to show for their efforts other than unverified theories and one mediocre photo."""

from https://www.thehuntingphotographer.com/blog/qualityvsquantit...


https://cafe.io It is a DNS service for AWS EC2 to keep the ever changing IPs when you cannot use the Elastic IP like ASG or when you don't want to install any third party clients to your instances.

It fetches the IPs regularly via AWS API and assign them to fixed subdomains.

It is pretty new :) still developing actively.


When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.


You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.

In any query engine you can execute the same query in different ways. The more restrictions that you can apply on the DuckDB side the less data you need to return to Postgres.

For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.

In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.

Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.

There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.

How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.


When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense.

Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.

Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.


Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.

Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.


I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...

You can run it like this:

  cd /tmp
  git clone https://huggingface.co/sdobson/nanochat
  uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
    --model-dir /tmp/nanochat \
    --prompt "Tell me about dogs."

> But his biggest users (tigerbeetle and bun especially) will only be taken seriously once Zig is 1.0.

TB is only 5 years old but already migrating some of the largest brokerages, exchanges and wealth managements in their respective jurisdictions.

Zig’s quality for us here holds up under some pretty extreme fuzzing (a fleet of 1000 dedicated CPU cores), Deterministic Simulation Testing and Jepsen auditing (TB did 4x the typical audit engagement duration), and is orthogonal to 1.0 backwards compatibility.

Zig version upgrades for our team are no big deal, compared to the difficulty of the consensus and local storage engine challenges we work on, and we vendor most of our std lib usage in stdx.

> They’ll nudge him towards 1.0.

On the contrary, we want Andrew to take his time and get it right on the big decisions, because the half life of these projects can be decades.

We’re in no rush. For example, TigerBeetle is designed to power the next 30 years of transaction processing and Zig’s trajectory here is what’s important.

That said, Zig and Zig’s toolchain today, is already better, at least for our purposes, than anything else we considered using.


Kafka isn’t a queue, it’s a distributed log. A partitioned topic can take very large volumes of message writes, persist them indefinitely, deliver them to any subscriber in-order and at-least-once (even for subscribers added after the message was published), and do all of that distributed and HA.

If you need all those things, there just are not a lot of options.


(Since DX is already explained...)

Grafana/Prom/Loki is an awesome stack - overall I'd say that we try to correlate more signals in one place (your logs <> traces <> session replay), and we also take an approach to go more dev-friendly to query instead of going the PromQL/LogQL route.

It's a stack I really wanted to love myself as well but I've personally ran into a few issues when using it:

Loki is a handful to get right, you have to think about your labels, they can't be high-cardinality (ex. IDs), the search is really slow if it's not a label, and the syntax is complex because it's derived from PromQL which I don't think is a good fit for logs. This means an engineer on your team can't just jump in and start typing keywords to match on, nor can they just log out logs and know they can quickly find it again in prod. Engineers need to filter logs by a label first and then wait for a regex to run if they want to do full-text search.

Prometheus is pretty good, my only qualm is again the approachability of PromQL - it's rare to see an engineer that isn't fluent with time-series/metric systems to be able to pick up all the concepts very quickly. This means that metrics access is largely limited to premade dashboards or a certain set of engineers that know the Prometheus setup really well.

Grafana has definitely set the standard for OSS metrics, but I personally haven't had a lot of success using their tools outside of metrics, though ymmv and it's all about the tradeoffs you're looking for in an observability tool.


what is DX?

why not grafana / prometheus / loki?


God clickhouse is such great software, if it only it was as ergonomic as duckdb, and management wasn't doing some questionable things (deleting references to competitors in GH issues, weird legal letters, etc.)

The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)


People passing cynical comments at Google need to understand that at a big co like Google, something like this doesn't 'just happen'. It probably happened because some passionate L6/L7 engineer wanted to do it and pushed through the bureaucracy to get approvals for it, probably largely on their own time (by which I mean that this was at best a side-project for them and at worst a distraction that was losing them favor with their bosses). At every point in the process, they probably had to justify what they were doing to their leads, to lawyers, to privacy reviewers, who had no real stake in it and so had nothing to lose by saying No. They almost certainly won't receive any career progress out of this and would risk a setback if something slips through the cracks (such as some unredacted proprietary information).

They did it because they felt it was the right thing to do. Good things happen through the actions of individuals like this. We should acknowledge and celebrate it when they do, anti-big-tech cynicism can wait.


The website http://astronaut.io/ does a similar thing but for recent videos, and not just from iPhones. From the home page:

> These videos come from YouTube. They were uploaded in the last week and have titles like DSC 1234 and IMG 4321. They have almost zero previous views. They are unnamed, unedited, and unseen (by anyone but you).

At one point you might be at a school recital in Malaysia, and the next minute you are at a birthday in Ecuador. It's amazing!


I've worked in the field, even with the biggest and fastest ECU's in existance, Formula 1, with 10Khz cycles on 4Ghz CPU's

we never saw the need to use the other cores. the CPU is a few million times faster than the cycles, if there is a problem it's a HW IO or blocking driver. even with normal 500Hz ECU's you can always upgrade the CPU freq to run faster, than needing more cores. even better would be to use better drivers with zero copy and MMU access.

it's nice that threading got better recently in embedded Realtime OS's, HW always gets better, but certainly drivers not and SW not. and developers even less. maybe they think rust will help, but then they should chose a proper concurrency-safe system. certainly not rust, which only lies about it.

also this thing is only simulation verified, not formally verified.


Yes, there are tons of resources but I'll try to offer some simple tips.

1. Sales is a lot like golf. You can make it so complicated as to be impossible or you can simply walk up and hit the ball. I've been leading and building sales orgs for almost 20 years and my advice is to walk up and hit the ball.

2. Sales is about people and it's about problem solving. It is not about solutions or technology or chemicals or lines of code or artichokes. It's about people and it's about solving problems.

3. People buy 4 things and 4 things only. Ever. Those 4 things are time, money, sex, and approval/peace of mind. If you try selling something other than those 4 things you will fail.

4. People buy aspirin always. They buy vitamins only occassionally and at unpredictable times. Sell aspirin.

5. I say in every talk I give: "all things being equal people buy from their friends. So make everything else equal then go make a lot of friends."

6. Being valuable and useful is all you ever need to do to sell things. Help people out. Send interesting posts. Write birthday cards. Record videos sharing your ideas for growing their business. Introduce people who would benefit from knowing each other then get out of the way, expecting nothing in return. Do this consistently and authentically and people will find ways to give you money. I promise.

7. No one cares about your quota, your payroll, your opex, your burn rate, etc. No one. They care about the problem you are solving for them.

There is more than 100 trillion dollars in the global economy just waiting for you to breathe it in. Good luck.


We repair these as part of our business, and to be clear, both the keyboards and the screens are failing on these at an alarming rate.

iFixit detailed the issues with the screens, which (in Apple's unending quest for "thinness") use a thinner flex cable to connect the display to the rest of the laptop. This thinner cable is prone to breakage, and we are already seeing 2016-2017 MacBook Pros in our shop regularly for this issue.

Since Apple built the flex cable into the display, the only solution (even from third parties like us) is a new display. At $600-$700 each, this is unacceptable.

And, like the keyboards, this is a part that's pretty much guaranteed to fail (unless you basically never open your laptop.)

Apple hasn't announced a fix yet, even with a petition with over 11,000 signatures, and more screens failing by the day.

From the time the keyboard issues happened, I made a strong recommendation to avoid buying these. If you can do your work on a PC, do so. (Personally, I now use a Dell XPS 15 as a "desktop replacement", and kept my old 2013 MacBook Pro around too.) If you need a Mac, consider a desktop version (with a SSD!), or stick with the 2015 or older MacBooks.

Even if you think the keyboard issues are fixed, consider too that this is the 4th generation of these keyboards--and Apple promised that the 2nd and 3rd generation would fix these as well. This plus the screen issues means switching to PC if you need speed should be a serious consideration.

iFixit article on "stage light" display issues/"flexgate": https://ifixit.org/blog/12903/flexgate/


I am not sure if WhatsApp engs chose ejabberd because it was written in Erlang or because ejabberd was the defacto implementation of XMPP. They stumbled upon and fixed bugs in BEAM/OTP at their scale [0][1][2]. They also ran FreeBSD (for its superior networking?) on bare-metal hosts running customised system-images [3] and employed networking experts at some point.

[0] https://www.youtube-nocookie.com/embed/c12cYAUTXXs

[1] https://www.youtube-nocookie.com/embed/wDk6l3tPBuw

[2] https://www.youtube-nocookie.com/embed/93MA0VUWP9w

[3] https://www.youtube-nocookie.com/embed/TneLO5TdW_M


> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.

But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.

And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.

[0]: https://github.com/TimelyDataflow/differential-dataflow


I work in this field so I'm incredibly biased: automated business solutions that cut entry-level data employees out of the equation. You save TONS on the bottom line, and cut out human-driven process that is error prone and difficult to manage. I'm talking about things beyond "API-driven dev", more in the realms of Puppeteer, Microsoft Office automation, screen-scraping (mouse/keyboard), etc. I make API's out of things that other devs balk at - and trust me, it has a lot of market value.

This isn't as "up and coming" as all of the other items people are mentioning, but I'd put it on a "always increasing in popularity" trajectory due to an ever-increasing need. It's not really sexy or interesting, but there will always be a HUGE market for the things that I can do =)

I will warn people that "up and coming" tech is often fad-based and has boom and bust cycles, and personally I'd rather be working for a paycheck then waiting to win the lottery in this regard.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: