Hacker Newsnew | past | comments | ask | show | jobs | submit | lsorber's commentslogin

For those who want to dive deeper, here’s a 300 LOC implementation of GRPO in pure NumPy: https://github.com/superlinear-ai/microGRPO

The implementation learns to play Battleship in about 2000 steps, pretty neat!


You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings.

The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.

This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].

And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].

[1] https://weaviate.io/blog/late-chunking

[2] https://jina.ai/news/what-is-colbert-and-late-interaction-an...

[3] https://github.com/superlinear-ai/raglite


You can achieve the same effect by using LLM to do question answering prior to embedding, it's much more flexible but slower, you can use CoT, or even graph rag. Late chunking is a faster implicit alternative.


I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?

My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.


The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.

Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.

Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].

[1] https://github.com/superlinear-ai/raglite

[2] https://huggingface.co/blog/fsommers/document-similarity-col...


What does it mean to "pool" embeddings? The first article seems to assume the reader is familiar


“Pooling” is just aggregation methods. It could mean taking max or average values, or more exotic methods like attention pooling. It’s meant to reduce the one-per-token dimensionality to one per passage or document.


You’d need to go a level below the API that most embedding services expose.

A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.

Late chunking means changing this reduction to yield many vectors instead of just one.


Did you even need the D, wouldn't a PI controller be sufficient?


The point is that once you know the word PID you can read about the concept and reach conclusions such as yours.


The article project is already using a Pi controller.


If TSMC buys its lithography machines, why should it even get any credit for 5nm at all?


Could you give an example of an unsolved riddle from linguistics?


In my opinion, the best solution to these issues is to:

1. Declare numbers as numbers in the configuration language. E.g. "decimal(1e1000)".

2. Parse declared numbers with a lossless format like Python's decimal.Decimal.

3. Let users decide at their own risk if they want to convert to a lossy format like float.


Where's the data that says Moore's law no longer holds? I see comments and articles asserting this but everytime with evidence. The data that I do find certainly still suggests Moore's law is doing fine.


Moore's law is still on pace but Dennard scaling has broken down for ~15 years now. That means we're not getting the same speedups as we were used to and that gets explained by "Moore's law has broken down" as a title.

https://en.wikipedia.org/wiki/Dennard_scaling


I did not know about Dennard law, thanks for that link. Any idea why it has slowed down?


Dennards law held until the approximately linear relations that made it work ended. Delay stopped being entirely gage, and more importantly, voltage couldn’t drop forever, due to material limits and intrinsic silicon limits.


the trick is to look at transistor density. people will take transistor counts of say, the eypc cpu but that is 9 chiplets for a huge total die area. transistor counts doubling via doubling die area isn’t moores law


Doubling the number of transistors on a die and reducing the power usage isn't moore's law?

If i could swap out a chiplet i might be inclined to agree with you.


I'm writing this comment on an 11 year old machine. It doesn't feel much slower than the 1 year old machine that I use at work. The number of transistors in its processor is also not 1/32-th or whatever of the office-machine's (especially if you don't count cache). It's processor has 45nm feature size, the modern machine's CPU is 14nm. In what way do you think does Moore's law still hold?


first generation core i7 (960), 45nm, october 2009, 263mm^2 die area, 4 cores, 4 threads, 3.2-3.46ghz 731 million transistors, 130W TDP,

tenth generation core mobile i7 (i7-10875H) 14nm, Q2-2020. 25mm^2 per set of two cores[0], 8 cores, 16 threads, 2.3-5.1ghz. 43 million transistors per square mm[1] -> over 4.3 billion transistors, 45W TDP.

So you have 4 times the threads, at a hair under twice the speed (without stressing turbo boost that much) at 1/3rd of the power at the wall. from top of the line desktop CPU in october of 2009 to "it's ok" mobile CPU from april of this year.

Intel stopped publishing (where i can find it) die size or transistor counts, i found the [0] and [1] information on a comparison between TSMC's 7nm fab runs and Intel's claimed 14nm fab runs; and i was only able to very briefly confirm this for the mobile i7 tenth gen and not the desktops. The desktop CPUs are like 400mm^2 larger in physical size, no idea about die size. Sorry. Intel stopped publishing those i guess, seems around the time that ryzen came out or slightly before.

edit: 11 years is 5.5 doublings of transistors IIRC, so you should really compare like to like, but i don't have that much free time. My laptop's CPU has 7.8 billion transistors just for the cores, and an additional 2 billion for I/O.

sorry. this information isn't in a precise place and i am unused to HN forms.


Sounds great until the client realises they can hire someone else who does charge by the hour, saving them a massive 100k - 10k = 90k compared to your proposition.


Or they look at your proposal for $100k and say, "Mmmmm... maybe next year" and the project never happens.

Really depends on the client and your reputation though, you can definitely pull it off with the right pitch to the right people at the right time.


Yeah, your ability to charge more scales with your proven ability to deliver. That's where referrals/portfolios/testimonials/case studies come in.

Basically, you want to charge a percentage of the value you're promising to provide for them. And then that's going to get multiplied by their confidence in you.

If I'm bidding on a project that I think can make $1MM/yr for my client, and I normally charge 10% of the first year's worth of value, then I'm looking at $100K. If the business I'm working with only has a 50% confidence that I'm a safe bet to produce that value, or an 80% confidence, that's going to get reflected in what price we negotiate to. Probably more like $50k in the 50% confidence bucket.

But the more proven you are, the more you can say, "I said this thing would make $2MM and by gum it actually made $5MM" the more you're going to be able to charge.

N.B. That 50% isn't "It's a coin toss whether or not they'll complete the project", it's more like "When they complete the project, we're confident that we're going to get at least 50% of the value they're selling us on"


Are you sure about that? It depends on how Cloudflare defines what a cold start is. It might well include the initial loading of your code, with imports and init.


Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.


I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.


Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.


I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.


But with pickling you can only read the data in Python.


If pickling is what is working best for you, it can't be much data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: