More

lsorber · 2025-06-28T12:40:48 1751114448

For those who want to dive deeper, here’s a 300 LOC implementation of GRPO in pure NumPy: https://github.com/superlinear-ai/microGRPO

The implementation learns to play Battleship in about 2000 steps, pretty neat!

lsorber · on Nov 24, 2024

You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings.

The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.

This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].

And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].

[1] https://weaviate.io/blog/late-chunking

[2] https://jina.ai/news/what-is-colbert-and-late-interaction-an...

[3] https://github.com/superlinear-ai/raglite

visarga · on Nov 24, 2024

You can achieve the same effect by using LLM to do question answering prior to embedding, it's much more flexible but slower, you can use CoT, or even graph rag. Late chunking is a faster implicit alternative.

voiper1 · on Nov 24, 2024

I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?

My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.

lsorber · on Nov 24, 2024

The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.

Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.

Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].

[1] https://github.com/superlinear-ai/raglite

[2] https://huggingface.co/blog/fsommers/document-similarity-col...

causal · on Nov 24, 2024

What does it mean to "pool" embeddings? The first article seems to assume the reader is familiar

deepsquirrelnet · on Nov 24, 2024

“Pooling” is just aggregation methods. It could mean taking max or average values, or more exotic methods like attention pooling. It’s meant to reduce the one-per-token dimensionality to one per passage or document.

_hl_ · on Nov 24, 2024

You’d need to go a level below the API that most embedding services expose.

A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.

Late chunking means changing this reduction to yield many vectors instead of just one.

lsorber · on Dec 16, 2020

Did you even need the D, wouldn't a PI controller be sufficient?

harperlee · on Dec 16, 2020

The point is that once you know the word PID you can read about the concept and reach conclusions such as yours.

tga · on Dec 16, 2020

The article project is already using a Pi controller.

lsorber · on Dec 14, 2020

If TSMC buys its lithography machines, why should it even get any credit for 5nm at all?

lsorber · on Nov 30, 2020

Could you give an example of an unsolved riddle from linguistics?

lsorber · on Oct 4, 2020

In my opinion, the best solution to these issues is to:

1. Declare numbers as numbers in the configuration language. E.g. "decimal(1e1000)".

2. Parse declared numbers with a lossless format like Python's decimal.Decimal.

3. Let users decide at their own risk if they want to convert to a lossy format like float.

lsorber · on Sept 20, 2020

Where's the data that says Moore's law no longer holds? I see comments and articles asserting this but everytime with evidence. The data that I do find certainly still suggests Moore's law is doing fine.

pedrocr · on Sept 20, 2020

Moore's law is still on pace but Dennard scaling has broken down for ~15 years now. That means we're not getting the same speedups as we were used to and that gets explained by "Moore's law has broken down" as a title.

https://en.wikipedia.org/wiki/Dennard_scaling

dorfsmay · on Sept 20, 2020

I did not know about Dennard law, thanks for that link. Any idea why it has slowed down?

blp · on Sept 20, 2020

Dennards law held until the approximately linear relations that made it work ended. Delay stopped being entirely gage, and more importantly, voltage couldn’t drop forever, due to material limits and intrinsic silicon limits.

gameswithgo · on Sept 20, 2020

the trick is to look at transistor density. people will take transistor counts of say, the eypc cpu but that is 9 chiplets for a huge total die area. transistor counts doubling via doubling die area isn’t moores law

genewitch · on Sept 20, 2020

Doubling the number of transistors on a die and reducing the power usage isn't moore's law?

If i could swap out a chiplet i might be inclined to agree with you.

adrianN · on Sept 20, 2020

I'm writing this comment on an 11 year old machine. It doesn't feel much slower than the 1 year old machine that I use at work. The number of transistors in its processor is also not 1/32-th or whatever of the office-machine's (especially if you don't count cache). It's processor has 45nm feature size, the modern machine's CPU is 14nm. In what way do you think does Moore's law still hold?

genewitch · on Sept 20, 2020

first generation core i7 (960), 45nm, october 2009, 263mm^2 die area, 4 cores, 4 threads, 3.2-3.46ghz 731 million transistors, 130W TDP,

tenth generation core mobile i7 (i7-10875H) 14nm, Q2-2020. 25mm^2 per set of two cores[0], 8 cores, 16 threads, 2.3-5.1ghz. 43 million transistors per square mm[1] -> over 4.3 billion transistors, 45W TDP.

So you have 4 times the threads, at a hair under twice the speed (without stressing turbo boost that much) at 1/3rd of the power at the wall. from top of the line desktop CPU in october of 2009 to "it's ok" mobile CPU from april of this year.

Intel stopped publishing (where i can find it) die size or transistor counts, i found the [0] and [1] information on a comparison between TSMC's 7nm fab runs and Intel's claimed 14nm fab runs; and i was only able to very briefly confirm this for the mobile i7 tenth gen and not the desktops. The desktop CPUs are like 400mm^2 larger in physical size, no idea about die size. Sorry. Intel stopped publishing those i guess, seems around the time that ryzen came out or slightly before.

edit: 11 years is 5.5 doublings of transistors IIRC, so you should really compare like to like, but i don't have that much free time. My laptop's CPU has 7.8 billion transistors just for the cores, and an additional 2 billion for I/O.

sorry. this information isn't in a precise place and i am unused to HN forms.

lsorber · on Sept 17, 2020

Sounds great until the client realises they can hire someone else who does charge by the hour, saving them a massive 100k - 10k = 90k compared to your proposition.

bavell · on Sept 17, 2020

Or they look at your proposal for $100k and say, "Mmmmm... maybe next year" and the project never happens.

Really depends on the client and your reputation though, you can definitely pull it off with the right pitch to the right people at the right time.

komon · on Sept 17, 2020

Yeah, your ability to charge more scales with your proven ability to deliver. That's where referrals/portfolios/testimonials/case studies come in.

Basically, you want to charge a percentage of the value you're promising to provide for them. And then that's going to get multiplied by their confidence in you.

If I'm bidding on a project that I think can make $1MM/yr for my client, and I normally charge 10% of the first year's worth of value, then I'm looking at $100K. If the business I'm working with only has a 50% confidence that I'm a safe bet to produce that value, or an 80% confidence, that's going to get reflected in what price we negotiate to. Probably more like $50k in the 50% confidence bucket.

But the more proven you are, the more you can say, "I said this thing would make $2MM and by gum it actually made $5MM" the more you're going to be able to charge.

N.B. That 50% isn't "It's a coin toss whether or not they'll complete the project", it's more like "When they complete the project, we're confident that we're going to get at least 50% of the value they're selling us on"

lsorber · on July 31, 2020

Are you sure about that? It depends on how Cloudflare defines what a cold start is. It might well include the initial loading of your code, with imports and init.

lsorber · on July 27, 2020

Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.

alfalfasprout · on July 27, 2020

I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.

lsorber · on July 31, 2020

Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.

EdwardDiego · on July 27, 2020

I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.

kylebarron · on July 27, 2020

But with pickling you can only read the data in Python.

cbsmith · on July 28, 2020

If pickling is what is working best for you, it can't be much data.