One upside of the deterministic schemes is they include provenance/lineage. Can literally "trace up" the path the history back to the original ID giver.
Kinda has me curious about how much information is required to represent any arbitrary provenance tree/graph on a network of N-nodes/objects (entirely via the self-described ID)?
(thinking in the comment: I guess if worst case linear chain, and you assume that the information of the full provenance should be accessible by the id, that scales as O(N x id_size), so its quite bad. But, assuming "best case" (that any node is expected to be log(N) steps from root, depth of log(N)) feels like global_id_size = log(N) x local_id_size is roughly the optimal limit? so effectively the size of the global_id grows as log(N)^2? Would that mean: from the 399 bit number, with lineage, would be a lower limit for a global_id_size be like (400 bit)^2 ~= 20 kB (because of carrying the ordered-local-id provenance information, and not relative to local shared knowledge)
Provenance is a DAG, so you get a partial order for free by topological sort. That can be extended to a compatible total order. Then provenance for a node is just its ordering. This kind of mapping from objects to the first N consecutive naturals is also a minimal perfect hash function, which have n log n overhead. We can't navigate the tree to track ancestry, but equality implies identical ancestry.
Alternatively, we could track the whole history in somewhat more bits with a succinct encoding, 2N if it's a binary tree.
In practice, deterministic IDs usually accept a 2^-N collision risk to get log n.
The ATProto underlying BlueSky social network is similar. It uses a content-addressed DAG.
Each “post” has a CID, which is a cryptographic hash of the data. To “prove” ownership of the post, there’s a witness hash that is sent that can be proved all the way up the tree to the repo root hash, which is signed with the root key.
Neat way of having data say “here’s the data, and if you care to verify it, here’s an MST”.
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?
(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)
Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.
But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.
> this is where the taylor expression would fail to represent the values well.
"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"
I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?
To spell it out for myself and others: approaching equivalent calculations for each individual attention block means we also approach equivalent performance for the combination of them. And with an error bar approaching floating point accuracy, the performance should be practically identical to regular attention. Elementwise errors of this magnitude can't lead to any noteworthy changes in the overall result, especially given how robust LLM networks seem to be to small deviations.
> This just... softens attentions ability to attend?
I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.
Right, and when they compare to floating point accuracy they seem to be using the number of decimals supported by the mantissa, but the exponent is important no?
When someone says the error is of a certain magnitude they mean the absolute value of the difference between the the two things, so what they're saying is that the values they produced with their approximation are about as wrong as the difference between the actual values and those values cast to float16. The exponent is most definitely important and would be included in that.
Previous paper from DeepSeek has mentioned Anna’s Archive.
> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions.
https://arxiv.org/abs/2403.05525
DeepSeek-VL paper
After maintaining my own agents library for a while, I’ve switched over to pydantic ai recently. I have some minor nits, but overall it's been working great for me. I’ve especially liked combining it with langfuse.
Towards coding agents, I wonder if there are any good / efficient ways to measure how much different implementations work on coding? SWE-bench seems good, but expensive to run. Effectively I’m curious for things like: given tool definition X vs Y (eg. diff vs full file edit), prompt for tool X vs Y (how it’s described, does it use examples), model choice (eg. MCP with Claude, but python-exec inline with GPT-5), sub-agents, todo lists, etc. how much across each ablation, does it matter? And measure not just success, but cost to success too (efficiency).
Overall, it seems like in the phase space of options, everything “kinda works” but I’m very curious if there are any major lifts, big gotchas, etc.
I ask, because it feels like the Claude code cli always does a little bit better, subjectively for me, but I haven’t seen a LLMarena or clear A vs B, comparison or measure.
The first time I got off at and heard Komagome's tune I mistakenly thought it was some halloween special because it was late October at the time, and the song felt so distinct and unique.
Interestingly this one seems it is from before 高輪ゲートウェイ (Takanawa Gateway) station which opened in 2020, but the numbering shows the gap (JY 25 -> JY 27). That led me to looking it up, and turns out that they introduced the numbering in 2016, and that already came pre-planned with the gap ready [1].
In the street where I grew up they had to renumber most of the houses one year because a row of new buildings were built, so everyone that was further down the street than the new houses had to have their numbers increased so that the new houses could be given numbers that were in order with where along the street they were built.
I wonder if that sort of renumbering is common or not, and if Japan is better at planning that sort of thing also.
I was too young at the time to know if this lead to any mail delivery issues, and I imagine the postal delivery service was made aware of the change. But I would think that even if they were notified it would sometimes be the case that if your house used to be say number 53 and now it’s 73 that mail that was intended for you ends up in the mail box of the house that used to be 33 and is now 53.
Even if not at first then at least like 3 years later when some random company still has your old address on file and most other mail for everyone in the street is usually addressed to updated numbers.
Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.
I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.
It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.
(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)
I've been working on something very similar as a tool for my own AI research -- though I don't have the success they claim. Mine often plateaus on the optimization metric. I think there's secret sauce in the meta-prompting and meta-heuristic comments from the paper that are quite vague, but it makes sense -- it changes the dynamics of the search space and helps the LLM get out of ruts. I'm now going to try to integrate some ideas based off of my interpretation of their work to see how it goes.
If it goes well, I could open source it.
What are the things you would want to optimize with such a framework? (So far I've been focusing on optimizing ML training and architecture search itself). Hearing other ideas would help motivate me to open source if there's real demand for something like this.
This does seem similar to what has been done in the neural architecture search domain, doesn't it?
In my case, I'd mainly be interested in mathematics: I'd provide a mathematical problem and a baseline algorithm for it and would want an open source framework to be able to improve on that.
Also definitely interested in open-source ml search: there are so many new approaches (I follow this channel for innovations; it is overwhelming https://www.youtube.com/@code4AI) and it would be great being able to define a use case and having a search come up with the best approaches.
I work in the field of medical image processing. I haven't thought particularly hard about it, but I'm sure I could find a ton of use cases if I wanted to.
One upside of the deterministic schemes is they include provenance/lineage. Can literally "trace up" the path the history back to the original ID giver.
Kinda has me curious about how much information is required to represent any arbitrary provenance tree/graph on a network of N-nodes/objects (entirely via the self-described ID)?
(thinking in the comment: I guess if worst case linear chain, and you assume that the information of the full provenance should be accessible by the id, that scales as O(N x id_size), so its quite bad. But, assuming "best case" (that any node is expected to be log(N) steps from root, depth of log(N)) feels like global_id_size = log(N) x local_id_size is roughly the optimal limit? so effectively the size of the global_id grows as log(N)^2? Would that mean: from the 399 bit number, with lineage, would be a lower limit for a global_id_size be like (400 bit)^2 ~= 20 kB (because of carrying the ordered-local-id provenance information, and not relative to local shared knowledge)
reply