More

thesz · 2026-03-01T20:33:52 1772397232

Transformers are able to recognize balanced brackets grammar at 97% success rate: https://openreview.net/pdf?id=kaILSVAspn

This is 3% or infinitely far away from the perfect tech.

The perfect tech is the stack.

krackers · 2026-03-01T23:59:13 1772409553

This is very interesting since there is another notable paper which shows LLMs can recognize and generate CFGs

https://arxiv.org/abs/2305.13673

and of course a^n b^n is also classic CFG, so it's not clear why one paper had positive results while the other hand negative.

thesz · 2026-03-02T00:20:47 1772410847

Dyck grammar (balanced brackets) are not an a^nb^n, there are several kinds of brackets.

I cannot find probability of success in paper you linked. Is it 100%? I believe it is less than 100%, because LLMs are intrinsically probabilistic machines.

krackers · 2026-03-02T00:29:19 1772411359

Figure 12 shows probabilities I think, it actually does seem to be 100% at temperature 0.1 for certain pretraining runs.

thesz · 2026-03-02T00:59:32 1772413172

  > it actually does seem to be 100%

For all Dyck grammar sequences, infinitely many of them? ;)

krackers · 2026-03-02T01:13:18 1772413998

Well they used strings of < 800 chars, you probably run into context window and training limits at some point (they mention some result that you need at least something of GPT-2 size to begin recognizing more intricate CFGs (their synthetic cfg3f). But then again your physical real-world computer which is conceptually "turing complete" can't handle "infinite strings" either.

> Dyck/balanced-brackets grammar

Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.

Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.

(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.

thesz · 2026-03-02T02:12:01 1772417521

  > but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.

They recognize and/or generate finite (<800 chars) grammars in that paper.

Usually, sizes of files on a typical Unix workstation follow two-mode log-normal distribution (sum of two log-normal distributions), with heavy tails due to log-normality [1]. Authors of the paper did not attempt to model that distribution.

[1] This was true for my home directories for several years.

thesz · 2026-03-02T01:03:34 1772413414

And this Figure 12 is not about Dyck/balanced-brackets grammar. This figure is about something not properly described in the paper.

thesz · 2026-02-28T17:10:31 1772298631

  > I do read the code, but reviewing code is very different from producing it, and surely teaches you less. If you don’t believe this, I doubt you work in software.

I work in software and for single line I write I read hundredths of them.

If I am fixing bugs in my own (mostly self-education) programs, I read my program several times, over and over again. If writing programs taught me something, it is how to read programs most effectively. And also how to write programs to be most effectively read.

padolsey · 2026-03-01T08:30:52 1772353852

> I work in software and for single line I write I read hundredths of them.

I'm not sure whether this should humble or confuse me. I am definitely WAY heavier on the write-side of this equation. I love programming. And writing. I love them both so much that I wrote a book about programming. But I don't like reading other peoples' code. Nor reading generally. I can't read faster than I can talk. I envy those who can. So, reading code has always been a pain. That said, I love little clever golf-y code, nuggets of perl or bitwise magic. But whole reams of code? Hundreds upon hundreds of lines? Gosh no. But I respect anyone who has that patience. FWIW I find that one can still gain incredibly rich understanding without having to read too heavily by finding the implied contracts/interfaces and then writing up a bunch of assertions to see if you're right, TDD style.

thesz · 2026-03-02T00:56:42 1772413002

Most of the software engineers out there do the support, augmenting source code behemoths the least possible way to achieve desired outcome. I believe that more than 90% of software development was support roles as early as 2K or so.

Not that I had an opportunity to write new code, but most of my work through my experience was either to fix bugs or to add new functionality to an existing system with as little code as possible. Both goals mean reuse and understanding of the existing code. For both "reuse" and "understanding" you have to thoroughly read existing code a dozen or so times over.

Tests (in TDD) can show you presence of bugs, not the absence of them. For the absence of bugs one has to thoroughly know problem domain and source code solving the problems.

GTP · 2026-02-28T17:39:32 1772300372

> If I am fixing bugs in my own (mostly self-education) programs, I read my program several times

I think here lies the difference OP is talking about. You are reading your own code, which means you had to first put in the effort to write it. If you use LLMs, you are reading code you didn't write.

hosh · 2026-02-28T18:52:35 1772304755

I read other people’s code all the time. I work as a platform engineer with sre functions.

Gemini 3 by itself is insufficient. I often find myself tracing through things or testing during runtime to understand how things behave. Claude Opus is not much better for this.

On the other hand, pairing with Gemini 3 feels like pairing with other people. No one is going to get everything right all the time. I might ask Gemini to construct gcloud commands or look things up for me, but we’re trying to figure things out together.

thesz · 2026-02-28T17:57:56 1772301476

If I need to change someone's code, I also read it. several times.

dongguanxianhao · 2026-02-28T17:38:39 1772300319

>hundredths of them

Man, it would rule so much if programmers were literate and knew how to actually communicate what they intend to say.

MDCore · 2026-02-28T18:00:22 1772301622

It's obvious from the context here what the intended meaning was. Everyone makes typos sometimes.

dongguanxianhao · 2026-03-01T07:37:55 1772350675

It is literally not clear. OP could mean that they read hundredths of lines of code for each code they write ie 100 lines of code written and 1-3 lines read. That is in fact literally what they wrote.

thesz · 2026-03-02T02:23:40 1772418220

I read hundredths (100ths) of lines of code for one line of code I write.

My last PR of three lines of code moved into conditional tasked me to read about 8000 lines of code to understand and justify the reason to do exactly that.

Brian_K_White · 2026-02-28T17:54:23 1772301263

Man it would rule so much if programmers could manage not to be assholes by default so much of the time.

It's ironic that the more ignorant one is the one calling another ignorant.

Alright I've had my fun with the name-calling. I will now explain the stunningly obvious. Not a thing anyone should have to for someone so sharp as yourself but there we are...

For someone to produce that text after growing up in an English speaking environment, they would indeed be comically inept communicators. Which is why the more reasonable assumption is that English is not in fact their native language.

Not merely the more generous assumption. Being generous by default would be a better character trait than not, but still arguably a luxury. But also simply the more reasonable assumption by plain numbers and reasoning. So, not only were you a douche, you had to go out of your way to select a less likely possibility to make the douche you wanted to be fit the situation.

Literate programmers indeed.

epgui · 2026-02-28T17:58:39 1772301519

Not everyone has English as a first language.

thesz · 2026-02-26T23:23:49 1772148229

  > 39% adoption in two years (internet took 5, PCs took 12).

Adjust for connectivity and see whether it is different (from pure hype) this time.

thesz · 2026-02-26T16:16:00 1772122560

An old one: https://news.ycombinator.com/item?id=33755016

Stylometry can match not only people, but ethnic groups. No LLM required.

thesz · 2026-02-24T18:03:33 1771956213

You said AI: https://github.com/stassa/louise

  Louise (Patsantzis & Muggleton 2021) is a machine learning system that learns Prolog programs.

  Louise is a Meta-Interpretive Learning (MIL) system. MIL (Muggleton et al. 2014), (Muggleton et al. 2015), is a new setting for Inductive Logic Programming (ILP) (Muggleton, 1991). ILP is a form of weakly-supervised machine learning of logic programs from examples of program behaviour (meaning examples of the inputs and outputs of the programs to be learned). Unlike conventional, statistical machine learning algorithms, ILP approaches do not need to see examples of programs to learn new programs and instead rely on background knowledge, a library of pre-existing logic programs that they reuse to compose new programs.

This is what was done by Douglas Lenat from late 1970-s on [1]. He did his work using Lisp, this thing does something close using Prolog.

[1] https://en.wikipedia.org/wiki/Eurisko

discarded1023 · 2026-02-25T03:36:10 1771990570

If we're going down that path: Ehud Shapiro got there back in 1984 [1]. His PhD thesis is excellent and shows what logic programming could do (/could have been).

He viewed the task of learning predicates (programs/relations) as a debugging task. The magic is in a refinement operator that enumerates new programs. The diagnostic part was wildly insightful -- he showed how to operationalise Popper's notion of falsification. There are plenty of more modern accounts of that aspect but sadly the learning part was broadly neglected.

There are more recent probabilistic accounts of this approach to learning from the 1990s.

... and if you want to go all the way back you can dig up Gordon Plotkin's PhD thesis on antiunification from the early 1970s.

[1] https://en.wikipedia.org/wiki/Algorithmic_program_debugging

thesz · 2026-02-23T15:54:48 1771862088

SQL is not a pipeline, it is a graph.

Imagine three joins of three queries A,B and C, where first join J1 joins A and B, second join J2 joins A and C and third join J3 joins J1 and J2. Note that I said "queries," not "tables" - these A, B and C can be complex things one would not want or be able to compute more than once. Forget about compute, A, B and C can be quite complex to even write down and the user may really do not want to repeat itself. Look at TPC-DS, there are subqueries in the "with" sections that are quite complex.

This is why pipeline replacements for SQL are more or less futile efforts. They simplify simple part and avoid touching complex one.

I think that something like Verse [1] is more or less way to go. Not the Verse itself, but functional logic programming as an idea, where you can have first class data producers and effect system to specify transactions.

[1] https://en.wikipedia.org/wiki/Unreal_Engine#Verse

data_ders · 2026-02-23T16:06:07 1771862767

TIL about Verse looks cool I'll have to check it out.

> SQL is not a pipeline, it is a graph.

Maybe it's both? and maybe there will always be hard-to-express queries in SQL, and that's ok?

the RDBMS's relational model is certainly a graph and joins accordingly introduce complexity.

For me, just as creators of the internet regret that subdomains come before domains, I really we could go back in time and have `FROM` be the first predicate and not `SELECT`. This is much more intuitive and lends itself to the idea of a pipeline: a table scan (FROM) that is piped to a projection (SELECT).

thesz · 2026-02-23T18:03:21 1771869801

Pipeline is a specific kind of a graph.

Yes, there will always be hard-to-express queries, the question is how far can we go?

snthpy · 2026-02-24T18:04:08 1771956248

Thanks, I'll check out Verse.

I haven't seen anyone make the point about graphs before. FWIW PRQL allows defining named subqueries that can be reused, like J1 and J2 in your example.

jnpnj · 2026-02-23T19:00:46 1771873246

Crazy to think that Fortnite might unleash a new population of people who toyed with functional-logic as their first paradigm.

lloydatkinson · 2026-02-23T16:08:05 1771862885

Does it really help to call SQL a graph?

data_ders · 2026-02-23T16:11:59 1771863119

right? like it's a graph and a relational model query and a pipeline and a language and an abstract syntax tree and declarative logical plan

thesz · 2026-02-23T18:04:45 1771869885

It does. Just like any other programming language.

lloydatkinson · 2026-02-23T19:11:54 1771873914

May as well call everything a graph at that point; meaningless.

thesz · 2026-02-23T22:37:14 1771886234

  > meaningless.

No.

You present "programs are graphs" as trivial truth. True trivial truths are, as you pointed out, meaningless. But you leave out degree of applicability - information in the dependence graph differs between programming languages.

Dependencies form a graph, and analyses needed to optimize execution of the program graph differ wildly between languages. Look at С++ aliasing rules and C's "restrict" keyword.

One can't escape the dependence graph. But one can execute dependence graph better or worse, depending (pun intended) on the programming language.

thesz · 2026-02-22T08:13:09 1771747989

800 mm2, about 90mm per side, if imagined as a square. Also, 250 W of power consumption.

The form factor should be anything but thumbdrive.

pfortuny · 2026-02-22T08:21:00 1771748460

mmmhhhhh 800mm2 ~= (30mm)2, which is more like a (biggish) thumb drive.

thesz · 2026-02-22T08:24:36 1771748676

Thanks!

I haven't had my coffee yet. ;)

pfortuny · 2026-02-22T14:42:54 1771771374

Shit happens :D

bdangubic · 2026-02-22T14:44:27 1771771467

always after the coffee :)

baq · 2026-02-22T11:44:18 1771760658

the radiator wouldn't be though

thesz · 2026-02-22T08:06:59 1771747619

8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.

I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.

Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.

Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.

250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.

Looks very, very doable.

It does look doable even for FP4 - these are 3-bit coefficients in disguise.

amelius · 2026-02-22T11:45:26 1771760726

I'm looking forward to the model.toVHDL() method in PyTorch.

sowbug · 2026-02-22T17:50:36 1771782636

Ugh, quick, everyone start panic-buying FPGAs now.

throwup238 · 2026-02-22T19:50:40 1771789840

largest FPGAs have on the order of tens of millions of logic cells/elements. They’re not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies don’t need millions of them to scale infrastructure.

(The chips also cost tens of thousands of dollars each)

8note · 2026-02-22T20:10:54 1771791054

they also arent power friendly

p0u4a · 2026-02-23T04:34:21 1771821261

Pretty close to what you describe: https://github.com/fastmachinelearning/hls4ml

Simboo · 2026-02-22T15:07:13 1771772833

Deep Differentiable Logic Gate Networks

thesz · 2026-02-22T22:28:21 1771799301

I see you and I raise approximate logic synthesis [1] [2].

[1] https://www.sciencedirect.com/science/article/pii/S138376212...

[2] https://arxiv.org/abs/2506.22772

You can synthesize a logic circuit that is as complex as it gets to have a certain accuracy.

Deep differentiable logic networks, in my experience, do not scale well for larger (more inputs) logic elements. One still has to apply logic optimization and synthesis afterwards. So why not to synthesize ones own approximate circuit to the accuracy one's desire?

androiddrew · 2026-02-22T14:33:40 1771770820

Is this a thing?

mikeurbach · 2026-02-22T19:35:50 1771788950

I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.

https://capra.cs.cornell.edu/latte22/paper/2.pdf

https://www.youtube.com/watch?v=QxwZpYfD60g

cpldcpu · 2026-02-22T16:37:22 1771778242

They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.

amelius · 2026-02-22T18:41:43 1771785703

I think they are talking about the transistors that apply the weights to the inputs.

mirekrusin · 2026-02-22T20:24:17 1771791857

gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly

cyanydeez · 2026-02-22T16:41:40 1771778500

Whats the theoretixal full wafer scale model they could produce?

thesz · 2026-02-20T15:45:53 1771602353

  > what's your alternative?

Brain derived neurotrophic factor (BDNF) serum levels are inversely associated with depression's severity [1].

https://pmc.ncbi.nlm.nih.gov/articles/PMC3188695/

Give yourself a good BDNF boost through diet and/or exercise.

Ketogenic diet even improves on schizophrenia to the point that patients go off from medication [2] [3].

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC12237970/

[3] https://www.frontiersin.org/journals/nutrition/articles/10.3...

That's my alternative.

thesz · 2026-02-19T07:34:08 1771486448

  > Assume FP64 units are ~2-4x bigger.

This is wrong assumption. FP64 usually uses the same circuitry as two FP32, adding not that much ((de)normalization, mostly).

From the top of my head, overhead is around 10% or so.

  > Why would gamers want to pay for any features they don't use?

https://www.youtube.com/watch?v=lEBQveBCtKY

Apparently FP80, which is even wider than FP64, is beneficial for pathfinding algorithms in games.

Pathfinding for hundredths of units is a task worth putting on GPU.

kbolino · 2026-02-19T17:15:38 1771521338

Has FP80 ever existed anywhere other than x87?

Paul_Clayton · 2026-02-27T23:24:18 1772234658

The Motorola 88k and 68k both supported (eventually) extended precision, and, of course, Itanium supported it for x87 compatibility.

https://en.wikipedia.org/wiki/Motorola_88000 (under "Registers": "32 80-bit (88110 only)")

https://en.wikipedia.org/wiki/Extended_precision (see section titled "IEEE 754 extended-precision formats")