Dyck grammar (balanced brackets) are not an a^nb^n, there are several kinds of brackets.
I cannot find probability of success in paper you linked. Is it 100%? I believe it is less than 100%, because LLMs are intrinsically probabilistic machines.
Well they used strings of < 800 chars, you probably run into context window and training limits at some point (they mention some result that you need at least something of GPT-2 size to begin recognizing more intricate CFGs (their synthetic cfg3f). But then again your physical real-world computer which is conceptually "turing complete" can't handle "infinite strings" either.
> Dyck/balanced-brackets grammar
Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.
Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.
> but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
They recognize and/or generate finite (<800 chars) grammars in that paper.
Usually, sizes of files on a typical Unix workstation follow two-mode log-normal distribution (sum of two log-normal distributions), with heavy tails due to log-normality [1]. Authors of the paper did not attempt to model that distribution.
[1] This was true for my home directories for several years.
> I do read the code, but reviewing code is very different from producing it, and surely teaches you less. If you don’t believe this, I doubt you work in software.
I work in software and for single line I write I read hundredths of them.
If I am fixing bugs in my own (mostly self-education) programs, I read my program several times, over and over again. If writing programs taught me something, it is how to read programs most effectively. And also how to write programs to be most effectively read.
> I work in software and for single line I write I read hundredths of them.
I'm not sure whether this should humble or confuse me. I am definitely WAY heavier on the write-side of this equation. I love programming. And writing. I love them both so much that I wrote a book about programming. But I don't like reading other peoples' code. Nor reading generally. I can't read faster than I can talk. I envy those who can. So, reading code has always been a pain. That said, I love little clever golf-y code, nuggets of perl or bitwise magic. But whole reams of code? Hundreds upon hundreds of lines? Gosh no. But I respect anyone who has that patience. FWIW I find that one can still gain incredibly rich understanding without having to read too heavily by finding the implied contracts/interfaces and then writing up a bunch of assertions to see if you're right, TDD style.
Most of the software engineers out there do the support, augmenting source code behemoths the least possible way to achieve desired outcome. I believe that more than 90% of software development was support roles as early as 2K or so.
Not that I had an opportunity to write new code, but most of my work through my experience was either to fix bugs or to add new functionality to an existing system with as little code as possible. Both goals mean reuse and understanding of the existing code. For both "reuse" and "understanding" you have to thoroughly read existing code a dozen or so times over.
Tests (in TDD) can show you presence of bugs, not the absence of them. For the absence of bugs one has to thoroughly know problem domain and source code solving the problems.
> If I am fixing bugs in my own (mostly self-education) programs, I read my program several times
I think here lies the difference OP is talking about. You are reading your own code, which means you had to first put in the effort to write it. If you use LLMs, you are reading code you didn't write.
I read other people’s code all the time. I work as a platform engineer with sre functions.
Gemini 3 by itself is insufficient. I often find myself tracing through things or testing during runtime to understand how things behave. Claude Opus is not much better for this.
On the other hand, pairing with Gemini 3 feels like pairing with other people. No one is going to get everything right all the time. I might ask Gemini to construct gcloud commands or look things up for me, but we’re trying to figure things out together.
It is literally not clear. OP could mean that they read hundredths of lines of code for each code they write ie 100 lines of code written and 1-3 lines read. That is in fact literally what they wrote.
I read hundredths (100ths) of lines of code for one line of code I write.
My last PR of three lines of code moved into conditional tasked me to read about 8000 lines of code to understand and justify the reason to do exactly that.
Man it would rule so much if programmers could manage not to be assholes by default so much of the time.
It's ironic that the more ignorant one is the one calling another ignorant.
Alright I've had my fun with the name-calling. I will now explain the stunningly obvious. Not a thing anyone should have to for someone so sharp as yourself but there we are...
For someone to produce that text after growing up in an English speaking environment, they would indeed be comically inept communicators. Which is why the more reasonable assumption is that English is not in fact their native language.
Not merely the more generous assumption. Being generous by default would be a better character trait than not, but still arguably a luxury. But also simply the more reasonable assumption by plain numbers and reasoning. So, not only were you a douche, you had to go out of your way to select a less likely possibility to make the douche you wanted to be fit the situation.
Louise (Patsantzis & Muggleton 2021) is a machine learning system that learns Prolog programs.
Louise is a Meta-Interpretive Learning (MIL) system. MIL (Muggleton et al. 2014), (Muggleton et al. 2015), is a new setting for Inductive Logic Programming (ILP) (Muggleton, 1991). ILP is a form of weakly-supervised machine learning of logic programs from examples of program behaviour (meaning examples of the inputs and outputs of the programs to be learned). Unlike conventional, statistical machine learning algorithms, ILP approaches do not need to see examples of programs to learn new programs and instead rely on background knowledge, a library of pre-existing logic programs that they reuse to compose new programs.
This is what was done by Douglas Lenat from late 1970-s on [1]. He did his work using Lisp, this thing does something close using Prolog.
If we're going down that path: Ehud Shapiro got there back in 1984 [1]. His PhD thesis is excellent and shows what logic programming could do (/could have been).
He viewed the task of learning predicates (programs/relations) as a debugging task. The magic is in a refinement operator that enumerates new programs. The diagnostic part was wildly insightful -- he showed how to operationalise Popper's notion of falsification. There are plenty of more modern accounts of that aspect but sadly the learning part was broadly neglected.
There are more recent probabilistic accounts of this approach to learning from the 1990s.
... and if you want to go all the way back you can dig up Gordon Plotkin's PhD thesis on antiunification from the early 1970s.
Imagine three joins of three queries A,B and C, where first join J1 joins A and B, second join J2 joins A and C and third join J3 joins J1 and J2. Note that I said "queries," not "tables" - these A, B and C can be complex things one would not want or be able to compute more than once. Forget about compute, A, B and C can be quite complex to even write down and the user may really do not want to repeat itself. Look at TPC-DS, there are subqueries in the "with" sections that are quite complex.
This is why pipeline replacements for SQL are more or less futile efforts. They simplify simple part and avoid touching complex one.
I think that something like Verse [1] is more or less way to go. Not the Verse itself, but functional logic programming as an idea, where you can have first class data producers and effect system to specify transactions.
TIL about Verse looks cool I'll have to check it out.
> SQL is not a pipeline, it is a graph.
Maybe it's both? and maybe there will always be hard-to-express queries in SQL, and that's ok?
the RDBMS's relational model is certainly a graph and joins accordingly introduce complexity.
For me, just as creators of the internet regret that subdomains come before domains, I really we could go back in time and have `FROM` be the first predicate and not `SELECT`. This is much more intuitive and lends itself to the idea of a pipeline: a table scan (FROM) that is piped to a projection (SELECT).
I haven't seen anyone make the point about graphs before. FWIW PRQL allows defining named subqueries that can be reused, like J1 and J2 in your example.
You present "programs are graphs" as trivial truth. True trivial truths are, as you pointed out, meaningless. But you leave out degree of applicability - information in the dependence graph differs between programming languages.
Dependencies form a graph, and analyses needed to optimize execution of the program graph differ wildly between languages. Look at С++ aliasing rules and C's "restrict" keyword.
One can't escape the dependence graph. But one can execute dependence graph better or worse, depending (pun intended) on the programming language.
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
largest FPGAs have on the order of tens of millions of logic cells/elements. They’re not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies don’t need millions of them to scale infrastructure.
(The chips also cost tens of thousands of dollars each)
You can synthesize a logic circuit that is as complex as it gets to have a certain accuracy.
Deep differentiable logic networks, in my experience, do not scale well for larger (more inputs) logic elements. One still has to apply logic optimization and synthesis afterwards. So why not to synthesize ones own approximate circuit to the accuracy one's desire?
I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.
They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly
This is 3% or infinitely far away from the perfect tech.
The perfect tech is the stack.
reply