I am working to track and trace and time transactions and while this is possible when and if you know the identity of at least one participant it’s quite another thing when no identity is known at all. Criminals know that so it’s notoriously hard to pull off. Thanks to Daleware secrecy and lax Super PAC rules to disclose sources of funds it’s not going to get easier.
So either your friends are genius saucers or they have effective government intelligence that would be highly appreciated. I’d be interested.
You are spot on regarding the bedroom though. Exporting physical USD is far more lucrative, by the shipload, often by Chinese Money Laundering Organisations, for free.
Sorry, I'm a bit fuzzy on the details, but I know that he usually goes after big fish - people who deal in bulk or are somehow involved in manufacturing, not dudes who order pills in the mail. These people are usually being investigated/surveilled otherwise, and he does work with the police, so you could say he has 'government intelligence'.
That's why diversity of sources is the only way to escape censorship: you get one half truth from one source, another half truth from another source, then two halves make whole truth.
That's also trivial to manipulate; control the narrative, and you control the Overton window. People picking the middle of two fake options are still under the influence of whoever chose those options — just ask any stage magician.
Propaganda works when it's the only source of information. This situation is created by censorship, especially in internets, where you don't need to walk to open a distant site.
Taking a step back, there is another way for propaganda to function that doesn't even require being the main source, but simply to make the lie so huge that people can't process the idea someone would be *that* level of dishonest: https://en.wikipedia.org/wiki/Big_lie
Consider your own previous comment:
> you get one half truth from one source, another half truth from another source, then two halves make whole truth.
What happens when one source says that the Alpha Party* consists of child-eating devil-worshiping lizards from Alpha Ceti 5 who caused the 9/11 attacks to cover up how the mind-control chemtrail fluid they were making in the WTC burned hot enough to melt steel beams, and the other source says the Alpha Party is standing on a platform of reducing the tax burden on hard-working families?
The latter can be a half-truth, but you don't get even a little closer to a full truth by adding any part of the "other side".
* A made-up party, any similarities to actual persons is coincidence and all the usual disclaimer.
Your links give examples of campaigns that happened, but didn't quite work. You think the problem is their very happening? And the very fact that you know about child-eating devil-worshiping lizards from Alpha Ceti 5 shows that an opinion is available no matter what propaganda you use against it as long as it's not censored. You can suppress it only by censorship, not by propaganda. In any case using shitposting sites as a source of information is tricky, journalism isn't that bad yet.
> Your links give examples of campaigns that happened, but didn't quite work. You think the problem is their very happening?
They clearly did work, though.
Problem? No, the problem isn't their very happening, it's more that they are effective strategies. Some also used by advertising agencies.
> And the very fact that you know about child-eating devil-worshiping lizards from Alpha Ceti 5 shows that an opinion is available no matter what propaganda you use against it as long as it's not censored.
I don't know anything about child-eating devil-worshiping lizards from Alpha Ceti 5, that doesn't mean I can't talk about them. It's called "making stuff up".
Not sure where you're going with that sentence though. You do realise, I hope, that this was supposed to be a string of nonsense? That the point was that no matter which half you take from a string of nonsense, you can't combine it with a half-truth to get a full truth, you just get a half truth with a different false part.
Which in this example might be something like "the Alpha Party* caused the 9/11 attacks to cover up how the mind-control chemtrail fluid they were making in the WTC burned hot enough to melt steel beams, and wants to reduce the tax burden on families where the parents earn more than double the national average income between them".
The half-truth remains, at best, a half-truth. But that's the best case, and you only get that if you already knew what part was less than honest before you considered what to dismiss, at which point you didn't need anything from other statements in the first place.
> You can suppress it only by censorship, not by propaganda.
That's the point of disagreement: you, as a human, can only pay attention to so much. For example, if I buy all the ad space around you and fill it with only my own message, that is propaganda that denies the same space to anyone who wants to tell the truth.
> In any case using shitposting sites as a source of information is tricky, journalism isn't that bad yet.
This assumes you have the cognitive resources to do that. Most people just switch to someone they trust to avoid exactly this. Matter of fact, that was the major advantage of the net back in the day.
I think people have to deal with pluralism of opinions in everyday life too, since different people have different opinions. Aren't they socially maladapted if they can't do that?
> That's why diversity of sources is the only way to escape censorship:
No, it's a page out of the old fascist playbook where flooding the stage with propaganda generates enough confusion to help fascists further their hateful agenda.
I find it hilarious when people who are pro censorship bring up Karl Popper and the Paradox of Tolerance.
You can tell they've never read his work because his conclusion in the end is that you should tolerate intolerance up and until it promotes specific violence.
So total freedom of speech up and until it starts inciting violence. It's basically the same stance the US Constitution has.
Fascists in the original sense, Mussolini, didn't tolerate opposition.
I'm not sure about modern fascists, but US politics does look rather Kayfabe-y to me. Fake opposition, there for the purpose of being an opponent.
Of course then you get all the discourse about what even counts as fascism, and someone brings up that the origin of the word is the Roman "fasces" (bundle of sticks) and how that etymological root points to the concept of "strength through unity" which is also why the Lincoln memorial has Lincoln resting his hands on them[0] and why trade unions often use the "strength through unity" phrasing (and get annoyed/upset by the connection).
Notice how all the major AI companies (at least the ones that don't do open releases) stopped telling us how many parameters their models have. Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.
And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.
Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements
I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old
You're hitting on something really important that barely gets discussed. For instance, notice how opus 4.5's speed essentially doubled, bringing it right in line with the speed of sonnet 4.5? (sonnet 4.6 got a speed bump too, though closer to 25%).
It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.
I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in
this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers.
By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.
These AI companies are all in the same boat. At current operating costs and profit margins they can't hope to pay back the investment, so they have to pull tricks like rebranding models and downgrading offerings silently.
There's no oversight of this industry. The consumer protection dept in the US was literally shut down by the administration, and even if they had not been, this technology is too opaque for anyone to really be able to tell if today they're giving you a lower model than what you paid for yesterday.
I'm convinced they're all doing everything they can in the background to cut costs and increase profits.
I can't prove that Gemini 3 is dumber than when it came out because of the non deterministic nature of this technology, but it sure feels like it.
opus 4.6 was going to be sonet 5 up until week of release. The price bump is even bigger than you realize because they don't let you run opus 4.6 at full speed unless you pay them an extra 10x for the new "fast mode"
If that's true, it would be surprising; the current Sonnet 4.6 is not in the same league as either Opus 4.5 or 4.6, either anecdotally or on benchmarks.
Because Opus 4.6 is better than 4.5. So if it's true that Sonnet 5 was so good they gave it the Opus name, does that mean there was an Opus upgrade that didn't pan out? And what is Sonnet 4.6? An upgraded Haiku? Just trying to follow the red yarn in the conspiracy board here.
I don't know whether there was an opus that ran into trouble or if they just looked at the model they had and decided that they could charge more than originally intended.
sonet 4.6 presumably is either a version of sonet 4.5 with optimizations for cost instead of perf (or a haiku that also got upscaled).
Anthropic is preparing for IPO this year, so it's not exactly a stretch to suggest that they might be trying to decrease their losses and increase inference margin.
It's quite plausible to me that the difference is inference configuration. This could be done through configurable depth, Moe experts, layers etc. Even beam decoding changes can make substantial performance changes.
Train one large model, then down configure it for different pricing tiers.
I dont think thats plausible because they also just launched a high-speed variant which presumably has the inference optimization and smaller batching and costs about 10x
also, if you have inference optimizations why not apply them to all models?
It kind of makes sense, at least a year or so ago, I know $20.00 unlimited plans were costing these companies ~$250.00 averaged out, they're still lighting money on fire with $200.00 but probably not nearly as bad, however, I'm not sure if costs have gone up with changes in models, seems like the agentic tooling is more expensive for them (hence why they're pushing anyone they can to pay per token).
Cite a source. Your concrete claim is that, on average, for every $1 of subscription revenue on a monthly subscription, OpenAI and Anthropic were losing $11.50?
It seems completely implausible.
I could believe that if a $20 sub used every possible token granted, it would cost $250. But certainly almost no one was completely milking their subscription. In the same way that no one is streaming netflix literally 24/7.
From what I've gathered, they've been mostly training limited. Better training methods and cleaner training data allows smaller models to rival or outperform larger models training with older methods and lower-quality training data.
For example, the Qwen3 technical report[1] says that the Qwen3 models are architecturally very similar to Qwen2.5, with the main change being a tweak in the attention layers to stabilize training. And if you compare table 1 in Qwen3 paper with table 1 in Qwen 2.5 technical report[2], the layer count, attention configuration and such is very similar. Yet Qwen3 was widely regarded as a significant upgrade to Qwen2.5.
However, for training, they doubled the pre-training token count, and tripled the number of languages. It's been shown that training on more languages can actually help LLMs generalize better. They used Qwen2.5 VL and Qwen 2.5 to generate additional training data by parsing a large number PDFs and turning them into high quality training tokens. They improved their annotation so they could more effectively provide diverse training tokens to the model, improving training efficiency.
They continued this trend with Qwen3.5, where even more and better training data[3] made their Qwen3.5-397B-A17B model match the 1T-parameter Qwen3-Max-Base.
That said there's also been a lot of work on model architecture[4], getting more speed and quality per parameter. In the case of Qwen3-Next architecture which 3.5 is based on, that means such things as hybrid attention for faster long-context operation, and sparse MoE and multi-token prediction for less compute per output token.
I used Qwen as an example here, from what I gather they're just an example of the general trend.
Similar trend in open text-to-image models: Flux.1 was 12B but now we have 6B models with much better quality. Qwen Image goes from 20B to 7B while merging the edit line and improving quality. Now that the cost of spot H200s at 140GB came down to A100 levels, you can finally try larger scale finetuning/distillation/rl with these models. Very promising direction for open tools and models if the trend continues.
> I doubt frontier models have actually substantially grown in size in the last 1.5 years
... and you'd be most likely very correct with your doubt, given the evidence we have.
What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.
I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.
Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.
They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.
[1] for lack of a better term that I am not aware of.
I'd suggest that a measure like 'density[1]/parameter' as you put it will asymptotically rise to a hard theoretical limit (that probably isn't much higher than what we have already). So quite unlike Moore's Law.
Obviously, there’s a limit to how much you can squeeze into a single parameter. I guess the low-hanging fruit will be picked up soon, and scaling will continue with algorithmic improvements in training, like [1], to keep the training compute feasible.
I take "you can't have human-level intelligence without roughly the same number of parameters (hundreds of trillions)" as a null hypothesis: true until proven otherwise.
Why don't we need them? If I need to run a hundred small models to get a given level of quality, what's the difference to me between that and running one large model?
You can run smaller models on smaller compute hardware and split the compute. For large models you need to be able to fit the whole model in memory to get any decent throughput.
It's unfair to take some high number that reflects either disagreement, or assumes that size-equality has a meaning.
> level of quality
What is quality, though? What is high quality, though? Do MY FELLOW HUMANS really know what "quality" is comprised of? Do I hear someone yell "QUALITY IS SUBJECTIVE" from the cheap seats?
I'll explain.
You might care about accuracy (repetition of learned/given text) more than about actual cognitive abilities (clothesline/12 shirts/how long to dry).
From my perspective, the ability to repeat given/learned text has nothing to do with "high quality". Any idiot can do that.
Here's a simple example:
Stupid doctors exist. Plentifully so, even. Every doctor can pattern-match symptoms to medication or further tests, but not every doctor is capable of recognizing when two seemingly different symptoms are actually connected. (simple example: a stiff neck caused by sinus issues)
There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.
There are plenty of benefits for running multiple models in parallel. A big one is specialization and caching. Another is context expansion. Context expansion is what "reasoning" models can be observed doing, when they support themselves with their very own feedback loop.
One does not need "hundred" small models to achieve whatever you might consider worthy of being called "quality". All these models can not only reason independently of each other, but also interact contextually, expanding each other's contexts around what actually matters.
They also don't need to learn all the information about "everything", like big models do. It's simply not necessary anymore. We have very capable systems for retrieving information and feeding them to model with gigantic context windows, if needed. We can create purpose-built models. Density/parameter is always increasing.
Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts, can disseminate multiple perspectives on a matter in parallel, boosting context expansion massively.
A single model cannot refactor its own path of thoughts during an inference run, thus massively increasing inefficiency. A single model can only provide itself with feedback one after another, while multiple models can do it all in parallel.
See ... there's two things which cover the above fundamentally:
1. No matter how you put it, we've learned that models are "smarter" when there is at least one feedback-loop involved.
2. No matter how you put it, you can always have yet another model process the output of a previously run model.
These two things, in combination, strongly indicate that multiple small, high-efficiency models running in parallel, providing themselves with the independent feedback they require to actually expand contexts in depth, is the way to go.
Or, in other words:
Big models scale Parameters, many small models scale Insight.
> There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.
But a smart person who hasn’t read all the texts won’t be a good doctor, either.
Chess players spend enormous amounts of time studying openings for a reason.
> Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts
So, even assuming that one can train a model on reasoning/cognitive abilities, how does one pick the relevant texts for a desired outcome?
Bitter Lesson is about exploration and learning from experience. So RL (Sutton's own field) and meta learning. Specialized models are fine from Bitter Lesson standpoint if the specialization mixture is meta learned / searched / dynamically learned&routed.
The corollary to the bitter lesson is that in any market meaningful time scale a human crafted solution will outperform one which relies on compute and data. It's only on time scales over 5 years that your bespoke solution will be over taken. By which point you can hand craft a new system which uses the brute force model as part of it.
Repeat ad-nauseam.
I wish the people who quote the blog post actually read it.
> Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.
AFAICT that's mostly because what you're getting when you select a "model" from most of these cloud chat model providers today, isn't a specific concrete model, but rather is a model family, where your inference request is being routed to varying models within the family during the request. There's thus no one number of weights for "the model", since several entirely-independent models can be involved in generating each response.
And to be clear, I'm not just talking about how selecting e.g. "ChatGPT 5.2" sometimes gets you a thinking model and sometimes doesn't, etc.
I'm rather saying that, even when specifically requesting the strongest / most intelligent "thinking" models, there are architectural reasons that the workload could be (and probably is) routed to several component "sub-models", that handle inference during different parts of the high-level response "lifecycle"; with the inference framework detecting transition points in the response stream, and "handing off" the context + response stream from one of these "sub-models" to another.
(Why? Well, imagine how much "smarter" a model could be if it had a lot more of its layers available for deliberation, because it didn't have to spend so many layers on full-fat NLP parsing of input or full-fat NLP generation of output. Split a model into a pipeline of three sub-models, where the first one is trained to "just understand" — i.e. deliberate by rephrasing whatever you say to it into simpler terms; the second one is trained to "just think" — i.e. assuming pre-"understood" input and doing deep scratch work in some arbitrary grammar to eventually write out a plan for a response; and the third one is trained to "just speak" — i.e. attend almost purely to the response plan and whatever context-tokens that plan attends to, to NLP-generate styled prose, in a given language, with whatever constraints the prompt required. Each of these sub-models can be far smaller and hotter in VRAM than a naive monolithic thinking model. And these sub-models can make a fixed assumption about which phase they're operating in, rather than having to spend precious layers just to make that determination, over and over again, on every single token generation step.)
And, presuming they're doing this, the cloud provider can then choose to route each response lifecycle phase to a different weight-complexity-variant for that lifecycle phase's sub-model. (Probably using a very cheap initial classifier model before each phase: context => scalar nextPhaseComplexityDemand.) Why? Because even if you choose the highest-intelligence model from the selector, and you give it a prompt that really depends on that intelligence for a response... your response will only require a complex understanding-phase sub-model if your input prose contained the high-NLP-complexity tokens that would confuse a lesser understanding-phase sub-model; and your response will only require a complex responding-phase sub-model if the thinking-phase model's emitted response plan specifies complex NLP or prompt-instruction-following requirements that only a more-complex responding-phase sub-model knows how to manage.
Which is great, because it means that now even when using the "thinking" model, most people with most requests are only holding a reservation on a GPU holding a copy of the (probably still hundreds-of-billions-of-weights) high-complexity-variant thinking-phase sub-model weights, for the limited part of that response generation lifecycle where the thinking phase is actually occurring. During the "understanding" and "responding" phases, that reservation can be released for someone else to use! And for the vast majority of requests, the "thinking" phase is the shortest phase. So users end up sitting around waiting for the "understanding" and "responding" phases to complete before triggering another inference request. Which brings the per-user duty cycle of thinking-phase sub-model use way down.
It's the same thing. Quantize your parameters? "Bigger" model runs faster. MOE base model distillation? "Bigger" model runs as smaller model.
There is no gain for anyone anywhere by reducing parameter count overall if that's what you mean. That sounds more like you don't like transformer models than a real performance desire
That’s such an amazing story. I ca relate to much of the emotional, time and financial commitment of a non-profit and keeping at it like you have is absolutely commendable. Seeing you succeed is an inspiration to continue doing the right thing for me and for many others!
+1 canceled all OpenAI and switched to Gemini hours after it dropped. I was tired of vape AI, obfuscated facts in hallucinations and promises of future improvements.
Remarkably some claim AI has now discovered a new drug candidate on its own. Reading the prep-print (https://www.biorxiv.org/content/10.1101/2025.04.14.648850v2....), it appears the model was targeted to just a very specific task and without evaluating other models on the same task. I know nothing about gens, and I can see that is an important advance. However, seems a bit headline grabbing when claiming victory for one model without comparing against others using the same process.
If a simple majority classifier has the same performance as a fancy model with 58 layers of transformers, and you use your fancy model instead of the majority classifier, is it the model that's doing the discovery or is it the operator that choose to look in a particular place?
I am all for crediting humans and I don't particularly fancy all the anthropomorphising myself. However rubbing it in now feels similarly pointless as suggesting the US should switch to metric.
Well it's important, because the particular new lead for drug targeting is not super valuable, they are a dime a dozen, easier to find than a startup idea. Actually driving a successful drug development program is an entirely different matter that can only be established with $10-$100M of early exploration, with a successful drug costing much more to get to market.
It could also be that particular prioritization method that uses Gemma is useful in its own, but we won't know that unless it is somehow benchmarked against the many alternatives that have been used up until now. And in other benchmark settings, these cell sentence methods have not been that impressive.
Would be great if more data were made available by OP to peer review some of this. That said, making money with failure starts looking like a business model - highly unethical. Why make customers succeed when you loose money doing so.
So either your friends are genius saucers or they have effective government intelligence that would be highly appreciated. I’d be interested.
You are spot on regarding the bedroom though. Exporting physical USD is far more lucrative, by the shipload, often by Chinese Money Laundering Organisations, for free.
reply