Hacker Newsnew | past | comments | ask | show | jobs | submit | indeed30's commentslogin

I don’t think you can do anything sensible here without making much stronger modelling assumptions. A vanilla non-parametric bootstrap is only valid under a very specific generative story: IID sampling from a population. Many (most?) curve-fitting problems won't satisfy that.

For example, suppose you measure the decay of a radioactive source at fixed times t = 0,1,2,... and fit y = A e^{-kt}. The only randomness is small measurement error with, say, SD = 0.5. The bootstrap sees the huge spread in the y-values that comes from the deterministic decay curve itself, not from noise. It interprets that structural variation as sampling variability and you end up with absurdly wide bootstrap confidence intervals that have nothing to do with the actual uncertainty in the experiment.


These are all big topics, but any "parametric curve fitting" like this tool uses is parameter estimation (the parameters of the various curves). That already makes strong modeling assumptions (usually including IID, Gaussian noise, etc.,) to get the parameter estimates in the first place. I agree it would be even better to have ways to input measurement errors (in both x- & y- !) per your example and have non-bootstrap options (I only said "probably"), residual diagnostics, etc.

Maybe a residuals plot and IID tests of residuals (i.e. tests of some of the strong assumptions!) would be a better next step for the author than error estimates, but I stand by my original feedback. Right now even the simplest case of a straight line fit is reported with only exact slope & intercept (well, not exact, but to an almost surely meaningless 16 decimals!), though I guess he thought to truncate the goodness of fit measures at ~4 digits.


I think we are just coming at this from different angles. I do understand and agree that we are estimating the parameters of the fit curves.

> That already makes strong modeling assumptions (usually including IID, Gaussian noise, etc.,) to get the parameter estimates in the first place

You lose me here - I don't agree with "usually". I guess you're thinking of examples where you are sampling from a population and estimating features of that population. There's nothing wrong with that, but that is a much smaller domain than curve fitting in general.

If you give me a set of x and y, I can fit a parametric curve that tries to minimises the average squared distance between fit and observed values of y without making any assumptions whatsoever. This is a purely mechanical, non-stochastic procedure.

For example, if you give me the points {(0,0), (1,1), (2,4), (3,9)} and the curve y = a x^b, then I'm going to fit a=1, b=2, and I certainly don't need to assume anything about the data generating process to do so. However there is no concept of a confidence interval in this example - the estimates are the estimates, the residual error is 0, and that is pretty much all that can be said.

If you go further and tell me that each of these pairs (x,y) is randomly sampled, or maybe the x is fixed and the y is sampled, then I can do more. But that is often not the case.


What methods can you use the estimate the standard error in this case?


The radioactive decay example specifically? Fit A and k (e.g. by nonlinear least squares) and then use the Jacobian to obtain the approximate covariance matrix. The diagnonal elements of that matrix give you the standard error estimates.


As long as UK taxes are flow-based and not stock-based, it seems a bit silly to base analysis on a stock-based denominator like the number of millionaires.


Only sound comment here.

You can usually tell message board prevalent politics by seeing which stuff gets demands for rigor and which stuff is accepted as-is.

It's a progressive organization releasing a "study".


I wouldn’t call the embedding layer "separate" from the LLM. It’s learned jointly with the rest of the network, and its dimensionality is one of the most fundamental architectural choices. You’re right though that, in principle, you can pick an embedding size independent of other hyperparameters like number of layers or heads, so I see where you're coming from.

However the embedding dimension sets the rank of the token representation space. Each layer can transform or refine those vectors, but it can’t expand their intrinsic capacity. A tall but narrow network is bottlenecked by that width. Width-first scaling tends to outperform pure depth scaling, you want enough representational richness per token before you start stacking more layers of processing.

So yeah, embedding size doesn’t have to scale up in lockstep with model size, but in practice it usually does, because once models grow deeper and more capable, narrow embeddings quickly become the limiting factor.


I hear you, but the article is talking specifically about "embeddings as a product" -- not the embeddings that are within an LLM architecture. It starts:

> As a quick review, embeddings are compressed numerical representations of a variety of features (text, images, audio) that we can use for machine learning tasks like search, recommendations, RAG, and classification.

Current standalone embedding models are not intrinsically connected to SotA LLM architectures (e.g. the Qwen reference) -- right? The article seems to mix the two ideas together.


I think it's actually (2020)


Ten years ago, I worked for a company that had billions of sensor readings from mobile phones. The idea was to use crowdsourced data to create truly detailed, real-world coverage maps, and then sell that data to marketing and network operations teams at telcos.

We used reverse geocoding extensively — but never down to street addresses, always to a higher level. We wanted to split measurements by country, region, city — any geographic unit. When you deal with country borders, you get a lot of weird measurements as phones roam onto foreign networks. We weren’t interested in reporting on the experience of users roaming while abroad, so we needed shapefiles good enough to filter all that out and to partition the rest of the data cleanly.

We built a 30-machine Spark cluster on AWS back when Spark was still super early — around v0.7, definitely before 1.0. At the time, you pretty much had to use Scala with Spark if you cared about performance. Most of the workload was point-in-polygon tests. Before that, we were using a brutally hacky pipeline involving PostGIS, EMR, and Pig, and it was hell.

It was incredibly fun, but looking back now, I can see so clearly all the mistakes I made.


I just jotted down your closing sentence. Equally insightful and touching.


I would be pretty confident that Disney has a pricing team whose entire job is to model those effects.


And I'm sure they'll do what every pricing model analyst does, which is simplify things to N=1, average across all factors to determine what the market is willing to bear according to GDP, missing out how inequality of wealth plays into these market dynamics and potential customer base. The price is affordable as per the model, but in reality the only people that can afford it are a fraction of people who have most of the wealth, because N=!1 but 8.2B, of which only a fraction have the money to spend of frivolities when fighting tooth and nail for food, housing, and healthcare.


So, can somebody in the know speculate about how Deepseek (or OpenAI, or whoever really) is actually running their API?

If I wanted to run a production-grade service using the full Deepseek model, with good tokens/sec and the ability to serve concurrent requests, what sort of hardware are we looking at?


Racks and Racks of servers (likely nVidia HGX H100/H200 8-GPU server) connected at at least 100GB (but more likely 400gb and 800gb) links. The servers alone start at about $350k. Then you need to supply power, cooling, networking and a technical team to support the program.


That's interesting - I believe this is exactly how Sequelize implements soft-deletion.


To discover where else you then subsequently forwarded it.

I'm not suggesting this is actually a problem, but that's how an argument could go.


I don't disagree with what you say, but one difference is that we generally hold these people accountable and often shift liability to them when they are wrong (though not always, admittedly), which is not something I have ever seen done with any AI system.


This sounds like an argument in favor of AI personhood, not an argument against AI experts.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: