kimukasetsu's comments

kimukasetsu · on Sept 27, 2023

As others have pointed out, it is a good idea to encode domain knowledge in your time series model through specification and priors. Prophet rarely beats a well specified GLM or SARIMA in real world applications, especially when uncertainty estimates are needed. Professionally, I have succesfully applied Gaussian Processes to many such cases.

A GP is an intuitive and expressive way to code time covariance in a model. A famous example is the relative birthdays model, discussed by Gelman et al in Bayesian Data Analysis and here [1].

[1] https://avehtari.github.io/casestudies/Birthdays/birthdays.h...

kimukasetsu · on June 16, 2023

The biggest mistake engineers make is determining sample sizes. It is not trivial to determine the sample size for a trial without prior knowledge of effect sizes. Instead of waiting for a fixed sample size, I would recommend using a sequential testing framework: set a stopping condition and perform a test for each new batch of sample units.

This is called optional stopping and it is not possible using a classic t-test, since Type I and II errors are only valid at a determined sample size. However, other tests make it possible: see safe anytime-valid statistics [1, 2] or, simply, bayesian testing [3, 4].

[1] https://arxiv.org/abs/2210.01948

[2] https://arxiv.org/abs/2011.03567

[3] https://pubmed.ncbi.nlm.nih.gov/24659049/

[4] http://doingbayesiandataanalysis.blogspot.com/2013/11/option...

travisjungroth · on June 16, 2023

People often don’t determine sample sizes at all! And doing power calculations without an idea of effect size isn’t just hard but impossible. It’s one of the inputs to the formula. But at least it’s fast so you can sort of guess and check.

Anytime valid inference helps with this situation, but it doesn’t solve it. If you’re trying to detect a small effect, it would be nicer to figure out you need a million samples up front versus learning that because your test with 1,000 samples a day took three years.

Still, anytime is way better than fixed IMO. Fixed almost never really exists. Every A/B testing platform I’ve seen allows peeking.

I work with the author of the second paper you listed. The math looks advanced, but it’s very easy to implement.

hackernewds · on June 16, 2023

The biggest mistake is engineers owning experimentation. They should be owned by data scientists.

Realize though that is a luxury, but I also see this trend in blue chip companies

pbae · on June 16, 2023

Did a data scientist write this? You don't need to be a member of a priesthood to run experiments. You just need to know what you're doing.

bonniemuffin · on June 16, 2023

I agree with both sides here. :) DS should own experimentation, AND engineers should be able to run a majority of experiments independently.

As a data scientist at a "blue chip company", my team owns experimentation, but that doesn't mean we run all the experiments. Our role is to create guidelines, processes, and tooling so that engineers can run their own experiments independently most of the time. Part of that is also helping engineers recognize when they're dealing with a difficult/complex/unusual case where they should bring DS in for more bespoke hands-on support. We probably only look at <10% of experiments (either in the setup or results phase or both), because engineers/PMs are able to set up, run, and draw conclusions from most of the experiments without needing us.

playingalong · on June 16, 2023

... and by some definition you'd be a data scientist yourself. (Regardless of your job title)

kimukasetsu · on April 20, 2023

I had a similar experience. I took psylocibin a few times, but after the last one I started getting panic attacks and feeling derealization quite often. This led to negative recurring thoughts and I felt really bad for a while. Got better after a year of cognitive behavioral therapy and six months taking a light dosage of antidepressants. Routine exercises and zen meditation helped tremendously as well. Get help, it is possible. Your brain can heal itself with your help.

kimukasetsu · on Dec 27, 2022

I have a similar setup with Emacs + Evil + Poly + Quarto Modes. Evil for vim keybindings, Poly to read cells as buffers and Quarto as a superior alternative to Jupytext [1].

[1] https://quarto.org/

omrjml · on Dec 27, 2022

This sounds interesting. Would you mind sharing how exactly you have this setup?

kimukasetsu · on Oct 20, 2022

This.

Multilevel models relax the assumption of independent observations by specifying that the measures of repeated experimental units are dependent on each other. It's a way of telling your model that it has less information than it would have if all observations came from independent units. Therefore, standard errors of effects are usually larger. Otherwise, they are biased [1].

Since most researches are not aware of multilevel models, they design their experiments and aggregate their data to fit the independence assumption, which is rarely a good idea. Many are not even aware of modeling beyond hypothesis tests, and are unable or unwilling to adjust their analysis for confounding factors or non-sampling errors that arise due to experiment design flaws.

Also, p-values should be deprecated, since a) nil hypothesis are strawmen at best and false by definition at worst [2] and b) they incentivize researchers to not think hard about effect sizes and uncertainty in their problems.

[1] https://academic.oup.com/biomet/article/73/1/13/246001 [2] http://www.stat.columbia.edu/~gelman/research/published/fail...

nextos · on Oct 20, 2022

Many articles in e.g. Nature Genetics are cheating by hacking their p-values. These hacks would be much harder to get away with if, as a starter, they were asked to use hierarchical models and continuous explanatory variables, whenever possible.

The article from Andrew Gelman you cited explains this quite well. In general the review articles and books he has co-authored are incredibly helpful to learn how to avoid common issues that plague statistical inference.

We need to shift away from null hypotheses and p-values towards generative models, model selection and effect sizes. It leads to much more robust inference.

kimukasetsu · on March 20, 2022

This. Copilot enables bad programming practices imo, especially through clunky APIs like pandas.

kimukasetsu · on Sept 22, 2021

RMarkdown has neither of these issues, and it supports Python. It is baffling to me that most data scientists use Jupyter, since its diffs are meaningless. Its export options are very underwhelming compared to Rmd as well. Notebooks [1] are simply a special case of R Markdown formats. Besides, Rmd are literally text files that work with any text editor, including vim.

[1] https://bookdown.org/yihui/rmarkdown/notebook.html

contravariant · on Sept 22, 2021

You can use Jupytext and basically get the best of both worlds (it hooks into jupyterlab to save/restore a markdown version of the notebook). A possible downside is that it doesn't store the outputs of the cells, though that is intended as a feature.

And since rMarkdown just uses pandoc under the hood, it's a bit unfair to say it has better export options than ipynb which is also supported by pandoc.

ivansavz · on Sept 23, 2021

The closest thing to RMarkdown is MyST — the native .md format for the jupyterbook project: https://jupyterbook.org/content/myst.html

I've switched to that instead of notebooks, and loving the text-based life... gittable, diffable, peer-reviewable code, etc.

kimukasetsu · on June 22, 2021

I strongly agree with this. Not to mention parameter interpretability and, in the case of Bayesian models, uncertainty estimates and convergence diagnostics. Such things are very important when making decision under uncertainty. Kaggle competitions and empirical benchmarks are very biased samples of model performance in real life.

I feel these two things often influence too much the course of Machine Learning research and communities, and this is not good. Most ML researchers and pratictioners are barely aware of the latest advances in parametric modelling, which is a shame. Multilevel models allow you to model response variables with explicit dependent structures. This is done through random (sometimes hierarchical) effects constrained by variance parameters. These parameters regularize the effects themselves and converge really well when fitting factors with high cardinality.

Also, multilevel models are very interesting when it comes to the bias-variance tradeoff. Having more levels in a distribution of random effects actually DECREASES [1] overfitting, which is fascinating.

[1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixe...

borroka · on June 22, 2021

While I agree and it is surprising that multi-level/hierarchical modeling is rarely applied in industry (I used them extensively in academia and industry), dealing with hundreds or thousands of random effects in large data sets, especially in non-linear models, is a computational nightmare. And the benefits may not warrant those nightmares.

RA_Fisher · on June 23, 2021

Finally multi-level/hierarchical modeling is starting to permeate industry thanks to Stan and company.

I use hierarchical modeling regularly to help build Zapier. So do other companies like Generable: https://www.generable.com/

I suspect hierarchical models will become the next “new” hot data structure in software engineering due to their ability to compact logic. https://twitter.com/statwonk/status/1363104221747421184?s=21

borroka · on June 24, 2021

I don't know about permeating the industry. I know for example that the model that Airbnb used 3 years ago (things may have changed in the meantime) to forecast occupancy was a random-effects model maintained by a single person in Europe. I don't know about the penetrance of Generable and companies providing similar probabilistic modeling solutions, although I hope they succeed.

When I was working for one of the FAANGs, I was the only one using random effects models (that I know of), in particular non-linear random effects models with ~ hundreds of random effects. I was using a language/tool faster than Stan (fitting the same model with Stan would have taken hours, or more likely days), but making the models converge was always challenging. In addition, since most of my colleagues had a CS background and were in love with the latest not interpretable, brute force algorithm, and were scared of a more statistical approach they made no effort to learn, I faced pushback and skepticism despite the model working very well.

I love random effects model, and I build my technical career on them.

laichzeit0 · on June 23, 2021

I think one of the main reasons is that there is no good Python library for doing linear mixed effect models. There is no sklearn implementation. There are some libraries that wrap R's lmer (probably using rpy2 or soemthing). The best native Python library I could find is statsmodels, and it has several shortfalls (saving a model to disk consumes hundreds of megabytes, the predict method is useless, it just predicts using the fixed effects, multi-level beyond just 1 group is not even clearly documented, and the syntax sucks if you really do it, nevermind actually implementing a predict method using those random effects). I think once someone does a decent sklearn implementation, it might take off. I've been thinking of doing an implementation for sklearn as a side project, but I'm not an ML researcher, just a practitioner, so it might suck :)

fho · on June 23, 2021

I used statsmodels for a while ... it's definitely possible to predict arbitrary inputs, it just a pain to fiddle in the right inputs ...