> Partly because it's pretty easy to goose one or two metrics at the expense of ...

> Partly because it's pretty easy to goose one or two metrics at the expense of a small global regression

That is true, but again I think it's more complicated than that. AB tests are good for measuring incremental changes, but really bad at predicting the impact of 0 to 1 type products / features, because so much of the success or failure depends on public perception and network effects. Twitter couldn't AB test "Fleets" and have any confidence as to what the product adoption would look like. So companies have no choice but to launch and see.