I'm curious how this advance update thing is supposed to work. What does disclosing those details look like, actually?
The reason I'm asking is that as these things grow in complexity, it's quite possible that even if you join the team that works on these systems it will probably take you a pretty long time to understand how they really work. Their actual behaviour is likely to still be mysterious a lot of the time because they're driven by data.
Is a high-level description in english OK? Do we need to see pseudocode? The source code code? Do they have to open source it? What parts, if it's tied to internal frameworks? If there is ML, do they have to disclose all their sauce there? The trained network / weights? The training data, if the alg alone is useless without a data set?
Any human-initiated change to search algorithms is presumably human-understandable. Someone writes a rule to downrank some terms or traits of a website, they presumably document it somewhere.
That documentation will need to be shared, and the implementation of the rule change will need to be delayed until the disclosure window has passed.
Honestly, first and foremost, I expect a firehose of documentation, if Google isn't lying about making dozens of changes to it's algorithms every day. News companies might need a full-time guy (or team) just to sit there and read through them all.
But on the other hand, a bunch of journalists will have a ton of never-before-seen information about how the world's most powerful companies affect every other company on the planet. That alone is going to be worth some major exclusives.
Also, by the mere nature of being forced to share it, Google and Facebook will have to clean up their acts, they'll have to assume any change they make that could open them up to legal scrutiny will be found.
You underestimate the complexity here by orders of magnitudes. You also overestimate the usefulness to news companies. You underestimate the harm that bad actors can take.
The search algorithm tells you the order of search results for a particular set of terms. Except that as input you need to feed it a graph of the entire indexed internet, which is re-indexed periodically as the content on the index changes. How does knowing that benefit new companies? What, exactly would your hypothetical full-time guy/team, equipped with that index at huge cost, tell their company that would justify the time and expense? That they should write interesting content that lots of people consume?
Second, the general approach has been published and is well documented [1], as are its susceptibilities to attack [2]. So there's your algorithm, what does it tell you?
Third, general SEO isn't the problem, it's coordinated attacks that can poison all search results / ads markets if enough detail is known. Google invests [3] heavily to address these areas [4].
Finally, you underestimate how much of a firehose you'd have to drink from. It describes all of the internet.
You might want to note some very important parts of your first-listed source:
> Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline's homepage when the airline's name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.
Larry and Sergey themselves both believed that ad-funded search was problematic, and that a transparent search engine in the academic realm was "crucial".
Unfortunately, Larry and Sergey's price was clearly billions of dollars.
Instead of relying on ignorance to push your agenda, you could frame that paragraph against what search engines were doing at the time, how Google was doing something different, and what Google still is transparent about today [1][2].
Publishing "the algorithm" doesn't have value for most of the internet, it is impractical to use even if it was, and bad actors would use it to destroy search quality to the detriment of e-commerce everywhere.
The reason I'm asking is that as these things grow in complexity, it's quite possible that even if you join the team that works on these systems it will probably take you a pretty long time to understand how they really work. Their actual behaviour is likely to still be mysterious a lot of the time because they're driven by data.
Is a high-level description in english OK? Do we need to see pseudocode? The source code code? Do they have to open source it? What parts, if it's tied to internal frameworks? If there is ML, do they have to disclose all their sauce there? The trained network / weights? The training data, if the alg alone is useless without a data set?