That is a good startingpoint, but also something SEO can very easy hack. Obfusicating the usual hints of a popular CMS is not hard.
I wonder whether doing a parallel search on google and filter out by their top-results from your own results would be a feasiable solution? Add a filter on the top 500 websites, and whether known ad-sources are used and you might get slowly there.
Maybe instead of a smart searchengine it would be better to focus on a dumb focus which gives access to all the metadata of a page too, and allows people to optimize for themself. Fulltext alone is not the only relevant content for good results. Google knows it and uses them, but has very limited acces to it for the enduser.
I LOVE the idea of a search engine that lowers the rank of results with ads. This might be the key to success in finding projects of love rather than commercial click holes.
Oh. That’s also an idea, but maybe we’d still get tons of ad-less blogs that are just link farms to other sites.
Still, I guess that’s only viable because Google rewards lots of links. If you just disable link relevance that part of gaming the system will be gone too.
How about TF(PageRank)/IDF(PageRank)? As in, start by ranking the individual result URLs with their PageRank for the given query; but then normalize those rankings by the PageRank each TF(PageRank) result URL’s domain/origin has for all known queries (this is the IDF part.)
Then, the more distinct queries a given website ranks for (i.e. the more SEO battles it wins/the more generally optimal it is at “playing the game”), the less prominently any individual results from said website would be ranked for any given query.
So big sites that people link to for thousands of different reasons (Wikipedia, say) wouldn’t disappear from the results entirely; but they would rank below some person’s hand-written HTML website they made 100% just to answer your question, which only gets linked to on click-paths originating on sites that contain your exact search terms.
This would incentivize creating pages that are actually about one particular thing; while actively punishing not just SEO lead-gen bullshit; not just keyword-stuffed landing pages we see in most modern corporate sites; but also content centralization in general (i.e. content platforms like Reddit, Github, Wikipedia, etc.) while leaving unaffected actual hosting by these platforms, of the kind that puts individual sites on their own domains (e.g. Github Pages, WordPress.com, Tumblr, specialty Wikis, etc.)
———
A fun way to think of this is that it’s similar to using a ladder ranking system (usually used for competitive games) to solve the stable-marriage problem on a dating site.
In such a system, you have two considerations:
• you want people to find someone who’s highly compatible with them, i.e. someone who ranks for their query
• you want to optimize for relationship length; and therefore, you want to lower the ranking of matches that, while theoretically compatible, would result in high relationship stress/tension.
Satisfying just the first constraint is pretty simple (and gets you a regular dating site.) To satisfy the second constraint, though, you need some way of computing relationship stress.
One large (and more importantly, “amenable to analysis”) source of relationship stress, comes from matches between highly-sought-after and not-highly-sought-after people, i.e. matches where one partner is “out of the league of” the other partner.
So, going with just that source for now (as fixing just that source of stress would go a long way to making a better dating site), to compute it, you would need some way to 1. globally rank users, and then 2. measure the “distance” between two users in this ranking.
The naive way of globally ranking users is with arbitrary heuristics. (OKCupid actually does this in a weak sense, sharding its users between two buckets/leagues: “very attractive” and “everyone else.”)
But the optimal way of globally ranking users, specifically in the context of a matching problem, is (AFAICT) with IDF(PageRank): a user’s “global rank” can just be the percentage of compatibility-queries that highly rank the given user. This is, strictly speaking, a measure of the user’s “optionality” in the dating pool: the number of potential suitors looking at them, that they can therefore choose between.
If you put the user on a global ladder by this “optionality” ranking; and normalize the returned compatibility-query result ranking by the resulting users’ rankings on this global “optionality” ladder; then you’re basically returning a result set (partially) optimized for stability-of-relationship: compatibility over delta-optionality.
———
All this leads back to a clean metaphor: highly-SEOed websites—or just large knots of Internet centralization—are like famous attractive people. “Everyone” wants to get with them; but that means that they’re much less likely to meet your individual needs, if you were to end up interacting with them. Ideally, you want a page that’s “just for you.” A page with low optionality, that can’t help but serve your particular needs.
Then we might go back to something sort of interesting.