> PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).
Yes, exactly.
> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.
I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:
* Like the B-field, there are also false positives.
* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.
* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.
So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.
For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.
Yes, exactly.
> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.
I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:
* Like the B-field, there are also false positives.
* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.
* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.
So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.
For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.