Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This has been done for the last 20 years (sort of in secret) in the Automatic Speech Recognition community. In brief, if A is the auditory observation you're trying to transcribe and W a possible word sequence, then you use Bayes law to rewrite

    P(W|A) = P(A|W)P(W)
The first is known as the auditory model which is responsible for transcribing sounds into phonemes and potential words. The second is the language model responsible just for ranking likely word sequences.

The first tends to be a lot harder than the second, it's models are fuzzier. So to compensate, the real, in practice model is

    P(W|A) = P(W|A)P(W)^a
where a is a positive number. I forget what the typical value of the fudge factor is, but generally you use cross validation to optimize it.

Exactly one class told me about the existence of it. Most others pretend like it's not there.



This is great to hear! I used a similar construction on my final project (electrical engineering) but left the appropriate mathematical proof (if any) into the Future Works section. The classification rule I found to be better was:

  c = argmax_{c \in C} (log(P(c)) + Sum_i (N_{d,d_i} \times log(P(d_i|c)))
where

  c - best class
  C - set of classes
  P(c) - prior prob. of class c
  P(d_i|c) - conditional of the word d_i on the class c and
  N_{d,d_i} - frequency of word d_i in document d
Excuse me for the possibly weird notation.

Virtually this expression means it places a frequency power upon the word probability, which is basically assuming (naively, I think) "words that are more likely to occur should be empowered".


So what's the rigorous mathematical justification this trick works and doesn't produce inconsistencies?

The power law smells a bit like P(W) were itself some maximum-likelihood estimate.


The rigorous mathematical justification this trick works better than not using this trick is that speech recognition works better when used. It is for theory to explain reality rather than the other way around.


Specifically, what justifies this is that the parameter is optimized via cross-validation, so within the space that we're optimizing over, we know that we're picking a good parameter. If it's not equal to 1, then it turns out that the "rigorous" probabilistic model was missing something that turned out to be useful.

In this case, it's that the probabilistic model only incorporates naively measured probabilities, and does not account for the error in those measured values. This is rather tricky to do right - you need to know a lot about the true prior distributions as well as the error distributions in order to account for it, and generally we don't have enough information to do this effectively.

So we "cheat", and just use the probabilistic model as a starting point to launch an optimization against.


Stated differently, it is rigorous if you accept that A and W are what we think they are that we want to find w`, that we can as w` = argmax P(A|W)P(W).

The problem is that we can't know P(A|W) or P(W). We must choose a family of distributions (a model) that we believe they live in and then perform statistical estimation to find the best representative of those families, call them f`(A|W) and g`(W). We can induce error here in three ways (1) choosing too small a family (2) having too little data to find the best representative of our family (3) having finite resources and thus giving up before we even get the optimal one we could have found.

(1), (2), and (3) are in contention. For instance, increasing (1) can be done so we can quickly process more data to reduce (2) and (3). It's a battle of tradeoffs.

The fudge factor is induced to compensate for the fact that P(W) is somewhat easier to estimate that P(A|W) (but don't get me wrong, they're both technically difficult). That means that f`(A|W) tends to be blurrier than g`(W) (has lower Fisher information, perhaps). Considering f` and g` to be on equal ground is then pretty silly, so giving more power to the sharper model improves classification accuracy.

---

The point being, it's not actually surprising that the fudge factor exists. It's more of a conflation of ideas and notation that makes it seem confusing.

(P.S. read f` as "f star", the asterisks were being eaten)


There isn't one. It's an engineering solution that just happens to work pretty often in ASR.

Complex P(W)'s are almost certainly fit by MLE. The models used are super complex, so MLE is often the only practical estimate. There are MaxEnt and Bayes methods, but I think generally people trade computational tractability of enormous models which wouldn't gain much from prior information because there's just so much data available.

The best data is more data.


Iwwelewant Cowwection! Actual Bayes law is:

P(W|A) = P(A|W)P(W)/P(A)


If you're maximizing over all possible values of W, A never changes, which is why the /P(A) will usually be omitted from the function you're maximizing.

You're right, though, it shouldn't be written as P(W|A) in that case, that's misleading.


P(A) is the same for all candidates you are considering so you drop it.


True, but I didn't feel like writing \propto




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: