This has been done for the last 20 years (sort of in secret) in the Automatic Sp...

andredieb · on March 12, 2012

This is great to hear! I used a similar construction on my final project (electrical engineering) but left the appropriate mathematical proof (if any) into the Future Works section. The classification rule I found to be better was:

  c = argmax_{c \in C} (log(P(c)) + Sum_i (N_{d,d_i} \times log(P(d_i|c)))

where

  c - best class
  C - set of classes
  P(c) - prior prob. of class c
  P(d_i|c) - conditional of the word d_i on the class c and
  N_{d,d_i} - frequency of word d_i in document d

Excuse me for the possibly weird notation.

Virtually this expression means it places a frequency power upon the word probability, which is basically assuming (naively, I think) "words that are more likely to occur should be empowered".

MakeUsersWant · on March 12, 2012

So what's the rigorous mathematical justification this trick works and doesn't produce inconsistencies?

The power law smells a bit like P(W) were itself some maximum-likelihood estimate.

jerf · on March 12, 2012

The rigorous mathematical justification this trick works better than not using this trick is that speech recognition works better when used. It is for theory to explain reality rather than the other way around.

bermanoid · on March 12, 2012

Specifically, what justifies this is that the parameter is optimized via cross-validation, so within the space that we're optimizing over, we know that we're picking a good parameter. If it's not equal to 1, then it turns out that the "rigorous" probabilistic model was missing something that turned out to be useful.

In this case, it's that the probabilistic model only incorporates naively measured probabilities, and does not account for the error in those measured values. This is rather tricky to do right - you need to know a lot about the true prior distributions as well as the error distributions in order to account for it, and generally we don't have enough information to do this effectively.

So we "cheat", and just use the probabilistic model as a starting point to launch an optimization against.

tel · on March 12, 2012

Stated differently, it is rigorous if you accept that A and W are what we think they are that we want to find w`, that we can as w` = argmax P(A|W)P(W).

The problem is that we can't know P(A|W) or P(W). We must choose a family of distributions (a model) that we believe they live in and then perform statistical estimation to find the best representative of those families, call them f`(A|W) and g`(W). We can induce error here in three ways (1) choosing too small a family (2) having too little data to find the best representative of our family (3) having finite resources and thus giving up before we even get the optimal one we could have found.

(1), (2), and (3) are in contention. For instance, increasing (1) can be done so we can quickly process more data to reduce (2) and (3). It's a battle of tradeoffs.

The fudge factor is induced to compensate for the fact that P(W) is somewhat easier to estimate that P(A|W) (but don't get me wrong, they're both technically difficult). That means that f`(A|W) tends to be blurrier than g`(W) (has lower Fisher information, perhaps). Considering f` and g` to be on equal ground is then pretty silly, so giving more power to the sharper model improves classification accuracy.

---

The point being, it's not actually surprising that the fudge factor exists. It's more of a conflation of ideas and notation that makes it seem confusing.

(P.S. read f` as "f star", the asterisks were being eaten)

tel · on March 12, 2012

There isn't one. It's an engineering solution that just happens to work pretty often in ASR.

Complex P(W)'s are almost certainly fit by MLE. The models used are super complex, so MLE is often the only practical estimate. There are MaxEnt and Bayes methods, but I think generally people trade computational tractability of enormous models which wouldn't gain much from prior information because there's just so much data available.

The best data is more data.

sbmassey · on March 12, 2012

Iwwelewant Cowwection! Actual Bayes law is:

P(W|A) = P(A|W)P(W)/P(A)

bermanoid · on March 12, 2012

If you're maximizing over all possible values of W, A never changes, which is why the /P(A) will usually be omitted from the function you're maximizing.

You're right, though, it shouldn't be written as P(W|A) in that case, that's misleading.

jshen · on March 12, 2012

P(A) is the same for all candidates you are considering so you drop it.

tel · on March 12, 2012

True, but I didn't feel like writing \propto