> I would say that 99% of all spam I receive is written in English, whereas about 80% of my personal emails are in German.
You're missing some important numbers here for this to be meaningful: Base rates.
If you're getting 10 spam emails per month, and 99% of them are in English, but you're receiving 1000 legitimate personal emails per month, and 20% of them are in English, that's a total of 210 English emails on average per month. In this case, while technically true that an email written in English is more likely to be spam than an email written in German, it's also such a very poor indicator that you'd be better off using another indicator entirely.
On top of that, you've only pinpointed two categories out of many more possible categories of emails. What if someone is getting 50 spam emails per month, 5 job offer emails per month, 10 updates relevant to their profession per month, 100 professional emails per month, and 50 personal emails per month, and the English:German ratios for those are all 99:1, 9:1, 4:1, 1:1 and 4:1 respectively? Suddenly just "this email is written in English" isn't all that important anymore.
Base rates are very, very, very, super important when discussing stats like these.
That's why I said "one factor of many". If you have ten factors like that and combine them, you'll get an excellent spam filter. That's how Bayesian spam filters work. They combine a number of factors that are not very significant on their own, but using all of them adds up.
You're missing some important numbers here for this to be meaningful: Base rates.
If you're getting 10 spam emails per month, and 99% of them are in English, but you're receiving 1000 legitimate personal emails per month, and 20% of them are in English, that's a total of 210 English emails on average per month. In this case, while technically true that an email written in English is more likely to be spam than an email written in German, it's also such a very poor indicator that you'd be better off using another indicator entirely.
On top of that, you've only pinpointed two categories out of many more possible categories of emails. What if someone is getting 50 spam emails per month, 5 job offer emails per month, 10 updates relevant to their profession per month, 100 professional emails per month, and 50 personal emails per month, and the English:German ratios for those are all 99:1, 9:1, 4:1, 1:1 and 4:1 respectively? Suddenly just "this email is written in English" isn't all that important anymore.
Base rates are very, very, very, super important when discussing stats like these.