python - Bias in Naive Bayes classifier - Data Science Stack Exchange

Unbalanced class distributions

First, unbalanced datasets will cause your model to have a bias towards the over-represented classes. If the distribution of the classes is not very drastic then this should not cause a significant problem with any algorithm you will employ. However, as the difference between the class distribution becomes more severe you should expect to get higher false negatives for that class. Consider this, you are trying to have the model adequately identify what it means for a specific example to belong to a class. If you do not provide sufficient examples, then the model will not be able to understand the extent of the variation which exists among the examples.

If the class distribution is very different, then I would suggest anomaly detection techniques. These techniques allow you to learn the distribution of a single classification and then identify when novel examples fall within this distribution or not.

Choosing an algorithm

More classes will result in a higher dimensional output, thus contributing to the complexity of your model. For example, if you have a model which discriminates between 2 classes with a set dataset size. Then further discrimination (increasing the number of output classes) will cause the model to have higher bias. You should thus expect to see greater test error if you do not increase the size of your dataset.

If you have a set dataset X. Then you need to find the correct balance between bias and model complexity to get the optimal results. For example, a neural network based technique (highly complex) is not a good algorithm to use for a limited dataset with many output classes. However, Naive Baye's or Random Forest would be.

https://datascience.stackexchange.com/questions/24651/bias-in-naive-bayes-classifier