BACK to VOLUME 40 NO.3

Kybernetika 40(3):293-304, 2004.

Text Document Classification Based on Mixture Models

Jana Novovičová and Antonín Malík


Abstract:

Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.


Keywords: text classification; multinomial mixture model;


AMS: 62H30; 62G05; 68T10;


download abstract.pdf


BIB TeX

@article{kyb:2004:3:293-304,

author = {Novovi\v{c}ov\'{a}, Jana and Mal\'{\i}k, Anton\'{\i}n},

title = {Text Document Classification Based on Mixture Models},

journal = {Kybernetika},

volume = {40},

year = {2004},

number = {3},

pages = {293-304}

publisher = {{\'U}TIA, AV {\v C}R, Prague },

}


BACK to VOLUME 40 NO.3