Institute of Information Theory and Automation

Details of the defence

Type of the defence: Ph.D.
Name: Mgr. Antonín Malík
Name of the thesis: Text classification
Abstract:The goal of text document classification (TC) is to assign automatically a new document into one or more predefined classes based on its content. The representation scheme using a bag of words approach leads to very high-dimensional feature/word space. The dominant approach in TC to dimensionality reduction is feature selection (FS). Methods for FS in TC task use an evaluation function that is applied to a single word. However, they evaluate each word separately and completely ignore the existence of other words and the manner how the words work together. In this thesis, the novel algorithms for word selection are proposed. The sequential forward selection methods based on proposed improved mutual information criterion functions are presented. The performance of the proposed criteria compared to the information gain, chi-squared statistic and odds ratio which evaluate features individually is discussed. The experimental results using naive Bayes classifier based on multinomial model,linear support vector machine and k-nearest neighbor classifiers on the Reuters-21578 data set are analyzed from various perspectives, including recall, precision and F1-measure. Experimental results indicate the effectiveness of the proposed FS algorithms in TC. The probabilistic models for TC problem by using mixture model for class conditional probability functions is also presented. The focus is devoted to the application of the mixture of multivariate Bernoulli distributions and on the mixture of multinomial distributions. The proposed approach is a generalization of naive Bayes that tries to properly model significant class-conditional dependencies by spreading them over different class mixture components. Maximum-likelihood estimation of mixture parameters is done by using the well-known expectation-maximization algorithm. Experimental results on Reuters-21578 and Newsgroups data sets indicate the effectiveness of proposed mixture models in TC task; an increase in classification accuracy has been achieved.
Supervisor:Doc.RNDr. Jana Novovičová, CSc.
Date and time of the defence: 9:00, 30.09. 2008
Venue: Zasedací místnost č. K112 Fakulty elektrotechnické ČVUT v Praze, Karlovo nám. 13, Praha 2
Date of record: 17.9. 2008
Statute: obhájeno


List of forthcoming defences is here.
Responsible for information: admin
Last modification: 20.01.2009
Institute of Information Theory and Automation