RNDr. Jan Kalina, Ph.D.

Scientist

Department: Machine Learning
Phone: +420 266 05 3099
Office:
Email: (email)

Selected publications

Likelihood ratio testing under measurement errors.

Broniatowski, M., Jurečková, Jana, Kalina, Jan.

In Entropy, volume 20, issue 12 , 2018, ISSN 1099-4300.

The likelihood ratio test of a simple null hypothesis against a simple alternative is considered in the situation that observations are mismeasured due to the presence of measurement errors.

A Robust Pre-processing of BeadChip Microarray Images.

Kalina, Jan.

In Biocybernetics and Biomedical Engineering, volume 38, issue 3, pages 556-563 , 2018, ISSN 0208-5216.

Microarray images commonly used in gene expression studies are heavily contaminated by noise and/or outlying values (outliers). Unfortunately, standard methodology for the analysis of Illumina BeadChip microarray images turns out to be too vulnerable to data contamination by outliers. In this paper, an alternative approach to low-level pre-processing of images obtained by the BeadChip microarray technology is proposed. The novel approach robustifies the standard methodology in a complex way and thus ensures a sufficient robustness (resistance) to outliers. A gene expression data set from a cardiovascular genetic study is analyzed and the performance of the novel robust approach is compared with the standard methodology. The robust approach is able to detect and delete a larger percentage of outliers. More importantly, gene expressions are estimated more precisely. As a consequence, also the performance of a subsequently performed classification task to two groups (patients vs. control persons) is improved over the cardiovascular gene expression data set. A further improvement was obtained when considering weighted gene expression values, where the weights correspond to a robust estimate of variability of the measurements for each individual gene transcript.

On Locally Most Powerful Sequential Rank Tests.

Kalina, Jan.

In Sequential Analysis, volume 36, issue 1, pages 111-125 , 2017, ISSN 0747-4946.

Sequential ranks are defined as ranks of such observations, which have been observed so far in a sequential design. This paper studies hypotheses tests based on sequential ranks for different situations. The locally most powerful sequential rank test is derived for the hypothesis of randomness against a general alternative, including the two-sample difference in location or regression in location as special cases for the alternative hypothesis. Further, the locally most powerful sequential rank tests are derived for the one-sample problem and for independence of two samples in an analogous spirit as the classical results of Hájek and Šidák (1967) for (classical) ranks. The locally most powerful tests are derived for a fixed sample size and the results bring arguments in favor of existing tests. In addition, we propose a sequential testing procedure based on these statistics of the locally most powerful tests. Principles of such sequential testing are explained on the two-sample Wilcoxon test based on sequential ranks.

A Robust Supervised Variable Selection for Noisy High-Dimensional Data.

Kalina, Jan, Schlenker, Anna.

In BioMed Research International, volume 2015, issue 320385, pages 1-10 , 2015, ISSN 2314-6133.

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

Classification Methods for High-Dimensional Genetic Data.

Kalina, Jan.

In Biocybernetics and Biomedical Engineering, volume 34, issue 1, pages 10-18 , 2014, ISSN 0208-5216.

Standard methods of multivariate statistics fail in the analysis of high-dimensional data. This paper gives an overview of recent classification methods proposed for the analysis of high-dimensional data, especially in the context of molecular genetics. We discuss methods of both biostatistics and data mining based on various background, explain their principles, and compare their advantages and limitations. We also include dimension reduction methods tailor-made for classification analysis and also such classification methods which reduce the dimension of the computation intrinsically. A common feature of numerous classification methods is the shrinkage estimation principle, which has obtained a recent intensive attention in high-dimensional applications.