Learning Classifiers When The Training Data Is Not IID

Murat Dundar, Balaji Krishnapuram, Jinbo Bi, R. Bharat Rao

Most methods for classifier design assume that the training samples are drawn independently and identically from an unknown data generating distribution, although this assumption is violated in several real life problems. Relaxing this IID assumption, we consider algorithms from the statistics literature for the more realistic situation where batches or sub-groups of training samples may have internal correlations, although the samples from different batches may be considered to be uncorrelated. Next, we propose simpler (more efficient) variants that scale well to large datasets; theoretical results are provided to support their validity. Experimental results from real-life computer aided diagnosis (CAD) problems indicate that relaxing the IID assumption leads to statistically significant improvements in the accuracy of the learned classifier. Surprisingly, the simpler algorithm proposed here is experimentally found to be even more accurate than the original version.