Data Complexity Library in C++

Description

The data complexity library (DCoL) is a library implemented in C++ that provides the implementation of a set of measures designed to characterize the apparent complexity of data sets for supervised learning, which were originally proposed by Ho and Basu (2002). In particular, the code supplies the following measures:

Measures of overlaps in the feature values from different classes. This library provides routines that compute

The maximum Fisher's discriminant ratio (F1).
The directional-vector maximum Fisher's discriminant ratio (F1v).
The overlap of the per-class bounding boxes (F2).
The maximum (individual) feature efficiency (F3).
The collective feature efficiency (F4).

Measures of class separability. This library provides routines that compute

The leave-one-out error rate of the one-nearest neighbor classifier (L1).
The minimized sum of the error distance of a linear classifier (L2).
The fraction of points on the class boundary (N1).
The ratio of average intra/inter class nearest neighbor distance (N2).
The training error of a linear classifier (N3).

Measures of geometry, topology, and density of manifolds. This library provides routines that compute

The nonlinearity of a linear classifier (L3).
The fraction of maximum covering spheres (T1).
The average number of points per dimension (T2).

The implementation of these complexity measures is based on the descriptions provided by Ho and Basu (2002) and Ho et al. (2006). However, some of them have been revised and updated from its initial definition (Orriols-Puig et al., 2010). The majority of these measures were initially designed for two-class data sets and were only applied to problems with continuous attributes; nominal or categorical attributes were numerically coded and treated as continuous. The latter restriction was because most of the complexity measures rely on distance functions between attributes. In our implementation, all the measures, except for those based on a linear discriminant and on the directional-vector Fisher's discriminant, have been extended to deal with m-class data sets (m > 2), following the guidelines suggested by Ho et al. (2006). Furthermore, the most relevant distance functions for continuous and nominal attributes have been implemented. In this way, we enable our library to deal with m-class data sets that contain nominal and/or continuous attributes.

The library also offers two other functionalities:

Partition of the data set by means of stratified k-fold cross-validation.
Conversion of an m-class data set (m > 2) to m two-class data sets. Each new data set discriminates one of the classes against the others.

References

T. K. Ho and M. Basu. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289-300, 2002.

T. K. Ho, M. Basu, and M. Law. Measures of geometrical complexity in classification problems. In Data Complexity in Pattern Recognition, pages 1-23. Springer, 2006.

A. Orriols-Puig, N. Macià, and T. K. Ho. Documentation for the data complexity library in C++. Technical report, La Salle - Universitat Ramon Llull, 2010.

Comments and Suggestions

If you have any comments or find any bug, please send an email to aorriols at gmail dot com or macia dot nuria at gmail dot com

DCoL