Description
The data complexity library (DCoL) is a library implemented in C++ that provides the implementation of a set of measures designed to characterize the apparent complexity of data sets for supervised learning, which were originally proposed by Ho and Basu (2002). In particular, the code supplies the following measures:
- The maximum Fisher's discriminant ratio (F1).
- The directional-vector maximum Fisher's discriminant ratio (F1v).
- The overlap of the per-class bounding boxes (F2).
- The maximum (individual) feature efficiency (F3).
- The collective feature efficiency (F4).
- The leave-one-out error rate of the one-nearest neighbor classifier (L1).
- The minimized sum of the error distance of a linear classifier (L2).
- The fraction of points on the class boundary (N1).
- The ratio of average intra/inter class nearest neighbor distance (N2).
- The training error of a linear classifier (N3).
- The nonlinearity of a linear classifier (L3).
- The fraction of maximum covering spheres (T1).
- The average number of points per dimension (T2).
Measures of overlaps in the feature values from different classes. This library provides routines that compute
Measures of class separability. This library provides routines that compute
Measures of geometry, topology, and density of manifolds. This library provides routines that compute
The implementation of these complexity measures is based on the descriptions provided by Ho and Basu (2002) and Ho et al. (2006). However, some of them have been revised and updated from its initial definition (Orriols-Puig et al., 2010). The majority of these measures were initially designed for two-class data sets and were only applied to problems with continuous attributes; nominal or categorical attributes were numerically coded and treated as continuous. The latter restriction was because most of the complexity measures rely on distance functions between attributes. In our implementation, all the measures, except for those based on a linear discriminant and on the directional-vector Fisher's discriminant, have been extended to deal with m-class data sets (m > 2), following the guidelines suggested by Ho et al. (2006). Furthermore, the most relevant distance functions for continuous and nominal attributes have been implemented. In this way, we enable our library to deal with m-class data sets that contain nominal and/or continuous attributes.
The library also offers two other functionalities:
- Partition of the data set by means of stratified k-fold cross-validation.
- Conversion of an m-class data set (m > 2) to m two-class data sets. Each new data set discriminates one of the classes against the others.