Histogram-valued data


Download (txt-file)



This data set is "non-standard" in the sense of being composed of "histogram-valued" attributes, that is, a set of attributes that assume a histogram as a value (and not, as usual, a single number or category). A precise description of the data can be found in the following paper (see subsections 3.2 and 5.1):

T. Fober and E. Hüllermeier. Similarity Measures for Protein Structures based on Fuzzy Histogram Comparison. Proceedings WCCI-2010, World Congress on Computational Intelligence, Barcelona, Spain, 2010.

As can be seen from the title, the data set is from the bioinformatics domain. It consists of 355 protein structures (more specifically: so-called protein binding sites), each of which belongs to one of two possible classes: ATP or NADH. A description of each structure is given in terms of 28 variables, and, as mentioned before, each of these variables assumes as a value a histogram over the same domain (a discretization of Euclidean distance).

The data set is stored in plain text format. Each entry starts with the class label (+1 for ATP and -1 for NADH) followed by the description of the structure, in which each histogram is simply a sequence of numbers. The entries are absolute frequencies; thus, to produce a relative frequency distribution, they still need to be normalized. The frequencies are separated by a comma and the histograms by a semicolon. The histograms are arranged as follows:

Histogram  1: "Acceptor" -- "Acceptor"
Histogram  2: "Acceptor" -- "Donor-Acceptor"
Histogram  3: "Acceptor" -- "Donor"
Histogram  4: "Acceptor" -- "Aliphatic"
Histogram  5: "Acceptor" -- "Aromatic"
Histogram  6: "Acceptor" -- "Pi"
Histogram  7: "Acceptor" -- "Metal"
Histogram  8: "Donor-Acceptor" -- "Donor-Acceptor"
Histogram  9: "Donor-Acceptor" -- "Donor"
Histogram 10: "Donor-Acceptor" -- "Aliphatic"
Histogram 11: "Donor-Acceptor" -- "Aromatic"
Histogram 12: "Donor-Acceptor" -- "Pi"
Histogram 13: "Donor-Acceptor" -- "Metal"
Histogram 14: "Donor" -- "Donor"
Histogram 15: "Donor" -- "Aliphatic"
Histogram 16: "Donor" -- "Aromatic"
Histogram 17: "Donor" -- "Pi"
Histogram 18: "Donor" -- "Metal"
Histogram 19: "Aliphatic" -- "Aliphatic"
Histogram 20: "Aliphatic" -- "Aromatic"
Histogram 21: "Aliphatic" -- "Pi"
Histogram 22: "Aliphatic" -- "Metal"
Histogram 23: "Aromatic" -- "Aromatic"
Histogram 24: "Aromatic" -- "Pi"
Histogram 25: "Aromatic" -- "Metal"
Histogram 26: "Pi" -- "Pi"
Histogram 27: "Pi" -- "Metal"
Histogram 28: "Metal" -- "Metal"

Please note that the dataset contains an additional 29th histogram which is giving for each physicochemical property the number of occurrence. The ordering in this histogram is "Acceptor", "Donor-Acceptor", "Donor", "Aliphatic", "Aromatic", "Pi", "Metal".
Further information: