Software
Few things are more annoying in CS research than looking for that elusive reference implementation, data generator and failing to find it. I've not been very good myself in the past with making my code available but I'll try and do better from here on out.
Data set generation
- The PrePeP implementation described in "PrePeP: A light-weight, extensible tool for predicting frequent hitters"
- The PrePeP prototype described in "PrePeP - A Tool for the Identification and Characterization of Pan Assay Interference Compounds"
- Reimplementation of the CDK-Means algorithm described in A Bi-clustering Framework for Categorical Data
- (Partial) Reimplementation of the Almaden QUEST generator for itemset data - there exists a more comprehensive version (customer sequences, taxonomy) at http://miles.cnuce.cnr.it/~palmeri/datam/DCI/datasets.php but even after wading through the modifications necessary to compile this legacy code, I did not get useful output and gave up...
Be sure to read the Readme for differences to the paper description. Used in Objectively evaluating condensed representations and interestingness measures for frequent itemset mining.
- Extension of the QUEST generator, including a number of additional distributions for itemset and transaction length distributions, as well as itemset weights to take a step towards more realistic data.
- Generator for event sequences - Generates sequences of randomly distributed, time-stamped noise events in which source episodes are embedded, for evaluating episode mining approaches, as described and used in Understanding episode mining techniques: Benchmarking on diverse, realistic, artificial data.