Synopsis: java zimmermann_quest_reimplementation [-N int | -L int | -I int | -T int | -D int | -c double | -C | -?] The following options are available: -N int Number of used items (default 1000) -L int Number of potentially large itemsets (default 2000) -I int Average size of potentially large itemsets (default 4) -T int Average size of transactions (default 10) -D int Number of transactions (size of data set) (default 10000) -c double Correlation level (default 0.5), if set to 0, all potentially large itemsets are randomly drawn from the pool of items -C No corruption allowed -- all source itemsets are embedded as is -? Help (exits program) --- Algorithmic Remarks: This program implements the process described: Rakesh Agrawal, Ramakrishnan Srikant '94 "Fast Algorithms for Mining Association Rules" which is however ambiguous at some points. First, and foremost, this implementation *does not* allow duplicates of items in potentially large itemsets or transactions, which therefore also do not count against the size limit. Furthermore: -- "Items in the first itemset are chosen randomly." - Items are sampled uniformly -- "To model the phenomenon that large itemsets often have common items, some fraction of items in subsequent itemsets are chosen from the previous itemset generated. We use an exponentially distributed random variable with mean equal to the correlation level to decide this fraction for each itemset." - Sampling an exponential distribution can return values larger than 1, those are rejected. Also, if a potentially large itemset has much greater size then its predecessor then a "fraction" of its items can be larger than said predecessor - this would lead to redundant itemsets and we resample so that only a true subset of the predecessor itemset is included -- The interplay of "If the large itemset on hand does not fit in the transaction, the itemset is put in the transaction anyway in half the cases, and the itemset is moved to the next transaction the rest of the cases." and "To model the phenomenon that all the items in a large itemset are not always bought together, we assign each itemset in T a corruption level c. When adding an itemset to a transaction, we keep dropping an item from the itemset as long as a uniformly distributed random number between 0 and 1 is less than c." is ambiguous. - This implementation checks the size of potentially large itemset *before* embedding in the transaction (and hence corruption, and inclusion check in the transaction) and makes the decision whether to defer the itemset to the next transaction. -- "The corruption level for an itemset is fixed and is obtained from a normal distribution with mean 0.5 and variance 0.1." - Sampling from this distribution can lead both to corruption levels > 1 and < 0, these are rejected and resampled. Potentially as a result of these decisions, the generated data shows somewhat different characteristics w.r.t. item count distributions from the T10I4D100K and T40I10D100K data available at http://fimi.ua.ac.be/. --- Output: Generator output goes to the command line and consists of: - Parameter settings, preceded by '#', each on a separate line - Potentially large itemsets and corresponding weight, separated by ':', and corruption level, separated by ",", line preceded by '#', each on a separate line - Transaction data, each on a separate line