Abstract—Maximal Frequent Patterns can be mined

using breadth-first search or depth-first search. The pure BFS algorithms work

well when all maximal frequent itemsets are short. The pure DFS algorithms work

well when all maximal frequent itemsets are long. Both the pure BFS and pure

DFS techniques will not be efficient, when the dataset contains some of long

maximal frequent itemsets and some of short maximal frequent itemsets.

Efficient pruning techniques are required to mine MFI from these kinds of

datasets. An algorithm (MFIMiner) using Breadth-First search with efficient

pruning mechanism that competently mines both long and short maximal frequent

itemsets is proposed in this paper. The performance of the algorithm is

evaluated and compared with GenMax and Mafia algorithms for T40I10D100K, T10I4D100K and Retail,

dataset. The result shows that, proposed algorithm has significant improvement

than existing algorithms for sparse datasets.

Index Terms—

Breadth-First Search with Pruning, Mining MFIs.

————————————————

· Mrs.K.Sumathi is with the Department of Computer Applications,

K.L.N.College of Information and Technology, Pottapalayam, Sivagangai

District, India. E-mail: [email protected]

· Dr.S.Kannan is with the

Department of Computer Applications, Madurai Kamaraj University, Madurai

District, Tamil nadu, India

E-mail: [email protected]

· Mr.K. Nagarajan

is a chief architect in Tata Consultancy Service, Chennai, Tamil nadu, India.

E-mail: [email protected]

http://in.linkedin.com/in/nagarajank.

—————————— u ——————————

1 Introduction

Mining frequent patterns plays an

important role in data mining, and it is widely used in applications regarding

association rules and customer behavior analysis etc., All frequent itemsets can

be constructed from Maximal Frequent itemsets(MFI), since every subset of MFI

is a frequent itemset. So many data mining applications need only MFI instead

of frequent itemset.

Bayardo presented MaxMiner

algorithm5 for mining maximal frequent

itemset in 1998. Maxminer used

set-enumeration tree as the concept framework and followed breadth-first

searching technique, as well as superset pruning strategy and dynamic recording

strategy. Pincer-Search 4 algorithm was presented by D. Lin and Z. M. Kedem.

Pincer search used both top-down and bottom-up search for mining all MFI. The

Depth-Project algorithm adopted depth-first approach with superset pruning

strategy & dynamic recording methods to mine all MFI 6. Burdick proposed

an algorithm called MAFIA3 which used depth-first search and stored data-set

in longitudinal bitmaps. MAFIA used three pruning strategy including Parent

Equivalence Pruning (PEP), Frequent Head Union Tail (FHUT) and Head Union Tail

MFI (HUTMFI). GenMax2 combined the pruning and mining process, utilizing two

approaches for mining MFI efficiently. The first approach is GenMax constructs

Local Maximal Frequent Itemsets (LMFI) for fast superset checking. Another

approach is GenMax uses Diffset Propagation method for fast support calculation

when the dataset is highly dense in nature.

2 Proposed

Work

The MFIMiner algorithm that efficiently mines both

long and short maximal frequent itemsets is proposed in this paper. The

MFIMiner algorithm runs in bottom-up direction and employs pruning to

efficiently mine all MFIs from the dataset

In general, the structure of the dataset may be in

two different ways – horizontal data format and vertical data format. This

approach uses vertical data representation of dataset. In this format data is

represented in item-tidset format, where item is the name of the item and

tidset is the set of transaction identifiers containing the item.

MFIMiner computes the frequency of an item by

counting the number of tids in the tidset and computes the frequency of an

itemset by intersecting the tidsets of corresponding items. The data structures

used in the MFIMiner algorithm are FI which denotes set of frequent itemsets,

CI which denotes set of candidate items, MFI which denotes the set of Maximal

Frequent Itemsets.

2.1

MFIMiner Algorithm

The first step of MFIMiner algorithm is to mine all

frequent items in the dataset. The frequent items are reordered in increasing

order of their support and added to FI. Candidates for each frequent item in FI

are obtained and added to CI. For example candidates of frequent item FI1

are added to CI1, candidates of FI2 are added to CI2

and so on. Let FIj ( FIj ? FI) be the frequent

item and CIj (CIj ? CI)is the set of candidates

for FIj.

The frequent extension of FIj is (FIj1,

FIj2…FIjn) where n is the number of candidates in CIj.

Frequent extension of FIj1 is constructed from CIj1 i.e. CIj1 ?(FIj2, FIj3…FIjn).

Initial candidates of FIj1 are constructed from (FIj2, FIj3…FIjn)

and candidates of FIj2 are constructed from (FIj3, FIj4…FIjn)

and so on. Exact candidates for each frequent extension are constructed using

generateCandidate method. The exact candidates of a frequent itemset do not

include the infrequent candidates of the frequent itemset. When a frequent

itemset has no candidate and has no superset in MFI, then frequent itemset is

added to MFI.

In the next step, the FIi ? CIi (i is varied from 1 to n) is removed from FI

and CI respectively, and checked, whether it has superset in MFI or not. If it

has superset in MFI then ignored. Otherwise the frequency of FIi ? CIi is computed. If

FIi ? CIi

is frequent then it is added to MFI. The next pair (FIi+1 ? CIi+1) is taken for

test. This step is repeated until an infrequent FIk ? CIk is obtained

If FIk ? CIk has no superset

in MFI and it is not frequent, then the following steps are done.

1.

Parent

Equivalent Pruning

2.

Frequent

Extension to the next level

When an infrequent FIk ? CIk is found,

parent equivalent pruning is done. This pruning which was introduced by

Calimlim in Mafia algorithm3 removes items from the candidate set (CIk)

and add those items to the frequent itemset(FIk) if the tidset of

any item in CIk includes the tidset of FIk.

The frequent itemset (FIk) is extended to

next level using items of CIk in bottom-up direction. The candidates

of each frequent extension are generated and infrequent candidates are pruned.

The frequent extensions and candidates of each frequent extension are added to

FI and CI respectively.

The data structures FI and CI are updated when new

frequent extension and candidates of frequent extension are found. MFI is updated when the frequent extension

has no candidate and no superset in MFI or FIi ? CIi has no superset

in MFI and is frequent.

MFIMiner algorithm is explained with following

example. Consider the transaction database DB Table 1.1 which includes six

different items, I = {A, B, C, D, E, F} and six transactions T= {T1, T2, T3,

T4, T5, T6}. The vertical data format of the database DB is given below.

All frequent

items are extracted and reordered in ascending order with respect to the

support. The support of an item is directly given by the number of transactions

in the tidsets. For example, the minimum

support is considered to be of 3 transactions. In database DBTable 1.1, all

the items are having more than two tids in the tidset, and so all the items are

frequent.

The items A, B, C, D, E and F are frequent items and

will be considered to the next level. These items are sorted in ascending order

of their support.

Tables

1.1: Vertical Data Representation of the

transactional database DB

Items

Tidsets

A

T1, T2, T3, T4, T5

B

T2, T5, T6

C

T1, T2, T4,

T5, T6

D

T1, T3, T4, T5

E

T2, T3, T4,

T5, T6

F

T2, T4, T6

The reordered frequent items are B, F, D, A, C, E and

candidates of each frequent item are obtained. Frequent items and candidates of

each frequent item of database DB Table 1.1 is given in Table 1.2.

Table 1.2 :

Frequent items and Candidate Items in database DB with support 3 transactions

Frequent Items

Candidates of Frequent Item

B

C,E

F

C,E

D

A,C,E

A

C,E

C

E

E

–

FI1

and CI1 are removed from FI and CI respectively and Mining MFI is

started by checking the frequency of FI1 ? CI1, continued

until an infrequent FIn ? CIn is

obtained. When an infrequent FIi

? CIi

is obtained, the parent equivalent pruning is done and frequent extension of

the FIi is generated using CIi and new candidates of

frequent extension are also obtained. The frequent extensions of FIi are

generated from CIi and each frequent extension and its candidates

are added to FI and CI respectively. Frequent extensions of a frequent itemset

are generated as follows.

frequentExtension

(FIi, CIi)

For each x ?

CIi

FIi+1 = FIi ?

x; // new frequent extension

CIi.remove(x);

// initial candidates of new

frequent extension

CIi+1=generateCandidate(FIi+1,CIi)

; // exact candidates of new frequent

extension

If the CIi+1 is empty and the FIi+1

has no superset in MFI, then FIi+1 is added to MFI. In database DBTable 1.1, the first two

frequent item and its candidates are {B}-{C,E} and {F}-{C,E}, has no superset

in MFI, are frequent and added to MFI. The next pair {D}-{A,C,E} is not

frequent and parent equivalent pruning is done. The item A is trimmed from the

candidate set and added to the frequent item D. The frequent extension of {DA}

is generated from {C, E} and candidates of each frequent extension are

generated. Frequent extension and candidates are {D,A,C}-{} and {D,A,E}-{}.

Both of these frequent itemsets, have no superset in MFI, are frequent, and are

added to MFI. The next pair, is {A}-{C, E} has no superset in MFI, is frequent

and added to MFI. The subsequent pairs,

are {CE} {} and {E}{}, have superset in MFI and ignored. MFIMiner algorithm returns

a subset of the MFI and would

require post-pruning to eliminate non-maximal patterns. MFIs with minimum

support 3 in database DB are {ACD}, {ACE}, {ADE}, {BCE} and {CEF}.

Number of MFI generated depends on the support value

which is specified by the user is shown in table 1.3. MFIs with minimum support

2 in database DB are ACEF, ACDE, ABCE and BCEF.

Table

1.3. Maximal frequent itemsets.

Tid

Items

Maximal Frequent Itemsets

Minimum support = 3

Maximal Frequent Itemsets

Minimum support = 2

1

ACD

ACD

ACE

ADE

BCE

CEF

ACEF

ACDE

ABCE

BCEF

BCEA

2

ABCEF

3

ADE

4

ACDEF

5

ABCDE

6

BCEF

Algorithm –MFIMiner

Input: dataset D, support S

Output: Maximal Frequent Itemsets MFI

MFIMiner (Dataset D, Support S)

BEGIN

1.

Generate

frequent items and reorder them in ascending order of their support.

2.

Generate

candidate items for each frequent item and reorder the candidates in increasing

order of support.

3.

For each x ? FIs // FI-> set of frequent items

If candidate(x) is empty and x

has no superset in MFI

AddtoMFI(x);

continue;

c=cand(x);

FI=FI/x //

remove the frequent item from the set of frequent item(FI)

CI=CI/cand(x) // remove the candidates of FI from

candidate set

if x ? c has

superset in MFI continue;

if size of x ? c is 1 or 2 then add x

? c to MFI;

continue;

else if x ? c is frequent then add x

? c to MFI;

continue;

else

for each y ? c

If(tidset(y).containsAll(tidset(x))

x.add(y);

c.remove(y);

Bottomup_ExternFI(x,c,tidset(x))

END

Bottomup_ExternFI Method

Input: Frequent Itemset, Candidate set, tidset of Frequent Itemset

Output: Frequent extensions of frequent itemset and candidates

of each frequent extension, MFIs if generated.

Bottomup_ExternFI(frequent,candidate,

tidset(frequent))

BEGIN

For each x ? candidate

Nfrequent=frequent ?

x // current frequent itemset

candidate.remove(x)

// candidates of current frequent itemset

if candidate is empty && Nfrequent has

no superset in MFI

MFI.add(Nfrequent); continue;

newcandidate=GenerateCandidates(Nfrequent,candidate, tidset(frequent))

if newcandidate is

empty

if Nfrequent has no superset in MFI

MFI.add(Nfrequent);

else

FI.add(Nfrequent);

// New frequent

itemset is added to FI.

cand(Nfrequent)=newcandidate;

// candidates of new

frequent item is added to the candidate set.

END

GenerateCandidates

Input: Frequent Itemset, Candidate set, tidset of Frequent Itemset

Output: Exact Candidates of Frequent Itemset.

GenerateCandidates(frequent,candidate,

tid(frequent))

BEGIN

cand=null; // cand will contain exact candidates of

frequent item(frequent)

for each x ?

candidate

If(tid(frequent)

? tid(x) ? support)

cand.add(x); // candidates are stored in increasing order

of support.

return (cand);

END

The MFIMiner algorithm follows

breadth-first search strategy and candidate generation method. When a candidate

of a frequent itemset contains all tidsets of the frequent items, the candidate

is removed from the candidate set and added to the frequent itemsets. The tidset

of frequent itemset is sent to quickly generate exact candidates of the

frequent extensions. MFIMiner generates maximal frequent itemsets before

generating all frequent Itemsets. Frequent itemsets are added to the MFI

directly, if it has no superset in MFI.

2.2Pruning

MFIMiner

algorithm using four pruning techniques to reduce the search space are given

below.

1.

Reordering

of items with respect to its support in ascending order. This reordering

technique is introduced by Roberto Bayardo in

MaxMiner algorithm5 for mining maximal frequent itemsets. Once frequent items are generated, they are

reordered with respect to its support in ascending order.

2.

The

FIi ? CIi

can be directly added to MFI, if they have no superset in MFI. The elements of

frequent FIi ?

CIi are not used for finding frequent extension.

3.

Infrequent

candidates are pruned in bottom-up direction using GenearateCandidate method.

Infrequent candidates of frequent itemset are removed from initial candidates

to construct exact candidates.

4.

Parent equivalent pruning is done to trim the

item in candidate set of the frequent itemsets. This pruning was introduced by

calimlim in Mafia algorithm. Items in the candidate set are added to the

frequent itemset without performing frequency computation if the tidset of item

contains all tidset of frequent item.

3 Performance Evaluation

To evaluate the performance of the proposed MFIMiner

algorithm, it is implemented in Java language

on a 2.93 GHz Intel core PC with 2.96GB of main memory. The performance of

proposed MFIMiner is compared with java version of GenMax2 and Mafia3.

Timings in the figures are based on total time including all preprocessing

costs such as horizontal-to-vertical conversion.

The performance of MFIMiner algorithm is compared

with two different algorithms. It is observed that the performance can vary

significantly depending on the dataset characteristics. To evaluate the

performance of MFIMiner

algorithm, three different benchmark datasets are used. All these datasets can

be downloaded from FIMI Repository7. Datasets taken for experiment are

·

T10I4D100K Dataset

·

T40I10D100K Dataset

·

Retail Dataset

3.1 T10I4D100K

Dataset

The first dataset is T10I4D100K7 which contains

1000 attributes and 100,000 records. The

average record length is 10. In T10I4D100K dataset, the number of frequent

items is huge, the length of frequent itemset is short and frequent itemset

will have small number of candidates.

This dataset has very high concentrations of itemsets

around two and three items long. The number of maximal frequent itemsets is not

much smaller than number of frequent itemsets. The mean pattern length

is very small and it is around 2 to 6.

The performance of MFIMiner algorithm has been compared

with existing algorithms for various support values and the results show that

MFIMiner algorithm generates all MFI quickly than the existing algorithms. When

support values are decreased, performance of MFIMiner is significantly

increased than GenMax which is considered as an efficient algorithm as shown in

Figure 1.1. For T10I4D100K dataset, MFIMiner performance is around 2 to 4 times

better than Mafia for the taken support.

Figure 1.1 Performance of MFIMiner algorithm on T10I4D100K dataset.

3.2 T40I10D100K Dataset

The second dataset is T40I10D100K7 which contains

1000 attributes and 100,000 records. The

average record length is 40. In T40I10D100K

dataset, the number of frequent items is huge, the length of frequent

itemsets is short and frequent itemsets have small number of candidates. The T40I10D100K is characterized by very short maximal

frequent itemsets, where the average length of maximal frequent itemsets is

very small and it is around two to six items.

The maximum number of maximal

patterns is of length one or two.

Number of maximal patterns

is not much smaller than number of frequent itemsets.

The performance of MFIMiner algorithm has been

compared with existing algorithms for various support values and the results

show that MFIMiner algorithm gives better performance than the existing

algorithms. For T40I10D100K dataset, the performance of MFIMiner is slightly

better than GenMax as shown in Figure 1.2. For T40I10D100K dataset, MFIMiner

performance is up to 2 times better than Mafia for the taken support.

Figure

1.2 Performance of MFIMiner algorithm on T40I10D100K dataset.

3.3 Retail Dataset

The third dataset is Retail which contains 16,470

items and 88,162 transactions. The

average number of distinct items in each transaction is 13 and most transaction

contains items between 7 and 11 items. Mean pattern

length is long when compared to sparse datasets. In retail dataset, the number of frequent items

is huge and frequent itemsets have large number of candidates.

The performance of MFIMiner algorithm has been

compared with existing algorithms for various support values and results show

that MFIMiner algorithm gives better performance than the GenMax algorithm. On

Retail dataset, we find that MFIMiner is the fastest, having a slight edge over

GenMax.

Figure 1.3 Performance of MFIMiner

algorithm on Retail dataset.

Figure 1.3 illustrates that, the MFIMiner algorithm has

better performance than GenMax and Mafia has better performance than MaxMiner

on Retail dataset.

4 Result

and Discussion

The

MFIMiner algorithm is tested using different datasets. T10I4D100K and

T40I10D100K dataset have more number of frequent items for the taken support.

Most of the frequent items have less number of candidates. The length of MFI generated from these

candidates is close to the number of candidates. For these dataset, number of

frequent extensions is small and MFIMiner generates all MFI quickly.

Retail dataset has more number of

frequent items for the taken support and each frequent itemset has large number

of candidates when compared to sparse datasets. For Retail dataset, MFIMiner

takes some additional time to generate exact candidates. The number of frequent extension is huge

and FI, CI having huge number frequent items and candidate sets respectively.

The performance of MFIMiner is fine,

when number of frequent extension generation is small. The performance of

MFIMiner is not up to the level, when frequent items in the dataset have large

number of candidate items and length of maximal frequent itemsets generated for

these candidates is small.

5 Conclusion

MFIMiner

algorithm for mining maximal frequent patterns is introduced in this paper.

MFIMiner algorithm employs breadth-first search and pruning to remove the

infrequent candidates. The MFIMiner algorithm is compared with GenMax2 and

Mafia3 algorithms and the results show that MFIMiner algorithm performs

well than existing algorithms for

T40I10D100K and T10I4D100K dataset and performance of MFIMiner is improved than

GenMax for Retail datasets.

6 References

1 R. Agrawal, T. Imielienski and A. Swami,

“Mining association rules between sets of items in largedatabases. In P.

Bunemann and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD Conference

on Management of Data, Pages 207-216, Newyork, 1993, ACM Press

2 Karam Gouda and Mohammed

J. Zaki, 2005. GenMax: An Efficient Algorithm for Mining Maximal Frequent

Itemsets. In the Proceedings of the Data Mining and Knowledge Discovery, 11,

1-20, 2005.

3 Burdick, D., M. Calimlim

and J. Gehrke, “MAFIA: A maximal frequent itemset algorithm for transactional

databases”, In International Conference on Data Engineering, pp: 443 – 452,

April 2001, doi = 10.1.1.100.6805

4 D. Lin and Z. M. Kedem,

“Pincer-Search: A New Algorithm for Discovering the Maximum Frequent

Set”, In Proceedings of VI Intl. Conference on Extending Database

Technology, 1998.

5 Roberto Bayardo,

“Efficiently mining long patterns from databases”, in ACM SIGMOD Conference

1998.

6

Agrawal, R., Aggarwal, C., and Prasad, V. 2000. Depth first generation of long

patterns. In 7th Int’l Conference on Knowledge Discovery and Data Mining, pp.

108–118.

7

www.fimi.cs.helsinki.fi/fimi03/datasets.html.

Authors Biography

Mrs.K.Sumathi is an assistant professor of Computer Application

Department at K.L.N.College of Information Technology, Madurai. She received

her B.Sc. in Physics from Madurai Meenakshi College in 1998, her MCA from

Madurai Kamaraj old university in 2001 and her MPhil degree in Computer Science

from Madurai Kamaraj University in 2008. Her research Interests includes

Frequent Itemset Mining and Classification Model Generation.

Dr.S.Kannan is an associate professor of Computer Application

Department at Madurai Kamaraj University. He received his B.Sc. in Physics,

M.Sc. in Physics, M.Sc.in Computer Science from Madurai Kamaraj university in

1985,1987 and 1996 respectively. He received his Master of Philosophy and Ph.D.

degrees in Computer Science from Madurai Kamaraj University in 1998 and 2010

respectively. His experience also includes presenting more than 15 papers in

conference and seminars on Associative Classifications and Maximal Frequent

Itemset Mining. His research interests focus on Data Mining and Image Processing.

Mr.K. Nagarajan has been with Tata Consultancy Service, Chennai for over

15 years and has held leadership positions in various retail accounts across

the globe. He joined TCS in Mumbai, India in 1997. An active member of IEEE and

published various white papers on Enterprise Data warehouse, Business

Intelligence, Decision Sciences and Data Mining. He has filed patent on

advanced search algorithms and guided analytics (pending). His research work focus

on advanced Big Data Architectures and databases. He is a TOGAF 8.0 Certified

and Sliver Certified Netezza Architect.