Abstract—Maximal Frequent Patterns can be mined
using breadth-first search or depth-first search. The pure BFS algorithms work
well when all maximal frequent itemsets are short. The pure DFS algorithms work
well when all maximal frequent itemsets are long. Both the pure BFS and pure
DFS techniques will not be efficient, when the dataset contains some of long
maximal frequent itemsets and some of short maximal frequent itemsets.
Efficient pruning techniques are required to mine MFI from these kinds of
datasets. An algorithm (MFIMiner) using Breadth-First search with efficient
pruning mechanism that competently mines both long and short maximal frequent
itemsets is proposed in this paper. The performance of the algorithm is
evaluated and compared with GenMax and Mafia algorithms for T40I10D100K, T10I4D100K and Retail,
dataset. The result shows that, proposed algorithm has significant improvement
than existing algorithms for sparse datasets.

 

Index Terms—
Breadth-First Search with Pruning, Mining MFIs.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

————————————————
·   Mrs.K.Sumathi is with the Department of Computer Applications,
K.L.N.College of Information and Technology, Pottapalayam, Sivagangai
District, India. E-mail: [email protected]
·   Dr.S.Kannan  is with the
Department of Computer Applications, Madurai Kamaraj University, Madurai
District, Tamil nadu, India
E-mail: [email protected]
·   Mr.K. Nagarajan
is a chief architect in Tata Consultancy Service, Chennai, Tamil nadu, India.
 E-mail: [email protected]
http://in.linkedin.com/in/nagarajank.
 

——————————   u   ——————————

1   Introduction

Mining frequent patterns plays an
important role in data mining, and it is widely used in applications regarding
association rules and customer behavior analysis etc., All frequent itemsets can
be constructed from Maximal Frequent itemsets(MFI), since every subset of MFI
is a frequent itemset. So many data mining applications need only MFI instead
of frequent itemset.

Bayardo presented MaxMiner
algorithm5 for mining  maximal frequent
itemset  in 1998. Maxminer used
set-enumeration tree as the concept framework and followed breadth-first
searching technique, as well as superset pruning strategy and dynamic recording
strategy. Pincer-Search 4 algorithm was presented by D. Lin and Z. M. Kedem.
Pincer search used both top-down and bottom-up search for mining all MFI. The
Depth-Project algorithm adopted depth-first approach with superset pruning
strategy & dynamic recording methods to mine all MFI 6. Burdick proposed
an algorithm called MAFIA3 which used depth-first search and stored data-set
in longitudinal bitmaps. MAFIA used three pruning strategy including Parent
Equivalence Pruning (PEP), Frequent Head Union Tail (FHUT) and Head Union Tail
MFI (HUTMFI). GenMax2 combined the pruning and mining process, utilizing two
approaches for mining MFI efficiently. The first approach is GenMax constructs
Local Maximal Frequent Itemsets (LMFI) for fast superset checking. Another
approach is GenMax uses Diffset Propagation method for fast support calculation
when the dataset is highly dense in nature.

2   Proposed
Work

The MFIMiner algorithm that efficiently mines both
long and short maximal frequent itemsets is proposed in this paper. The
MFIMiner algorithm runs in bottom-up direction and employs pruning to
efficiently mine all MFIs from the dataset

In general, the structure of the dataset may be in
two different ways – horizontal data format and vertical data format. This
approach uses vertical data representation of dataset. In this format data is
represented in item-tidset format, where item is the name of the item and
tidset is the set of transaction identifiers containing the item.

MFIMiner computes the frequency of an item by
counting the number of tids in the tidset and computes the frequency of an
itemset by intersecting the tidsets of corresponding items. The data structures
used in the MFIMiner algorithm are FI which denotes set of frequent itemsets,
CI which denotes set of candidate items, MFI which denotes the set of Maximal
Frequent Itemsets.

2.1
MFIMiner Algorithm

The first step of MFIMiner algorithm is to mine all
frequent items in the dataset. The frequent items are reordered in increasing
order of their support and added to FI. Candidates for each frequent item in FI
are obtained and added to CI. For example candidates of frequent item FI1
are added to CI1, candidates of FI2 are added to CI2
and so on. Let FIj ( FIj ? FI) be the frequent
item and CIj (CIj ? CI)is the set of candidates
for FIj.

The frequent extension of FIj is (FIj1,
FIj2…FIjn) where n is the number of candidates in CIj.
Frequent extension of FIj1 is constructed from CIj1  i.e. CIj1 ?(FIj2, FIj3…FIjn).
Initial candidates of FIj1 are constructed from (FIj2, FIj3…FIjn)
and candidates of FIj2 are constructed from (FIj3, FIj4…FIjn)
and so on. Exact candidates for each frequent extension are constructed using
generateCandidate method. The exact candidates of a frequent itemset do not
include the infrequent candidates of the frequent itemset. When a frequent
itemset has no candidate and has no superset in MFI, then frequent itemset is
added to MFI.

In the next step, the FIi ? CIi  (i is varied from 1 to n) is removed from FI
and CI respectively, and checked, whether it has superset in MFI or not. If it
has superset in MFI then ignored. Otherwise the frequency of FIi ? CIi is computed. If
FIi ? CIi
is frequent then it is added to MFI. The next pair (FIi+1 ? CIi+1) is taken for
test. This step is repeated until an infrequent FIk ? CIk is obtained

If FIk ? CIk has no superset
in MFI and it is not frequent, then the following steps are done.

 

1.      
Parent
Equivalent Pruning

2.      
Frequent
Extension to the next level

 

When an infrequent FIk ? CIk is found,
parent equivalent pruning is done. This pruning which was introduced by
Calimlim in Mafia algorithm3 removes items from the candidate set (CIk)
and add those items to the frequent itemset(FIk) if the tidset of
any item in CIk includes the tidset of FIk.

The frequent itemset (FIk) is extended to
next level using items of CIk in bottom-up direction. The candidates
of each frequent extension are generated and infrequent candidates are pruned.
The frequent extensions and candidates of each frequent extension are added to
FI and CI respectively.

The data structures FI and CI are updated when new
frequent extension and candidates of frequent extension are found.  MFI is updated when the frequent extension
has no candidate and no superset in MFI or FIi ? CIi has no superset
in MFI and is frequent.

MFIMiner algorithm is explained with following
example. Consider the transaction database DB Table 1.1 which includes six
different items, I = {A, B, C, D, E, F} and six transactions T= {T1, T2, T3,
T4, T5, T6}. The vertical data format of the database DB is given below.

  All frequent
items are extracted and reordered in ascending order with respect to the
support. The support of an item is directly given by the number of transactions
in the tidsets.  For example, the minimum
support is considered to be of 3 transactions. In database DBTable 1.1, all
the items are having more than two tids in the tidset, and so all the items are
frequent.

The items A, B, C, D, E and F are frequent items and
will be considered to the next level. These items are sorted in ascending order
of their support.

 

 

Tables
1.1:  Vertical Data Representation of the
transactional database DB

Items

Tidsets

A

T1, T2, T3, T4, T5
 

B

T2, T5, T6

C

T1, T2, T4,
T5, T6

D

T1, T3, T4, T5
 

E

T2, T3, T4,
T5, T6

F

T2, T4, T6

 

 

 

 

 

 

 

 

 

 

The reordered frequent items are B, F, D, A, C, E and
candidates of each frequent item are obtained. Frequent items and candidates of
each frequent item of database DB Table 1.1 is given in Table 1.2. 

 

Table 1.2 :
Frequent items and Candidate Items in database DB with support 3 transactions

Frequent Items

Candidates of Frequent Item

B

C,E

F

C,E

D

A,C,E

A

C,E

C

E

E

 

 

 

 

 

 

 

 

 

 

 

FI1
and CI1 are removed from FI and CI respectively and Mining MFI is
started by checking the frequency of FI1 ? CI1, continued
until an infrequent FIn ? CIn is
obtained.  When an infrequent FIi
? CIi
is obtained, the parent equivalent pruning is done and frequent extension of
the FIi is generated using CIi and new candidates of
frequent extension are also obtained. The frequent extensions of FIi are
generated from CIi and each frequent extension and its candidates
are added to FI and CI respectively. Frequent extensions of a frequent itemset
are generated as follows.

 

 frequentExtension
(FIi, CIi)

   For each x ?
CIi

 
FIi+1 = FIi  ?
x;   // new frequent extension

  CIi.remove(x);
// initial candidates of new
frequent     extension

  CIi+1=generateCandidate(FIi+1,CIi)
; // exact candidates of new frequent
extension

 

If the CIi+1 is empty and the FIi+1
has no superset in MFI, then FIi+1 is added to MFI.  In database DBTable 1.1, the first two
frequent item and its candidates are {B}-{C,E} and {F}-{C,E}, has no superset
in MFI, are frequent and added to MFI. The next pair {D}-{A,C,E} is not
frequent and parent equivalent pruning is done. The item A is trimmed from the
candidate set and added to the frequent item D. The frequent extension of {DA}
is generated from {C, E} and candidates of each frequent extension are
generated. Frequent extension and candidates are {D,A,C}-{} and {D,A,E}-{}.
Both of these frequent itemsets, have no superset in MFI, are frequent, and are
added to MFI. The next pair, is {A}-{C, E} has no superset in MFI, is frequent
and added to MFI.  The subsequent pairs,
are {CE} {} and {E}{}, have superset in MFI and ignored. MFIMiner algorithm returns
a subset of the MFI and would
require post-pruning to eliminate non-maximal patterns. MFIs with minimum
support 3 in database DB are {ACD}, {ACE}, {ADE}, {BCE} and {CEF}.

Number of MFI generated depends on the support value
which is specified by the user is shown in table 1.3. MFIs with minimum support
2 in database DB are ACEF, ACDE, ABCE and BCEF.

 

Table
1.3. Maximal frequent itemsets.

 

Tid

Items

Maximal Frequent Itemsets
Minimum support = 3

Maximal Frequent Itemsets
Minimum support = 2

1

ACD

 
ACD
ACE
ADE
BCE
CEF

ACEF
ACDE
ABCE
BCEF
BCEA

2

ABCEF

3

ADE

4

ACDEF

5

ABCDE

6

BCEF

   

Algorithm –MFIMiner

Input: dataset D, support S

Output: Maximal Frequent Itemsets MFI

MFIMiner (Dataset D, Support S)

BEGIN

1.            
Generate
frequent items and reorder them in ascending order of their support.

2.            
Generate
candidate items for each frequent item and reorder the candidates in increasing
order of support.

3.            
 For each x ? FIs  // FI-> set of frequent items

  
If candidate(x)  is empty and x
has no superset in MFI 

AddtoMFI(x);
continue;

c=cand(x);

FI=FI/x  //
remove the frequent item from the set of frequent item(FI)

CI=CI/cand(x) // remove the candidates of FI from
candidate set

               if x ? c  has
superset in MFI continue;

             if size of x ? c  is 1 or 2 then add x
? c  to MFI;

      
                     continue;

      
      else if x ? c  is frequent then add x
? c  to MFI;   

                            continue;

       
     else

                        
for each y  ? c

                              
If(tidset(y).containsAll(tidset(x))

                                 x.add(y);

                                 c.remove(y);

                       
 Bottomup_ExternFI(x,c,tidset(x))

 

END

 

 

Bottomup_ExternFI Method

Input: Frequent Itemset, Candidate set, tidset of Frequent Itemset

Output: Frequent extensions of frequent itemset and candidates
of each frequent extension, MFIs if generated.

Bottomup_ExternFI(frequent,candidate,
tidset(frequent))

BEGIN

For each x ? candidate

Nfrequent=frequent ?
x  // current frequent itemset

candidate.remove(x)
// candidates of current frequent itemset

 if candidate is empty && Nfrequent has
no superset in MFI

            MFI.add(Nfrequent); continue;

 
newcandidate=GenerateCandidates(Nfrequent,candidate, tidset(frequent))

   
        if newcandidate is
empty 

    
                       if  Nfrequent has no superset in MFI

    
                           MFI.add(Nfrequent);

    
                       else

   
FI.add(Nfrequent); 

// New frequent
itemset is added to FI.

cand(Nfrequent)=newcandidate;

// candidates of new
frequent item is added to the candidate set.

END

GenerateCandidates

Input: Frequent Itemset, Candidate set, tidset of Frequent Itemset

Output: Exact Candidates of Frequent Itemset.

GenerateCandidates(frequent,candidate,
tid(frequent))

BEGIN

cand=null; // cand will contain exact candidates of
frequent item(frequent)

for each x ?
candidate

If(tid(frequent)
? tid(x) ? support)

cand.add(x); // candidates are stored in increasing order
of support.

return (cand);

END

 

The MFIMiner algorithm follows
breadth-first search strategy and candidate generation method. When a candidate
of a frequent itemset contains all tidsets of the frequent items, the candidate
is removed from the candidate set and added to the frequent itemsets. The tidset
of frequent itemset is sent to quickly generate exact candidates of the
frequent extensions. MFIMiner generates maximal frequent itemsets before
generating all frequent Itemsets. Frequent itemsets are added to the MFI
directly, if it has no superset in MFI.

 

2.2Pruning

 

MFIMiner
algorithm using four pruning techniques to reduce the search space are given
below.

 

1.            
Reordering
of items with respect to its support in ascending order. This reordering
technique is introduced by Roberto Bayardo in
MaxMiner algorithm5 for mining maximal frequent itemsets.  Once frequent items are generated, they are
reordered with respect to its support in ascending order.

2.            
The
FIi ? CIi
can be directly added to MFI, if they have no superset in MFI. The elements of
frequent FIi ?
CIi are not used for finding frequent extension.

3.            
Infrequent
candidates are pruned in bottom-up direction using GenearateCandidate method.
Infrequent candidates of frequent itemset are removed from initial candidates
to construct exact candidates.

4.            
 Parent equivalent pruning is done to trim the
item in candidate set of the frequent itemsets. This pruning was introduced by
calimlim in Mafia algorithm. Items in the candidate set are added to the
frequent itemset without performing frequency computation if the tidset of item
contains all tidset of frequent item.

3   Performance Evaluation

To evaluate the performance of the proposed MFIMiner
algorithm, it is implemented in Java language
on a 2.93 GHz Intel core PC with 2.96GB of main memory. The performance of
proposed MFIMiner is compared with java version of GenMax2 and Mafia3.
Timings in the figures are based on total time including all preprocessing
costs such as horizontal-to-vertical conversion.

The performance of MFIMiner algorithm is compared
with two different algorithms. It is observed that the performance can vary
significantly depending on the dataset characteristics. To evaluate the
performance of MFIMiner
algorithm, three different benchmark datasets are used. All these datasets can
be downloaded from FIMI Repository7. Datasets taken for experiment are

 

·              
T10I4D100K Dataset

·              
T40I10D100K Dataset

·              
Retail Dataset

 

3.1 T10I4D100K
Dataset

 

The first dataset is T10I4D100K7 which contains
1000 attributes and 100,000 records.  The
average record length is 10. In T10I4D100K dataset, the number of frequent
items is huge, the length of frequent itemset is short and frequent itemset
will have small number of candidates.

This dataset has very high concentrations of itemsets
around two and three items long.  The number of maximal frequent itemsets is not
much smaller than number of frequent itemsets. The mean pattern length
is very small and it is around 2 to 6.

The performance of MFIMiner algorithm has been compared
with existing algorithms for various support values and the results show that
MFIMiner algorithm generates all MFI quickly than the existing algorithms. When
support values are decreased, performance of MFIMiner is significantly
increased than GenMax which is considered as an efficient algorithm as shown in
Figure 1.1. For T10I4D100K dataset, MFIMiner performance is around 2 to 4 times
better than Mafia for the taken support.

 

 Figure 1.1 Performance of MFIMiner algorithm on T10I4D100K dataset.

 

3.2 T40I10D100K Dataset

 

The second dataset is T40I10D100K7 which contains
1000 attributes and 100,000 records.  The
average record length is 40. In T40I10D100K
dataset, the number of frequent items is huge, the length of frequent
itemsets is short and frequent itemsets have small number of candidates. The T40I10D100K is characterized by very short maximal
frequent itemsets, where the average length of maximal frequent itemsets is
very small and it is around two to six items.

The maximum number of maximal
patterns is of length one or two.
Number of maximal patterns
is not much smaller than number of frequent itemsets.

The performance of MFIMiner algorithm has been
compared with existing algorithms for various support values and the results
show that MFIMiner algorithm gives better performance than the existing
algorithms. For T40I10D100K dataset, the performance of MFIMiner is slightly
better than GenMax as shown in Figure 1.2. For T40I10D100K dataset, MFIMiner
performance is up to 2 times better than Mafia for the taken support.

 

Figure  
1.2 Performance of MFIMiner algorithm on T40I10D100K dataset.

 

 

3.3 Retail Dataset

 

The third dataset is Retail which contains 16,470
items and 88,162 transactions.  The
average number of distinct items in each transaction is 13 and most transaction
contains items between 7 and 11 items. Mean pattern
length is long when compared to sparse datasets. In retail dataset, the number of frequent items
is huge and frequent itemsets have large number of candidates.

 

The performance of MFIMiner algorithm has been
compared with existing algorithms for various support values and results show
that MFIMiner algorithm gives better performance than the GenMax algorithm. On
Retail dataset, we find that MFIMiner is the fastest, having a slight edge over
GenMax.

 

Figure 1.3 Performance of MFIMiner
algorithm on Retail dataset.

Figure 1.3 illustrates that, the MFIMiner algorithm has
better performance than GenMax and Mafia has better performance than MaxMiner
on Retail dataset.

4   Result
and Discussion

                The
MFIMiner algorithm is tested using different datasets. T10I4D100K and
T40I10D100K dataset have more number of frequent items for the taken support.
Most of the frequent items have less number of candidates.  The length of MFI generated from these
candidates is close to the number of candidates. For these dataset, number of
frequent extensions is small and MFIMiner generates all MFI quickly.

Retail dataset has more number of
frequent items for the taken support and each frequent itemset has large number
of candidates when compared to sparse datasets. For Retail dataset, MFIMiner
takes some additional time to generate exact candidates.    The number of frequent extension is huge
and FI, CI having huge number frequent items and candidate sets respectively.

The performance of MFIMiner is fine,
when number of frequent extension generation is small. The performance of
MFIMiner is not up to the level, when frequent items in the dataset have large
number of candidate items and length of maximal frequent itemsets generated for
these candidates is small.

5   Conclusion

 

MFIMiner
algorithm for mining maximal frequent patterns is introduced in this paper.
MFIMiner algorithm employs breadth-first search and pruning to remove the
infrequent candidates. The MFIMiner algorithm is compared with GenMax2 and
Mafia3 algorithms and the results show that MFIMiner algorithm performs
well  than existing algorithms for
T40I10D100K and T10I4D100K dataset and performance of MFIMiner is improved than
GenMax for Retail datasets.

6   References

1  R. Agrawal, T. Imielienski and A. Swami,
“Mining association rules between sets of items in largedatabases. In P.
Bunemann and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD Conference
on Management of Data, Pages 207-216, Newyork, 1993, ACM Press

2 Karam Gouda and Mohammed
J. Zaki, 2005. GenMax: An Efficient Algorithm for Mining Maximal Frequent
Itemsets. In the Proceedings of the Data Mining and Knowledge Discovery, 11,
1-20, 2005. 

3 Burdick, D., M. Calimlim
and J. Gehrke, “MAFIA: A maximal frequent itemset algorithm for transactional
databases”, In International Conference on Data Engineering, pp: 443 – 452,
April 2001, doi = 10.1.1.100.6805

4 D. Lin and Z. M. Kedem,
“Pincer-Search: A New Algorithm for Discovering the Maximum Frequent
Set”, In Proceedings of VI Intl. Conference on Extending Database
Technology, 1998.

5 Roberto Bayardo,
“Efficiently mining long patterns from databases”, in ACM SIGMOD Conference
1998.

6
Agrawal, R., Aggarwal, C., and Prasad, V. 2000. Depth first generation of long
patterns. In 7th Int’l Conference on Knowledge Discovery and Data Mining, pp.
108–118.

7
www.fimi.cs.helsinki.fi/fimi03/datasets.html.

 

Authors Biography

Mrs.K.Sumathi is an assistant professor of Computer Application
Department at K.L.N.College of Information Technology, Madurai. She received
her B.Sc. in Physics from Madurai Meenakshi College in 1998, her MCA from
Madurai Kamaraj old university in 2001 and her MPhil degree in Computer Science
from Madurai Kamaraj University in 2008. Her research Interests includes
Frequent Itemset Mining and Classification Model Generation.

 

Dr.S.Kannan is an associate professor of Computer Application
Department at Madurai Kamaraj University. He received his B.Sc. in Physics,
M.Sc. in Physics, M.Sc.in Computer Science from Madurai Kamaraj university in
1985,1987 and 1996 respectively. He received his Master of Philosophy and Ph.D.
degrees in Computer Science from Madurai Kamaraj University in 1998 and 2010
respectively. His experience also includes presenting more than 15 papers in
conference and seminars on Associative Classifications and Maximal Frequent
Itemset Mining. His research interests focus on Data Mining and Image Processing.

 

Mr.K. Nagarajan has been with Tata Consultancy Service, Chennai for over
15 years and has held leadership positions in various retail accounts across
the globe. He joined TCS in Mumbai, India in 1997. An active member of IEEE and
published various white papers on Enterprise Data warehouse, Business
Intelligence, Decision Sciences and Data Mining. He has filed patent on
advanced search algorithms and guided analytics (pending). His research work focus
on advanced Big Data Architectures and databases. He is a TOGAF 8.0 Certified
and Sliver Certified Netezza Architect.

x

Hi!
I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out