Abstract—Maximal Frequent Patterns can be minedusing breadth-first search or depth-first search. The pure BFS algorithms workwell when all maximal frequent itemsets are short. The pure DFS algorithms workwell when all maximal frequent itemsets are long.

Both the pure BFS and pureDFS techniques will not be efficient, when the dataset contains some of longmaximal frequent itemsets and some of short maximal frequent itemsets.Efficient pruning techniques are required to mine MFI from these kinds ofdatasets. An algorithm (MFIMiner) using Breadth-First search with efficientpruning mechanism that competently mines both long and short maximal frequentitemsets is proposed in this paper. The performance of the algorithm isevaluated and compared with GenMax and Mafia algorithms for T40I10D100K, T10I4D100K and Retail,dataset.

The result shows that, proposed algorithm has significant improvementthan existing algorithms for sparse datasets. Index Terms—Breadth-First Search with Pruning, Mining MFIs. ———————————————— · Mrs.

K.Sumathi is with the Department of Computer Applications, K.L.N.College of Information and Technology, Pottapalayam, Sivagangai District, India. E-mail: [email protected]

com · Dr.S.Kannan is with the Department of Computer Applications, Madurai Kamaraj University, Madurai District, Tamil nadu, India E-mail: [email protected]

com · Mr.K. Nagarajan is a chief architect in Tata Consultancy Service, Chennai, Tamil nadu, India. E-mail: [email protected] http://in.linkedin.com/in/nagarajank.

—————————— u ——————————1 IntroductionMining frequent patterns plays animportant role in data mining, and it is widely used in applications regardingassociation rules and customer behavior analysis etc., All frequent itemsets canbe constructed from Maximal Frequent itemsets(MFI), since every subset of MFIis a frequent itemset. So many data mining applications need only MFI insteadof frequent itemset.Bayardo presented MaxMineralgorithm5 for mining maximal frequentitemset in 1998. Maxminer usedset-enumeration tree as the concept framework and followed breadth-firstsearching technique, as well as superset pruning strategy and dynamic recordingstrategy. Pincer-Search 4 algorithm was presented by D.

Lin and Z. M. Kedem.Pincer search used both top-down and bottom-up search for mining all MFI. TheDepth-Project algorithm adopted depth-first approach with superset pruningstrategy & dynamic recording methods to mine all MFI 6. Burdick proposedan algorithm called MAFIA3 which used depth-first search and stored data-setin longitudinal bitmaps. MAFIA used three pruning strategy including ParentEquivalence Pruning (PEP), Frequent Head Union Tail (FHUT) and Head Union TailMFI (HUTMFI).

GenMax2 combined the pruning and mining process, utilizing twoapproaches for mining MFI efficiently. The first approach is GenMax constructsLocal Maximal Frequent Itemsets (LMFI) for fast superset checking. Anotherapproach is GenMax uses Diffset Propagation method for fast support calculationwhen the dataset is highly dense in nature. 2 ProposedWorkThe MFIMiner algorithm that efficiently mines bothlong and short maximal frequent itemsets is proposed in this paper. TheMFIMiner algorithm runs in bottom-up direction and employs pruning toefficiently mine all MFIs from the datasetIn general, the structure of the dataset may be intwo different ways – horizontal data format and vertical data format.

Thisapproach uses vertical data representation of dataset. In this format data isrepresented in item-tidset format, where item is the name of the item andtidset is the set of transaction identifiers containing the item. MFIMiner computes the frequency of an item bycounting the number of tids in the tidset and computes the frequency of anitemset by intersecting the tidsets of corresponding items. The data structuresused in the MFIMiner algorithm are FI which denotes set of frequent itemsets,CI which denotes set of candidate items, MFI which denotes the set of MaximalFrequent Itemsets.2.

1MFIMiner AlgorithmThe first step of MFIMiner algorithm is to mine allfrequent items in the dataset. The frequent items are reordered in increasingorder of their support and added to FI. Candidates for each frequent item in FIare obtained and added to CI. For example candidates of frequent item FI1are added to CI1, candidates of FI2 are added to CI2and so on. Let FIj ( FIj ? FI) be the frequentitem and CIj (CIj ? CI)is the set of candidatesfor FIj.

The frequent extension of FIj is (FIj1,FIj2…FIjn) where n is the number of candidates in CIj.Frequent extension of FIj1 is constructed from CIj1 i.e. CIj1 ?(FIj2, FIj3…FIjn).Initial candidates of FIj1 are constructed from (FIj2, FIj3…FIjn)and candidates of FIj2 are constructed from (FIj3, FIj4…FIjn)and so on. Exact candidates for each frequent extension are constructed usinggenerateCandidate method. The exact candidates of a frequent itemset do notinclude the infrequent candidates of the frequent itemset.

When a frequentitemset has no candidate and has no superset in MFI, then frequent itemset isadded to MFI.In the next step, the FIi ? CIi (i is varied from 1 to n) is removed from FIand CI respectively, and checked, whether it has superset in MFI or not. If ithas superset in MFI then ignored. Otherwise the frequency of FIi ? CIi is computed. IfFIi ? CIiis frequent then it is added to MFI. The next pair (FIi+1 ? CIi+1) is taken fortest.

This step is repeated until an infrequent FIk ? CIk is obtainedIf FIk ? CIk has no supersetin MFI and it is not frequent, then the following steps are done. 1. ParentEquivalent Pruning 2. FrequentExtension to the next level When an infrequent FIk ? CIk is found,parent equivalent pruning is done. This pruning which was introduced byCalimlim in Mafia algorithm3 removes items from the candidate set (CIk)and add those items to the frequent itemset(FIk) if the tidset ofany item in CIk includes the tidset of FIk.

The frequent itemset (FIk) is extended tonext level using items of CIk in bottom-up direction. The candidatesof each frequent extension are generated and infrequent candidates are pruned.The frequent extensions and candidates of each frequent extension are added toFI and CI respectively. The data structures FI and CI are updated when newfrequent extension and candidates of frequent extension are found. MFI is updated when the frequent extensionhas no candidate and no superset in MFI or FIi ? CIi has no supersetin MFI and is frequent.

MFIMiner algorithm is explained with followingexample. Consider the transaction database DB Table 1.1 which includes sixdifferent items, I = {A, B, C, D, E, F} and six transactions T= {T1, T2, T3,T4, T5, T6}. The vertical data format of the database DB is given below. All frequentitems are extracted and reordered in ascending order with respect to thesupport.

The support of an item is directly given by the number of transactionsin the tidsets. For example, the minimumsupport is considered to be of 3 transactions. In database DBTable 1.1, allthe items are having more than two tids in the tidset, and so all the items arefrequent. The items A, B, C, D, E and F are frequent items andwill be considered to the next level.

These items are sorted in ascending orderof their support. Tables1.1: Vertical Data Representation of thetransactional database DB Items Tidsets A T1, T2, T3, T4, T5 B T2, T5, T6 C T1, T2, T4, T5, T6 D T1, T3, T4, T5 E T2, T3, T4, T5, T6 F T2, T4, T6 The reordered frequent items are B, F, D, A, C, E andcandidates of each frequent item are obtained.

Frequent items and candidates ofeach frequent item of database DB Table 1.1 is given in Table 1.2. Table 1.2 :Frequent items and Candidate Items in database DB with support 3 transactions Frequent Items Candidates of Frequent Item B C,E F C,E D A,C,E A C,E C E E – FI1and CI1 are removed from FI and CI respectively and Mining MFI isstarted by checking the frequency of FI1 ? CI1, continueduntil an infrequent FIn ? CIn isobtained. When an infrequent FIi? CIiis obtained, the parent equivalent pruning is done and frequent extension ofthe FIi is generated using CIi and new candidates offrequent extension are also obtained.

The frequent extensions of FIi aregenerated from CIi and each frequent extension and its candidatesare added to FI and CI respectively. Frequent extensions of a frequent itemsetare generated as follows. frequentExtension(FIi, CIi) For each x ?CIi FIi+1 = FIi ?x; // new frequent extension CIi.remove(x);// initial candidates of newfrequent extension CIi+1=generateCandidate(FIi+1,CIi); // exact candidates of new frequentextension If the CIi+1 is empty and the FIi+1has no superset in MFI, then FIi+1 is added to MFI.

In database DBTable 1.1, the first twofrequent item and its candidates are {B}-{C,E} and {F}-{C,E}, has no supersetin MFI, are frequent and added to MFI. The next pair {D}-{A,C,E} is notfrequent and parent equivalent pruning is done. The item A is trimmed from thecandidate set and added to the frequent item D.

The frequent extension of {DA}is generated from {C, E} and candidates of each frequent extension aregenerated. Frequent extension and candidates are {D,A,C}-{} and {D,A,E}-{}.Both of these frequent itemsets, have no superset in MFI, are frequent, and areadded to MFI.

The next pair, is {A}-{C, E} has no superset in MFI, is frequentand added to MFI. The subsequent pairs,are {CE} {} and {E}{}, have superset in MFI and ignored. MFIMiner algorithm returnsa subset of the MFI and wouldrequire post-pruning to eliminate non-maximal patterns. MFIs with minimumsupport 3 in database DB are {ACD}, {ACE}, {ADE}, {BCE} and {CEF}.Number of MFI generated depends on the support valuewhich is specified by the user is shown in table 1.3. MFIs with minimum support2 in database DB are ACEF, ACDE, ABCE and BCEF. Table1.

3. Maximal frequent itemsets. Tid Items Maximal Frequent Itemsets Minimum support = 3 Maximal Frequent Itemsets Minimum support = 2 1 ACD ACD ACE ADE BCE CEF ACEF ACDE ABCE BCEF BCEA 2 ABCEF 3 ADE 4 ACDEF 5 ABCDE 6 BCEF Algorithm –MFIMinerInput: dataset D, support SOutput: Maximal Frequent Itemsets MFIMFIMiner (Dataset D, Support S)BEGIN1. Generatefrequent items and reorder them in ascending order of their support. 2. Generatecandidate items for each frequent item and reorder the candidates in increasingorder of support.3.

For each x ? FIs // FI-> set of frequent items If candidate(x) is empty and xhas no superset in MFI AddtoMFI(x);continue;c=cand(x);FI=FI/x //remove the frequent item from the set of frequent item(FI)CI=CI/cand(x) // remove the candidates of FI fromcandidate set if x ? c hassuperset in MFI continue; if size of x ? c is 1 or 2 then add x? c to MFI; continue; else if x ? c is frequent then add x? c to MFI; continue; else for each y ? c If(tidset(y).containsAll(tidset(x)) x.add(y); c.

remove(y); Bottomup_ExternFI(x,c,tidset(x)) END Bottomup_ExternFI MethodInput: Frequent Itemset, Candidate set, tidset of Frequent ItemsetOutput: Frequent extensions of frequent itemset and candidatesof each frequent extension, MFIs if generated.Bottomup_ExternFI(frequent,candidate,tidset(frequent))BEGINFor each x ? candidateNfrequent=frequent ?x // current frequent itemsetcandidate.remove(x)// candidates of current frequent itemset if candidate is empty && Nfrequent hasno superset in MFI MFI.add(Nfrequent); continue; newcandidate=GenerateCandidates(Nfrequent,candidate, tidset(frequent)) if newcandidate isempty if Nfrequent has no superset in MFI MFI.

add(Nfrequent); else FI.add(Nfrequent); // New frequentitemset is added to FI. cand(Nfrequent)=newcandidate; // candidates of newfrequent item is added to the candidate set.ENDGenerateCandidates Input: Frequent Itemset, Candidate set, tidset of Frequent ItemsetOutput: Exact Candidates of Frequent Itemset.GenerateCandidates(frequent,candidate,tid(frequent))BEGINcand=null; // cand will contain exact candidates offrequent item(frequent)for each x ?candidateIf(tid(frequent)? tid(x) ? support)cand.add(x); // candidates are stored in increasing orderof support.return (cand);END The MFIMiner algorithm followsbreadth-first search strategy and candidate generation method. When a candidateof a frequent itemset contains all tidsets of the frequent items, the candidateis removed from the candidate set and added to the frequent itemsets.

The tidsetof frequent itemset is sent to quickly generate exact candidates of thefrequent extensions. MFIMiner generates maximal frequent itemsets beforegenerating all frequent Itemsets. Frequent itemsets are added to the MFIdirectly, if it has no superset in MFI. 2.2Pruning MFIMineralgorithm using four pruning techniques to reduce the search space are givenbelow.

1. Reorderingof items with respect to its support in ascending order. This reorderingtechnique is introduced by Roberto Bayardo inMaxMiner algorithm5 for mining maximal frequent itemsets. Once frequent items are generated, they arereordered with respect to its support in ascending order.2.

TheFIi ? CIican be directly added to MFI, if they have no superset in MFI. The elements offrequent FIi ?CIi are not used for finding frequent extension. 3. Infrequentcandidates are pruned in bottom-up direction using GenearateCandidate method.

Infrequent candidates of frequent itemset are removed from initial candidatesto construct exact candidates.4. Parent equivalent pruning is done to trim theitem in candidate set of the frequent itemsets. This pruning was introduced bycalimlim in Mafia algorithm. Items in the candidate set are added to thefrequent itemset without performing frequency computation if the tidset of itemcontains all tidset of frequent item.

3 Performance EvaluationTo evaluate the performance of the proposed MFIMineralgorithm, it is implemented in Java languageon a 2.93 GHz Intel core PC with 2.96GB of main memory. The performance ofproposed MFIMiner is compared with java version of GenMax2 and Mafia3.Timings in the figures are based on total time including all preprocessingcosts such as horizontal-to-vertical conversion.The performance of MFIMiner algorithm is comparedwith two different algorithms.

It is observed that the performance can varysignificantly depending on the dataset characteristics. To evaluate theperformance of MFIMineralgorithm, three different benchmark datasets are used. All these datasets canbe downloaded from FIMI Repository7. Datasets taken for experiment are · T10I4D100K Dataset· T40I10D100K Dataset· Retail Dataset 3.1 T10I4D100KDataset The first dataset is T10I4D100K7 which contains1000 attributes and 100,000 records.

Theaverage record length is 10. In T10I4D100K dataset, the number of frequentitems is huge, the length of frequent itemset is short and frequent itemsetwill have small number of candidates. This dataset has very high concentrations of itemsetsaround two and three items long.

The number of maximal frequent itemsets is notmuch smaller than number of frequent itemsets. The mean pattern lengthis very small and it is around 2 to 6. The performance of MFIMiner algorithm has been comparedwith existing algorithms for various support values and the results show thatMFIMiner algorithm generates all MFI quickly than the existing algorithms. Whensupport values are decreased, performance of MFIMiner is significantlyincreased than GenMax which is considered as an efficient algorithm as shown inFigure 1.1. For T10I4D100K dataset, MFIMiner performance is around 2 to 4 timesbetter than Mafia for the taken support. Figure 1.

1 Performance of MFIMiner algorithm on T10I4D100K dataset. 3.2 T40I10D100K Dataset The second dataset is T40I10D100K7 which contains1000 attributes and 100,000 records.

Theaverage record length is 40. In T40I10D100Kdataset, the number of frequent items is huge, the length of frequentitemsets is short and frequent itemsets have small number of candidates. The T40I10D100K is characterized by very short maximalfrequent itemsets, where the average length of maximal frequent itemsets isvery small and it is around two to six items. The maximum number of maximalpatterns is of length one or two.Number of maximal patternsis not much smaller than number of frequent itemsets.The performance of MFIMiner algorithm has beencompared with existing algorithms for various support values and the resultsshow that MFIMiner algorithm gives better performance than the existingalgorithms. For T40I10D100K dataset, the performance of MFIMiner is slightlybetter than GenMax as shown in Figure 1.2.

For T40I10D100K dataset, MFIMinerperformance is up to 2 times better than Mafia for the taken support. Figure 1.2 Performance of MFIMiner algorithm on T40I10D100K dataset. 3.

3 Retail Dataset The third dataset is Retail which contains 16,470items and 88,162 transactions. Theaverage number of distinct items in each transaction is 13 and most transactioncontains items between 7 and 11 items. Mean patternlength is long when compared to sparse datasets. In retail dataset, the number of frequent itemsis huge and frequent itemsets have large number of candidates. The performance of MFIMiner algorithm has beencompared with existing algorithms for various support values and results showthat MFIMiner algorithm gives better performance than the GenMax algorithm. OnRetail dataset, we find that MFIMiner is the fastest, having a slight edge overGenMax.

Figure 1.3 Performance of MFIMineralgorithm on Retail dataset.Figure 1.3 illustrates that, the MFIMiner algorithm hasbetter performance than GenMax and Mafia has better performance than MaxMineron Retail dataset.4 Resultand Discussion TheMFIMiner algorithm is tested using different datasets. T10I4D100K andT40I10D100K dataset have more number of frequent items for the taken support.Most of the frequent items have less number of candidates.

The length of MFI generated from thesecandidates is close to the number of candidates. For these dataset, number offrequent extensions is small and MFIMiner generates all MFI quickly.Retail dataset has more number offrequent items for the taken support and each frequent itemset has large numberof candidates when compared to sparse datasets. For Retail dataset, MFIMinertakes some additional time to generate exact candidates. The number of frequent extension is hugeand FI, CI having huge number frequent items and candidate sets respectively.

The performance of MFIMiner is fine,when number of frequent extension generation is small. The performance ofMFIMiner is not up to the level, when frequent items in the dataset have largenumber of candidate items and length of maximal frequent itemsets generated forthese candidates is small. 5 Conclusion MFIMineralgorithm for mining maximal frequent patterns is introduced in this paper.

MFIMiner algorithm employs breadth-first search and pruning to remove theinfrequent candidates. The MFIMiner algorithm is compared with GenMax2 andMafia3 algorithms and the results show that MFIMiner algorithm performswell than existing algorithms forT40I10D100K and T10I4D100K dataset and performance of MFIMiner is improved thanGenMax for Retail datasets.6 References1 R. Agrawal, T. Imielienski and A. Swami,”Mining association rules between sets of items in largedatabases.

In P.Bunemann and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD Conferenceon Management of Data, Pages 207-216, Newyork, 1993, ACM Press2 Karam Gouda and MohammedJ. Zaki, 2005. GenMax: An Efficient Algorithm for Mining Maximal FrequentItemsets.

In the Proceedings of the Data Mining and Knowledge Discovery, 11,1-20, 2005. 3 Burdick, D., M. Calimlimand J. Gehrke, “MAFIA: A maximal frequent itemset algorithm for transactionaldatabases”, In International Conference on Data Engineering, pp: 443 – 452,April 2001, doi = 10.1.1.100.

68054 D. Lin and Z. M. Kedem,”Pincer-Search: A New Algorithm for Discovering the Maximum FrequentSet”, In Proceedings of VI Intl.

Conference on Extending DatabaseTechnology, 1998. 5 Roberto Bayardo,”Efficiently mining long patterns from databases”, in ACM SIGMOD Conference1998.6Agrawal, R., Aggarwal, C., and Prasad, V. 2000. Depth first generation of longpatterns.

In 7th Int’l Conference on Knowledge Discovery and Data Mining, pp.108–118.7www.fimi.cs.helsinki.fi/fimi03/datasets.html.

Authors BiographyMrs.K.Sumathi is an assistant professor of Computer ApplicationDepartment at K.L.N.College of Information Technology, Madurai. She receivedher B.Sc.

in Physics from Madurai Meenakshi College in 1998, her MCA fromMadurai Kamaraj old university in 2001 and her MPhil degree in Computer Sciencefrom Madurai Kamaraj University in 2008. Her research Interests includesFrequent Itemset Mining and Classification Model Generation. Dr.S.Kannan is an associate professor of Computer ApplicationDepartment at Madurai Kamaraj University. He received his B.

Sc. in Physics,M.Sc. in Physics, M.

Sc.in Computer Science from Madurai Kamaraj university in1985,1987 and 1996 respectively. He received his Master of Philosophy and Ph.D.degrees in Computer Science from Madurai Kamaraj University in 1998 and 2010respectively.

His experience also includes presenting more than 15 papers inconference and seminars on Associative Classifications and Maximal FrequentItemset Mining. His research interests focus on Data Mining and Image Processing. Mr.K. Nagarajan has been with Tata Consultancy Service, Chennai for over15 years and has held leadership positions in various retail accounts acrossthe globe. He joined TCS in Mumbai, India in 1997.

An active member of IEEE andpublished various white papers on Enterprise Data warehouse, BusinessIntelligence, Decision Sciences and Data Mining. He has filed patent onadvanced search algorithms and guided analytics (pending). His research work focuson advanced Big Data Architectures and databases.

He is a TOGAF 8.0 Certifiedand Sliver Certified Netezza Architect.