Home > Published Issues > 2016 > Volume 7, No. 4, November 2016 >

Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms

Todsanai Chumwatana
College of Information and Communication Technology, Rangsit University, Thailand

Abstract—The amount of electronically stored information in genome sequence database has grown rapidly in the last decade. This makes frequent substring extraction an essential task as most frequent substrings are meaningful in genome sequences, in order to support the application in the area of information retrieval and data analytics. In this paper, two frequent substring mining techniques are investigated: frequent substring and frequent max substring mining algorithms. Many research communities have acknowledged that the frequent substring mining is one of the viable solutions for extracting the interesting patterns in genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract meaningful patterns. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The frequent max substring mining technique provides significant benefits over the frequent substring mining technique in term of storage space. Meanwhile, the frequent substring mining technique requires less computational time as this technique is straight forward.

Index Terms—frequent max substring mining, frequent substring mining, genome sequence, frequent max substrings, frequent substrings

Cite: Todsanai Chumwatana, "Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms," Vol. 7, No. 4, pp. 281-286, November, 2016. doi: 10.12720/jait.7.4.281-286