Abstract—The amount of electronically stored information in genome sequence database has grown rapidly in the last decade. This makes frequent substring extraction an essential task as most frequent substrings are meaningful in genome sequences, in order to support the application in the area of information retrieval and data analytics. In this paper, two frequent substring mining techniques are investigated: frequent substring and frequent max substring mining algorithms. Many research communities have acknowledged that the frequent substring mining is one of the viable solutions for extracting the interesting patterns in genome or protein in area of bioinformatics. Beside this, the frequent max substring technique has been proposed as an alternative method to extract meaningful patterns. In this paper, experimental studies and comparison results are shown in order to compare two techniques. From the experimental results, the following observations can be made. The frequent max substring mining technique provides significant benefits over the frequent substring mining technique in term of storage space. Meanwhile, the frequent substring mining technique requires less computational time as this technique is straight forward.
Index Terms—frequent max substring mining, frequent substring mining, genome sequence, frequent max substrings, frequent substrings
Cite: Todsanai Chumwatana, "Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms," Vol. 7, No. 4, pp. 281-286, November, 2016. doi: 10.12720/jait.7.4.281-286
Copyright © 2013-2020. JAIT. All Rights Reserved
This work is licensed under the Creative Commons Attribution License (CC BY-NC-ND 4.0)