Abstract—With the arrival of the data deluge, traditional and centralized tools used to extract knowledge from data become obsolete due to their limited ability to handle massive data. To cope with the need for scalable solutions, a new framework has emerged: Hadoop, an open-source ecosystem designed for storage and large-scale processing work on a cluster of commodity hardware. In order to overcome the limitations in key word based information retrieval systems, an efficient methodology has been designed. A system with the new approach mimics the real world, where every task is laced with certain indexing as this is basic idea behind knowledge processing. Hadoop and R: open source frame works for storing and processing large datasets, are used for preprocessing the text documents. First, a set of text documents are considered. Preprocessing is performed on a large domain of data using R. This includes the removal of the stop words along with stemming and excluding less frequency words. Despite this preprocessing, owing to the colossal number of index terms still floating in the considered domain data, the problem of high dimensionality is encountered. Therefore the dimensionality of such a group of terms is reduced by incorporating a keyword based methodology in Hadoop MapReduce Framework. The developed Model is useful for processing the query which gives us the relevant information with low response time from the data pool considered.
Copyright © 2013-2020. JAIT. All Rights Reserved
This work is licensed under the Creative Commons Attribution License (CC BY-NC-ND 4.0)