Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering for information retrieval

Poonam Yadav

← Back to VOLUME 3, ISSUE 10, OCTOBER 2014

Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering for information retrieval

Poonam Yadav

👁 37 views📥 2 downloads

Abstract: Due to continuous development of World Wide Web, web database are growing massively where automatic grouping of web documents pose a new challenge for researchers to easily retrieve the information. Literature presents different algorithms for web document clustering useful for information retrieval. In this work, Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering is developed for information retrieval. At first, web documents are read and initial pre-processing are applied to extract the important words. Then, feature space is constructed using keywords and its frequency. Subsequently, document to document similarity matrix is constructed using the similarity measure, called semantic retrieval measure (SR). The measure considers four different criteria, such as, the probability of occurrence in the document, probability of occurrence in the first document, probability of occurrence in the second document and probability of occurrence in both synonyms set. Based on this measure, D-D matrix is computed to do the final grouping using Multiple-Kernel Fuzzy C-Means Algorithm. The experimentation is done with 100 web documents and the results are evaluated with accuracy and entropy.

Keywords: Information retrieval, Similarity measure, web document clustering, Entropy, Accuracy.

How to Cite:

[1] Poonam Yadav, “Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering for information retrieval,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)

This work is licensed under a Creative Commons Attribution 4.0 International License.