Main Content Extraction From Web Page Using Dom

PRANJALI G.GONDSE; PROFESSOR ANJALI B.RAUT

← Back to VOLUME 3, ISSUE 3, MARCH 2014

Main Content Extraction From Web Page Using Dom

MS. PRANJALI G.GONDSE, PROFESSOR ANJALI B.RAUT

👁 42 views📥 1 download

Abstract: Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. The rapid growth of World Wide Web has been tremendous in recent years. With the large amount of information on the Internet, web pages have been the potential source of information retrieval and data mining technology such as commercial search engines, web mining applications. Internet web pages contain several items that cannot be classified as the informative content, e.g., search and filtering panel, navigation links, advertisements, and so on called as noisy parts. Most clients and end-users search for the informative content, and largely do not seek the non- informative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the “primary or informative content sections” from the other content sections. These sections are known as “Web page blocks” or just “blocks.” First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the non informative content block .Main focus is on review and evaluation of algorithm , capable of extracting main content from web page. Proposed algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, a Web cache system that applies proposed algorithms to remove non informative content blocks and to identify similar blocks across Web pages can achieve significant storage savings will be shown.

Keywords: DOM Tree, information extraction, web mining

How to Cite:

[1] MS. PRANJALI G.GONDSE, PROFESSOR ANJALI B.RAUT, “Main Content Extraction From Web Page Using Dom,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)

This work is licensed under a Creative Commons Attribution 4.0 International License.