← Back to VOLUME 3, ISSUE 3, MARCH 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Main Content Extraction From Web Page Using Dom
MS. PRANJALI G.GONDSE, PROFESSOR ANJALI B.RAUT
Downloads: Download PDF
๐ 41 views๐ฅ 1 download
Abstract: Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. The rapid growth of World Wide Web has been tremendous in recent years. With the large amount of information on the Internet, web pages have been the potential source of information retrieval and data mining technology such as commercial search engines, web mining applications. Internet web pages contain several items that cannot be classified as the informative content, e.g., search and filtering panel, navigation links, advertisements, and so on called as noisy parts. Most clients and end-users search for the informative content, and largely do not seek the non- informative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the โprimary or informative content sectionsโ from the other content sections. These sections are known as โWeb page blocksโ or just โblocks.โ First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the non informative content block .Main focus is on review and evaluation of algorithm , capable of extracting main content from web page. Proposed algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, a Web cache system that applies proposed algorithms to remove non informative content blocks and to identify similar blocks across Web pages can achieve significant storage savings will be shown.
Keywords: DOM Tree, information extraction, web mining
Keywords: DOM Tree, information extraction, web mining
How to Cite:
[1] MS. PRANJALI G.GONDSE, PROFESSOR ANJALI B.RAUT, โMain Content Extraction From Web Page Using Dom,โ International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)
