Acessibilidade / Reportar erro

An Enhanced Focused Web Crawler for Biomedical Topics Using Attention Enhanced Siamese Long Short Term Memory Networks

HIGHLIGHTS

  • This paper proposes a new focused crawler for biomedical topics.

  • This paper proposes a novel Attention Enhanced Siamese Long Short Term Memory Networks.

  • The proposed model is trained using ADAM optimizer with Batch Normalization.

  • This paper produces an average harvest rate of 0.39.

Abstract

The Internet is chosen to be one among the primary source of biomedical information. To retrieve necessary biomedical information, the search engine needs an efficient, focused crawler mechanism. But the area of research concerned with the focused crawler for biomedical topics is notably scanty. However, the quantity, momentum, diversity, and quality of the available online biomedical information, challenges and calls for enhanced aid to crawl. This paper surmounts the challenges and proposes a new learning approach for focused web crawling adopting Attention Enhanced Siamese Long Short Term Memory (AE-SLSTM) Networks with peephole connections which predicts topical relevance of the web page. The proposed AE-SLSTM model accurately computes the semantic similarity between the topic and the web pages. The performance of the newly designed crawler is assessed using two well known metrics namely harvest rate (hrate) and irrelevance ratio (prate). The presented crawler surpass the existing focused crawlers with an average hrate of 0.39 and an average prate of 0.61 after crawling 5,000 web pages relating to biomedical topics. The results clearly depicts that the proposed methodology aids to download more relevant biomedical web pages related to the particular topic from the internet.

Keywords:
focused crawler; attention mechanism; LSTM; peephole connection; Manhattan distance; Siamese network

Instituto de Tecnologia do Paraná - Tecpar Rua Prof. Algacyr Munhoz Mader, 3775 - CIC, 81350-010 Curitiba PR Brazil, Tel.: +55 41 3316-3052/3054, Fax: +55 41 3346-2872 - Curitiba - PR - Brazil
E-mail: babt@tecpar.br