Acessibilidade / Reportar erro

GENERIC FRAMEWORK FOR AUTOMATIC SUBJECT GENERATION AND INDEXING IN A DIGITAL REPOSITORY

ABSTRACT

This study aims to present a generic framework for automatic subject generation, using machine learning techniques in the Annif tool. Subsequently, perform the indexing of data and metadata in a digital repository, providing the recovery of records through faceted search. To achieve this objective, the framework was applied in the area of Information Science, building a corpus of knowledge, based on metadata of 438 articles from the Base Brasileira de Ciência da Informação (BRAPCI). The Tesauro Brasileiro de Ciência da Informação (TBCI) was used as controlled vocabulary. The “collector” application developed in Phyton was used to download metadata and complete files of Dissertations and Theses from existing collections in the Institutional Repositório Institucional da Universidade de Brasília (RiUnB). After the model training process with Annif, subjects were automatically generated and indexed in the Tainacan digital repository. In this repository, taxonomies were created based on the elaborated controlled vocabulary. In the end, it was possible to parameterize faceted searches with the possibility for the user to insert labeling and at the same time perform web browsing, selecting the terms of the faceted taxonomy. It is concluded that the proposed generic framework can be applied in any area of knowledge, helping in the automatic generation of subjects, indexing in a digital repository and parameterization of faceted taxonomies for information retrieval.

Keywords:
Automatic Subject Generation; Indexing; Collections; Digital Repository; Faceted Search

Escola de Ciência da Informação da UFMG Antonio Carlos, 6627 - Pampulha, 31270- 901 - Belo Horizonte -MG, Brasil, Tel: 031) 3499-5227 , Fax: (031) 3499-5200 - Belo Horizonte - MG - Brazil
E-mail: pci@eci.ufmg.br