A NLP Enhanced Visual Analytics Tool for Archives Metadata

Ozdemir, Anil; Müstecep, Dilara; Agaoglu, Orhan; Balcisoy, Selim

dc.contributor.author	Ozdemir, Anil	en_US
dc.contributor.author	Müstecep, Dilara	en_US
dc.contributor.author	Agaoglu, Orhan	en_US
dc.contributor.author	Balcisoy, Selim	en_US
dc.contributor.editor	Spagnuolo, Michela and Melero, Francisco Javier	en_US
dc.date.accessioned	2020-11-17T17:51:38Z
dc.date.available	2020-11-17T17:51:38Z
dc.date.issued	2020
dc.identifier.isbn	978-3-03868-110-6
dc.identifier.issn	2312-6124
dc.identifier.uri	https://doi.org/10.2312/gch.20201297
dc.identifier.uri	https://diglib.eg.org:443/handle/10.2312/gch20201297
dc.description.abstract	Today, almost all cultural heritage (CH) institutions are starting to digitize parts of their collections and archives to improve accessibility, preservation of originals, publicity, and visibility of the institution on the Internet. With this recent development, digital document collections have been multiplying. These collections are spread over more than one area of life in a vast domain, including art, history, mathematics, physics, etc. Such a situation creates a substantial volume of documents digitally available. Also, it creates the need for various approaches that allow users to understand latent meanings in collections, discover and investigate relationships, and extract the necessary information from collections. To address this need, we introduce a visual exploratory tool that facilitates the uncovering of hidden information and stories underlying documents, extracting the key individuals, temporal expressions, locations, entities, and keywords within the documents ,establishing a network between documents and allow researchers and archivists to form and test hypotheses and observe individual relationships, networks, and stories present in the archives metadata collections.Consequently, we have designed and developed a visual exploration tool for large archives with limited metadata employing state of the art Natural Language Processing (NLP) techniques to assist cultural heritage researchers. To design such a tool, we have collaborated with archive professionals from an cultural institution, SALT (https:// saltonline.org/) which focused on public service producing research-based exhibitions, publications, and digitization projects. As a result of our conversations Salt team we decided to use Waqfs of Crete which is an archive consisting of official records of Muslim inhabitants of Crete. Documents spanning the period from 1825 to 1928 in Ottoman Turkish and Greek provide an opportunity to examine the multi-layered social structure on the island, especially from a cultural and economic perspective. The metadata contains information for approximately 10 thousand documents and includes the summary of those documents, the year they were published, the location, the language used, and the documents' picture. Also, We extracted various features including locations, key individuals, dates, entities and keywords from the document summaries on metadata using NLP methods including regular expressions for extracting , and word embedding models for capturing similarities between documents. We have integrated all of these features into designed tool to let the user to see networks that can represent the relationship between documents, as well as easily access similar documents in the archive. In the network we demonstrated, particular nodes correspond to the documents itself. To assign an weighted edge between two documents in the network, the total number of shared individuals and keywords between documents are computed and edges are set based on a predetermined threshold value. This threshold has been found by manually tweaking both considering the speed at which the result is reflected on the application and average number of shared attributes. To capture similarity between documents, we used state-of-theart word embedding models including Word2vec, FastText and Transformer which provides a method to compute dense vector representations for documents. Consequently, each document was represented as fixed-sized mathematical vectors as output of each model, and the similarity between documents was calculated by taking the arithmetic cosine similarities of vectors. The designed interface consisting of six components which includes interactive map that allows the user to view documents in different locations and view the document networks that formed by calculating total number of shared attributes between documents. Remaining components include information box that contains document-specific attributes such as location, time, person, entities, and keyword, document browser that enable users and researchers to browse documents easily, individual and keyword search menu and filtering panel. In this way, the users may find documents that are roughly related to each other very quickly. Later, the user can browse each document on its network and view documents that have common individuals and keywords with each other. Thus, the user may follow the interactions between documents like a story and able to do this for all the people who lived in the 19th century on Crete's island.	en_US
dc.publisher	The Eurographics Association	en_US
dc.subject	Computing methodologies
dc.subject	Visualization
dc.subject	Natural Language Processing
dc.subject	Information Extraction
dc.subject	Cultural Heritage Preservation
dc.subject	Historic Document Analysis
dc.title	A NLP Enhanced Visual Analytics Tool for Archives Metadata	en_US
dc.description.seriesinformation	Eurographics Workshop on Graphics and Cultural Heritage
dc.description.sectionheaders	Posters
dc.identifier.doi	10.2312/gch.20201297
dc.identifier.pages	83-83

Files in this item

Name:: 083-083.pdf
Size:: 143.5Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

GCH 2020 - Eurographics Workshop on Graphics and Cultural Heritage
ISBN 978-3-03868-110-6

Show simple item record