Defoe: A spark-based toolbox for analysing digital historical textual data

Rosa Filgueira, Mariona Coll Ardanuy, Giovanni Colavizza, James Hetherington, Melissa Terras, Michael Jackson, Anna Roubickova, Amrey Krause, Ruth Ahnert, Tessa Hauswedell, Julianne Nyhan, David Beavan, Timothy Hobson

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

5 Citationer (Scopus)

Abstract

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

OriginalsprogEngelsk
TitelProceedings - IEEE 15th International Conference on eScience, eScience 2019
Antal sider8
ForlagInstitute of Electrical and Electronics Engineers Inc.
Publikationsdato2019
Sider235-242
Artikelnummer9041813
ISBN (Elektronisk)9781728124513
DOI
StatusUdgivet - 2019
Udgivet eksterntJa
Begivenhed15th IEEE International Conference on eScience, eScience 2019 - San Diego, USA
Varighed: 24 sep. 201927 sep. 2019

Konference

Konference15th IEEE International Conference on eScience, eScience 2019
Land/OmrådeUSA
BySan Diego
Periode24/09/201927/09/2019
NavnProceedings - IEEE 15th International Conference on eScience, eScience 2019

Bibliografisk note

Publisher Copyright:
© 2019 IEEE.

Citationsformater