An exploratory analysis of methods for real-time data deduplication in streaming processes

João Esteves, Rosa Costa, Yongluan Zhou, Ana Carolina Brito De Almeida

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

55 Downloads (Pure)

Abstract

Modern stream processing systems typically require ingesting and correlating data from multiple data sources. However, these sources are out of control and prone to software errors and unavailability, causing data anomalies that must be necessarily remedied before processing the data. In this context, anomaly, such as data duplication, appears as one of the most prominent challenges of stream processing. Data duplication can hinder real-time analysis of data for decision making. This paper investigates the challenges and performs an experimental analysis of operators and auxiliary tools to help with data deduplication. The results show that there is an increase in data delivery time when using external mechanisms. However, these mechanisms are essential for an ingestion process to guarantee that no data is lost and that no duplicates are persisted.
Original languageEnglish
Title of host publicationDEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems
PublisherAssociation for Computing Machinery
Publication date27 Jun 2023
Pages91–102
ISBN (Electronic)9798400701221
DOIs
Publication statusPublished - 27 Jun 2023
Event17th ACM International Conference on Distributed and Event-based Systems - DEBS '23 - Neuchatel, Switzerland
Duration: 27 Jun 202330 Jun 2023

Conference

Conference17th ACM International Conference on Distributed and Event-based Systems - DEBS '23
Country/TerritorySwitzerland
CityNeuchatel
Period27/06/202330/06/2023

Cite this