Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

An active learning framework for duplicate detection in SaaS platforms

Nguyen, Quy H., Nguyen, Dac, Dao, Minh-Son, Dang-Nguyen, Duc-Tien orcid logoORCID: 0000-0002-2761-2213, Gurrin, Cathal orcid logoORCID: 0000-0003-2903-3968 and Nguyen, Binh T. (2020) An active learning framework for duplicate detection in SaaS platforms. In: Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20), 26–29 Oct 2020, Dublin, Ireland. ISBN 978-1-4503-7087-5

Abstract
With the rapid growth of users’ data in SaaS (Software-as-a-service) platforms using micro-services, it becomes essential to detect duplicated entities for ensuring the integrity and consistency of data in many companies and businesses (primarily multinational corporations). Due to the large volume of databases today, the expected duplicate detection algorithms need to be not only accurate but also practical, which means that it can release the detection results as fast as possible for a given request. Among existing algorithms for the deduplicate detection problem, using Siamese neural networks with the triplet loss has become one of the robust ways to measure the similarity of two entities (texts, paragraphs, or documents) for identifying all possible duplicated items. In this paper, we first propose a practical framework for building a duplicate detection system in a SaaS platform. Second, we present a new active learning schema for training and updating duplicate detection algorithms. In this schema, we not only allow the crowd to provide more annotated data for enhancing the chosen learning model but also use the Siamese neural networks as well as the triplet loss to construct an efficient model for the problem. Finally, we design a user interface of our proposed deduplicate detection system, which can easily apply for empirical applications in different companies.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:active learning; datasets; triplet loss; duplicate removal
Subjects:UNSPECIFIED
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > INSIGHT Centre for Data Analytics
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). . Association for Computing Machinery (ACM). ISBN 978-1-4503-7087-5
Publisher:Association for Computing Machinery (ACM)
Official URL:https://doi.org/10.1145/3372278.3391933
Copyright Information:© 2020 The Authors
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland SFI/13/RC/2106, L. Meltzers Høyskolefonds, UiB 2019/2259-NILSO
ID Code:24631
Deposited On:17 Jun 2020 13:42 by Cathal Gurrin . Last Modified 15 Dec 2021 15:38
Documents

Full text available as:

[thumbnail of p412-nguyenA.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record