Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Duplicate identification algorithms in SaaS platforms

Nguyen, Dac, Nguyen, Quy H., Dao, Minh-Son, Dang-Nguyen, Duc-Tien orcid logoORCID: 0000-0002-2761-2213, Gurrin, Cathal orcid logoORCID: 0000-0003-2903-3968 and Nguyen, Binh T. (2020) Duplicate identification algorithms in SaaS platforms. In: 2020 Intelligent Cross-Data Analysis and Retrieval Workshop (ICDAR'20), 20-26 Oct 2020, Dublin, Ireland. ISBN 978-1-4503-7509-2

Abstract
Existing duplicate records is one of the most common issues in many Software-as-as-Service (SaaS) platforms. In this paper, we study the duplicate identification problem in one specific SaaS platform related to quality and compliance management by using the address information. We interpret all typical mistakes from users that can generate the existent duplicated organizations in a given dataset, collected from the SaaS platform. Also, we create another set by crawling location data from Open Address (US Zone). We compare different methods, including Bag-of-words (using Cosine Distance), Record Linkage Toolkits, and Siamese Neural Networks using the triplet loss, in terms of precision, recall, and F1-score. The experimental results show that using Siamese Neural Networks can achieve a better performance in comparison with other techniques. We plan to publish our Open Address dataset and all implementation codes to facilitate further research in the related fields.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Uncontrolled Keywords:siamese; software-as-a-service; bi-gru; triplet loss; duplicate identification
Subjects:Computer Science > Computer security
Computer Science > Software engineering
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > ADAPT
Published in: Proceedings of the 2020 Intelligent Cross-Data Analysis and Retrieval Workshop (ICDAR'20). . Association for Computing Machinery (ACM). ISBN 978-1-4503-7509-2
Publisher:Association for Computing Machinery (ACM)
Official URL:https://doi.org/10.1145/3379174.3392319
Copyright Information:© 2020 The Authors
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland under grant number SFI/13/RC/2106, L. Meltzers Høyskolefonds, UiB 2019/2259-NILSO
ID Code:24667
Deposited On:22 Jun 2020 15:32 by Cathal Gurrin . Last Modified 15 Dec 2021 15:40
Documents

Full text available as:

[thumbnail of 3379174.3392319.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record