Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Coping with noise in a real-world weblog crawler and retrieval system

Lanagan, James, Ferguson, Paul, O'Hare, Neil and Smeaton, Alan F. orcid logoORCID: 0000-0003-1028-8389 (2010) Coping with noise in a real-world weblog crawler and retrieval system. In: Fourth International AAAI Conference on Weblogs and Social Media, 23-26 May , 2010, Washington, DC, UK.

Abstract
In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself.
Metadata
Item Type:Conference or Workshop Item (Poster)
Event Type:Conference
Refereed:Yes
Additional Information:Contact alan.smeaton@dcu.ie
Uncontrolled Keywords:blogs; blogging; social networks;
Subjects:Computer Science > Interactive computer systems
Computer Science > Computer software
Computer Science > Artificial intelligence
Computer Science > Information retrieval
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Initiatives and Centres > CLARITY: The Centre for Sensor Web Technologies
Official URL:http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/pa...
Copyright Information:Copyright 2010 Association for the Advancement of Artificial Intelligence
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:15439
Deposited On:08 Jul 2010 09:15 by Alan Smeaton . Last Modified 02 Nov 2018 15:02
Documents

Full text available as:

[thumbnail of 1469-7876-1-PB.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record