FooTweets: a bilingual parallel corpus of World Cup tweets

Sluyter-Gäthje, Henny, Lohar, Pintu ORCID: 0000-0002-5328-1585, Afli, Haithem ORCID: 0000-0002-7449-4707 and Way, Andy ORCID: 0000-0001-5736-5930 (2018) FooTweets: a bilingual parallel corpus of World Cup tweets. In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, 7-12 May 2018, Miyazaki, Japan. ISBN 979-10-95546-00-9

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

The way information spreads through society has changed significantly over the past decade with the advent of online social networking. Twitter, one of the most widely used social networking websites, is known as the real-time, public microblogging network where news breaks first. Most users love it for its iconic 140-character limitation and unfiltered feed that show them news and opinions in the form of tweets. Tweets are usually multilingual in nature and of varying quality. However, machine translation (MT) of twitter data is a challenging task especially due to the following two reasons: (i) tweets are informal in nature (i.e., violates linguistic norms), and (ii) parallel resource for twitter data is scarcely available on the Internet. In this paper, we develop FooTweets, a first parallel corpus of tweets for English–German language pair. We extract 4, 000 English tweets from the FIFA 2014 world cup and manually translate them into German with a special focus on the informal nature of the tweets. In addition to this, we also annotate sentiment scores between 0 and 1 to all the tweets depending upon the degree of sentiment associated with them. This data has recently been used to build sentiment translation engines and an extensive evaluation revealed that such a resource is very useful in machine translation of user generated content.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	tweets; parallel data; sentiment translation
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Initiatives and Centres > ADAPT
Published in:	Proceedings of the 11th International Conference on Language Resources and Evaluation. . European Language Resource Association. ISBN 979-10-95546-00-9
Publisher:	European Language Resource Association
Official URL:	https://www.aclweb.org/anthology/L18-1422
Copyright Information:	© 2018 ELRA
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	ADAPT Centre for Digital Content Technology at Dublin City University is funded under the Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund
ID Code:	23203
Deposited On:	25 Apr 2019 09:39 by Thomas Murtagh . Last Modified 05 May 2023 16:31

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
106kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric

DORAS | DCU Research Repository

FooTweets: a bilingual parallel corpus of World Cup tweets

Downloads