Time Extraction from Real-time Generated Football Reports

NODALIDA07_BorgWhere it all started

As this was the very first paper I sent for peer review, it also deserves my very first blog post! This piece of work was not only my first publication, it also resulted in my first attendance at a scientific conference – and boy, was I nervous! I spent hours on preparing the presentation, and even had a recorded demo of the tool I presented… And the computer in the room turned out to be too slow to play videos embedded in PowerPoint. I managed to quickly Alt+Tab and play it in the Windows Media Player though, but it surely put further pressure on my nerves. What also makes this paper a significant milestone in my research journey is that it was output from Prof. Pierre Nugues course on natural language processing (in the Fall 2006) – the first contact I had with text processing. I really got hooked on the topic, and three years later, after a few years as a software developer with ABB, I ended up applying for an opening as a PhD student working on natural language processing in the software engineering context.

Blacklisted?

Writing about the paper forces me to make a confession. I double posted. I didn’t have a clue it was a big no-no! To me it made perfectly sense, there were two related conferences with two overlapping calls for papers – both with suitable deadlines. Of course I submitted to both of them, why not? As far as I know nobody noticed my big crime. The big conference, the 45th Annual Meeting of the Association of Computational Linguistics, rejected the paper first but with reasonable comments. The killer comment was about my paper’s use of regular expressions to detect football phrases, a much too limited approach according to the reviewers. The reviewers at NODALIDA pointed out this as well of course, but I later was told that my submission was accepted as a borderline paper. Awesome!

What’s in this paper?

When I did the project in the NLP course, I was a big fan of the online football manager game Hattrick. I actually still manage my team (created already in 2002!), but I really don’t spend much time with it nowadays. But back in 2006, it surely was an important hobby. In Hattrick, the game engine simulates matches in real time and outputs a sentence to the match report when important things happen – goals, free kicks, corner kicks, injuries, cards and the like. My ideas was to i) detect events in the live reports (e.g., a pass, a shot, or a save), and to ii) sort the events using a few rules I developed. The event detection wasn’t really a contribution in the paper, as I simply relied on regular expressions – useful to extract events from a game engine, but hardly from a human journalist! Instead, the contribution was sorting events in time.

I limited my work to real-time generated match reports in Swedish. As the reports were generated in real-time, I assumed that the events described in a sentence occurred after the previous sentences. Similar real-time reporting is used in other sports as well, and also when newspapers cover major events as they happen. The idea I came up with was based on categorizing each match event into 1) result change, 2) save, 3) finish, 4) pre-finish, 5) idle ball, and 6) other. The simple rule to sort events within sentences then stated that events in category 1 happens before 2, 2 before 3, and so on. I compared this approach to a (more) naïve baseline – that the events in the match happened in the same order as they appear within a sentence. I evaluated the rule on a set of Hattrick match reports, and yes – the approach with the rNODALIDA07_Borg-2ule was more accurate for match reports in Swedish! Everything was then packaged in a prototype Java tool, using the TimeML Specification Language to encode events, and both the course project and the paper were happy. In conclusion, a very nice experience for me that got me interested in doing more research!

Tweetable Summary

Shoot first, save later – events described in Swedish real-time generated football match reports can be sorted using simple rules.

Implications for Practice
  • Real-time generated texts often appear online, detecting and (time-)sorting events could be useful.
  • Simple rules can sort events within sentences in real-time generated football reports.
  • Determining the order of events is fundamental to animate match sequences based on a match report.
Implications for Research
  • The rule-based approach to sort events in a match report could be replicated using other languages.
  • Similar approaches could be explored in other domains, such as news reporting.
  • I use output from a game engine to study natural language, an idea that could be used also by other researchers to conduct preliminary studies on real but restricted text.
Markus Borg, Time Extraction from Real-time Generated Football Reports, In Proc. of the 16th Nordic Conference of Computational Linguistics (NODALIDA), pp. 37-43, Tartu, Estonia, 2007. (link, preprint)

Abstract

This paper describes a system to extract events and time information from football match reports generated through minute-by-minute reporting. We describe a method that uses regular expressions to find the events and divides them into different types to determine in which order they occurred. In addition, our system detects time expressions and we present a way to structure the collected data using XML.