Twitter Event #Hashtag Analysis

With the World Triathlon Stockholm event having taken place over the weekend we decided to perform an analysis of the tweets surrounding the event to see if we could determine trends and sentiment or extract any other valuable information from the Twitter stream. Fortunately, each event in the World Triathlon Series uses a standard hashtag to convey event tweets, which in the case of Stockholm was #WTSStockholm. Naturally, there are many other Tweets that referred to the race but did not incorporate the hashtag hence these will not be included in the analysis below.

The Live Twitter Feed method of the API is designed to follow event hashtags using the Twitter Streaming API such that all Tweets containing the event hashtag are published via a custom websocket API as well as persisted for on-demand retrieval. Whilst the primary usage of the feed is to provide a curated live feed for applications such as TriathlonLive.tv we can also use the feed to perform some post-event analysis and to potentially garner feedback from the event.

As well as some basic Twitter analytics we will also consider a few additional methods of extracting information from the Twitter feed via a machine learning approach. Firstly, we will extract the raw information from the API using the following request which covers the entire race week:

curl --header "apikey: [[app:key]]" https://api.triathlon.org/v1/live/twitter?filter=WTSStockholm&start_date=2016-06-27&end_date=2016-07-04

This yields us the information that during the race week from Monday-Sunday there were 2799 Tweets using the hashtag #WTSStockholm and 1232 of those were retweets. Using one of the date fields for each tweet (either timestamp or created_at) we may plot the dispersion of Tweets throughout the week. The following chart shows the number of Tweets per day as well as the respective breakdown of retweets.

1144 — Number of Tweets using #WTSStockholm by day during race week

It should come as no surprise that Twitter activity spiked on the days immediately before and after the race with race day dominating with ~75% of all Tweets occurring on that day alone. Breaking race day down by hour yields the following chart:

1026 — Number of Tweets using #WTSStockholm per hour on race day (July 2nd)

Again, spikes in activity occur at race-times with the Elite Women starting at 14:00 and the Elite Men at 16:45 (all times GMT) with the live TV broadcasts lasting 140 minutes for each. The predominant spike of activity at 6-7pm coincides with the end of the men's race.

Whilst Tweet volume alone is an interesting metric that could be compared across events and programs we are looking to extract some more detailed information from the actual Tweets themselves. For that we turn to machine learning algorithms that are able to automate text classification and extraction such that we may pass our array of Tweets and have them classified according to sentiment or language or return keywords. Whilst such algorithms are beyond the scope of this article the site MonkeyLearn provides us such a ready-made service with public classifiers and extractors that have already been suitably trained for this type of analysis and that we can simply access via an API.

👍
Download the PHP package used to run the following analysis
You may download the code used to run the following analysis from here.

Thanks to the availability of clients to access the MonkeyLearn API this process couldn't be much simpler and we have provided a quick and simple package in PHP to perform sentiment and keyword extraction using the MonkeyLearn platform that we will be using for the remainder of this article.

Using the entire data set of Tweets and excluding retweets and running the keyword extractor using the keywords.php script we get the following result:

Keyword | Count | Relevance
---------------------------
Brownlee | 121 | 0.998 
Good luck | 26 | 0.640 
chase group | 40 | 0.582 
race | 147 | 0.537 
lead group | 22 | 0.495 
weekend | 61 | 0.486 
seconds | 89 | 0.460 
lap | 84 | 0.436 
triathlonlive | 150 | 0.431 
women | 78 | 0.356

Restricting this analysis by time (TV broadcast start and finish times) we may look at the Elite Men and Elite Women races separately (whilst not perfect this is a decent approximation):

Keyword | Count | Relevance
---------------------------
chase group | 27 | 0.996 
seconds | 25 | 0.504 
lap | 37 | 0.472 
2nd chase group | 4 | 0.442 
triathlonlive | 50 | 0.398 
race | 34 | 0.365 
Ueda | 12 | 0.356 
Vuelta | 15 | 0.334 
much good sport | 3 | 0.332 
Juri Ide | 6 | 0.332

Despite leading the entirety of the bike leg Flora Duffy is conspicuously missing from the top keywords albeit there is reference to the group of chasers trying to unsuccessfully track her down. This is likely a failure of the extractor to correctly identify athlete names based on different ways to express them e.g. a nickname, twitter handle or full name in which case specific training of such an extraction module would be required.

Keyword | Count | Relevance
---------------------------
Brownlee | 63 | 0.997 
lead group | 17 | 0.857 
Brownlee Brothers | 9 | 0.454 
seconds | 48 | 0.414 
triathlonlive | 66 | 0.403 
WTS podium | 11 | 0.403 
chasing | 23 | 0.375 
first chase group | 4 | 0.336 
discounted season pass | 4 | 0.336 
220Triathlon | 34 | 0.323

Here, as expected the Brownlee keyword dominates and the different groups and triathlonlive once again feature prominently.

Turning our attention to classifying the data we put all the Tweets (excluding retweets) through a public Twitter Sentiment analysis module using the sentiment.php script which yielded the following results:

Positive: 452 
Negative: 115 
Neutral: 1029

The script also provides examples of Tweets classified as positive and negative. Negative Tweets for example were mostly concerned with incidents in the race rather than the race itself.

Unfortunately due to illness Jgomeznoya will not compete in #WTSStockholm...
And it looks like Gunderson has crashed out on that last lap #WTSStockholm
#bbcsport fuming you didn't show #WTSStockholm ruined my weekend and cost me £10
Extremely disappointed to not start #WTSStockholm. Sick since Friday with a GI bug and will now reload in #WTSHamburg

Hard to think of a more deserved win than the one of @floraduffy at #WTSStockholm , congratulations.
Love the #WTSStockholm course. Technical, fast, and scenic. How a race should be #fullgas
Stockholm hosting a great Tri event today @WTSStockholm with perfect weather ☀️
#WTSStockholm today - after Wales' heroics last night, I'm sure @NonStanford &amp; @heljinx will be looking to achieve a similar feat.

Whilst it is feasible that given enough data points a baseline sentiment for events could be computed and compared between events perhaps the most useful takeaway is the ability to quickly extract feedback from both positive and negative Tweets.

Finally we used the Language Detection module to classify the languages most often used with the #WTSStockholm hashtag (see the module category tree for further information about specific language classification).

1304 — Tweets using #WTSStockholm hashtag classified by language

This analysis of #WTSStockholm Tweets provides a quick example of what is possible to achieve using the Live Twitter Feed method of the API in conjunction with the MonkeyLearn service. It would of course be possible to extend this analysis by creating custom-trained modules to extract different types of data or analyse the sentiment of say different athletes.