With the World Triathlon Stockholm event having taken place over the weekend we decided to perform an analysis of the tweets surrounding the event to see if we could determine trends and sentiment or extract any other valuable information from the Twitter stream. Fortunately, each event in the World Triathlon Series uses a standard hashtag to convey event tweets, which in the case of Stockholm was #WTSStockholm. Naturally, there are many other Tweets that referred to the race but did not incorporate the hashtag hence these will not be included in the analysis below.
The Live Twitter Feed method of the API is designed to follow event hashtags using the Twitter Streaming API such that all Tweets containing the event hashtag are published via a custom websocket API as well as persisted for on-demand retrieval. Whilst the primary usage of the feed is to provide a curated live feed for applications such as TriathlonLive.tv we can also use the feed to perform some post-event analysis and to potentially garner feedback from the event.
As well as some basic Twitter analytics we will also consider a few additional methods of extracting information from the Twitter feed via a machine learning approach. Firstly, we will extract the raw information from the API using the following request which covers the entire race week:
curl --header "apikey: [[app:key]]" https://api.triathlon.org/v1/live/twitter?filter=WTSStockholm&start_date=2016-06-27&end_date=2016-07-04
This yields us the information that during the race week from Monday-Sunday there were 2799 Tweets using the hashtag #WTSStockholm and 1232 of those were retweets. Using one of the date fields for each tweet (either
created_at) we may plot the dispersion of Tweets throughout the week. The following chart shows the number of Tweets per day as well as the respective breakdown of retweets.
Number of Tweets using #WTSStockholm by day during race week
It should come as no surprise that Twitter activity spiked on the days immediately before and after the race with race day dominating with ~75% of all Tweets occurring on that day alone. Breaking race day down by hour yields the following chart:
Number of Tweets using #WTSStockholm per hour on race day (July 2nd)
Again, spikes in activity occur at race-times with the Elite Women starting at 14:00 and the Elite Men at 16:45 (all times GMT) with the live TV broadcasts lasting 140 minutes for each. The predominant spike of activity at 6-7pm coincides with the end of the men's race.
Whilst Tweet volume alone is an interesting metric that could be compared across events and programs we are looking to extract some more detailed information from the actual Tweets themselves. For that we turn to machine learning algorithms that are able to automate text classification and extraction such that we may pass our array of Tweets and have them classified according to sentiment or language or return keywords. Whilst such algorithms are beyond the scope of this article the site MonkeyLearn provides us such a ready-made service with public classifiers and extractors that have already been suitably trained for this type of analysis and that we can simply access via an API.
Download the PHP package used to run the following analysis
You may download the code used to run the following analysis from here.
Thanks to the availability of clients to access the MonkeyLearn API this process couldn't be much simpler and we have provided a quick and simple package in PHP to perform sentiment and keyword extraction using the MonkeyLearn platform that we will be using for the remainder of this article.
Using the entire data set of Tweets and excluding retweets and running the keyword extractor using the
keywords.php script we get the following result:
Keyword | Count | Relevance --------------------------- Brownlee | 121 | 0.998 Good luck | 26 | 0.640 chase group | 40 | 0.582 race | 147 | 0.537 lead group | 22 | 0.495 weekend | 61 | 0.486 seconds | 89 | 0.460 lap | 84 | 0.436 triathlonlive | 150 | 0.431 women | 78 | 0.356
Restricting this analysis by time (TV broadcast start and finish times) we may look at the Elite Men and Elite Women races separately (whilst not perfect this is a decent approximation):
Keyword | Count | Relevance --------------------------- chase group | 27 | 0.996 seconds | 25 | 0.504 lap | 37 | 0.472 2nd chase group | 4 | 0.442 triathlonlive | 50 | 0.398 race | 34 | 0.365 Ueda | 12 | 0.356 Vuelta | 15 | 0.334 much good sport | 3 | 0.332 Juri Ide | 6 | 0.332
Despite leading the entirety of the bike leg Flora Duffy is conspicuously missing from the top keywords albeit there is reference to the group of chasers trying to unsuccessfully track her down. This is likely a failure of the extractor to correctly identify athlete names based on different ways to express them e.g. a nickname, twitter handle or full name in which case specific training of such an extraction module would be required.
Keyword | Count | Relevance --------------------------- Brownlee | 63 | 0.997 lead group | 17 | 0.857 Brownlee Brothers | 9 | 0.454 seconds | 48 | 0.414 triathlonlive | 66 | 0.403 WTS podium | 11 | 0.403 chasing | 23 | 0.375 first chase group | 4 | 0.336 discounted season pass | 4 | 0.336 220Triathlon | 34 | 0.323
Here, as expected the
Brownlee keyword dominates and the different
triathlonlive once again feature prominently.
Turning our attention to classifying the data we put all the Tweets (excluding retweets) through a public Twitter Sentiment analysis module using the
sentiment.php script which yielded the following results:
Positive: 452 Negative: 115 Neutral: 1029
The script also provides examples of Tweets classified as
negative. Negative Tweets for example were mostly concerned with incidents in the race rather than the race itself.
Unfortunately due to illness Jgomeznoya will not compete in #WTSStockholm... And it looks like Gunderson has crashed out on that last lap #WTSStockholm #bbcsport fuming you didn't show #WTSStockholm ruined my weekend and cost me £10 Extremely disappointed to not start #WTSStockholm. Sick since Friday with a GI bug and will now reload in #WTSHamburg
Hard to think of a more deserved win than the one of @floraduffy at #WTSStockholm , congratulations. Love the #WTSStockholm course. Technical, fast, and scenic. How a race should be #fullgas Stockholm hosting a great Tri event today @WTSStockholm with perfect weather ☀️ #WTSStockholm today - after Wales' heroics last night, I'm sure @NonStanford & @heljinx will be looking to achieve a similar feat.
Whilst it is feasible that given enough data points a baseline sentiment for events could be computed and compared between events perhaps the most useful takeaway is the ability to quickly extract feedback from both positive and negative Tweets.
Finally we used the Language Detection module to classify the languages most often used with the #WTSStockholm hashtag (see the module category tree for further information about specific language classification).
Tweets using #WTSStockholm hashtag classified by language
This analysis of #WTSStockholm Tweets provides a quick example of what is possible to achieve using the Live Twitter Feed method of the API in conjunction with the MonkeyLearn service. It would of course be possible to extend this analysis by creating custom-trained modules to extract different types of data or analyse the sentiment of say different athletes.