Twitter Event #Hashtag Analysis
With the World Triathlon Stockholm event having taken place over the weekend we decided to perform an analysis of the tweets surrounding the event to see if we could determine trends and sentiment or extract any other valuable information from the Twitter stream. Fortunately, each event in the World Triathlon Series uses a standard hashtag to convey event tweets, which in the case of Stockholm was #WTSStockholm. Naturally, there are many other Tweets that referred to the race but did not incorporate the hashtag hence these will not be included in the analysis below.
The Live Twitter Feed method of the API is designed to follow event hashtags using the Twitter Streaming API such that all Tweets containing the event hashtag are published via a custom websocket API as well as persisted for on-demand retrieval. Whilst the primary usage of the feed is to provide a curated live feed for applications such as TriathlonLive.tv we can also use the feed to perform some post-event analysis and to potentially garner feedback from the event.
As well as some basic Twitter analytics we will also consider a few additional methods of extracting information from the Twitter feed via a machine learning approach. Firstly, we will extract the raw information from the API using the following request which covers the entire race week:
curl --header "apikey: [[app:key]]" https://api.triathlon.org/v1/live/twitter?filter=WTSStockholm&start_date=2016-06-27&end_date=2016-07-04
This yields us the information that during the race week from Monday-Sunday there were 2799 Tweets using the hashtag #WTSStockholm and 1232 of those were retweets. Using one of the date fields for each tweet (either timestamp
or created_at
) we may plot the dispersion of Tweets throughout the week. The following chart shows the number of Tweets per day as well as the respective breakdown of retweets.
It should come as no surprise that Twitter activity spiked on the days immediately before and after the race with race day dominating with ~75% of all Tweets occurring on that day alone. Breaking race day down by hour yields the following chart:
Again, spikes in activity occur at race-times with the Elite Women starting at 14:00 and the Elite Men at 16:45 (all times GMT) with the live TV broadcasts lasting 140 minutes for each. The predominant spike of activity at 6-7pm coincides with the end of the men's race.
Whilst Tweet volume alone is an interesting metric that could be compared across events and programs we are looking to extract some more detailed information from the actual Tweets themselves. For that we turn to machine learning algorithms that are able to automate text classification and extraction such that we may pass our array of Tweets and have them classified according to sentiment or language or return keywords. Whilst such algorithms are beyond the scope of this article the site MonkeyLearn provides us such a ready-made service with public classifiers and extractors that have already been suitably trained for this type of analysis and that we can simply access via an API.
Download the PHP package used to run the following analysis
You may download the code used to run the following analysis from here.
Thanks to the availability of clients to access the MonkeyLearn API this process couldn't be much simpler and we have provided a quick and simple package in PHP to perform sentiment and keyword extraction using the MonkeyLearn platform that we will be using for the remainder of this article.
Using the entire data set of Tweets and excluding retweets and running the keyword extractor using the keywords.php
script we get the following result:
Keyword | Count | Relevance
---------------------------
Brownlee | 121 | 0.998
Good luck | 26 | 0.640
chase group | 40 | 0.582
race | 147 | 0.537
lead group | 22 | 0.495
weekend | 61 | 0.486
seconds | 89 | 0.460
lap | 84 | 0.436
triathlonlive | 150 | 0.431
women | 78 | 0.356
Restricting this analysis by time (TV broadcast start and finish times) we may look at the Elite Men and Elite Women races separately (whilst not perfect this is a decent approximation):
Keyword | Count | Relevance
---------------------------
chase group | 27 | 0.996
seconds | 25 | 0.504
lap | 37 | 0.472
2nd chase group | 4 | 0.442
triathlonlive | 50 | 0.398
race | 34 | 0.365
Ueda | 12 | 0.356
Vuelta | 15 | 0.334
much good sport | 3 | 0.332
Juri Ide | 6 | 0.332
Despite leading the entirety of the bike leg Flora Duffy is conspicuously missing from the top keywords albeit there is reference to the group of chasers trying to unsuccessfully track her down. This is likely a failure of the extractor to correctly identify athlete names based on different ways to express them e.g. a nickname, twitter handle or full name in which case specific training of such an extraction module would be required.
Keyword | Count | Relevance
---------------------------
Brownlee | 63 | 0.997
lead group | 17 | 0.857
Brownlee Brothers | 9 | 0.454
seconds | 48 | 0.414
triathlonlive | 66 | 0.403
WTS podium | 11 | 0.403
chasing | 23 | 0.375
first chase group | 4 | 0.336
discounted season pass | 4 | 0.336
220Triathlon | 34 | 0.323
Here, as expected the Brownlee
keyword dominates and the different groups
and triathlonlive
once again feature prominently.
Turning our attention to classifying the data we put all the Tweets (excluding retweets) through a public Twitter Sentiment analysis module using the sentiment.php
script which yielded the following results:
Positive: 452
Negative: 115
Neutral: 1029
The script also provides examples of Tweets classified as positive
and negative
. Negative Tweets for example were mostly concerned with incidents in the race rather than the race itself.
Unfortunately due to illness Jgomeznoya will not compete in #WTSStockholm...
And it looks like Gunderson has crashed out on that last lap #WTSStockholm
#bbcsport fuming you didn't show #WTSStockholm ruined my weekend and cost me £10
Extremely disappointed to not start #WTSStockholm. Sick since Friday with a GI bug and will now reload in #WTSHamburg
Hard to think of a more deserved win than the one of @floraduffy at #WTSStockholm , congratulations.
Love the #WTSStockholm course. Technical, fast, and scenic. How a race should be #fullgas
Stockholm hosting a great Tri event today @WTSStockholm with perfect weather ☀️
#WTSStockholm today - after Wales' heroics last night, I'm sure @NonStanford & @heljinx will be looking to achieve a similar feat.
Whilst it is feasible that given enough data points a baseline sentiment for events could be computed and compared between events perhaps the most useful takeaway is the ability to quickly extract feedback from both positive and negative Tweets.
Finally we used the Language Detection module to classify the languages most often used with the #WTSStockholm hashtag (see the module category tree for further information about specific language classification).
This analysis of #WTSStockholm Tweets provides a quick example of what is possible to achieve using the Live Twitter Feed method of the API in conjunction with the MonkeyLearn service. It would of course be possible to extend this analysis by creating custom-trained modules to extract different types of data or analyse the sentiment of say different athletes.