{"_id":"577acf4a8f467e0e00868c62","__v":0,"initVersion":{"_id":"55773a5ba042551900b002ce","version":"1"},"user":{"_id":"546d17e2eb9cfd1400dd4529","username":"","name":"World Triathlon"},"project":"55773a5ba042551900b002cb","createdAt":"2016-07-04T21:04:10.115Z","changelog":[],"body":"With the [World Triathlon Stockholm](http://stockholm.triathlon.org/) event having taken place over the weekend we decided to perform an analysis of the tweets surrounding the event to see if we could determine trends and sentiment or extract any other valuable information from the Twitter stream. Fortunately, each event in the World Triathlon Series uses a standard hashtag to convey event tweets, which in the case of Stockholm was **#WTSStockholm**. Naturally, there are many other Tweets that referred to the race but did not incorporate the hashtag hence these will not be included in the analysis below.\n\nThe [Live Twitter Feed](https://developers.triathlon.org/docs/live-twitter-feed) method of the API is designed to follow event hashtags using the [Twitter Streaming API](https://dev.twitter.com/streaming/overview) such that all Tweets containing the event hashtag are published via a [custom websocket API](https://developers.triathlon.org/docs/twitter-stream) as well as persisted for on-demand retrieval. Whilst the primary usage of the feed is to provide a curated live feed for applications such as [TriathlonLive.tv ](https://triathlonlive.tv) we can also use the feed to perform some post-event analysis and to potentially garner feedback from the event.\n\nAs well as some basic Twitter analytics we will also consider a few additional methods of extracting information from the Twitter feed via a machine learning approach. Firstly, we will extract the raw information from the API using the following request which covers the entire race week:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"curl --header \\\"apikey: [[app:key]]\\\" https://api.triathlon.org/v1/live/twitter?filter=WTSStockholm&start_date=2016-06-27&end_date=2016-07-04\",\n      \"language\": \"curl\",\n      \"name\": \"Extracting all tweets from Live Twitter Feed endpoint\"\n    }\n  ]\n}\n[/block]\nThis yields us the information that during the race week from Monday-Sunday there were 2799 Tweets using the hashtag #WTSStockholm and 1232 of those were retweets. Using one of the date fields for each tweet (either `timestamp` or `created_at`) we may plot the dispersion of Tweets throughout the week. The following chart shows the number of Tweets per day as well as the respective breakdown of retweets.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/Jpt4X6hjReOSzem2JsKV_TweetsByDay.png\",\n        \"TweetsByDay.png\",\n        \"1144\",\n        \"349\",\n        \"#f97b0d\",\n        \"\"\n      ],\n      \"caption\": \"Number of Tweets using #WTSStockholm by day during race week\",\n      \"border\": true\n    }\n  ]\n}\n[/block]\nIt should come as no surprise that Twitter activity spiked on the days immediately before and after the race with race day dominating with ~75% of all Tweets occurring on that day alone. Breaking race day down by hour yields the following chart:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/hZR743R2Rt60XwcqVraA_TweetsByHours-July2nd.png\",\n        \"TweetsByHours-July2nd.png\",\n        \"1026\",\n        \"611\",\n        \"#fa7b0d\",\n        \"\"\n      ],\n      \"border\": true,\n      \"sizing\": \"smart\",\n      \"caption\": \"Number of Tweets using #WTSStockholm per hour on race day (July 2nd)\"\n    }\n  ]\n}\n[/block]\nAgain, spikes in activity occur at race-times with the Elite Women starting at 14:00 and the Elite Men at 16:45 (all times GMT) with the live TV broadcasts lasting 140 minutes for each. The predominant spike of activity at 6-7pm coincides with the end of the men's race.\n\nWhilst Tweet volume alone is an interesting metric that could be compared across events and programs we are looking to extract some more detailed information from the actual Tweets themselves. For that we turn to [machine learning algorithms](https://en.wikipedia.org/wiki/Machine_learning) that are able to automate text classification and extraction such that we may pass our array of Tweets and have them classified according to sentiment or language or return keywords. Whilst such algorithms are beyond the scope of this article the site [MonkeyLearn](http://monkeylearn.com/) provides us such a ready-made service with public [classifiers](https://app.monkeylearn.com/main/classifiers/cl_qkjxv9Ly/) and [extractors](https://app.monkeylearn.com/main/extractors/ex_y7BPYzNG/) that have already been suitably *trained* for this type of analysis and that we can simply access via an API.\n[block:callout]\n{\n  \"type\": \"success\",\n  \"title\": \"Download the PHP package used to run the following analysis\",\n  \"body\": \"You may download the code used to run the following analysis from [here](https://github.com/World-Triathlon/TwitterAnalysis).\"\n}\n[/block]\nThanks to the availability of clients to access the MonkeyLearn API this process couldn't be much simpler and we have provided a [quick and simple package in PHP](https://github.com/World-Triathlon/TwitterAnalysis) to perform sentiment and keyword extraction using the MonkeyLearn platform that we will be using for the remainder of this article.\n\nUsing the entire data set of Tweets and excluding retweets and running the [keyword extractor](https://app.monkeylearn.com/main/extractors/ex_y7BPYzNG/) using the `keywords.php` script we get the following result:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Keyword | Count | Relevance\\n---------------------------\\nBrownlee | 121 | 0.998 \\nGood luck | 26 | 0.640 \\nchase group | 40 | 0.582 \\nrace | 147 | 0.537 \\nlead group | 22 | 0.495 \\nweekend | 61 | 0.486 \\nseconds | 89 | 0.460 \\nlap | 84 | 0.436 \\ntriathlonlive | 150 | 0.431 \\nwomen | 78 | 0.356 \",\n      \"language\": \"text\",\n      \"name\": \"Keyword extraction on all Tweets\"\n    }\n  ]\n}\n[/block]\nRestricting this analysis by time (TV broadcast start and finish times) we may look at the Elite Men and Elite Women races separately (whilst not perfect this is a decent approximation):\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Keyword | Count | Relevance\\n---------------------------\\nchase group | 27 | 0.996 \\nseconds | 25 | 0.504 \\nlap | 37 | 0.472 \\n2nd chase group | 4 | 0.442 \\ntriathlonlive | 50 | 0.398 \\nrace | 34 | 0.365 \\nUeda | 12 | 0.356 \\nVuelta | 15 | 0.334 \\nmuch good sport | 3 | 0.332 \\nJuri Ide | 6 | 0.332 \",\n      \"language\": \"text\",\n      \"name\": \"Keyword extraction during Elite Women TV broadcast\"\n    }\n  ]\n}\n[/block]\nDespite leading the entirety of the bike leg Flora Duffy is conspicuously missing from the top keywords albeit there is reference to the group of chasers trying to unsuccessfully track her down. This is likely a failure of the extractor to correctly identify athlete names based on different ways to express them e.g. a nickname, twitter handle or full name in which case specific training of such an extraction module would be required.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Keyword | Count | Relevance\\n---------------------------\\nBrownlee | 63 | 0.997 \\nlead group | 17 | 0.857 \\nBrownlee Brothers | 9 | 0.454 \\nseconds | 48 | 0.414 \\ntriathlonlive | 66 | 0.403 \\nWTS podium | 11 | 0.403 \\nchasing | 23 | 0.375 \\nfirst chase group | 4 | 0.336 \\ndiscounted season pass | 4 | 0.336 \\n220Triathlon | 34 | 0.323 \",\n      \"language\": \"text\",\n      \"name\": \"Keyword extraction during Elite Men TV broadcast\"\n    }\n  ]\n}\n[/block]\nHere, as expected the `Brownlee` keyword dominates and the different `groups` and `triathlonlive` once again feature prominently. \n\nTurning our attention to classifying the data we put all the Tweets (excluding retweets) through a public [Twitter Sentiment analysis module](https://app.monkeylearn.com/main/classifiers/cl_qkjxv9Ly/) using the `sentiment.php` script which yielded the following results:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Positive: 452 \\nNegative: 115 \\nNeutral: 1029 \",\n      \"language\": \"text\",\n      \"name\": \"Sentiment Analysis\"\n    }\n  ]\n}\n[/block]\nThe script also provides examples of Tweets classified as `positive` and `negative`. Negative Tweets for example were mostly concerned with incidents in the race rather than the race itself.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Unfortunately due to illness Jgomeznoya will not compete in #WTSStockholm...\\nAnd it looks like Gunderson has crashed out on that last lap #WTSStockholm\\n#bbcsport fuming you didn't show #WTSStockholm ruined my weekend and cost me £10\\nExtremely disappointed to not start #WTSStockholm. Sick since Friday with a GI bug and will now reload in #WTSHamburg\",\n      \"language\": \"text\",\n      \"name\": \"Example Negative Tweets\"\n    }\n  ]\n}\n[/block]\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Hard to think of a more deserved win than the one of @floraduffy at #WTSStockholm , congratulations.\\nLove the #WTSStockholm course. Technical, fast, and scenic. How a race should be #fullgas\\nStockholm hosting a great Tri event today @WTSStockholm with perfect weather ☀️\\n#WTSStockholm today - after Wales' heroics last night, I'm sure @NonStanford & @heljinx will be looking to achieve a similar feat. \",\n      \"language\": \"text\",\n      \"name\": \"Example Positive Tweets\"\n    }\n  ]\n}\n[/block]\nWhilst it is feasible that given enough data points a baseline sentiment for events could be computed and compared between events perhaps the most useful takeaway is the ability to quickly extract feedback from both positive and negative Tweets.\n\nFinally we used the [Language Detection](https://app.monkeylearn.com/main/classifiers/cl_oJNMkt2V/tab/sandbox-tab) module to classify the languages most often used with the #WTSStockholm hashtag (see the module category tree for further information about specific language classification).\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/EIa03xlqRAWnivapLYJU_TweetsByLanguage.png\",\n        \"TweetsByLanguage.png\",\n        \"1304\",\n        \"280\",\n        \"#1c73b3\",\n        \"\"\n      ],\n      \"caption\": \"Tweets using #WTSStockholm hashtag classified by language\"\n    }\n  ]\n}\n[/block]\nThis analysis of #WTSStockholm Tweets provides a quick example of what is possible to achieve using the [Live Twitter Feed](https://developers.triathlon.org/docs/live-twitter-feed) method of the API in conjunction with the [MonkeyLearn](http://monkeylearn.com/) service. It would of course be possible to extend this analysis by creating custom-trained modules to extract different types of data or analyse the sentiment of say different athletes.","slug":"twitter-event-hashtag-analysis","title":"Twitter Event #Hashtag Analysis"}

Twitter Event #Hashtag Analysis


With the [World Triathlon Stockholm](http://stockholm.triathlon.org/) event having taken place over the weekend we decided to perform an analysis of the tweets surrounding the event to see if we could determine trends and sentiment or extract any other valuable information from the Twitter stream. Fortunately, each event in the World Triathlon Series uses a standard hashtag to convey event tweets, which in the case of Stockholm was **#WTSStockholm**. Naturally, there are many other Tweets that referred to the race but did not incorporate the hashtag hence these will not be included in the analysis below. The [Live Twitter Feed](https://developers.triathlon.org/docs/live-twitter-feed) method of the API is designed to follow event hashtags using the [Twitter Streaming API](https://dev.twitter.com/streaming/overview) such that all Tweets containing the event hashtag are published via a [custom websocket API](https://developers.triathlon.org/docs/twitter-stream) as well as persisted for on-demand retrieval. Whilst the primary usage of the feed is to provide a curated live feed for applications such as [TriathlonLive.tv ](https://triathlonlive.tv) we can also use the feed to perform some post-event analysis and to potentially garner feedback from the event. As well as some basic Twitter analytics we will also consider a few additional methods of extracting information from the Twitter feed via a machine learning approach. Firstly, we will extract the raw information from the API using the following request which covers the entire race week: [block:code] { "codes": [ { "code": "curl --header \"apikey: [[app:key]]\" https://api.triathlon.org/v1/live/twitter?filter=WTSStockholm&start_date=2016-06-27&end_date=2016-07-04", "language": "curl", "name": "Extracting all tweets from Live Twitter Feed endpoint" } ] } [/block] This yields us the information that during the race week from Monday-Sunday there were 2799 Tweets using the hashtag #WTSStockholm and 1232 of those were retweets. Using one of the date fields for each tweet (either `timestamp` or `created_at`) we may plot the dispersion of Tweets throughout the week. The following chart shows the number of Tweets per day as well as the respective breakdown of retweets. [block:image] { "images": [ { "image": [ "https://files.readme.io/Jpt4X6hjReOSzem2JsKV_TweetsByDay.png", "TweetsByDay.png", "1144", "349", "#f97b0d", "" ], "caption": "Number of Tweets using #WTSStockholm by day during race week", "border": true } ] } [/block] It should come as no surprise that Twitter activity spiked on the days immediately before and after the race with race day dominating with ~75% of all Tweets occurring on that day alone. Breaking race day down by hour yields the following chart: [block:image] { "images": [ { "image": [ "https://files.readme.io/hZR743R2Rt60XwcqVraA_TweetsByHours-July2nd.png", "TweetsByHours-July2nd.png", "1026", "611", "#fa7b0d", "" ], "border": true, "sizing": "smart", "caption": "Number of Tweets using #WTSStockholm per hour on race day (July 2nd)" } ] } [/block] Again, spikes in activity occur at race-times with the Elite Women starting at 14:00 and the Elite Men at 16:45 (all times GMT) with the live TV broadcasts lasting 140 minutes for each. The predominant spike of activity at 6-7pm coincides with the end of the men's race. Whilst Tweet volume alone is an interesting metric that could be compared across events and programs we are looking to extract some more detailed information from the actual Tweets themselves. For that we turn to [machine learning algorithms](https://en.wikipedia.org/wiki/Machine_learning) that are able to automate text classification and extraction such that we may pass our array of Tweets and have them classified according to sentiment or language or return keywords. Whilst such algorithms are beyond the scope of this article the site [MonkeyLearn](http://monkeylearn.com/) provides us such a ready-made service with public [classifiers](https://app.monkeylearn.com/main/classifiers/cl_qkjxv9Ly/) and [extractors](https://app.monkeylearn.com/main/extractors/ex_y7BPYzNG/) that have already been suitably *trained* for this type of analysis and that we can simply access via an API. [block:callout] { "type": "success", "title": "Download the PHP package used to run the following analysis", "body": "You may download the code used to run the following analysis from [here](https://github.com/World-Triathlon/TwitterAnalysis)." } [/block] Thanks to the availability of clients to access the MonkeyLearn API this process couldn't be much simpler and we have provided a [quick and simple package in PHP](https://github.com/World-Triathlon/TwitterAnalysis) to perform sentiment and keyword extraction using the MonkeyLearn platform that we will be using for the remainder of this article. Using the entire data set of Tweets and excluding retweets and running the [keyword extractor](https://app.monkeylearn.com/main/extractors/ex_y7BPYzNG/) using the `keywords.php` script we get the following result: [block:code] { "codes": [ { "code": "Keyword | Count | Relevance\n---------------------------\nBrownlee | 121 | 0.998 \nGood luck | 26 | 0.640 \nchase group | 40 | 0.582 \nrace | 147 | 0.537 \nlead group | 22 | 0.495 \nweekend | 61 | 0.486 \nseconds | 89 | 0.460 \nlap | 84 | 0.436 \ntriathlonlive | 150 | 0.431 \nwomen | 78 | 0.356 ", "language": "text", "name": "Keyword extraction on all Tweets" } ] } [/block] Restricting this analysis by time (TV broadcast start and finish times) we may look at the Elite Men and Elite Women races separately (whilst not perfect this is a decent approximation): [block:code] { "codes": [ { "code": "Keyword | Count | Relevance\n---------------------------\nchase group | 27 | 0.996 \nseconds | 25 | 0.504 \nlap | 37 | 0.472 \n2nd chase group | 4 | 0.442 \ntriathlonlive | 50 | 0.398 \nrace | 34 | 0.365 \nUeda | 12 | 0.356 \nVuelta | 15 | 0.334 \nmuch good sport | 3 | 0.332 \nJuri Ide | 6 | 0.332 ", "language": "text", "name": "Keyword extraction during Elite Women TV broadcast" } ] } [/block] Despite leading the entirety of the bike leg Flora Duffy is conspicuously missing from the top keywords albeit there is reference to the group of chasers trying to unsuccessfully track her down. This is likely a failure of the extractor to correctly identify athlete names based on different ways to express them e.g. a nickname, twitter handle or full name in which case specific training of such an extraction module would be required. [block:code] { "codes": [ { "code": "Keyword | Count | Relevance\n---------------------------\nBrownlee | 63 | 0.997 \nlead group | 17 | 0.857 \nBrownlee Brothers | 9 | 0.454 \nseconds | 48 | 0.414 \ntriathlonlive | 66 | 0.403 \nWTS podium | 11 | 0.403 \nchasing | 23 | 0.375 \nfirst chase group | 4 | 0.336 \ndiscounted season pass | 4 | 0.336 \n220Triathlon | 34 | 0.323 ", "language": "text", "name": "Keyword extraction during Elite Men TV broadcast" } ] } [/block] Here, as expected the `Brownlee` keyword dominates and the different `groups` and `triathlonlive` once again feature prominently. Turning our attention to classifying the data we put all the Tweets (excluding retweets) through a public [Twitter Sentiment analysis module](https://app.monkeylearn.com/main/classifiers/cl_qkjxv9Ly/) using the `sentiment.php` script which yielded the following results: [block:code] { "codes": [ { "code": "Positive: 452 \nNegative: 115 \nNeutral: 1029 ", "language": "text", "name": "Sentiment Analysis" } ] } [/block] The script also provides examples of Tweets classified as `positive` and `negative`. Negative Tweets for example were mostly concerned with incidents in the race rather than the race itself. [block:code] { "codes": [ { "code": "Unfortunately due to illness Jgomeznoya will not compete in #WTSStockholm...\nAnd it looks like Gunderson has crashed out on that last lap #WTSStockholm\n#bbcsport fuming you didn't show #WTSStockholm ruined my weekend and cost me £10\nExtremely disappointed to not start #WTSStockholm. Sick since Friday with a GI bug and will now reload in #WTSHamburg", "language": "text", "name": "Example Negative Tweets" } ] } [/block] [block:code] { "codes": [ { "code": "Hard to think of a more deserved win than the one of @floraduffy at #WTSStockholm , congratulations.\nLove the #WTSStockholm course. Technical, fast, and scenic. How a race should be #fullgas\nStockholm hosting a great Tri event today @WTSStockholm with perfect weather ☀️\n#WTSStockholm today - after Wales' heroics last night, I'm sure @NonStanford & @heljinx will be looking to achieve a similar feat. ", "language": "text", "name": "Example Positive Tweets" } ] } [/block] Whilst it is feasible that given enough data points a baseline sentiment for events could be computed and compared between events perhaps the most useful takeaway is the ability to quickly extract feedback from both positive and negative Tweets. Finally we used the [Language Detection](https://app.monkeylearn.com/main/classifiers/cl_oJNMkt2V/tab/sandbox-tab) module to classify the languages most often used with the #WTSStockholm hashtag (see the module category tree for further information about specific language classification). [block:image] { "images": [ { "image": [ "https://files.readme.io/EIa03xlqRAWnivapLYJU_TweetsByLanguage.png", "TweetsByLanguage.png", "1304", "280", "#1c73b3", "" ], "caption": "Tweets using #WTSStockholm hashtag classified by language" } ] } [/block] This analysis of #WTSStockholm Tweets provides a quick example of what is possible to achieve using the [Live Twitter Feed](https://developers.triathlon.org/docs/live-twitter-feed) method of the API in conjunction with the [MonkeyLearn](http://monkeylearn.com/) service. It would of course be possible to extend this analysis by creating custom-trained modules to extract different types of data or analyse the sentiment of say different athletes.