Friday, May 13, 2016

Social Media Analytics - Twitter

Analysing the sentiment of voters during election time is a real good experience for a data scientist. The one place where public sentiment and mood reflect at the speed of thought is the social media. Tweets and posts appear near real-time. Comments, discussions and arguments explode right when a topic is hot. Such a steady flow of public opinion is a goldmine for a data scientist.

Twitter hash tags and the facility to fetch all tweets having a particular hash tag helps a data scientist to collect all tweets around a given topic. If we have the right tools, it is real easy to interface with twitter APIs and pull tweets around a given hashtag.

The below exercise was done purely for its academic value. For real world usage, it may require further refinement. Yet, this offers susbtantial insight into the process of mining twitter.

I selected #KeralaElections2016 and #KeralaPolls2016 hash tags which are getting decent amount of tweets on the upcoming state elections. A twitter application was created using my own login to pull tweets. This can be done by logging into https://apps.twitter.com/. Later, this application's access token and secret key are used to gain access to tweets that we are looking for.

The tool I selected for this exercise is R. It is an open source analytics tool and is available on most OS platforms. R also has visualization capabilities. Standard R installation provides command line interface for developers. However, having an IDE like RStudio will make the development process a lot easier.

Following R libraries are needed at the minimum:
 
twitteR (twitter interface library)
ROAuth  (R library for Open Authentication) 
ggplot2 (library for plotting charts and graphs)
A new R Script should include necessary libraries in the beginning (preferably). Then twitter authentication function is invoked.

api_key <- "YOUR API KEY"
api_secret <- "YOUR API SECRET"
access_token <- "YOUR ACCESS TOKEN"
access_token_secret <- "YOUR ACCESS TOKEN SECRET"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

The parameters to the function are all obtained from the twitter application created using . When the above function is invoked, R will prompt to cache Oauth credentials so that it can be accessed across R sessions. Once the authentication is successful, we can invoke searchTwitter() function to fetch tweets having desired hash tags.
KeralaPollsTweets.list = searchTwitter("#KeralaElections2016 #KeralaPolls2016", n=1000)
The output of the searchTwitter() function is a list and for all practical purposes, it is better to store it as a data frame in R. The function twListToDF() is provided by twitteR library for conversion to dataframe.
KeralaPollsTweets.df = twListToDF(KeralaPollsTweets.list)
Now that we have all tweets in our data frame, we can proceed to analyze the text column (text column contains the tweet). There are two things that we are trying to find out from any given tweet:-
  1. Overall sentiment/mood conveyed in the tweet
  2. This is accomplished by splitting the tweet into words (after removing all special characters) and then matching against a standard set of positive and negative words. https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon provides a list of positive and negative opinion/sentiment words (English only) that can be used for this purpose. After analyzing tweets, I included a number of words that are commonly found to improve the accuracy.

    For each tweet, count of matches against positive and negative words are taken and their difference gives us a sentiment score. If the score is a negative number, it indicates a negative sentiment, positive score shows a positive sentiment and a 0 shows either positive and negative sentiments getting balanced out or there are no meaningful information to calculate the score.

    Note: Although, this method is simple, its accuracy may vary as mentioned by the lexicon authors themselves. For more accurate results, text mining should include phrases and sentence grammar into picture.

    References:

    1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA
    2. Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

  3. Political party being referred in the tweet
  4. To add meaning to the sentiment scores, it is better to identify the political party being referenced in the tweet. For simplicity, I selected only three major political parties. They are: a. UDF (United Democratic Front) b. LDF (Left Democratic Front) c. NDA (National Democratic Front)

    To identify the party being referenced in a tweet, I followed the sentiment approach and created keywords for all three parties. Then tweets were run through these keyword files and number of matches noted for each tweet against party specific keyword files. Highest match count resulted in the tweet being tagged to the respective party.

    Note: All tweets cannot be classified this way as there may not be sufficient information within some tweets to identify the party being referred. Such tweets are tagged as UNKNOWN.

To perform above categorization activities, we can use a simple looping structure to go through each tweet and perform matching, or we can resort to “apply” functions provided by R. Using apply function (laply in this case) avoid looping in the R script and instead, the underlying C/C++ libraries does the heavy lifting. This will provide significant performance improvement over doing a loop based matching. To use apply functions in R, “plyr” library is required.

Now that we have all tweets classified and the score calculated, next step is to plot the graph using this data. Ggplot 2 library provides extensive facilities for charting.

Ggplot call to plot the stacked bar chart using processed data
ggplot(data=KeralaPollsTweets.findf ) +  #dataframe having all processed data
  geom_bar(mapping=aes(x=score, fill=party)) +
  theme_bw() + # plain display, nicer colors
  labs(x="Sentiment Score",y="Volume of Tweets") +
  guides(fill=guide_legend(title="Political Front")) + 
  ggtitle("Twitter Sentiment on #KeralaElections2016, #KeralaPolls2016") + 
  scale_y_continuous(breaks=seq(0,260,10)) +
  scale_x_continuous(breaks=seq(-5,5,1)) 

A sample result is shown below. Total number of tweets considered is 333.

Sentiment Score LDF NDA UDF UNKNOWN
-3 0 0 1 1
-2 0 2 0 1
-1 0 5 0 14
0 38 184 11 55
1 1 8 2 9
2 1 0 0 0

Stacked bar chart drawn with the tabulated data

Inferences drawn from the chart
  1. Most talked about political party in twitter (atleast in state election related tweets) is NDA. LDF comes next. This is evident from the volume of tweets marked against each sentiment score. The ruling party (UDF) gets least amount of tweets.
  2. Maximum positive sentiment is with NDA(scores +1, +2). UDF and LDF gets equal positive sentiment. LDF touches a +2 (only 1 tweet though) sentiment score.
  3. Maximum negative sentiment is with NDA too. (scores -1, -2). LDF and UDF (see the anomalies section) do not get any negative score.
  4. Maximum negative and positive votes being attributed to NDA could be a side effect of the higher tweet volume for NDA, compared to other two parties
Anomalies

AS mentioned earlier, sentiment analysis by matching positive/negative sentiment lexicon may bring in lot of false positives and false negatives.

The -3 score attributed to UDF in the above chart is a perfect example. On analyzing the set of tweets, I could find that the below single tweet contributed this:

#UrVoteForUDF is a vote agnst fascism. It's one smash against divisive forces #KeralaElections2016 #KeralaPolls2016
Underlined words took the tweet to a score of -3. However, in reality, the tweet itself conveys positive sentiment towards UDF. Hence this entry should be treated as a +3 instead of -3.

For the record, this exercise was done 3 days before the polling day.

No comments:

Post a Comment