Wednesday, May 18, 2016

Social Media Analytics - Facebook


Maintaining brand presence in social media has become one of the top priorities for marketing departments. Brands are doing everything possible to create buzz and present themselves in the digital window of every potential consumer. While we move towards an era where consumers can see, experience and buy almost everything online, it is crucial that companies retain their brand recall in the social media world.

This study analyzes the effort spent by two major car manufacturers on facebook and their user response. Products chosen are two mass market cars - Maruti Suzuki Alto 800 and Tata Nano GenX. Both products fall under category A cars and are meant for volume sales. The original intention was to include Renault Kwid and Hyundai Eon too, but sadly, renault has'nt bothered much to post anything on their official facebook page – hardly 4 posts over a year, while Hundai did'nt even bother to create a page for Eon.

Facebook posts made by the manufacturer and the number of likes obtained are collected and analyzed. Posts are collected since launch date of each car to analyze the trend.

Following chart shows count of facebook posts made by each manufacturer on a monthly basis,since their market launch.

The blue line shows Maruti Alto 800 's facebook post counts per month, since 2013 till date and red line denotes TATA Nano GenX , since May 2015.

Inferences:

  1. Alto800 has spent considerable effort compared to Nano GenX in terms of maintaining buzz on facebook, right from product launch till date. The graph showing post counts shows significant difference until January 2016. There after, both maintain almost same number of posts.
  2. As expected, the initial euphoria on social media comes down as the product ages. In 2016, both products are spending almost same amount of effort on facebook. However, Alto800 came down to this point after two years from launch, whereas Nano GenX came down in an year's time.

Further to this, monthly sales figures of Alto 800 and Nano GenX are compared against number of likes they collected on facebook.

Note:

  1. Nano Genx was launched in May2015. Hence sales data is captured only from May 2015.
  2. Monthly Sales Data is obtained manually from www.team-bhp.com
  3. These manufacturer-reported sales numbers are factory dispatches to dealerships. They are NOT retail sales figures to end customers.

There are some interesting patterns emerging out of these two graphs.

Inferences

  1. For both cars, the number of likes collected on facebook (green line) follows same pattern as of the sales line (blue) . Although sales figures are not moving with the same intensity as that of facebook likes, the peaks and dips are showing a distinct correlation with the pattern of facebook like volumes.

    This means that buyers are expressing their positive sentiment in social media around the time of purchase. The smoother sales line compared to likes pattern indicates that everyone who likes the product are not buying it.The difference between sales and likes figures could be the layer of potential customers. This trend also indicates that facebook is a very good medium to connect with customers.

  2. Likes are substantially high during product launch and the interest flattens out as the product ages. There are events happening from the manufacturer's side whenever there is a dip in facebook activity of users. This is evident from the increased number of posts from Maruti in the months of April/May2015, and again in Nov/Dec2015. During these months, although sales were flat, they managed to improve facebook activity from users by increased number of posts.

    In Aug2015, maruti launched two contests that reached approximately 130 000 facebook users whereas the month before, only around 5000 – 10000 people liked Alto. However, in the case of Nano, there were no events that broke the regular pattern of like counts (less than 5000 per month).

Technical Details

Technology platform used for this study is R (www.r-project.org). The “Rfacebook” library facilitates API level connectivity to facebook data if you have an application registered on facebook.

Besides connectivity to social media like facebook and Twitter, R provides a good ecosystem for statistical analysis and visualization.

Further Analysis Proposed

It will be beneficial for manufacturers to understand what people are talking about in social media with respect to their products, so that they get better insight into the minds of their potential customers. Lexical analysis of comments made by facebook users will produce a word cloud surrounding a product. Classifying and summarizing this data and then relating it to product related features or shortcomings can be very helpful for the industry.

Friday, May 13, 2016

Social Media Analytics - Twitter

Analysing the sentiment of voters during election time is a real good experience for a data scientist. The one place where public sentiment and mood reflect at the speed of thought is the social media. Tweets and posts appear near real-time. Comments, discussions and arguments explode right when a topic is hot. Such a steady flow of public opinion is a goldmine for a data scientist.

Twitter hash tags and the facility to fetch all tweets having a particular hash tag helps a data scientist to collect all tweets around a given topic. If we have the right tools, it is real easy to interface with twitter APIs and pull tweets around a given hashtag.

The below exercise was done purely for its academic value. For real world usage, it may require further refinement. Yet, this offers susbtantial insight into the process of mining twitter.

I selected #KeralaElections2016 and #KeralaPolls2016 hash tags which are getting decent amount of tweets on the upcoming state elections. A twitter application was created using my own login to pull tweets. This can be done by logging into https://apps.twitter.com/. Later, this application's access token and secret key are used to gain access to tweets that we are looking for.

The tool I selected for this exercise is R. It is an open source analytics tool and is available on most OS platforms. R also has visualization capabilities. Standard R installation provides command line interface for developers. However, having an IDE like RStudio will make the development process a lot easier.

Following R libraries are needed at the minimum:
 
twitteR (twitter interface library)
ROAuth  (R library for Open Authentication) 
ggplot2 (library for plotting charts and graphs)
A new R Script should include necessary libraries in the beginning (preferably). Then twitter authentication function is invoked.

api_key <- "YOUR API KEY"
api_secret <- "YOUR API SECRET"
access_token <- "YOUR ACCESS TOKEN"
access_token_secret <- "YOUR ACCESS TOKEN SECRET"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

The parameters to the function are all obtained from the twitter application created using . When the above function is invoked, R will prompt to cache Oauth credentials so that it can be accessed across R sessions. Once the authentication is successful, we can invoke searchTwitter() function to fetch tweets having desired hash tags.
KeralaPollsTweets.list = searchTwitter("#KeralaElections2016 #KeralaPolls2016", n=1000)
The output of the searchTwitter() function is a list and for all practical purposes, it is better to store it as a data frame in R. The function twListToDF() is provided by twitteR library for conversion to dataframe.
KeralaPollsTweets.df = twListToDF(KeralaPollsTweets.list)
Now that we have all tweets in our data frame, we can proceed to analyze the text column (text column contains the tweet). There are two things that we are trying to find out from any given tweet:-
  1. Overall sentiment/mood conveyed in the tweet
  2. This is accomplished by splitting the tweet into words (after removing all special characters) and then matching against a standard set of positive and negative words. https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon provides a list of positive and negative opinion/sentiment words (English only) that can be used for this purpose. After analyzing tweets, I included a number of words that are commonly found to improve the accuracy.

    For each tweet, count of matches against positive and negative words are taken and their difference gives us a sentiment score. If the score is a negative number, it indicates a negative sentiment, positive score shows a positive sentiment and a 0 shows either positive and negative sentiments getting balanced out or there are no meaningful information to calculate the score.

    Note: Although, this method is simple, its accuracy may vary as mentioned by the lexicon authors themselves. For more accurate results, text mining should include phrases and sentence grammar into picture.

    References:

    1. Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA
    2. Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

  3. Political party being referred in the tweet
  4. To add meaning to the sentiment scores, it is better to identify the political party being referenced in the tweet. For simplicity, I selected only three major political parties. They are: a. UDF (United Democratic Front) b. LDF (Left Democratic Front) c. NDA (National Democratic Front)

    To identify the party being referenced in a tweet, I followed the sentiment approach and created keywords for all three parties. Then tweets were run through these keyword files and number of matches noted for each tweet against party specific keyword files. Highest match count resulted in the tweet being tagged to the respective party.

    Note: All tweets cannot be classified this way as there may not be sufficient information within some tweets to identify the party being referred. Such tweets are tagged as UNKNOWN.

To perform above categorization activities, we can use a simple looping structure to go through each tweet and perform matching, or we can resort to “apply” functions provided by R. Using apply function (laply in this case) avoid looping in the R script and instead, the underlying C/C++ libraries does the heavy lifting. This will provide significant performance improvement over doing a loop based matching. To use apply functions in R, “plyr” library is required.

Now that we have all tweets classified and the score calculated, next step is to plot the graph using this data. Ggplot 2 library provides extensive facilities for charting.

Ggplot call to plot the stacked bar chart using processed data
ggplot(data=KeralaPollsTweets.findf ) +  #dataframe having all processed data
  geom_bar(mapping=aes(x=score, fill=party)) +
  theme_bw() + # plain display, nicer colors
  labs(x="Sentiment Score",y="Volume of Tweets") +
  guides(fill=guide_legend(title="Political Front")) + 
  ggtitle("Twitter Sentiment on #KeralaElections2016, #KeralaPolls2016") + 
  scale_y_continuous(breaks=seq(0,260,10)) +
  scale_x_continuous(breaks=seq(-5,5,1)) 

A sample result is shown below. Total number of tweets considered is 333.

Sentiment Score LDF NDA UDF UNKNOWN
-3 0 0 1 1
-2 0 2 0 1
-1 0 5 0 14
0 38 184 11 55
1 1 8 2 9
2 1 0 0 0

Stacked bar chart drawn with the tabulated data

Inferences drawn from the chart
  1. Most talked about political party in twitter (atleast in state election related tweets) is NDA. LDF comes next. This is evident from the volume of tweets marked against each sentiment score. The ruling party (UDF) gets least amount of tweets.
  2. Maximum positive sentiment is with NDA(scores +1, +2). UDF and LDF gets equal positive sentiment. LDF touches a +2 (only 1 tweet though) sentiment score.
  3. Maximum negative sentiment is with NDA too. (scores -1, -2). LDF and UDF (see the anomalies section) do not get any negative score.
  4. Maximum negative and positive votes being attributed to NDA could be a side effect of the higher tweet volume for NDA, compared to other two parties
Anomalies

AS mentioned earlier, sentiment analysis by matching positive/negative sentiment lexicon may bring in lot of false positives and false negatives.

The -3 score attributed to UDF in the above chart is a perfect example. On analyzing the set of tweets, I could find that the below single tweet contributed this:

#UrVoteForUDF is a vote agnst fascism. It's one smash against divisive forces #KeralaElections2016 #KeralaPolls2016
Underlined words took the tweet to a score of -3. However, in reality, the tweet itself conveys positive sentiment towards UDF. Hence this entry should be treated as a +3 instead of -3.

For the record, this exercise was done 3 days before the polling day.