Track Consumer Engagement and Sentiment with Microsoft Codename "Social Analytics"

It isn't at all easy to make sense of the deluge of information from social media sites, particularly to measure opinion or sentiment, yet it has become increasingly important for marketing to be able to do so. Microsoft are keen to be in the vanguard with their "Social Analytics" research project

Social analytics is a new marketing-oriented discipline that attempts to measure the engagement of Internet-connected individuals with products, services, brands, celebrities, politicians, and political parties. An important element of social analytics is estimating the opinion of participants about the entities with which they’re engaged by a process commonly called sentiment analysis or opinion mining. Social analytics relies on high-performance computing (HPC) techniques to filter a deluge of user-generated source data from social media sites, such as Twitter and Facebook, and linguistics methods to infer positive or negative sentiment, also called tone, from brief messages. For example, Twitter users generated an estimated 290 million tweets per day in mid-February 2012. Facebook currently receives about one billion posts and 2.7 billion likes/comments per day.

Filtering “firehose” data streams of these magnitudes requires more HPC horsepower than most organizations are willing to devote to yet-unproven analytic techniques, so Microsoft’s SQL Azure Labs introduced its Codename “Social Analytics” Software as a Service (SaaS) application as a private Community Technical Preview (CTP) on October 25, 2011, which was updated on January 30, 2012. The Windows Azure Marketplace DataMarket (Azure DataMarket) currently delivers two no-charge social data streams having fixed topics: Windows 8 and Bill Gates. Microsoft has promised “Future releases will allow you to define your own topic(s) of interest,” but this capability hadn’t arrived by mid-February 2012. The “Social Analytics” CTPs require prospective users to apply for a DataMarket key for their stream of choice by completing a form hosted on Windows Azure. After receiving a key, you can test drive the data set with a sample Silverlight UI – the Engagement Client (see Figure 1) – by following the instructions provided in Microsoft Connect’s Engagement Client – Social Analytics page.


Figure 1. The updated Codename “Social Analytics” CTP’s Engagement Client application with a user-configurable Silverlight UI for a predefined data set filtered for Bill Gates or Windows 8 (as shown here). This configuration’s left pane displays new filtered tweets in real-time, the middle pane shows the number of filtered tweets per day for the last seven days, which is indicative of engagement, and the right pane displays the count of tweets containing the keywords listed. The January 2012 CTP update added five new “analytic widgets” to the client.

 Using Graphical Consumers for the Social Analytics API

The Social Analytics API is an alternative to the Engagement Client and lets you access Social Analytics’ Open Data Protocol (OData) streams directly with any application or programming language that can consume OData feeds from the Azure DataMarket. My Use OData to Execute RESTful CRUD Operations on Big Data in the Cloud post of December 2011 to the A Cloudy Place blog describes several OData consumers available from Microsoft and third parties. The PowerPivot for Excel add-in is especially suited for displaying raw Azure DataMarket streams because it provides a Table Import Wizard for connecting to and downloading OData-formatted datasets, as described on Microsoft Connect  and shown in Figure 2.


Figure 2. Completing the PowerPivot for Excel’s Table Import Wizard’s Connection and Select Tables/Views dialogs, and clicking Close to dismiss the Wizard displays a small dataset’s content in a worksheet (ContentItemTypes for this example.) Opening a large data set, such as ContentItems (usually ~800,000 items), is time consuming and usually warrants canceling the download.

My Querying Microsoft’s Codename “Social Analytics” OData Feeds with LINQPad blog post describes in detail how to use Joseph Albahari’s free LINQPad utility to display Social Analytics data sets in a data grid (see Figure 3) and export them to Excel.


Figure 3. Opening a large data set, such as ContentItems, in LINQPad displays by default the first 500 or fewer collection items in a table.

 Programming the Social Analytics API with Visual Studio

Social data analysts usually are more interested in graphing engagement and sentiment trends over time spans of days, weeks, or months after the occurrence of an event, such as a marketing campaign, rather than listing absolute numbers. Generating time-series graphs ordinarily requires programming a Web or desktop client application on a platform that supports an OData consumer API. The Codename “Social Analytics” team doesn’t provide many code samples for their “Social Analytics Authenticated API” documentation which you can download from Microsoft Connect, and the docs are missing detailed descriptions of the API’s object model properties. To fill this gap, I’ve written a .NET Windows form OData client project, the C# source code and executable for which you can download from my Window Live Skydrive account. Figure 4 shows the main form of the project, which was updated on February 18, 2012 to add error handling and details about the CTP’s 10 supported data source types.


Figure 4. This sample Windows form client displays a gradual rising trend in social engagement for the most recent 100,000 ContentItems over the last six days of a ten-day period. Positive and negative sentiment trended upward for five of the last six days. The Types list box at the middle right displays the current number of items for each of the 10 supported ContentItemTypes.

 Analyzing the Sentiment of Brief Text Messages

It isn’t a simple task to determine whether short text phrases typified by Twitter’s 140-character maximum tweets or brief comments about blog posts unambiguously specify a positive or negative sentiment about a topic. The updated Social Analytics API uses a recently enhanced sentiment analysis code, according to the Codename “Social Analytics” team’s Lab Bonus! Enhanced Sentiment Analysis for Twitter from Microsoft Research post of February 2, 2012:

The sentiment analysis code we used in prior releases from Microsoft Research was trained on short sentences and paragraphs. We predict that the accuracy of sentiment analysis will improve in Social Analytics by using the classifier trained specifically on tweets for Twitter content items. We will continue to use the sentence and paragraph classifiers on all other content.

The tweet classifier was trained on nearly 4 million tweets from over a year’s worth of English Twitter data. It is based on a study of how people express their moods on Twitter with mood-indicating hashtags. We mapped over 150 different mood-bearing hashtags to positive and negative affect, and used the hashtags as a training signal to learn which words and word pairs in a tweet are highly correlated with positive or negative affect.

The API uses a tone reliability score to specify sentiment analysis accuracy, with an 80 percent score to qualify positive tone (CalculatedToneId = 5) and 90 percent for negative tone (CalculatedToneid = 6). Otherwise the tone is considered neutral (Calculated ToneId = 3). The ratio of average positive (3,397) plus negative (112) tones per day to average Tweets per day (9,957) shown in Figure 4 is about 35 percent for the most recent 100,000 items, which includes few Facebook posts, comments and likes. My Twitter Sentiment Analysis: A Brief Bibliography post, updated on February 17, 2012, provides excerpts from and links to recent technical papers from Microsoft Research and others about sentiment analysis of social Web data, with emphasis on processing content from Twitter.

The Figure 4’s ContentItem Title data grid column contains the text of tweets and the titles of Facebook and blog posts, as well as the text of Facebook comments and likes. Excerpts from posts appear in the Summary column, as shown in Figure 5, and Facebook comments and likes repeat the title text in the summary column.


Figure 5. The Post ContentItemType for blog and Facebook posts, as well as the Facebook Comment type, populate Summary data to the right of the data grid’s Tone Reliability column; all items include URL links to the HTML source data.

Like text consists of “Firstname Lastname liked this” messages, which are totally ambiguous as to sentiment and comments are little, if any, better. The updated Windows form sample client includes a check box to limit ContentItems analyzed to Twitter tweets, retweets, and replies. Data for Twitter content for the same dates and number of items shows 3,486 positive and 122 negative average tones per day and 10,676 tweets, retweets, and replies for a similar 35 percent of items having reliable sentiment values. With today’s sentiment analysis techniques, making marketing decisions based primarily or entirely on Twitter content is probably the safest bet.


The Codename “Social Analytics” API and Microsoft Research’s investment in enhancing the accuracy of sentiment measurement for tweets is a promising advance in obtaining real-time, actionable social Web data for analyzing consumers’ perception of brands, technologies, politicians, celebrities, and many other entities. Analyst Barb Darrow (@gigabarb) asserted “Big data skills bring big dough” in a February 17, 2012 post to Giga Om’s Structure blog. However, it’s not only data scientists who stand to rake in the bucks from petabytes of social data; their employers are sure to take the lion’s share of the largess.