Social analytics is a new marketing-oriented discipline that attempts to measure the engagement of Internet-connected individuals with products, services, brands, celebrities, politicians, and political parties. An important element of social analytics is estimating the opinion of participants about the entities with which they’re engaged by a process commonly called sentiment analysis or opinion mining. Social analytics relies on high-performance computing (HPC) techniques to filter a deluge of user-generated source data from social media sites, such as Twitter and Facebook, and linguistics methods to infer positive or negative sentiment, also called tone, from brief messages. For example, Twitter users generated an estimated 290 million tweets per day in mid-February 2012. Facebook currently receives about one billion posts and 2.7 billion likes/comments per day.
Filtering “firehose” data streams of these magnitudes requires more HPC horsepower than most organizations are willing to devote to yet-unproven analytic techniques, so Microsoft’s SQL Azure Labs introduced its Codename “Social Analytics” Software as a Service (SaaS) application as a private Community Technical Preview (CTP) on October 25, 2011, which was updated on January 30, 2012. The Windows Azure Marketplace DataMarket (Azure DataMarket) currently delivers two no-charge social data streams having fixed topics: Windows 8 and Bill Gates. Microsoft has promised “Future releases will allow you to define your own topic(s) of interest,” but this capability hadn’t arrived by mid-February 2012. The “Social Analytics” CTPs require prospective users to apply for a DataMarket key for their stream of choice by completing a form hosted on Windows Azure. After receiving a key, you can test drive the data set with a sample Silverlight UI – the Engagement Client (see Figure 1) – by following the instructions provided in Microsoft Connect’s Engagement Client – Social Analytics page.
Using Graphical Consumers for the Social Analytics API
The Social Analytics API is an alternative to the Engagement Client and lets you access Social Analytics’ Open Data Protocol (OData) streams directly with any application or programming language that can consume OData feeds from the Azure DataMarket. My Use OData to Execute RESTful CRUD Operations on Big Data in the Cloud post of December 2011 to the A Cloudy Place blog describes several OData consumers available from Microsoft and third parties. The PowerPivot for Excel add-in is especially suited for displaying raw Azure DataMarket streams because it provides a Table Import Wizard for connecting to and downloading OData-formatted datasets, as described on Microsoft Connect and shown in Figure 2.
My Querying Microsoft’s Codename “Social Analytics” OData Feeds with LINQPad blog post describes in detail how to use Joseph Albahari’s free LINQPad utility to display Social Analytics data sets in a data grid (see Figure 3) and export them to Excel.
Programming the Social Analytics API with Visual Studio
Social data analysts usually are more interested in graphing engagement and sentiment trends over time spans of days, weeks, or months after the occurrence of an event, such as a marketing campaign, rather than listing absolute numbers. Generating time-series graphs ordinarily requires programming a Web or desktop client application on a platform that supports an OData consumer API. The Codename “Social Analytics” team doesn’t provide many code samples for their “Social Analytics Authenticated API” documentation which you can download from Microsoft Connect, and the docs are missing detailed descriptions of the API’s object model properties. To fill this gap, I’ve written a .NET Windows form OData client project, the C# source code and executable for which you can download from my Window Live Skydrive account. Figure 4 shows the main form of the project, which was updated on February 18, 2012 to add error handling and details about the CTP’s 10 supported data source types.
Analyzing the Sentiment of Brief Text Messages
It isn’t a simple task to determine whether short text phrases typified by Twitter’s 140-character maximum tweets or brief comments about blog posts unambiguously specify a positive or negative sentiment about a topic. The updated Social Analytics API uses a recently enhanced sentiment analysis code, according to the Codename “Social Analytics” team’s Lab Bonus! Enhanced Sentiment Analysis for Twitter from Microsoft Research post of February 2, 2012:
The sentiment analysis code we used in prior releases from Microsoft Research was trained on short sentences and paragraphs. We predict that the accuracy of sentiment analysis will improve in Social Analytics by using the classifier trained specifically on tweets for Twitter content items. We will continue to use the sentence and paragraph classifiers on all other content.
The tweet classifier was trained on nearly 4 million tweets from over a year’s worth of English Twitter data. It is based on a study of how people express their moods on Twitter with mood-indicating hashtags. We mapped over 150 different mood-bearing hashtags to positive and negative affect, and used the hashtags as a training signal to learn which words and word pairs in a tweet are highly correlated with positive or negative affect.
The API uses a tone reliability score to specify sentiment analysis accuracy, with an 80 percent score to qualify positive tone (CalculatedToneId = 5) and 90 percent for negative tone (CalculatedToneid = 6). Otherwise the tone is considered neutral (Calculated ToneId = 3). The ratio of average positive (3,397) plus negative (112) tones per day to average Tweets per day (9,957) shown in Figure 4 is about 35 percent for the most recent 100,000 items, which includes few Facebook posts, comments and likes. My Twitter Sentiment Analysis: A Brief Bibliography post, updated on February 17, 2012, provides excerpts from and links to recent technical papers from Microsoft Research and others about sentiment analysis of social Web data, with emphasis on processing content from Twitter.
The Figure 4’s ContentItem Title data grid column contains the text of tweets and the titles of Facebook and blog posts, as well as the text of Facebook comments and likes. Excerpts from posts appear in the Summary column, as shown in Figure 5, and Facebook comments and likes repeat the title text in the summary column.
Like text consists of “Firstname Lastname liked this” messages, which are totally ambiguous as to sentiment and comments are little, if any, better. The updated Windows form sample client includes a check box to limit ContentItems analyzed to Twitter tweets, retweets, and replies. Data for Twitter content for the same dates and number of items shows 3,486 positive and 122 negative average tones per day and 10,676 tweets, retweets, and replies for a similar 35 percent of items having reliable sentiment values. With today’s sentiment analysis techniques, making marketing decisions based primarily or entirely on Twitter content is probably the safest bet.
The Codename “Social Analytics” API and Microsoft Research’s investment in enhancing the accuracy of sentiment measurement for tweets is a promising advance in obtaining real-time, actionable social Web data for analyzing consumers’ perception of brands, technologies, politicians, celebrities, and many other entities. Analyst Barb Darrow (@gigabarb) asserted “Big data skills bring big dough” in a February 17, 2012 post to Giga Om’s Structure blog. However, it’s not only data scientists who stand to rake in the bucks from petabytes of social data; their employers are sure to take the lion’s share of the largess.