{"id":87138,"date":"2020-05-13T15:15:36","date_gmt":"2020-05-13T15:15:36","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=87138"},"modified":"2026-04-15T19:03:59","modified_gmt":"2026-04-15T19:03:59","slug":"text-mining-and-sentiment-analysis-with-r","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/text-mining-and-sentiment-analysis-with-r\/","title":{"rendered":"Sentiment Analysis and Text Mining with R: Word Clouds, NRC Lexicon, and Scores"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"h-executive-summary\">Executive Summary<\/h2>\n\n\n\n<p><strong>This article demonstrates text mining and sentiment analysis in R using a team health survey dataset. The R packages covered are: tm (text cleaning and corpus management), ggplot2 (visualisation), wordcloud (word cloud generation), and syuzhet (sentiment scoring and NRC emotion classification). The workflow follows five steps: installing and loading packages; reading text file data into R; cleaning the text (removing punctuation, numbers, stop words, whitespace); building a term-document matrix to count word frequencies; and generating visualisations &#8211; word clouds, word association charts, sentiment score plots, and emotion classification bar charts. All code examples use the R base functions and the packages above.<\/strong><\/p>\n\n\n<p><strong>The series so far:<\/strong><\/p>\n<ol>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/text-mining-and-sentiment-analysis-introduction\/\">Text Mining and Sentiment Analysis: Introduction<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/text-mining-and-sentiment-analysis-power-bi-visualizations\">Text Mining and Sentiment Analysis: Power BI Visualizations<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/text-mining-and-sentiment-analysis-with-r\/\">Text Mining and Sentiment Analysis: Analysis with R<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/bi\/sentiment-analysis-oracle-text\/\">Text Mining and Sentiment Analysis: Oracle Text<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/text-mining-and-sentiment-analysis-data-visualization-in-tableau\/\">Text Mining and Sentiment Analysis:\u00a0Data Visualization in Tableau<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/development\/data-science-development\/sentiment-analysis-python\/\">Sentiment Analysis with Python<\/a><\/li>\n<\/ol>\n\n\n\n\n<p>This is the third article of the \u201cText Mining and Sentiment Analysis\u201d Series. The first article introduced Azure Cognitive Services and demonstrated the setup and use of Text Analytics APIs for extracting key Phrases &amp; Sentiment Scores from text data. The second article demonstrated Power BI visualizations for analyzing Key Phrases &amp; Sentiment Scores and interpreting them to gain insights. This article explores R for text mining and sentiment analysis. I will demonstrate several common text analytics techniques and visualizations in R.<\/p>\n\n\n\n<p>Note: This article assumes basic familiarity with R and RStudio. Please jump to the References section for more information on installing R and RStudio. The Demo data raw text file and R script are available for download from my GitHub repository; please find the link in the References section.<\/p>\n\n\n\n<p>R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques and is highly extensible. R is available as free software. It\u2019s easy to learn<\/p>\n\n\n\n<p>and use and can produce well designed publication-quality plots. For the demos in this article, I am using R version 3.5.3 (2019-03-11), RStudio Version 1.1.456<\/p>\n\n\n\n<p>The input file for this article has only one column, the \u201cRaw text\u201d of survey responses and is a text file.<\/p>\n\n\n\n<p>A sample of the first few rows are shown in Notepad++ (showing all characters) in Figure 1.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1911\" height=\"385\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-computer-description-automatica.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" class=\"wp-image-87139\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 1. Sample of the input text file<\/strong><\/p>\n\n\n\n<p>The demo R script and demo input text file are available on my GitHub repo (please find the link in the References section).<\/p>\n\n\n\n<p>R has a rich set of packages for Natural Language Processing (NLP) and generating plots. The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis. I will demonstrate these steps and analysis like Word Frequency, Word Cloud, Word Association, Sentiment Scores and Emotion Classification using various plots and charts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-installing-and-loading-r-packages\">Installing and loading R packages<\/h2>\n\n\n\n<p>The following packages are used in the examples in this article:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>tm<\/strong> for text mining operations like removing numbers, special characters, punctuations and stop words (Stop words in any language are the most commonly occurring words that have very little value for NLP and should be filtered out. Examples of stop words in English are \u201cthe\u201d, \u201cis\u201d, \u201care\u201d.)<\/li>\n\n\n\n<li><strong>snowballc<\/strong> for stemming, which is the process of reducing words to their base or root form. For example, a stemming algorithm would reduce the words \u201cfishing\u201d, \u201cfished\u201d and \u201cfisher\u201d to the stem \u201cfish\u201d.<\/li>\n\n\n\n<li><strong>wordcloud<\/strong> for generating the word cloud plot.<\/li>\n\n\n\n<li><strong>RColorBrewer<\/strong> for color palettes used in various plots<\/li>\n\n\n\n<li><strong>syuzhet<\/strong> for sentiment scores and emotion classification<\/li>\n\n\n\n<li><strong>ggplot2<\/strong> for plotting graphs<\/li>\n<\/ul>\n<\/div>\n\n\n<p>Open RStudio and create a new R Script. Use the following code to install and load these packages.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Install\ninstall.packages(\"tm\")  # for text mining\ninstall.packages(\"SnowballC\") # for text stemming\ninstall.packages(\"wordcloud\") # word-cloud generator \ninstall.packages(\"RColorBrewer\") # color palettes\ninstall.packages(\"syuzhet\") # for sentiment analysis\ninstall.packages(\"ggplot2\") # for plotting graphs\n# Load\nlibrary(\"tm\")\nlibrary(\"SnowballC\")\nlibrary(\"wordcloud\")\nlibrary(\"RColorBrewer\")\nlibrary(\"syuzhet\")\nlibrary(\"ggplot2\")<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-reading-file-data-into-r\">Reading file data into R<\/h2>\n\n\n\n<p>The R base function <code>read.table()<\/code> is generally used to read a file in table format and imports data as a data frame. Several variants of this function are available, for importing different file formats;<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>read.csv() is<\/strong> used for reading comma-separated value (csv) files, where a comma \u201c,\u201d is used a field separator<\/li>\n\n\n\n<li><strong>read.delim()<\/strong> is used for reading tab-separated values (.txt) files<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The input file has multiple lines of text and no columns\/fields (data is not tabular), so you will use the <code>readLines<\/code> function. This function takes a file (or URL) as input and returns a vector containing as many elements as the number of lines in the file. The <code>readLines<\/code> function simply extracts the text from its input source and returns each line as a character string. The <code>n=<\/code> argument is useful to read a limited number (subset) of lines from the input source (Its default value is -1, which reads all lines from the input source). When using the filename in this function\u2019s argument, R assumes the file is in your current working directory (you can use the <code>getwd()<\/code> function in R console to find your current working directory). You can also choose the input file interactively, using the <code>file.choose()<\/code> function within the argument. The next step is to load that Vector as a Corpus. In R, a Corpus is a collection of text document(s) to apply text mining or NLP routines on. Details of using the <code>readLines<\/code> function are sourced from: <a href=\"https:\/\/www.stat.berkeley.edu\/~spector\/s133\/Read.html\">https:\/\/www.stat.berkeley.edu\/~spector\/s133\/Read.html<\/a> .<\/p>\n\n\n\n<p>In your R script, add the following code to load the data into a corpus.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Read the text file from local machine , choose file interactively\ntext &lt;- readLines(file.choose())\n# Load the data as a corpus\nTextDoc &lt;- Corpus(VectorSource(text))<\/pre>\n\n\n\n<p>Upon running this, you will be prompted to select the input file. Navigate to your file and click <em>Open <\/em>as shown in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1576\" height=\"572\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-computer-description-automatica-1.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" class=\"wp-image-87140\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 2. Select input file<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-cleaning-up-text-data\">Cleaning up Text Data<\/h2>\n\n\n\n<p>Cleaning the text data starts with making transformations like removing special characters from the text. This is done using the <code>tm_map()<\/code> function to replace special characters like <code>\/<\/code>, <code>@<\/code> and <code>|<\/code> with a space. The next step is to remove the unnecessary whitespace and convert the text to lower case.<\/p>\n\n\n\n<p>Then remove the <em>stopwords<\/em>. They are the most commonly occurring words in a language and have very little value in terms of gaining useful information. They should be removed before performing further analysis. Examples of stopwords in English are \u201cthe, is, at, on<em>\u201d<\/em>. There is no single universal list of stop words used by all NLP tools. <code>stopwords<\/code> in the <code>tm_map()<\/code> function supports several languages like English, French, German, Italian, and Spanish. Please note the language names are case sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Team Health example for removing non-default stop words like \u201cteam\u201d, \u201ccompany\u201d, \u201chealth\u201d. Next, remove numbers and punctuation.<\/p>\n\n\n\n<p>The last step is text stemming. It is the process of reducing the word to its root form. The stemming process simplifies the word to its common origin. For example, the stemming process reduces the words \u201cfishing\u201d, \u201cfished\u201d and \u201cfisher\u201d to its stem \u201cfish\u201d. Please note stemming uses the <em>SnowballC<\/em> package. (You may want to skip the text stemming step if your users indicate a preference to see the original \u201cunstemmed\u201d words in the word cloud plot)<\/p>\n\n\n\n<p>In your R script, add the following code to transform and run to clean-up the text data.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> #Replacing \"\/\", \"@\" and \"|\" with space\ntoSpace &lt;- content_transformer(function (x , pattern ) gsub(pattern, \" \", x))\nTextDoc &lt;- tm_map(TextDoc, toSpace, \"\/\")\nTextDoc &lt;- tm_map(TextDoc, toSpace, \"@\")\nTextDoc &lt;- tm_map(TextDoc, toSpace, \"\\\\|\")\n# Convert the text to lower case\nTextDoc &lt;- tm_map(TextDoc, content_transformer(tolower))\n# Remove numbers\nTextDoc &lt;- tm_map(TextDoc, removeNumbers)\n# Remove english common stopwords\nTextDoc &lt;- tm_map(TextDoc, removeWords, stopwords(\"english\"))\n# Remove your own stop word\n# specify your custom stopwords as a character vector\nTextDoc &lt;- tm_map(TextDoc, removeWords, c(\"s\", \"company\", \"team\")) \n# Remove punctuations\nTextDoc &lt;- tm_map(TextDoc, removePunctuation)\n# Eliminate extra white spaces\nTextDoc &lt;- tm_map(TextDoc, stripWhitespace)\n# Text stemming - which reduces words to their root form\nTextDoc &lt;- tm_map(TextDoc, stemDocument)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-building-the-term-document-matrix\">Building the term document matrix<\/h2>\n\n\n\n<p>After cleaning the text data, the next step is to count the occurrence of each word, to identify popular or trending topics. Using the function <code>TermDocumentMatrix()<\/code> from the text mining package, you can build a Document Matrix \u2013 a table containing the frequency of words.<\/p>\n\n\n\n<p>In your R script, add the following code and run it to see the top 5 most frequently found words in your text.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Build a term-document matrix\nTextDoc_dtm &lt;- TermDocumentMatrix(TextDoc)\ndtm_m &lt;- as.matrix(TextDoc_dtm)\n# Sort by descearing value of frequency\ndtm_v &lt;- sort(rowSums(dtm_m),decreasing=TRUE)\ndtm_d &lt;- data.frame(word = names(dtm_v),freq=dtm_v)\n# Display the top 5 most frequent words\nhead(dtm_d, 5)<\/pre>\n\n\n\n<p>The following table of word frequency is the expected output of the <code>head<\/code> command on RStudio Console.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"170\" height=\"146\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/word-image-1.png\" alt=\"\" class=\"wp-image-87141\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Plotting the top 5 most frequent words using a bar chart is a good basic way to visualize this word frequent data. In your R script, add the following code and run it to generate a bar chart, which will display in the <em>Plots<\/em> sections of RStudio.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Plot the most frequent words\nbarplot(dtm_d[1:5,]$freq, las = 2, names.arg = dtm_d[1:5,]$word,\n        col =\"lightgreen\", main =\"Top 5 most frequent words\",\n        ylab = \"Word frequencies\")<\/pre>\n\n\n\n<p>The plot can be seen in Figure 3.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"867\" height=\"665\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87142\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 3. Bar chart of the top 5 most frequent words<\/strong><\/p>\n\n\n\n<p>One could interpret the following from this bar chart:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>The most frequently occurring word is \u201cgood\u201d. Also notice that negative words like \u201cnot\u201d don\u2019t feature in the bar chart, which indicates there are no negative prefixes to change the context or meaning of the word \u201cgood\u201d ( In short, this indicates most responses don\u2019t mention negative phrases like \u201cnot good\u201d).<\/li>\n\n\n\n<li>\u201cwork\u201d, \u201chealth\u201d and \u201cfeel\u201d are the next three most frequently occurring words, which indicate that most people feel good about their work and their team\u2019s health.<\/li>\n\n\n\n<li>Finally, the root \u201cimprov\u201d for words like \u201cimprove\u201d, \u201cimprovement\u201d, \u201cimproving\u201d, etc. is also on the chart, and you need further analysis to infer if its context is positive or negative<\/li>\n<\/ul>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-generate-the-word-cloud\">Generate the Word Cloud<\/h2>\n\n\n\n<p>A word cloud is one of the most popular ways to visualize and analyze qualitative data. It\u2019s an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. Use the word frequency data frame (table) created previously to generate the word cloud. In your R script, add the following code and run it to generate the word cloud and display it in the <em>Plots<\/em> section of RStudio.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#generate word cloud\nset.seed(1234)\nwordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,\n          max.words=100, random.order=FALSE, rot.per=0.40, \n          colors=brewer.pal(8, \"Dark2\"))<\/pre>\n\n\n\n<p>Below is a brief description of the arguments used in the word cloud function;<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>words<\/strong> &#8211; words to be plotted<\/li>\n\n\n\n<li><strong>freq<\/strong> &#8211; frequencies of words<\/li>\n\n\n\n<li><strong>min.freq<\/strong> \u2013 words whose frequency is at or above this threshold value is plotted (in this case, I have set it to 5)<\/li>\n\n\n\n<li><strong>max.words<\/strong> \u2013 the maximum number of words to display on the plot (in the code above, I have set it 100)<\/li>\n\n\n\n<li><strong>random.order<\/strong> \u2013 I have set it to FALSE, so the words are plotted in order of decreasing frequency<\/li>\n\n\n\n<li><strong>rot.per<\/strong> \u2013 the percentage of words that are displayed as vertical text (with 90-degree rotation). I have set it 0.40 (40 %), please feel free to adjust this setting to suit your preferences<\/li>\n\n\n\n<li><strong>colors<\/strong> \u2013 changes word colors going from lowest to highest frequencies<\/li>\n<\/ul>\n<\/div>\n\n\n<p>You can see the resulting word cloud in Figure 4.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"672\" height=\"612\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-1.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87143\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 4. Word cloud plot<\/strong><\/p>\n\n\n\n<p>The word cloud shows additional words that occur frequently and could be of interest for further analysis. Words like \u201cneed\u201d, \u201csupport\u201d, \u201cissu\u201d (root for \u201cissue(s)\u201d, etc. could provide more context around the most frequently occurring words and help to gain a better understanding of the main themes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-word-association\">Word Association<\/h2>\n\n\n\n<p>Correlation is a statistical technique that can demonstrate whether, and how strongly, pairs of variables are related. This technique can be used effectively to analyze which words occur most often in association with the most frequently occurring words in the survey responses, which helps to see the context around these words<\/p>\n\n\n\n<p>In your R script, add the following code and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Find associations \nfindAssocs(TextDoc_dtm, terms = c(\"good\",\"work\",\"health\"), corlimit = 0.25)<\/pre>\n\n\n\n<p>You should see the results as shown in Figure 5.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"643\" height=\"203\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-2.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87144\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 5. Word association analysis for the top three most frequent terms<\/strong><\/p>\n\n\n\n<p>This script shows which words are most frequently associated with the top three terms (<code>corlimit = 0.25<\/code> is the lower limit\/threshold I have set. You can set it lower to see more words, or higher to see less). The output indicates that \u201cintegr\u201d (which is the root for word \u201cintegrity\u201d) and \u201csynergi\u201d (which is the root for words \u201csynergy\u201d, \u201csynergies\u201d, etc.) and occur 28% of the time with the word \u201cgood\u201d. You can interpret this as the context around the most frequently occurring word (\u201cgood\u201d) is positive. Similarly, the root of the word \u201ctogether\u201d is highly correlated with the word \u201cwork\u201d. This indicates that most responses are saying that teams \u201cwork together\u201d and can be interpreted in a positive context.<\/p>\n\n\n\n<p>You can modify the above script to find terms associated with words that occur at least 50 times or more, instead of having to hard code the terms in your script.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Find associations for words that occur at least 50 times\nfindAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"762\" height=\"455\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-3.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87145\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 6: Word association output for terms occurring at least 50 times <\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-sentiment-scores\">Sentiment Scores<\/h2>\n\n\n\n<p>Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.<\/p>\n\n\n\n<p>This example uses the Syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford. The <code>get_sentiment<\/code> function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are <code>syuzhet<\/code> (this is the default), <code>bing<\/code>, <code>afinn<\/code> and <code>nrc<\/code>. Each method uses a different scale and hence returns slightly different results. Please note the outcome of <code>nrc<\/code> method is more than just a numeric score, requires additional interpretations and is out of scope for this article. The descriptions of the <code>get_sentiment<\/code> function has been sourced from : <a href=\"https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html?\">https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html?<\/a><\/p>\n\n\n\n<p>Add the following code to the R script and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># regular sentiment score using get_sentiment() function and method of your choice\n# please note that different methods may have different scales\nsyuzhet_vector &lt;- get_sentiment(text, method=\"syuzhet\")\n# see the first row of the vector\nhead(syuzhet_vector)\n# see summary statistics of the vector\nsummary(syuzhet_vector)<\/pre>\n\n\n\n<p>Your results should look similar to Figure 7.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"691\" height=\"206\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-4.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87146\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 7. Syuzhet vector<\/strong><\/p>\n\n\n\n<p>An inspection of the Syuzhet vector shows the first element has the value of <em>2.60<\/em>. It means the sum of the sentiment scores of all meaningful words in the first response(line) in the text file, adds up to 2.60. The scale for sentiment scores using the <code>syuzhet<\/code> method is decimal and ranges from -1(indicating most negative) to +1(indicating most positive). Note that the summary statistics of the <code>suyzhet<\/code> vector show a median value of 1.6, which is above zero and can be interpreted as the overall average sentiment across all the responses is positive.<\/p>\n\n\n\n<p>Next, run the same analysis for the remaining two methods and inspect their respective vectors. Add the following code to the R script and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># bing\nbing_vector &lt;- get_sentiment(text, method=\"bing\")\nhead(bing_vector)\nsummary(bing_vector)\n#affin\nafinn_vector &lt;- get_sentiment(text, method=\"afinn\")\nhead(afinn_vector)\nsummary(afinn_vector)<\/pre>\n\n\n\n<p>Your results should resemble Figure 8.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"461\" height=\"237\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-5.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87147\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 8. bing and afinn vectors<\/strong><\/p>\n\n\n\n<p>Please note the scale of sentiment scores generated by:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>bing<\/strong> \u2013 binary scale with -1 indicating negative and +1 indicating positive sentiment<\/li>\n\n\n\n<li><strong>afinn<\/strong> \u2013 integer scale ranging from -5 to +5<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The summary statistics of <code>bing<\/code> and <code>afinn<\/code> vectors also show that the <code>Median<\/code> value of Sentiment scores is above 0 and can be interpreted as the overall average sentiment across the all the responses is positive.<\/p>\n\n\n\n<p>Because these different methods use different scales, it\u2019s better to convert their output to a common scale before comparing them. This basic scale conversion can be done easily using R\u2019s built-in <code>sign<\/code> function, which converts all positive number to 1, all negative numbers to -1 and all zeros remain 0.<\/p>\n\n\n\n<p>Add the following code to your R script and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#compare the first row of each vector using sign function\nrbind(\n  sign(head(syuzhet_vector)),\n  sign(head(bing_vector)),\n  sign(head(afinn_vector))\n)<\/pre>\n\n\n\n<p>Figure 9 shows the results.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"497\" height=\"164\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-6.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87148\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong> Figure 9. Normalize scale and compare three vectors<\/strong><\/p>\n\n\n\n<p>Note the first element of each row (vector) is <em>1<\/em>, indicating that all three methods have calculated a positive sentiment score, for the first response (line) in the text.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-emotion-classification\">Emotion Classification<\/h2>\n\n\n\n<p>Emotion classification is built on the NRC Word-Emotion Association Lexicon (aka EmoLex). The definition of \u201cNRC Emotion Lexicon\u201d, sourced from <a href=\"http:\/\/saifmohammad.com\/WebPages\/NRC-Emotion-Lexicon.htm\">http:\/\/saifmohammad.com\/WebPages\/NRC-Emotion-Lexicon.htm<\/a> is \u201cThe NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.\u201d<\/p>\n\n\n\n<p>To understand this, explore the <code>get_nrc_sentiments<\/code> function, which returns a data frame with each row representing a sentence from the original file. The data frame has ten columns (one column for each of the eight emotions, one column for positive sentiment valence and one for negative sentiment valence). The data in the columns (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive) can be accessed individually or in sets. The definition of <code>get_nrc_sentiments<\/code> has been sourced from: <a href=\"https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html?\">https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html?<\/a><\/p>\n\n\n\n<p>Add the following line to your R script and run it, to see the data frame generated from the previous execution of the <code>get_nrc_sentiment<\/code> function.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># run nrc sentiment analysis to return data frame with each row classified as one of the following\n# emotions, rather than a score: \n# anger, anticipation, disgust, fear, joy, sadness, surprise, trust \n# It also counts the number of positive and negative emotions found in each row\nd&lt;-get_nrc_sentiment(text)\n# head(d,10) - to see top 10 lines of the get_nrc_sentiment dataframe\nhead (d,10)<\/pre>\n\n\n\n<p>The results should look like Figure 10.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"686\" height=\"219\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-7.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87149\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 10. Data frame returned by get_nrc_sentiment function<\/strong><\/p>\n\n\n\n<p>The output shows that the first line of text has;<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Zero occurrences of words associated with emotions of anger, disgust, fear, sadness and surprise<\/li>\n\n\n\n<li>One occurrence each of words associated with emotions of anticipation and joy<\/li>\n\n\n\n<li>Two occurrences of words associated with emotions of trust<\/li>\n\n\n\n<li>Total of one occurrence of words associated with negative emotions<\/li>\n\n\n\n<li>Total of two occurrences of words associated with positive emotions<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The next step is to create two plots charts to help visually analyze the emotions in this survey text. First, perform some data transformation and clean-up steps before plotting charts. The first plot shows the total number of instances of words in the text, associated with each of the eight emotions. Add the following code to your R script and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#transpose\ntd&lt;-data.frame(t(d))\n#The function rowSums computes column sums across rows for each level of a grouping variable.\ntd_new &lt;- data.frame(rowSums(td[2:253]))\n#Transformation and cleaning\nnames(td_new)[1] &lt;- \"count\"\ntd_new &lt;- cbind(\"sentiment\" = rownames(td_new), td_new)\nrownames(td_new) &lt;- NULL\ntd_new2&lt;-td_new[1:8,]\n#Plot One - count of words associated with each sentiment\nquickplot(sentiment, data=td_new2, weight=count, geom=\"bar\", fill=sentiment, ylab=\"count\")+ggtitle(\"Survey sentiments\")<\/pre>\n\n\n\n<p>You can see the bar plot in Figure 11.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"928\" height=\"695\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-8.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87150\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 11. Bar Plot showing the count of words in the text, associated with each emotion<\/strong><\/p>\n\n\n\n<p>This bar chart demonstrates that words associated with the positive emotion of \u201ctrust\u201d occurred about five hundred times in the text, whereas words associated with the negative emotion of \u201cdisgust\u201d occurred less than 25 times. A deeper understanding of the overall emotions occurring in the survey response can be gained by comparing these number as a percentage of the total number of meaningful words. Add the following code to your R script and run it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#Plot two - count of words associated with each sentiment, expressed as a percentage\nbarplot(\n  sort(colSums(prop.table(d[, 1:8]))), \n  horiz = TRUE, \n  cex.names = 0.7, \n  las = 1, \n  main = \"Emotions in Text\", xlab=\"Percentage\"\n)<\/pre>\n\n\n\n<p>The Emotions bar plot can be seen in figure 12.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"898\" height=\"683\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2020\/05\/a-screenshot-of-a-cell-phone-description-automati-9.png\" alt=\"A screenshot of a cell phone\n\nDescription automatically generated\" class=\"wp-image-87151\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Figure 12. Bar Plot showing the count of words associated with each sentiment expressed as a percentage <\/strong><\/p>\n\n\n\n<p>This bar plot allows for a quick and easy comparison of the proportion of words associated with each emotion in the text. The emotion \u201ctrust\u201d has the longest bar and shows that words associated with this positive emotion constitute just over 35% of all the meaningful words in this text. On the other hand, the emotion of \u201cdisgust\u201d has the shortest bar and shows that words associated with this negative emotion constitute less than 2% of all the meaningful words in this text. Overall, words associated with the positive emotions of \u201ctrust\u201d and \u201cjoy\u201d account for almost 60% of the meaningful words in the text, which can be interpreted as a good sign of team health.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n\n\n\n<p>This article demonstrated reading text data into R, data cleaning and transformations. It demonstrated how to create a word frequency table and plot a word cloud, to identify prominent themes occurring in the text. Word association analysis using correlation, helped gain context around the prominent themes. It explored four methods to generate sentiment scores, which proved useful in assigning a numeric value to strength (of positivity or negativity) of sentiments in the text and allowed interpreting that the average sentiment through the text is trending positive. Lastly, it demonstrated how to implement an emotion classification with NRC sentiment and created two plots to analyze and interpret emotions found in the text.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-references\">References:<\/h2>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>R &#8211; <a href=\"https:\/\/www.r-project.org\/\">https:\/\/www.r-project.org\/<\/a><\/li>\n\n\n\n<li>Download and install R for windows &#8211; <a href=\"https:\/\/cran.r-project.org\/bin\/windows\/base\/\">https:\/\/cran.r-project.org\/bin\/windows\/base\/<\/a><\/li>\n\n\n\n<li>Download and install RStudio &#8211; <a href=\"https:\/\/rstudio.com\/products\/rstudio\/download\/\">https:\/\/rstudio.com\/products\/rstudio\/download\/<\/a><\/li>\n\n\n\n<li>Reading data into R &#8211; <a href=\"https:\/\/www.stat.berkeley.edu\/~spector\/s133\/Read.html\">https:\/\/www.stat.berkeley.edu\/~spector\/s133\/Read.html<\/a><\/li>\n\n\n\n<li>Power BI visuals using R &#8211; <a href=\"https:\/\/docs.microsoft.com\/en-us\/power-bi\/desktop-r-visuals\">https:\/\/docs.microsoft.com\/en-us\/power-bi\/desktop-r-visuals<\/a><\/li>\n\n\n\n<li>Natural Language Processing (NLP) &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing<\/a><\/li>\n\n\n\n<li>Stop words &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stop_words\">https:\/\/en.wikipedia.org\/wiki\/Stop_words<\/a><\/li>\n\n\n\n<li>Stemming &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stemming\">https:\/\/en.wikipedia.org\/wiki\/Stemming<\/a><\/li>\n\n\n\n<li>Word cloud in R &#8211; <a href=\"http:\/\/www.sthda.com\/english\/wiki\/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know\">http:\/\/www.sthda.com\/english\/wiki\/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know<\/a><\/li>\n\n\n\n<li>Correlation &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Correlation_and_dependence\">https:\/\/en.wikipedia.org\/wiki\/Correlation_and_dependence<\/a><\/li>\n\n\n\n<li>R Syuzhet Package &#8211; <a href=\"https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html\">https:\/\/cran.r-project.org\/web\/packages\/syuzhet\/vignettes\/syuzhet-vignette.html<\/a><\/li>\n\n\n\n<li>NRC Emotion lexicon &#8211; <a href=\"http:\/\/saifmohammad.com\/WebPages\/NRC-Emotion-Lexicon.htm\">http:\/\/saifmohammad.com\/WebPages\/NRC-Emotion-Lexicon.htm<\/a><\/li>\n\n\n\n<li>Sanil Mhatre\u2019s GitHub Repo for R Script and Demo data file &#8211; <a href=\"https:\/\/github.com\/SQLSuperGuru\/SimpleTalkDemo_R\">https:\/\/github.com\/SQLSuperGuru\/SimpleTalkDemo_R<\/a><\/li>\n<\/ul>\n<\/div>\n\n\n<section id=\"faq\" class=\"faq-block my-5xl\">\n    <h2>FAQs: Text Mining and Sentiment Analysis: Analysis with R<\/h2>\n\n                        <h3 class=\"mt-4xl\">1. How do I do sentiment analysis in R?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Install the syuzhet package (install.packages(&#8216;syuzhet&#8217;)). Load a text vector into R, then call get_sentiment(text_vector, method = &#8216;afinn&#8217;) for a numeric sentiment score per sentence (positive = positive, negative = negative), or get_nrc_sentiment(text_vector) for NRC emotion classification (joy, trust, fear, surprise, sadness, disgust, anger, anticipation). For a corpus of documents, use the tm package to clean the text first, then apply syuzhet to the cleaned data. The result is a data frame of sentiment scores that can be visualised with ggplot2.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">2. What R packages do I need for text mining?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Core text mining packages: tm (Corpus creation, text cleaning &#8211; removes punctuation, numbers, stop words, whitespace); SnowballC (word stemming to reduce words to their root form); RColorBrewer and wordcloud (word cloud visualisation); ggplot2 (sentiment and frequency bar charts); syuzhet (sentiment scores and NRC emotion classification). Install all with: install.packages(c(&#8216;tm&#8217;, &#8216;SnowballC&#8217;, &#8216;wordcloud&#8217;, &#8216;RColorBrewer&#8217;, &#8216;ggplot2&#8217;, &#8216;syuzhet&#8217;)). For reading various file formats: readr for CSV, readtext for Word documents and PDFs.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">3. What is the NRC Word-Emotion Association Lexicon?<\/h3>\n            <div class=\"faq-answer\">\n                <p>The NRC Emotion Lexicon (EmoLex) is a crowd-sourced lexicon of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, disgust) and two sentiments (positive, negative). Each word is tagged with 0 or 1 for each emotion\/sentiment category. In R, the syuzhet package&#8217;s get_nrc_sentiment() function looks up each word in the text against the NRC lexicon and returns counts for each emotion category. It provides richer analysis than positive\/negative scoring alone &#8211; useful for understanding the emotional tenor of customer feedback or survey responses.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">4. How do I generate a word cloud in R from text data?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Load the wordcloud and tm packages. Create a Corpus from your text data, clean it (tm_map for punctuation, stop words, whitespace), build a TermDocumentMatrix, and convert to a matrix. Sort words by frequency: word_freq = sort(rowSums(as.matrix(tdm)), decreasing = TRUE). Call wordcloud(words = names(word_freq), freq = word_freq, min.freq = 3, random.order = FALSE, colors = brewer.pal(8, &#8216;Dark2&#8217;)). Adjust min.freq to control how many words appear and scale for size parameters based on your dataset size.<\/p>\n            <\/div>\n            <\/section>\n","protected":false},"excerpt":{"rendered":"<p>Text mining and sentiment analysis with R: install the tm package, clean text data, build term-document matrices, generate word clouds, analyse word associations, calculate sentiment scores, and classify emotions using the NRC lexicon.&hellip;<\/p>\n","protected":false},"author":317671,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":true,"footnotes":""},"categories":[143528,47,53,1],"tags":[95509],"coauthors":[101710],"class_list":["post-87138","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-data-science","category-featured","category-uncategorized","tag-standardize"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/87138","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/317671"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=87138"}],"version-history":[{"count":12,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/87138\/revisions"}],"predecessor-version":[{"id":109850,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/87138\/revisions\/109850"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=87138"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=87138"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=87138"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=87138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}