{"id":93751,"date":"2022-04-06T19:58:31","date_gmt":"2022-04-06T19:58:31","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=93751"},"modified":"2022-04-27T21:27:23","modified_gmt":"2022-04-27T21:27:23","slug":"deeper-insights-topic-modeling","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/deeper-insights-topic-modeling\/","title":{"rendered":"Finding deeper insights with Topic Modeling"},"content":{"rendered":"<p>Topic modeling is a powerful Natural Language Processing technique for finding relationships among data in text documents. It falls under the category of unsupervised learning and works by representing a text document as a collection of topics (set of keywords) that best represent the prevalent contents of that document. This article will focus on a probabilistic modeling approach called Latent Dirichlet Allocation (LDA), by walking readers through topic modeling using the team <a href=\"https:\/\/github.com\/SQLSuperGuru\/SimpleTalkDemo_R\/blob\/master\/Oracle\/TeamHealthRawDataForDemo.xlsx\">health demo dataset<\/a>. Demonstrations will use Python and a Jupyter notebook running on Anaconda. Please follow instructions from the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/development\/data-science-development\/sentiment-analysis-python\/\">\u201cInitial setup\u201d section of the previous article<\/a> to install Anaconda and set up a Jupyter notebook.<\/p>\n<p>The second article of this series, <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/bi-sql-server\/text-mining-and-sentiment-analysis-power-bi-visualizations\/\">Text Mining and Sentiment Analysis: Power BI Visualizations<\/a>, introduced readers to the Word Cloud, a common technique to represent the frequency of keywords in a body of text. Word Cloud is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text. This technique is limited in its ability to discover underlying topics and themes in the text, because it only relies on the frequency of keywords to determine their popularity. Topic modeling overcomes these limitations and uncovers deeper insights from text data using statistical modeling for discovering the topics (collection of words) that occur in text documents.<\/p>\n<h2>Topic modeling with LDA<\/h2>\n<p>Text data from surveys, reviews, social media posts, user feedback, customer complaints, etc. and can contain insights valuable to businesses. It can reveal meaningful and actionable findings like top complaints from customers or user feedback for desired features in a product. Manually reading through a large volume of text to compile topics that reveal such valuable insights is neither practical nor scalable. Furthermore, basic <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\"><em>tf-idf schemes<\/em><\/a> and techniques like keywords, key phrases, or word cloud, which rely on word frequency, are severely limited in their ability to discover topics.<\/p>\n<p>Latent Dirichlet Allocation (LDA) is a popular and powerful topic modeling technique that applies generative, probabilistic models over collections of text documents. LDA treats each document as a collection of topics, and each topic is composed of a collection of words based on their probability distribution. Please refer to <a href=\"https:\/\/www.jmlr.org\/papers\/volume3\/blei03a\/blei03a.pdf\">this paper<\/a> in the Journal of Machine Learning research to learn more about LDA.<\/p>\n<h2>Demo environment setup<\/h2>\n<p>After following instructions from the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/development\/data-science-development\/sentiment-analysis-python\/\">\u201cInitial setup\u201d section of the previous article<\/a> to install Anaconda and set up a Jupyter notebook, return to <em>Anaconda Navigator<\/em> and launch <em>CMD.exe Prompt<\/em>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93780 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/Anaconda.png\" alt=\"An image showing Anaconda Navigator pointing to the CMDexe Prompt\" width=\"1281\" height=\"560\" \/><\/p>\n<p><strong>Figure 1. Launch CMD.exe Prompt <\/strong><\/p>\n<h3>Install spaCy<\/h3>\n<p><a href=\"https:\/\/spacy.io\/\"><em>spaCy<\/em><\/a> is a free, open-source library for Natural Language Processing in Python with features for common tasks like tagging, parsing, Named Entity Recognition (NER), lemmatization, etc. This article will use <em>spaCy<\/em> for lemmatization, which is the process of converting words to their root. For example, the lemma of words like \u2018walking\u2019, \u2018walks\u2019 is \u2018walk\u2019. Lemmatization uses contextual vocabulary and morphological analysis to produce better outcomes than stemming. You can learn more about lemmatization and stemming <a href=\"https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html\">here<\/a>.<\/p>\n<p>Run the following code in the <em>CMD.exe<\/em> Prompt, to install (or update if already installed) the <em>spaCy<\/em> package, along with its prerequisites.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">\tconda install -c conda-forge spacy<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93781 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/install-packages.png\" alt=\"An image showing the command line installing conda-forge spacy\" width=\"871\" height=\"758\" \/><\/p>\n<p><strong>Figure 2. Install spaCy package <\/strong><\/p>\n<p>Run the following code in <em>CMD.exe<\/em> Prompt, to download <code>en_core_web_sm<\/code> trained pipeline for English language<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">\tpython -m spacy download en_core_web_sm<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93782 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/spacy.png\" alt=\"An image showing a command line installing en_core_web_sm\" width=\"931\" height=\"431\" \/><\/p>\n<p><strong>Figure 3. Download en_core_web_sm trained pipeline for English<\/strong><\/p>\n<p>Please refer to <a href=\"https:\/\/spacy.io\/usage\">spaCy Installation guide<\/a> for detailed instructions and troubleshooting of common issues.<\/p>\n<h3>Install wordcloud<\/h3>\n<p>The wordcloud package is used to create the word cloud visualization, where the size of each keyword indicates its relative frequency within a particular text document.<\/p>\n<p>Run the following code in CMD.exe Prompt to install the wordcloud package.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">\tconda install -c conda-forge wordcloud<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93783 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/conda.png\" alt=\"An image showing command line installing wordcloud\" width=\"933\" height=\"657\" \/><\/p>\n<p><strong>Figure 4. Install wordcloud package<\/strong><\/p>\n<p>Restart the Jupyter notebook Kernel so it can use the newly installed spaCy and wordcloud packages.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93784 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/restart.png\" alt=\"An image showing a Jupyter Notebook and how to restart and clear the output\" width=\"851\" height=\"259\" \/><\/p>\n<p><strong>Figure 5. Restart Jupyter notebook Kernel<\/strong><\/p>\n<h3>Install Gensim and pyLDAvis<\/h3>\n<p><a href=\"https:\/\/radimrehurek.com\/gensim\/\">Gensim<\/a> is an open-source library for Natural Language Processing focusing on performing unsupervised topic modeling. The demo code in this article uses features specific to genism version 3.8.3 and may not work as expected with other versions of this library.<\/p>\n<p>pyLDAvis is an open-source package to build interactive web-based visualizations<\/p>\n<p>Run the following commands in the first cell of the Jupyter notebook to install genism and pyLDAvis<\/p>\n<pre class=\"lang:ps theme:powershell-ise  \">!pip install gensim==3.8.3\r\n!pip install pyLDAvis<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93785\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/gensim.png\" alt=\"An image showing how to install gensim and pyLDavis from a notebook\" width=\"1049\" height=\"545\" \/><\/p>\n<p><strong>Figure 6. Install genism and pyLDAvis from Jupyter notebook<\/strong><\/p>\n<h3>Import Natural Language Processing modules<\/h3>\n<p>This section introduces readers to modules used for Natural Language Processing (NLP) in this article<\/p>\n<ul>\n<li><a href=\"https:\/\/docs.python.org\/3\/library\/re.html\">re<\/a> module provides operations for regular expression matching, useful for pattern and string search.<\/li>\n<li><a href=\"https:\/\/pandas.pydata.org\/\">pandas<\/a> is one of the most widely used open-source tools for data manipulation and analysis. Developed in 2008, pandas provides an incredibly fast and efficient object with integrated indexing called DataFrame. It comes with tools for reading and writing data from and to files and SQL databases. It can manipulate, reshape, filter, aggregate, merge, join and pivot large datasets and is highly optimized for performance.<\/li>\n<li><a href=\"https:\/\/numpy.org\/\">NumPy<\/a> is a library for Python that offers comprehensive mathematical functions that can operate upon large, multi-dimensional arrays and matrices<\/li>\n<li><a href=\"https:\/\/docs.python.org\/3\/library\/pprint.html\">pprint<\/a> module in Python provides the ability to pretty-print arbitrary Python data structures that are otherwise hard to visualize<\/li>\n<li><a href=\"https:\/\/radimrehurek.com\/gensim\/\">Gensim<\/a> is an open-source library for Natural Language Processing focusing on performing unsupervised topic modeling.<\/li>\n<li><a href=\"https:\/\/spacy.io\/\">spaCy<\/a> is a free open-source library for Natural Language processing in Python with features for common tasks like tagging, parsing, Named Entity Recognition (NER), lemmatization, etc.<\/li>\n<li><a href=\"https:\/\/pyldavis.readthedocs.io\/en\/latest\/readme.html\">pyLDAvis<\/a> parses the output of a fitted LDA topic model into a user-friendly interactive web-based visualization. It enables data scientists to interpret the topics in a fitted topic.<\/li>\n<li><a href=\"https:\/\/matplotlib.org\/stable\/index.html\">matplotlib<\/a> is an easy-to-use, popular, and comprehensive library in Python for creating visualizations. It supports basic plots (like line, bar, scatter, etc.), plots of arrays &amp; fields, statistical plots (like histogram, boxplot, violin, etc.), and plots with unstructured coordinates.<\/li>\n<li><a href=\"https:\/\/www.nltk.org\/\">Natural Language Toolkit<\/a>, commonly known as NLTK, is a comprehensive open-source platform for building applications to process human language data. It comes with powerful text processing libraries for typical Natural Language Processing (NLP) tasks like cleaning, parsing, stemming, tagging, tokenization, classification, semantic reasoning, etc. NLTK has user-friendly interfaces to several popular corpora and lexical resources, Word2Vec, WordNet, VADER Sentiment Lexicon, etc.<\/li>\n<li><a href=\"https:\/\/amueller.github.io\/word_cloud\/\">Wordcloud<\/a> library is used to create the word cloud visualization<\/li>\n<\/ul>\n<p>Run this code snippet in the next cell of the Jupyter notebook to load the necessary modules. Please note that this step may need a few minutes to complete.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">import re\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom pprint import pprint\r\n# Gensim\r\nimport gensim\r\nimport gensim.corpora as corpora\r\nfrom gensim.utils import simple_preprocess\r\nfrom gensim.models import CoherenceModel\r\n# spacy for lemmatization\r\nimport spacy\r\n# Plotting tools\r\nimport pyLDAvis\r\nimport pyLDAvis.gensim_models as gensimvis\r\nimport matplotlib.pyplot as plt\r\n%matplotlib inline\r\n# NLTK Stop words\r\nimport nltk\r\nnltk.download('stopwords')\r\nfrom nltk.corpus import stopwords\r\nstop_words = stopwords.words('english')\r\nstop_words.extend(['us', 're'])\r\n# load spacy\r\nimport en_core_web_sm\r\nnlp = en_core_web_sm.load()\r\n#wordcloud\r\nfrom wordcloud import WordCloud, STOPWORDS, ImageColorGenerator<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93758\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-description-automa.png\" alt=\"An image showing a Jupyter notebook and all the import statements needed\" width=\"1315\" height=\"879\" \/><\/p>\n<p><strong>Figure 7. Load NLP modules<\/strong><\/p>\n<h2>Load and clean demo dataset<\/h2>\n<p>This step uses <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.read_excel.html\">read_excel<\/a> method from pandas to load the demo input datafile into a pandas dataframe. The code below also cleans text in the <code>Response<\/code> field by<\/p>\n<ul>\n<li>Converting the <code>Response<\/code> field to string datatype<\/li>\n<li>Converting all text to lowercase<\/li>\n<li>Removing all non-alphabet characters<\/li>\n<\/ul>\n<p>Run this code snippet in the next cell of the Jupyter notebook.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Import input data file and clean up response text\r\n# Be sure to change the file path\r\ndf = pd.read_excel (r'C:\\Users\\mhatres\\Documents\\demo\\TeamHealthRawDataForDemo.xlsx')\r\nprint (df.head(10))\r\ndf2 = df[['Response']]\r\n#print (df2.head(10)) \r\n# convert to string \r\ndf3 = df2['Response'].apply(str)\r\n#covert to lower-case\r\ndf4 = df3.str.casefold()\r\n#remove all non-aphabet characters\r\ndf5 = df4.str.replace(\"[^a-zA-Z#]\", \" \")\r\nprint (df5.head(10)) <\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93759\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-description-automa-1.png\" alt=\"An image showing how to load and clean the dataset\" width=\"1447\" height=\"917\" \/><\/p>\n<p><strong>Figure 8. Load and clean demo dataset<\/strong><\/p>\n<h2>Generate a Word Cloud<\/h2>\n<p>The word cloud visualization shows which keywords occur most frequently in the given text. Words with higher frequency are indicated by their larger size in the visual. Run this code snippet in the next cell of the Jupyter notebook.<\/p>\n<pre class=\"lang:tsql decode:true \"># Create and generate a word cloud image:\r\ntext = df5[0]\r\nwordcloud = WordCloud().generate(text)\r\n# Display the generated image:\r\nplt.imshow(wordcloud, interpolation='bilinear')\r\nplt.axis(\"off\")\r\nplt.show()<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93760\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-application-email.png\" alt=\"An image showing how to generate a word cloud and the word cloud. The big words are Together, works, fun, well, team\" width=\"1564\" height=\"791\" \/><\/p>\n<p><strong>Figure 9. Word Cloud visual<\/strong><\/p>\n<p>This word cloud visual indicates \u2018together\u2019 is the most frequent keyword in this text, followed by \u2018works,\u2019 \u2018fun,\u2019 \u2018well,\u2019 and \u2018team\u2019. While these words have positive connotations, it\u2019s difficult to gain deeper insights from this word cloud.<\/p>\n<h2>Topic Modeling<\/h2>\n<p>Topic modeling using LDA involves several steps and, in its most basic form, can be an iterative approach, with opportunities for further automating the process of running iterations. This section walks readers through the process of gaining deeper insights from this demo dataset using topic modeling.<\/p>\n<h3>Step 1: Tokenization<\/h3>\n<p><a href=\"https:\/\/aclanthology.org\/C92-4173.pdf\">Tokenization<\/a> in Natural Language Processing (NLP) is the process of separating a body of text into smaller units called tokens. Tokenization can be performed at \u2018word,\u2019 \u2018characters,\u2019 and \u2018sub-word (<a href=\"https:\/\/en.wikipedia.org\/wiki\/N-gram\">n-gram character<\/a>)\u2019 levels. Tokens are considered building blocks of any natural language and popularly used NLP techniques work at the token level. Tokenization is usually a fundamental step in most Natural Language Processing projects.<\/p>\n<p>A sentence or phrase composed of two words is called a bigram, and one composed of three words is called a trigram. Common examples of such phrases in this demo text include \u2018fun team,\u2019 \u2018small overhead,\u2019 and \u2018constantly learning together.\u2019<\/p>\n<p>Run the following code snippet in the next cell of the Jupyter notebook to tokenize words and create bigram\/trigram models.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Convert dataframe to list and tokenize words\r\ndata = df5.values.tolist()\r\n \r\ndef sent_to_words(sentences):\r\n    for sentence in sentences:\r\n        yield(gensim.utils.simple_preprocess(str(sentence)))  \r\ndata_words = list(sent_to_words(data))\r\n# Build the bigram and trigram models\r\nbigram = gensim.models.Phrases(data_words, min_count=3, threshold=10) \r\ntrigram = gensim.models.Phrases(bigram[data_words], threshold=8)  \r\nbigram_mod = gensim.models.phrases.Phraser(bigram)\r\ntrigram_mod = gensim.models.phrases.Phraser(trigram)\r\n# print a sample \r\nprint(trigram_mod[bigram_mod[data_words[3]]])<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93786 size-full\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/bigrams.png\" alt=\"An image showing how to generate the bigrams and trigrams. The bigrams are pointed out pretty_good and lack_of\" width=\"923\" height=\"463\" \/><\/p>\n<p><strong>Figure 10. word tokenization, bigrams, and trigrams <\/strong><\/p>\n<div class=\"note\">\n<p><em>NOTE: Depending on your versions of python and various packages\/libraries, your results may be different here and throughout the rest of the article.<\/em><\/p>\n<\/div>\n<p>You can tune the parameters of <code>min_count<\/code> and <code>threshold<\/code> and re-run this cell multiple times to arrive at a reasonable output sample. The ability of these models to identify larger quantities of bigrams\/trigrams diminishes as these parameters are set to higher values, however, the quality of the ones identified can improve. The sweet spot can be identified after a few iterations.<\/p>\n<h3>Step 2: Stop words, n-grams, and lemmatization<\/h3>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Stop_word\">Stop words<\/a> are any words that should be filtered out (typically by adding to a stop list) during natural language processing. In the English language, words which don\u2019t add much value to the sentence and can be safely ignored without compromising their meaning are typically included in a predefined stop list. Examples of stop words in English include \u2018a,\u2019 \u2018the,\u2019 \u2018have,\u2019 etc. Stop lists can be augmented with custom stop words as needed.<\/p>\n<p>Run this code snippet in the next cell of the Jupyter notebook to define functions for removing stop words, creating n-grams, and performing lemmatization.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Define functions for stopwords, n-grams and lemmatization\r\ndef remove_stopwords(texts):\r\n    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]\r\ndef make_bigrams(texts):\r\n    return [bigram_mod[doc] for doc in texts]\r\ndef make_trigrams(texts):\r\n    return [trigram_mod[bigram_mod[doc]] for doc in texts]\r\ndef lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):\r\n    \"\"\"https:\/\/spacy.io\/api\/annotation\"\"\"\r\n    texts_out = []\r\n    for sent in texts:\r\n        doc = nlp(\" \".join(sent)) \r\n        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])\r\n    return texts_out<\/pre>\n<p>Then run this code snippet to call these functions in order.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># call the functions created above\r\n# Remove Stop Words\r\ndata_words_nostops = remove_stopwords(data_words)\r\n# Form Bigrams\r\ndata_words_bigrams = make_bigrams(data_words_nostops)\r\n# Do lemmatization keeping only noun, adj, vb, adv\r\ndata_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])\r\nprint(data_lemmatized[:1])<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93762\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-application-email-1.png\" alt=\"Am image showing lemmatization of the words\" width=\"1422\" height=\"917\" \/><\/p>\n<p><strong>Figure 11. Stop words, n-grams, and lemmatization<\/strong><\/p>\n<h3>Step 3: Create dictionary and corpus<\/h3>\n<p>The LDA topic model needs a dictionary and a corpus as inputs. The dictionary is simply a collection of the lemmatized words. A unique id is assigned to each word in the dictionary and used to map the frequency of each word and to produce a term document frequency corpus. A corpus (Latin for body) refers to a collection of texts.<\/p>\n<p>Run this code snippet in the next cell of the Jupyter notebook to create the dictionary, the term document frequency corpus, and view a sample of its contents.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Create Dictionary\r\nid2word = corpora.Dictionary(data_lemmatized)\r\n# Create Corpus\r\ntexts = data_lemmatized\r\n# Term Document Frequency\r\ncorpus = [id2word.doc2bow(text) for text in texts]\r\n# view corpus in human readable format\r\n[[(id2word[id], freq) for id, freq in cp] for cp in corpus[1:2]]<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93779\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/word-freq.png\" alt=\"An image showing the frequency of the words. Expand is found once and product is found twice \" width=\"876\" height=\"755\" \/><\/p>\n<p><strong>Figure 12. Dictionary and corpus of word term-frequency<\/strong><\/p>\n<h3>Step 4: Build the LDA topic model<\/h3>\n<p>This section trains LDA model from the Gensim library using the <a href=\"https:\/\/radimrehurek.com\/gensim\/models\/ldamodel.html\">models.ldamodel<\/a> module.<\/p>\n<ul>\n<li><code>Corpus<\/code> and <code>id2word<\/code> (dictionary) are the two key inputs parameters prepared in the previous steps<\/li>\n<li><code>num_topics<\/code> parameter specifies the number of topics to be extracted from the input corpus. Set this value to 2 initially. I will iterate through a few values of this parameter to find the optimal topic model<\/li>\n<li><code>update_every<\/code> parameter determines how often the model parameters should be updated as several rounds of training passes are made. Set it to 1<\/li>\n<li><code>chunksize<\/code> parameter specifies the number of documents to be used in each training chunk. Set it to 100<\/li>\n<li><code>passes<\/code> parameter determines the number of training passes. Set it to 10<\/li>\n<li><code>per_word_topic<\/code> is set to True, which tells the model to compute a list of topics sorted in the descending order of most likely topics for each word<\/li>\n<li><code>alpha<\/code> is an optional parameter related to document-topic distribution. Setting it to <code>auto<\/code> allows the model to learn an asymmetric prior from the corpus<\/li>\n<\/ul>\n<p>Run this code snippet in the next cell of the Jupyter notebook to train the LDA topic model and print keywords for each topic along with their importance scores.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Build LDA model\r\nlda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\r\n                                           id2word=id2word,\r\n                                           num_topics=2, # change this number and re-run as needed\r\n                                           random_state=100,\r\n                                           update_every=1,\r\n                                           chunksize=100,\r\n                                           passes=10,\r\n                                           alpha='auto',\r\n                                           per_word_topics=True)\r\n# Print the Keyword in the topics\r\npprint(lda_model.print_topics())\r\ndoc_lda = lda_model[corpus]<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93778\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/image-for-Jupyter.png\" alt=\"An image showing how to generate the topic number, importance score and keyword\" width=\"997\" height=\"539\" \/><\/p>\n<p><strong>Figure 13. Train LDA topic model and infer extracted topics<\/strong><\/p>\n<ul>\n<li>The output shows two topics (topic 0 and topic 1), along with the top ten keywords within each topic and their importance scores<\/li>\n<li>The keywords of topic 0 seem to indicate \u2018good overall team health\u2019<\/li>\n<li>The keywords of topic 1 seem to indicate \u2018great work, feel good support (in) team, lot (of) fun\u2019<\/li>\n<li>This manual interpretation of topics is commonly known as \u2018labeling\u2019 or \u2018tagging\u2019 of extracted topics<\/li>\n<\/ul>\n<p>Since this is an unsupervised learning technique, it remains unclear if two is the right number of topics present in this text. The subsequent steps will help answer the question: \u201cDoes this text document have any more useful topics to extract?\u201d<\/p>\n<h3>Step 5: Compute model performance metrics<\/h3>\n<p><a href=\"https:\/\/towardsdatascience.com\/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0\">Model perplexity and topic coherence<\/a> are useful metrics to evaluate the performance of a trained topic model.<\/p>\n<p>The model perplexity measures how perplexed or surprised a model is when it encounters new data. Measured as a normalized log-likelihood of a held-out test set, it\u2019s an intrinsic metric widely used for language model evaluation. While a perplexity value closer to zero indicates a better model, optimizing for perplexity may not always lead to human readable topics.<\/p>\n<p>Topic Coherence measures the degree of semantic similarity between high-scoring words within the same topic. A set of words, phrases, or statements can be defined as \u2018coherent\u2019 if they support each other. This metric can help to differentiate between human interpretable semantic topics versus topics that are outcomes of statistical inference but have very little semantic value.<\/p>\n<p>Run this code snippet in the next cell of the Jupyter notebook to generate model perplexity and coherence score.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Compute Perplexity\r\nprint('\\nPerplexity: ', lda_model.log_perplexity(corpus)) \r\n# Compute Coherence Score\r\ncoherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\r\ncoherence_lda = coherence_model_lda.get_coherence()\r\nprint('\\nCoherence Score: ', coherence_lda)\r\n# change value of num_topics and re-run, you should notice these scores change<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93765\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-application-email-3.png\" alt=\"An image showing how to generate the Perplexity and Coherence Scores\" width=\"1473\" height=\"564\" \/><\/p>\n<p><strong>Figure 14. Perplexity and Coherence scores<\/strong><\/p>\n<p>Make a note of <em>Perplexity<\/em> and <em>Coherence<\/em> scores in Figure 14, as you will retrain the model with updated values for the <code>num_topic<\/code> parameter and recompute these metrics.<\/p>\n<h3>Step 6: Visualization<\/h3>\n<p>The pyLDAvis package is a great tool to generate an interactive chart to visualize the inter-topic distance map and examine the keywords for each topic. Run this code snippet in the next cell of the Jupyter notebook to create this chart<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Visualize the topics\r\npyLDAvis.enable_notebook()\r\nvis = gensimvis.prepare(lda_model, corpus, id2word)\r\nvis\r\n# change num_topics and re-run until intertopic distance chart looks good<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93766\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-chart-description-autom.png\" alt=\"An image showing the Intertopc Distance Map with two discreet bubbles\" width=\"1118\" height=\"838\" \/><\/p>\n<p><strong>Figure 15. pyLDAvis chart of modeled topics<\/strong><\/p>\n<ul>\n<li>The <em>Intertopic Distance Map<\/em> on the left half of this chart represents each topic as a bubble, whose size correlates to the prevalence of its topic within the text document. An optimal topic model is represented by large, non-overlapping bubbles that are scattered throughout the chart.<\/li>\n<\/ul>\n<p>A poor topic model has many small bubbles that are overlapping and\/or clustered in one region of the chart. You can retrain the model by incrementing <code>num_topics<\/code> and recreating this visual, as well as use the model performance metrics to find the optimal number of topics.<\/p>\n<ul>\n<li>The Relevant terms per topic on the right half of this chart shows the top 30 terms (words) per topic, along with the percentage prevalence of chosen topic. It\u2019s an interactive <a href=\"https:\/\/chartio.com\/learn\/charts\/stacked-bar-chart-complete-guide\/\">stacked bar chart<\/a> where each blue bar represents the overall frequency of a term (word) within the document. When you select a topic by clicking on one of the bubbles on the left side, the overlapping red bars appear on the right side, indicating estimated frequency of each term within that topic.<\/li>\n<li>While your output may look different, please make a note of the top ten words for each topic.<\/li>\n<\/ul>\n<h3>Step 7: Retrain topic model iteratively<\/h3>\n<p>This step is an iterative process involving:<\/p>\n<ul>\n<li>Incrementing the value of <code>num_topics<\/code> parameter and rebuild the LDA topic model (step 4)<\/li>\n<li>Noting values of model performance metrics (step 5)<\/li>\n<li>Generating pyLDAvis chart and studying the Intertopic Distance Map (Step 6)<\/li>\n<li>Repeating the above three steps until the Intertopic Distance Map looks optimal<\/li>\n<\/ul>\n<p>Run this code snippet in the next cell of the jupyter notebook<\/p>\n<pre class=\"lang:ps theme:powershell-ise\"># Retrain LDA model\r\nlda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\r\n                                           id2word=id2word,\r\n                                           num_topics=3, # change this number and re-run as needed\r\n                                           random_state=100,\r\n                                           update_every=1,\r\n                                           chunksize=100,\r\n                                           passes=10,\r\n                                           alpha='auto',\r\n                                           per_word_topics=True)\r\n# Regenerate model performance metrics\r\nprint('\\nPerplexity: ', lda_model.log_perplexity(corpus)) \r\ncoherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\r\ncoherence_lda = coherence_model_lda.get_coherence()\r\nprint('\\nCoherence Score: ', coherence_lda)\r\n# Recereate topic visualization\r\npyLDAvis.enable_notebook()\r\nvis = gensimvis.prepare(lda_model, corpus, id2word)\r\nvis<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93767\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-application-email-4.png\" alt=\"An image showing that the num_topics value can be change, in this case, to 3\" width=\"1682\" height=\"924\" \/><\/p>\n<p><strong>Figure 16. Retrain LDA topic model with num_topics = 3<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93768\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/chart-description-automatically-generated.png\" alt=\"An image showing the Intertopic Distance Map with three bubbles\" width=\"1340\" height=\"907\" \/><\/p>\n<p><strong>Figure 17. pyLDAvis chart for three topics<\/strong><\/p>\n<p>Repeat the above steps by setting value of <code>num_topics<\/code> parameter to four (4).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93769\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/graphical-user-interface-text-email-description.png\" alt=\"An image showing the num_topics changed to 4\" width=\"1251\" height=\"776\" \/><\/p>\n<p><strong>Figure 18. Retrain LDA topic model with num_topics = 4<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93770\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/chart-bubble-chart-description-automatically-gen.png\" alt=\"An image showing the Intertopic distance map with 4 bubbles. This time, two of them overlap\" width=\"1361\" height=\"899\" \/><\/p>\n<p>Overlapping bubbles indicate a poor model<\/p>\n<p><strong>Figure 19. pyLDAvis chart for four topics<\/strong><\/p>\n<p>The Intertopic Distance Map shows bubbles for topic number 1 and 3 are overlapping, which indicates I have overshot the optimal number of topics. At this point I can stop running iterations and analyze their outputs.<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p><strong>Num_topics<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>Model perplexity<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>Topic Coherence<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>Intertopic Distance Map<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>2<\/p>\n<\/td>\n<td>\n<p>-6.089<\/p>\n<\/td>\n<td>\n<p>0.221<\/p>\n<\/td>\n<td>\n<p>Two large bubbles well-spaced across chart quadrants<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>3<\/p>\n<\/td>\n<td>\n<p>-6.174<\/p>\n<\/td>\n<td>\n<p>0.245<\/p>\n<\/td>\n<td>\n<p>Three large bubbles well-spaced across chart quadrants<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>4<\/p>\n<\/td>\n<td>\n<p>-6.253<\/p>\n<\/td>\n<td>\n<p>0.274<\/p>\n<\/td>\n<td>\n<p>Three large bubbles and one small. Bubbles for topics 1 and 2 are overlapping<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Figure 20. Table of observations for three iterations <\/strong><\/p>\n<p>Model perplexity and Topic Coherence metrics seem to indicate model performs better as the value of <code>num_topics<\/code> parameter increases. However, the Intertopic Distance Map is optimal when <code>num_topics<\/code> parameter is set to 3. These factors lead me to conclude that three topics is the optimal number of topics I can extract from this text document.<\/p>\n<p>Your Intertopic Distance Map might not show overlapping bubbles for <code>num_topics<\/code> = 4, and you may need to continue running more iterations by incrementing <code>num_topics<\/code> until your Intertopic Distance Map shows overlapping bubbles. In this case, it\u2019s helpful to make a note of the top ten keywords for each topic in each iteration, so you can identify if keywords are repeating between different topics within the same iteration.<\/p>\n<h3>Step 8. Infer topic labels<\/h3>\n<p>After identifying the optimal number of topics, the next step is to infer human-readable labels for each topic using their frequent terms. This step is not an exact science and typically benefits from a good understanding of the business context of your dataset.<\/p>\n<p>Revisit the pyLDAvis chart for iteration where <code>num_topics<\/code> is set to 3, then click on bubble 1.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93771\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/chart-bubble-chart-description-automatically-gen-1.png\" alt=\"An image showing the map with 3 bubbles. Bubble #1 is clicked so it shows the top 30 most relevant terms for that topic\" width=\"1312\" height=\"866\" \/><\/p>\n<p><strong>Figure 21. three topic pyLDAvis chart, highlighting terms for topic 1<\/strong><\/p>\n<p>This figure focuses on bubble for topic 1 and indicates<\/p>\n<ul>\n<li>Topic 1 includes 43.4 % of tokens found in the text document<\/li>\n<li>The top 10 terms of topic 1 are used to infer label \u2018Overall Team (health) feel(s) good, positive (and) green\u2019 (I have used business context knowledge of the Team Health survey process to associate \u2018green\u2019 with good team health).<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93772\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/chart-description-automatically-generated-with-me.png\" alt=\"An image showing three bubbles and bubble #2 is clicked to show its top 30 terms\" width=\"1307\" height=\"866\" \/><\/p>\n<p><strong>Figure 22. Three topic pyLDAvis chart, highlighting terms for topic 2<\/strong><\/p>\n<p>This figure focuses on bubbles for topic 2 and indicates:<\/p>\n<ul>\n<li>Topic 2 includes 36.9 % of tokens found in the text document<\/li>\n<li>The top 10 terms of topic 2 are used to infer label \u2018Great work, good support (and) lot (of) fun\u2019<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-93773\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/04\/chart-bubble-chart-description-automatically-gen-2.png\" alt=\"bubble #3 is selected to show its top terms\" width=\"1309\" height=\"865\" \/><\/p>\n<p><strong>Figure 23. Three topic pyLDAvis chart, highlighting terms for topic 1<\/strong><\/p>\n<p>This figure focuses on bubble for topic 3 and indicates:<\/p>\n<ul>\n<li>Topic 3 includes 19.8 % of tokens found in the text document<\/li>\n<li>The top 10 terms of topic 3 are used to infer label \u2018Room (for) improvement\u2019<\/li>\n<\/ul>\n<p>If your Intertopic Distance Map does not show overlapping bubbles for <code>num_topics<\/code> = 4, you may need to continue running more iterations by incrementing <code>num_topics<\/code> until your Intertopic Distance Map shows overlapping bubbles. Please make note of the top ten keywords for each topic in each iteration. This is helpful to detect when keywords are repeating between different topics within the same iteration, even if the Intertopic Distance Map looks great, which indicates you have overshot the optimal number of topics.<\/p>\n<h2>Automation<\/h2>\n<p>This use case is simple and needs only three iterations to arrive at an optimal solution. A very complex use case may need several tens of iterations to find the optimal topic model. Running so many iterations manually can become a tedious chore. There are a few options to automate the process of running iterations<\/p>\n<ul>\n<li>Program a Loop\n<ul>\n<li>Write a loop to iterate the <code>num_topics<\/code> from 2 to 30 (or any upper value of your choice based on your business context knowledge of the data set).<\/li>\n<li>Plot model performance metrics as a line chart against <code>num_topics<\/code> on, to identify the value of <code>num_topics<\/code> where these metrics stop improving<\/li>\n<li>Save the pyLDAvis chart for each iteration in a folder and review the Intertopic Distance Maps to find the optimal number of human readable topics<\/li>\n<\/ul>\n<\/li>\n<li>LDA Mallet Model\n<ul>\n<li><a href=\"https:\/\/www.tutorialspoint.com\/gensim\/gensim_creating_lda_mallet_model.htm\">Mallet<\/a> is an open-source toolkit for NLP with a package for LDA based topic modeling<\/li>\n<li>Gensim provides a wrapper to facilitate Mallet\u2019s LDA topic model estimation and inference of topic distribution<\/li>\n<li>It handles running iterations without having to code a loop<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Exploring these automation options in detail is outside the scope of this article<\/p>\n<h2>Compare Topic modeling and Word Cloud<\/h2>\n<p>The process of topic modeling with LDA helped discover deeper insights from the Team Health survey responses in the form of following three topics:<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p><strong>Topic Number <\/strong><\/p>\n<\/td>\n<td>\n<p><strong>Percentage Composition of Tokens<\/strong><\/p>\n<\/td>\n<td>\n<p><strong>Topic Label<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>1<\/p>\n<\/td>\n<td>\n<p>43.3 %<\/p>\n<\/td>\n<td>\n<p>Overall Team health is good\/positive<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>2<\/p>\n<\/td>\n<td>\n<p>36.9 %<\/p>\n<\/td>\n<td>\n<p>Great work, lot of fun and supportive team<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>3<\/p>\n<\/td>\n<td>\n<p>19.8 %<\/p>\n<\/td>\n<td>\n<p>Some room for improvement<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Figure 24. Summary of Insights from Topic modeling<\/strong><\/p>\n<p>These topics help to deliver actionable analysis to business stakeholders. Combined with business context, subject matter expertise, and organizational knowledge, the following might be a good a summary readout to potential business stakeholder group of IT leadership.<\/p>\n<ul>\n<li>The prevalent consensus (43.3 %) amongst survey respondents indicates that overall Team health is good\/positive<\/li>\n<li>Over a third (36.9 %) of survey responses indicate teams are supportive of their members, they have lot of fun, and work is great (positive environment for teamwork)<\/li>\n<li>Around one fifth (19.8 %) of survey responses hint at some room for improvement<\/li>\n<\/ul>\n<p>Comparing these richer insights against the Word Cloud from Figure 9, highlights the Word Cloud\u2019s shortcomings and helps readers gain an appreciation for the deeper insights uncovered by Topic Modeling.<\/p>\n<h2>Deeper Insights with Topic Modeling<\/h2>\n<p>This article walked readers through the process of Topic modeling with LDA and understanding the value of its deeper insights, especially when compared to easier techniques like Word Cloud. Through the course of this article, I demonstrated:<\/p>\n<ul>\n<li>Setup of Anaconda Jupyter notebook environment for performing topic modeling<\/li>\n<li>Data cleaning and preparation steps needed for topic modeling with LDA<\/li>\n<li>Iterative process of training topic models and identifying an optimal solution<\/li>\n<li>Interpreting human readable insights from topic model output charts<\/li>\n<li>Comparing these deeper insights with outcomes from the easier technique of word cloud<\/li>\n<li>Business value of topic modeling as a popular and practical Natural Language processing technique<\/li>\n<\/ul>\n<h2>References<\/h2>\n<ul>\n<li>Topic modeling &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Topic_model\">https:\/\/en.wikipedia.org\/wiki\/Topic_model<\/a><\/li>\n<li>Latent Dirichlet allocation &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Latent_Dirichlet_allocation\">https:\/\/en.wikipedia.org\/wiki\/Latent_Dirichlet_allocation<\/a><\/li>\n<li>LDA paper from Journal of Machine Learning Research &#8211; <a href=\"https:\/\/www.jmlr.org\/papers\/volume3\/blei03a\/blei03a.pdf\">https:\/\/www.jmlr.org\/papers\/volume3\/blei03a\/blei03a.pdf<\/a><\/li>\n<li>TF-IDF &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\">https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf<\/a><\/li>\n<li>Lemmatization and stemming &#8211; <a href=\"https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html\">https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html<\/a><\/li>\n<li>spaCy &#8211; <a href=\"https:\/\/spacy.io\/\">https:\/\/spacy.io\/<\/a><\/li>\n<li>tokenization &#8211; https:\/\/aclanthology.org\/C92-4173.pdf<\/li>\n<li>n-grams &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/N-gram\">https:\/\/en.wikipedia.org\/wiki\/N-gram<\/a><\/li>\n<li>Evaluate topic models using perplexity and coherence scores &#8211; <a href=\"https:\/\/towardsdatascience.com\/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0\">https:\/\/towardsdatascience.com\/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0<\/a><\/li>\n<li>pyLDAvis &#8211; <a href=\"https:\/\/pyldavis.readthedocs.io\/en\/latest\/readme.html\">https:\/\/pyldavis.readthedocs.io\/en\/latest\/readme.html<\/a><\/li>\n<li>Link to the Topic modeling jupyter notebook with code and results on my Github repository &#8211; <a href=\"https:\/\/github.com\/SQLSuperGuru\/SentimentAnalysis\/blob\/main\/TopicModeling.ipynb\">https:\/\/github.com\/SQLSuperGuru\/SentimentAnalysis\/blob\/main\/TopicModeling.ipynb<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Topic modeling can be used to find more detailed insights into text than a word cloud can provide. Sanil Mhatre walks you through an example using Python.&hellip;<\/p>\n","protected":false},"author":317671,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,53,146042],"tags":[5134],"coauthors":[101710],"class_list":["post-93751","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-featured","category-python","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/93751","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/317671"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=93751"}],"version-history":[{"count":8,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/93751\/revisions"}],"predecessor-version":[{"id":93790,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/93751\/revisions\/93790"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=93751"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=93751"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=93751"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=93751"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}