Sentiment Analysis of a Podcast

 

There has been an increase in the exploration of text as a rich data source. Quantifying textual data can reveal trends, patterns, and characteristics that may go unnoticed initially by human interpretation. Combining quantitative analyses with computing capabilities of modern technology allows for quick processing of substantially large amounts of text.

Here we present a sentiment analysis of an intriguing form of text – podcast transcripts – to provide a discussion on the process of text analysis.

Podcast transcripts are a unique form of text because their initial intent was to be listened to, not read, creating a more intimate form of communication. The text used in this example are the transcripts from the NPR “Serial” podcast hosted by Sarah Koenig. “Serial” explores the investigation and trial of Adnan Syed who was accused of the murder of his girlfriend in 1999. The podcast consists of 12 episodes averaging 43 minutes and 7,480 words ­­each. Here we examine the 12 episodes together as a whole text.

Sentiment analysis involves processing text data to identify, quantify, and extract subjective information from the text. Using the available tools from the Tidyr package in R, we can examine the polarity (positive or negative tone) and the emotional association (joy, anger, fear, trust, anticipation, surprise, disgust, and sadness) of the text. We present one method of sentiment analysis which involves referencing a sentiment dictionary or list of words coded based on the objective. For examining the polarity, each word is given a positive, negative, or neutral value and for examining the emotions, each word is tagged with any associated emotions. As an example, the word “murder” is coded as negative and tagged with fear and sadness. We chose to use the NRC sentiment dictionary for this analysis as it is the only one that includes emotions, and it was created as a general purpose lexicon that works well for different types of text.

Starting with an overall visualization of the emotions and polarity of the podcast, a bar graph (Figure 1) displays the percentage of the text characterized by each emotion. In examining the text in this way, a particularly intriguing discovery is that the most common emotion is trust, which may be surprising for a podcast about a murder investigation and trial. The next most common emotion is anticipation. This confirms what one may expect in the context of podcasts: hosts would want to keep their listeners interested in the story so anticipation would play a key role in getting people to listen regularly.

Figure 2 shows that overall this text is positive as a larger percent of the words are coded as positive. Looking closer at which words occur most often within a specific sentiment or emotion, a sorted word cloud allows one to visually identify the most commonly used words coded as positive or negative.

The most frequently used negative words are crime, murder, kill, and calls. The most frequently used positive words are friend, talk, police, and pretty. It is important to examine the context of the most common words. Consider the word “pretty”, in the text “pretty” was used as an adverb not as an adjective (e.g. “I’m pretty sure I was in school. I think– no?”) All 53 instances of “pretty” in the text were used to show uncertainty.  However, the NRC dictionary defines and codes the word “pretty” as an adjective describing something as being attractive. This mismatch between usage within the text and the dictionary impacts the sentiment analysis. One should carefully consider how to handle such words appropriately.

Similarly we can examine each emotion in more detail. These graphs allow one to see which words were most represented in each emotion.

 

This graph again illustrates the importance of critically examining the results. The word “don” is coded as a top positive word, however, in this text “don” is the name of a person and like the other names it should be coded as neutral. However, the NRC lexicon codes the word “don” defined as a gentlemen or mentor. Similar concerns may be present for other words that may have multiple meanings. These words should be appropriately considered, particularly if among the most frequently used words in the text.

These graphs show a few of the many ways to quantify and visualize text data through a sentiment analysis to understand a text more objectively. As text analyses become more prevalent, it is imperative to actively engage in the process and critically examine results paying attention to not only the numbers and graphs but also the subject matter of the text.

 

Please like & share:

Software Review: NVivo as a Teaching Tool

nvivo-logoFor the past few weeks, DASIL has been publishing a series of blog posts comparing the two presidential candidates this year – Hillary Clinton and Donald Trump – using NVivo, a text analysis software. Given the increasing demand for qualitative data analysis in academic research and teaching, this blog post will discuss the strengths and weaknesses of NVivo as a teaching tool in qualitative analysis.

Efficiency and reliability

Using software like NVivo in content analysis can add rigor to qualitative research. Doing word search or coding using NVivo will produce more reliable results than doing so manually since the software rules out human error. Furthermore, NVivo proves to be really useful with large data sets – it would be extremely time-consuming to code hundreds of documents by hand with a highlighter pen.

Ease of use

NVivo is relatively simple to use. Users can import documents directly from word processing packages in various forms, including Word documents and pdfs, and code these documents easily on screen via the point-and-click system. Teachers and students can quickly become proficient in use of this software.

NVivo and social media

NVivo allows users to import Tweets, Facebook posts, and Youtube comments and incorporate them as part of their data. Given the rise of social media and increased interest in studying its impact on our society, this capability of NVivo may become more heavily employed.

Segmenting and identifying patterns 

NVivo allows users to create clusters of nodes and organize their data into categories and themes, making it easy for researchers to identify patterns. At the same time, the use of word clouds and cluster analysis also provides insight into prevailing themes and topics across data sets.

Limitations

While NVivo seems to be a great software that serves to provide a reliable, general picture of the data, it is important to be aware of its limitations. It may be tempting to limit the data analysis process to automatic word searches that yield a list of nodes and themes. While it is alluring to do so, in-depth analyses and critical thinking skill are needed for meaningful data analysis.

Although it is possible to search for particular words and derivations of those words, various ways in which ideas are expressed make it difficult to find all instances of a particular usage of words or ideas. Manual searches and evaluation of automatic word searches help to ensure that the data are, in fact, thoroughly examined.

Once individual themes in a data set are found, NVivo doesn’t provides tools to map out how these themes relate to one another, making it difficult to visualize the inter-relationships of the nodes and topics across data sets. Users need to think critically about ways in which these themes emerge and relate to each other to gain a deeper understanding of the data.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

A Tool for Visualizing Regression Models

Will sales of a good increase when its price goes down? Does the life expectancy of a country have anything to do with its GDP? To help answer these questions concerning different measures, researchers and analysts often employ the use of regression techniques.

Linear regression is a widely-used tool for quantifying the relationship between two or more quantitative variables. The underlying premise is simple: no more complicated than drawing a straight line through a scatterplot! This simple tool is nevertheless used for everything from market forecasting to economic models. Due to its pervasiveness in analytical fields, it is important to develop an intuition behind regression models and what they actually do. For this, I have developed a visualization tool that allows you to explore the way regressions work.

You can import your own dataset or choose from a selection of others, but the default one is information on a selection of movies. Suppose you want to know the strategy for making the most money from a film. In regression terminology, you ask what variables (factors) might be good predictors of a film’s box office gross?

The response variable is the measure you want to predict, which in this situation will be the box office gross (BoxOfficeGross). The attribute that you think might be a good predictor is the explanatory variable. The budget of the film might be a good explanatory variable to predict the revenue a film might earn, for example. Let’s change the explanatory variable of interest to Budget to explore this relationship. Do you see a clear pattern emerge from the scatterplot? Can you find a better predictor of BoxOfficeGross?

If you want to control for the effects of other pesky variables without having to worry about them directly, you can include them in your model as control variables.

Below the scatterplot are two important measures that are used in evaluating regression models: the p-value and the R2 value. What the p-value tells us is the probability of getting our result just by chance. In the context of a regression model, it suggests whether the specific combination of explanatory and control variables really do seem to affect the response variable in some way: a lower p-value means that there seems to be something actually going on with the data, as opposed to the points being just scattered randomly.  The R2 value, on the other hand, tells us how what proportion of the variability in the response (predicted) variable is explained by the explanatory (predictor) variable, in other words, how good the model is. If a model has a low R2 value and is incredibly bad at predicting our response, it might not be such a good model after all.

score vs runtime plot

If you want to predict a movie’s RottenTomatoesScore from its RunTime, for example, the incredibly small p-value might tempt you to conclude that, yes, longer movies do get better reviews! However, if you look at the scatterplot, you might get the feeling that something’s not right. The R2 value tells us this other side of the story: though RunTime does appear to be correlated to RottenTomatoesScore, the strength of that relationship is just too weak for us to do anything with!

Play around with the default dataset provided, or use your own dataset by going to the Change Dataset tab on top of the page. This visualization tool can be used to develop an intuition for regression analysis, to get a feel of a new dataset, or even in classrooms for a visual introduction to linear regression techniques.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

10 Suggestions for Making an Effective Poster

saracasson

Written papers are the traditional way to share research results at professional meetings, but poster sessions have been gaining popularity in many fields. Posters are particularly effective for sharing quantitative data, as they provide a good format for presenting data visualizations and allow readers to peruse the information at leisure.  For students they are a great teaching tool, as preparing a good poster also requires clear and concise writing.

Making a poster is easy, but making a really good poster is hard.  I have found the guidelines below helpful to students.  The most important piece of advice, however, is the one true for all writing—write, read and revise; write, read and revise; write, read and revise!

  1. Make your poster using PowerPoint. This will allow you to put in text via text boxes as well as to paste in charts, graphs, tables, maps, and pictures.  It is easy! To get your pictures and text boxes to line up consistently, use snap to grid.  In the Format tab choose Arrange>>Align and then Grid Setting. Select to view the grid and to snap to the grid.  You can set the grid size here as well.
  1. Use a single slide. In the Design Tab pick Page Setup, select custom, and then set the width and height to maximize your slide, given the locally-available paper size. At Grinnell the paper width available is 36”, so we set the width to 45” and the height to 36”.  Use “landscape” for your orientation.
  1. As in a written paper, have a descriptive title. Put the title (in 68 point type or larger) at the top of the poster.  Place your name and college affiliation in slightly smaller type immediately below it.
  1. The exact sections of the poster will vary some depending on the project, but include an abstract placed either under the title or in the upper left column.
  1. As in a written paper, be sure you have a good thesis and present it early in the poster, support it with evidence, then remind your audience of it as you conclude. Finish with a minimum of citations and acknowledgements in the lower right hand corner.
  1. Posters should read sequentially from the upper left, down the left column, then down the central column (if you have one) and finally down the right column. Alternative layouts are possible, but the order in which the poster is read must be obvious.
  1. Use a large font–a minimum of 28 point.
  1. Limit the number of words. Be concise and think of much of your text as captions for illustrations.
  1. Use lots of charts, graphs, maps, and other pictures. Be sure to label your figures and refer to them in the text.
  1. Make your poster attractive. Use color.  Pay attention to layout.  Do not have large empty areas.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

 

Please like & share:

Testing Weighted Data

In previous posts we discussed the challenges of accounting for weights in stratified random samples. While the calculation of population estimates is relatively standard, there is no universally accepted norm for statistical inference for weighted data. However, some methods are more appropriate than others. We will focus on examining three different methods for analyzing weighted data and discuss which is most appropriate to use, given the information available.

Three common methods for testing stratified random samples (weighted data) are:

  • The Simple Random Sample (SRS) Method assumes that the sample is an unweighted sample that is representative of the population, and does not include adjustments based on the weights that are assigned to each entry in the data set. This is the basic chi-square test taught in most introductory statistics classes.
  • The Raw Weight (RW) Method multiplies each entry by their respective weight and runs the analysis on this adjusted weighted sample.
  • The Rao-Scott Method takes into account both sampling variability and varibility among the assigned weights to adjust the chi-square from the RW method.

One example of a data set which incorporates a weight variable is the Complementary and Alternative Medicine (CAM) Survey, which was conducted by the National Center for Health Statistics (NCHS) in 2012. For the CAM survey, NCHS researchers gathered information on numerous variables such as race, sex, region, employment, marital status, and whether each individual surveyed used various types of CAM. In this dataset, weights were assigned based on race, sex, and age.

Among African Americans who used CAM for wellness, we conducted a chi-square test to determine whether there was a significant difference in the proportion of physical therapy users in each region. Below is a table comparing the test statistics and p-values for each of the three statistical tests:

blogpost1

The SRS method assumes that we are analyzing data collected from a simple random sample instead of a stratified random sample. Since the proportions in our sample do not represent the population, this method is inappropriate. The RW method multiplies each entry by their weight giving a slightly more representative sample. While this method is useful for estimating populations, the multiplication of the weights tends to give p-values that are much too small. Thus, both the SRS and RW methods are inaccurate methods for testing this data set. The Rao-Scott method involves adjustments for non-SRS sample designs as well as accounting for the weights, resulting in a better representation of the population.
Try it on your own!
Through a summer MAP with Pam Fellers and Shonda Kuiper, we created a CAM Data shiny app. Go to this app and compare how population estimates and test statistics can changes based upon the statistical method that is used. For example, select the X Axis Variable to be Sex and the Color By variable to be Surgery. Examine the chi-square values from each of the three types of tests. Which test gives the most extreme p-value? The least extreme? You can also find multiple datasets and student lab activities giving details on how to properly analyze weighted data here.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share: