5 Things To Do with a Data Set

Clustering

Like prediction and classification, understanding the way the data is organized can help us with analysis of data. One way to tease out the structure of the data is by examining clustering. Based on the patterns shown in the data, we can group individual observational units into distinct clusters. Clusters are defined so that observations within each cluster will have similar characteristics.  We can do further analysis of each group as well as comparing between groups. For example, marketers may want to know the customer segments to develop targeted marketing strategies. A cluster analysis will group customers so that people in the same customer segments tend to have similar needs but are different from those in other customer segments.  Some popular clustering methods are multi-dimensional scaling and latent class analysis.

Image source: http://www.dynamic-concepts.nl/en/segmentation/

Classification

The first step is constructing a classification system. The categories can be created based on either theories or observed statistical patterns such as those detected using clustering techniques. The next step is to identify the category or group to which a new observation belongs. For example, a new email can be put in the spam bin or non-spam bin based on the contents of the email. In statistics and machine learning, logistic regression, linear classifier, support vector machine and linear discriminant analysis are popular techniques used for classification problem.

Prediction

Predictive models can be built with the available data to tell you what is likely to happen. Predictive models assume either that a knowledge of past statistical patterning can be used to predict the future or the validity of some type of theoretical model.  For example, Netflix recommends movies to users based on the movies and shows which users have watched in the past.

Can we predict who will be the next president, Clinton or Trump? Yes, we can. Based on the polling data or candidates’ speeches, you can build a predictive model for the 2016 presidential election.  Nate Silver is well-known for the accuracy of his predictions of both political and sporting events. Here is his prediction model on the 2016 presidential election:

Source: http://projects.fivethirtyeight.com/2016-election-forecast/

Predictive modeling utilizes regression analysis, including linear regression, multiple regression and generalized linear models, as well as some machine learning algorithms, such as random forest tree and factor analysis.   Time series analysis can be used to forecast weather and the sales of a product of next season.

Anomaly Detection

Anomaly detection identifies unexpected or abnormal events. In the other words, we seek to find deviations from expected patterns. Detecting credit card fraud provides an example.  Credit card companies can analyze customers’ purchase behavior and history, so they can alert customers of possible fraud. Here are examples of popular anomaly detection techniques: k-nearest neighbor, neural network, support vector machine and cluster analysis.

Decision Making

One of the most common motivations for analyzing data is to drive better decision making. When a company needs to promote a new product, it can employ data analysis to set the price to maximize profit and avoid price wars with other competitors. Data analysis is so central to decision making that almost all analytic techniques – including not only the ones mentioned above but also geographical information systems, social network analysis, and qualitative analysis – can be applied.

You can leave the list at any time. Removal instructions are included in each message.

The Common Mistakes Made in Creating a Data Visualization

Oftentimes the best way to learn about how to do something right is by learning what not to do, especially for how to make good data visualizations. WTF Visualizations is a website that compiles poorly crafted data visualizations from across the web and media. Below is a sampling of some of the visualizations featured that illustrate some of the most common data visualization mistakes:

• Absence of Proper Scaling

Including proper scaling is essential in accurately representing your data. In the example below, the differentiation between values is misrepresented due to the absence of a clear scaling measure. The 52% measure does not appear to be as large as it should be in comparison to the other bars, and the 13% figure appears to be much larger than 3% when compared to the two 10% figures.

• Too Much Information

While the inclination is to include as much information in visualizations as possible, oftentimes including too much information detracts from the clarity and concision that is essential to good data visualization. The example below perfectly illustrates how including a myriad of different categories can muddle your visualization, as well as the importance of clear axis labels and descriptive titles.

Ensure that the data that you do decide to visualize is comprehensible to your audience: recode categories when there are too many; don’t include measures that illustrate the same phenomenon; don’t include 10 different variables when 3 will do. If need be, include more than one visualization to highlight different sub categories or variables.

Always double check your math before sharing your visualization to the public. You may run the risk of misrepresenting your data, as well as appearing as though you are not capable of simple arithmetic. The example below illustrates this point perfectly: while the creator uses a pie chart, the sections do not add up to 100%, but rather, 128%. The sections of the pie chart also do not accurately reflect the values they supposedly represent: The “51% Today” section, for instance, should be taking up a little more than half of the pie chart.

You can leave the list at any time. Removal instructions are included in each message.

What Makes a Good Data Visualization?

Being able to represent data in a clear, concise, and engaging way is an essential skill. While an effective poster is key, data visualizations are a tool that enhance the communication of narratives underlying the data. David McCandless, a world-renown data visualization maker and creator of Information is Beautiful, constructed a Venn diagram that depicts the essential elements of a successful data visualization.

The irony of this data visualization, which aims at serving as an aesthetically-pleasing vehicle for what comprises a good visualization, is that there are several elements that don’t make it a successful one. Firstly, the information he wants to communicate is not immediately obvious. Figuring out how each of the circle categories and their intersections relate to the associated examples (e.g. information x goal = plot) takes time and is distracting. In addition, some of the examples he gives aren’t very descriptive. What does he mean by “pure data viz” in the visual form x information intersection? What about “proof of concept” at the intersection of goal and story? There isn’t enough context available to make sense of these examples and categories. While the visualization is accessible to colorblind audiences (a very important element to a good data visualization), the point that McCandless wants to communicate is lost due to its lack of description and over-complicated use of the Venn diagram model.

You can leave the list at any time. Removal instructions are included in each message.

Journalists and Maps: The Need for Teaching Quantitative Literacy to Everyone

In recent years programs like ArcGIS and Tableau have made it very easy to produce maps. Journalists have responded by richly illustrating their articles with quantitative data displayed as maps.  Maps are both attractive and easier to explore visually than the same data provided in tabular form, so in many ways they are ideal illustrations. For the average reader information transmitted as quantitative data appears authoritative and these maps are no exception.  On the surface they seem real and informative. Unfortunately, just as with any data-driven information maps can inadvertently be misleading.

In a recent example, NBC news illustrates an article about the Supreme Court consideration of a Texas law that would force the closure of a high percentage of existing abortion clinics across the country were similar laws to be enforced or enacted more broadly with a map of the U.S. showing the number of abortions per state in 2012, using data from the CDC (Center for Disease Control).

A quick perusal of this map seems to show why Texas is so concerned about abortions.  After all it is one of the states with the most abortions.  After a moment’s examination, the viewer might (or might not) note that the states with the highest populations also seem to have the largest numbers of abortions.

Thus, this map really tells us little about which states have the biggest problem with abortions.  Some kind of standardization by population is needed.  One option would be to just use population size. Since the number of women of women who might potentially become pregnant and secure an abortion (usually defined as the number of women between 15 and 44) does not necessarily vary by state in direct proportion to the population size, this statistic may be a better measure for standardization than simple population size.  In terms of the number of abortions per 1,000 women ages 15-44 Florida and New York have high rates of abortions, but Texas no longer looks unusual.

But this still might not be the most revealing measure to use, since there is state to state variation in the birth rate.  In 2012 the average birth rate for the U.S. was 1.88, but Texas had a birth rate of 2.08.  The highest birth rate in the 50 states was 2.37 in Utah and the lowest 1.59 in Rhode Island. An option that takes into account the differential birth rates is to examine the ratio between the number of births and the number of abortions.  Using this measure, New York remains very high, but due to its relatively high birth rate Texas is even lower.

The CDC provides all of these statistics, but the journalist chose the least revealing of the possible measures to display.  Journalists are generally both well-educated and, we assume, well-meaning.  Why not pick a better measure to map when it would have been equally easy to do so?  I suspect that the answer lies firmly in the laps of educators like myself.  While we prioritize skills like writing and speaking well, we do not mandate that all students graduate statistically or even quantitatively literate, but we should.

You can leave the list at any time. Removal instructions are included in each message.

Testing Weighted Data

In previous posts we discussed the challenges of accounting for weights in stratified random samples. While the calculation of population estimates is relatively standard, there is no universally accepted norm for statistical inference for weighted data. However, some methods are more appropriate than others. We will focus on examining three different methods for analyzing weighted data and discuss which is most appropriate to use, given the information available.

Three common methods for testing stratified random samples (weighted data) are:

• The Simple Random Sample (SRS) Method assumes that the sample is an unweighted sample that is representative of the population, and does not include adjustments based on the weights that are assigned to each entry in the data set. This is the basic chi-square test taught in most introductory statistics classes.
• The Raw Weight (RW) Method multiplies each entry by their respective weight and runs the analysis on this adjusted weighted sample.
• The Rao-Scott Method takes into account both sampling variability and varibility among the assigned weights to adjust the chi-square from the RW method.

One example of a data set which incorporates a weight variable is the Complementary and Alternative Medicine (CAM) Survey, which was conducted by the National Center for Health Statistics (NCHS) in 2012. For the CAM survey, NCHS researchers gathered information on numerous variables such as race, sex, region, employment, marital status, and whether each individual surveyed used various types of CAM. In this dataset, weights were assigned based on race, sex, and age.

Among African Americans who used CAM for wellness, we conducted a chi-square test to determine whether there was a significant difference in the proportion of physical therapy users in each region. Below is a table comparing the test statistics and p-values for each of the three statistical tests:

The SRS method assumes that we are analyzing data collected from a simple random sample instead of a stratified random sample. Since the proportions in our sample do not represent the population, this method is inappropriate. The RW method multiplies each entry by their weight giving a slightly more representative sample. While this method is useful for estimating populations, the multiplication of the weights tends to give p-values that are much too small. Thus, both the SRS and RW methods are inaccurate methods for testing this data set. The Rao-Scott method involves adjustments for non-SRS sample designs as well as accounting for the weights, resulting in a better representation of the population.