Student Spotlight: Racial Bias in the NYPD Stop-and-frisk Policy

Donald Trump recently came out in favor of an old New York Police Department’s (NYPD) “stop-and-frisk” policy that allowed police officers to stop, question and frisk individuals for weapons or illegal items. This policy was under harsh criticism for racial profiling and was declared in violation of the constitution by a federal judge in 2013.

An earlier post by Krit Petrachainan showed a potential racial discrimination against African-Americans within different precincts. Expanding on this topic, we decided to look at data in 2014, one year after the policy had been reformed, but when major official policy changes had not yet taken place.

More specifically, this study examined whether race (Black or White) influenced the chance of being frisked after being stopped in NYC in 2014 after taking physical appearance, population distribution among different race suspect, and suspected crime types into account.

2014 Data From NYPD and Study Results

For this study, we used the 2014 Stop, Question and Frisked dataset retrieved from the city of New York Police Department. After data cleaning, the final dataset has 22054 observations. To address our research question, we built a logistic regression model and ran a drop-in-deviance test to determine the importance of Race variable in our model.

Our results suggest that after the suspect is stopped, race does not significantly influence the chance of being frisked in NYC in 2014. A drop-in-deviance tests after creating a logistic regression model predicting the likelihood of being frisked gave a G-statistic of 8.99, and corresponding p-value of 0.061. This marginally significant p-value shows we do not have enough evidence to conclude that adding terms associated with Race improves the predictive power of the model.

Logistic regression plot predicting probability of being frisked from precinct population Black, compared across race

Figure 1. Logistic regression plot predicting probability of being frisked from precinct population Black, compared across race

To better visualize the relationship in interactions between race and other variables, we created logistic regression plots predicting the probability of being frisked from either Black Pop or Age, and bar charts comparing proportion of suspects frisked across sex and race.

Interestingly, given that the suspects are stopped, as the precinct proportion of Blacks increases, both Black and White suspects are more likely to be frisked. Furthermore, this trend is more profound for Black than White suspects (Figure 1).

Additionally, young Black suspects are much more likely than their White counterparts to be frisked, given that they are stopped. This difference diminishes as suspect age increases (Figure 2).

Logistic regression plot predicting probability of being frisked from age, compared across age

Figure 2. Logistic regression plot predicting probability of being frisked from age, compared across age

Finally, male suspects are much more likely to be frisked than females, given they are stopped (Figure 3). However, the bar charts indicate that the effect of race on the probability of being frisked does not depend on gender.

Proportion frisked by race, compared across sex

Figure 3. Proportion frisked by race, compared across sex

Is stop-and-frisk prone to racial bias?

Our results suggest that given that the suspect is stopped, after taking other external factors into account, race does not significantly influence the chance of being frisked in NYC in 2014. However, after looking at relationships between race and precinct population Black, age, and sex, there is a possibility that the NYPD “stop-and-frisk” practices are prone to racism, posing threat to minority citizens in NYC. It is crucial that the NYPD continue to evaluate its “stop-and-frisk” policy and make appropriate changes to the policy and/or police officer training in order to prevent racial profiling at any level of investigation.

*** This study by Linh Pham, Takahiro Omura, and Han Trinh won 2nd place in the 2016 Undergraduate Class Project Competition (USCLAP).

Check out the 2016 USCLAP Winners here.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

5 Things To Do with a Data Set

Clustering

Like prediction and classification, understanding the way the data is organized can help us with analysis of data. One way to tease out the structure of the data is by examining clustering. Based on the patterns shown in the data, we can group individual observational units into distinct clusters. Clusters are defined so that observations within each cluster will have similar characteristics.  We can do further analysis of each group as well as comparing between groups. For example, marketers may want to know the customer segments to develop targeted marketing strategies. A cluster analysis will group customers so that people in the same customer segments tend to have similar needs but are different from those in other customer segments.  Some popular clustering methods are multi-dimensional scaling and latent class analysis.

An picture example of customer segmentation

Image source: http://www.dynamic-concepts.nl/en/segmentation/

Classification

The first step is constructing a classification system. The categories can be created based on either theories or observed statistical patterns such as those detected using clustering techniques. The next step is to identify the category or group to which a new observation belongs. For example, a new email can be put in the spam bin or non-spam bin based on the contents of the email. In statistics and machine learning, logistic regression, linear classifier, support vector machine and linear discriminant analysis are popular techniques used for classification problem.

Prediction

Predictive models can be built with the available data to tell you what is likely to happen. Predictive models assume either that a knowledge of past statistical patterning can be used to predict the future or the validity of some type of theoretical model.  For example, Netflix recommends movies to users based on the movies and shows which users have watched in the past.

Can we predict who will be the next president, Clinton or Trump? Yes, we can. Based on the polling data or candidates’ speeches, you can build a predictive model for the 2016 presidential election.  Nate Silver is well-known for the accuracy of his predictions of both political and sporting events. Here is his prediction model on the 2016 presidential election:

A map of polls-only forecast of the 2016 presidential election by Nate Silver

Source: http://projects.fivethirtyeight.com/2016-election-forecast/

Predictive modeling utilizes regression analysis, including linear regression, multiple regression and generalized linear models, as well as some machine learning algorithms, such as random forest tree and factor analysis.   Time series analysis can be used to forecast weather and the sales of a product of next season.

Anomaly Detection

Anomaly detection identifies unexpected or abnormal events. In the other words, we seek to find deviations from expected patterns. Detecting credit card fraud provides an example.  Credit card companies can analyze customers’ purchase behavior and history, so they can alert customers of possible fraud. Here are examples of popular anomaly detection techniques: k-nearest neighbor, neural network, support vector machine and cluster analysis.

Decision Making

One of the most common motivations for analyzing data is to drive better decision making. When a company needs to promote a new product, it can employ data analysis to set the price to maximize profit and avoid price wars with other competitors. Data analysis is so central to decision making that almost all analytic techniques – including not only the ones mentioned above but also geographical information systems, social network analysis, and qualitative analysis – can be applied.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

A Tool for Visualizing Regression Models

Will sales of a good increase when its price goes down? Does the life expectancy of a country have anything to do with its GDP? To help answer these questions concerning different measures, researchers and analysts often employ the use of regression techniques.

Linear regression is a widely-used tool for quantifying the relationship between two or more quantitative variables. The underlying premise is simple: no more complicated than drawing a straight line through a scatterplot! This simple tool is nevertheless used for everything from market forecasting to economic models. Due to its pervasiveness in analytical fields, it is important to develop an intuition behind regression models and what they actually do. For this, I have developed a visualization tool that allows you to explore the way regressions work.

You can import your own dataset or choose from a selection of others, but the default one is information on a selection of movies. Suppose you want to know the strategy for making the most money from a film. In regression terminology, you ask what variables (factors) might be good predictors of a film’s box office gross?

The response variable is the measure you want to predict, which in this situation will be the box office gross (BoxOfficeGross). The attribute that you think might be a good predictor is the explanatory variable. The budget of the film might be a good explanatory variable to predict the revenue a film might earn, for example. Let’s change the explanatory variable of interest to Budget to explore this relationship. Do you see a clear pattern emerge from the scatterplot? Can you find a better predictor of BoxOfficeGross?

If you want to control for the effects of other pesky variables without having to worry about them directly, you can include them in your model as control variables.

Below the scatterplot are two important measures that are used in evaluating regression models: the p-value and the R2 value. What the p-value tells us is the probability of getting our result just by chance. In the context of a regression model, it suggests whether the specific combination of explanatory and control variables really do seem to affect the response variable in some way: a lower p-value means that there seems to be something actually going on with the data, as opposed to the points being just scattered randomly.  The R2 value, on the other hand, tells us how what proportion of the variability in the response (predicted) variable is explained by the explanatory (predictor) variable, in other words, how good the model is. If a model has a low R2 value and is incredibly bad at predicting our response, it might not be such a good model after all.

score vs runtime plot

If you want to predict a movie’s RottenTomatoesScore from its RunTime, for example, the incredibly small p-value might tempt you to conclude that, yes, longer movies do get better reviews! However, if you look at the scatterplot, you might get the feeling that something’s not right. The R2 value tells us this other side of the story: though RunTime does appear to be correlated to RottenTomatoesScore, the strength of that relationship is just too weak for us to do anything with!

Play around with the default dataset provided, or use your own dataset by going to the Change Dataset tab on top of the page. This visualization tool can be used to develop an intuition for regression analysis, to get a feel of a new dataset, or even in classrooms for a visual introduction to linear regression techniques.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

Testing Weighted Data

In previous posts we discussed the challenges of accounting for weights in stratified random samples. While the calculation of population estimates is relatively standard, there is no universally accepted norm for statistical inference for weighted data. However, some methods are more appropriate than others. We will focus on examining three different methods for analyzing weighted data and discuss which is most appropriate to use, given the information available.

Three common methods for testing stratified random samples (weighted data) are:

  • The Simple Random Sample (SRS) Method assumes that the sample is an unweighted sample that is representative of the population, and does not include adjustments based on the weights that are assigned to each entry in the data set. This is the basic chi-square test taught in most introductory statistics classes.
  • The Raw Weight (RW) Method multiplies each entry by their respective weight and runs the analysis on this adjusted weighted sample.
  • The Rao-Scott Method takes into account both sampling variability and varibility among the assigned weights to adjust the chi-square from the RW method.

One example of a data set which incorporates a weight variable is the Complementary and Alternative Medicine (CAM) Survey, which was conducted by the National Center for Health Statistics (NCHS) in 2012. For the CAM survey, NCHS researchers gathered information on numerous variables such as race, sex, region, employment, marital status, and whether each individual surveyed used various types of CAM. In this dataset, weights were assigned based on race, sex, and age.

Among African Americans who used CAM for wellness, we conducted a chi-square test to determine whether there was a significant difference in the proportion of physical therapy users in each region. Below is a table comparing the test statistics and p-values for each of the three statistical tests:

blogpost1

The SRS method assumes that we are analyzing data collected from a simple random sample instead of a stratified random sample. Since the proportions in our sample do not represent the population, this method is inappropriate. The RW method multiplies each entry by their weight giving a slightly more representative sample. While this method is useful for estimating populations, the multiplication of the weights tends to give p-values that are much too small. Thus, both the SRS and RW methods are inaccurate methods for testing this data set. The Rao-Scott method involves adjustments for non-SRS sample designs as well as accounting for the weights, resulting in a better representation of the population.
Try it on your own!
Through a summer MAP with Pam Fellers and Shonda Kuiper, we created a CAM Data shiny app. Go to this app and compare how population estimates and test statistics can changes based upon the statistical method that is used. For example, select the X Axis Variable to be Sex and the Color By variable to be Surgery. Examine the chi-square values from each of the three types of tests. Which test gives the most extreme p-value? The least extreme? You can also find multiple datasets and student lab activities giving details on how to properly analyze weighted data here.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share:

Understanding Population Estimates Based Upon Stratified Random Samples

When a researcher is interested in examining distinct subgroups within a population, it is common to use a stratified random sample to better represent the entire population. This method involves dividing the population of interest into several small subgroups (called strata) based on specific variables of interest and then taking a simple random sample from each of these smaller groups. To account for stratified random samples, weights are used to better estimate population parameters.

Many people fail to recognize that data from a stratified random sample should not treated as a simple random sample (SRS), as Kathy Kamp, Professor of Anthropology, mentions in an earlier blog post. The following example explains why it is important to treat stratified random samples and SRS differently.

In 2010, CBS and the New York Times conducted a national phone survey (a stratified random sample) of 1,087 subjects as part of “a continuing series of monthly surveys that solicit[ed] public opinion on a range of political and social issues” (ICPSR 33183, 2012 March 15). In addition to political preference, they gathered information on race, sex, age, and region of residence.

The figure below demonstrates how population estimates vary depending on the use of weights. The unweighted graph incorrectly overestimates the number of females in the democratic party (52% Democrat and 40% Republican). This leads to an incorrect overestimate of the number of democrats in the nation. However, when weights are properly incorporated into the analysis we see that the ratios are actually much closer (46% Democrat and 45% Republican).

 

karincombined

 

As demonstrated above, there is a difference between the weighted and unweighted graphs and resulting proportions. Specifically, the number and percent of Republican supporters increases when we take into account the weights. The weighted graph and proportions give a more accurate estimation of Political Preference by Sex in the population than the unweighted graph.

Try it on your own!

Through a summer MAP with Pam Fellers and Shonda Kuiper, we have created a Political Data app using this dataset. Follow this link in  to view the influence of weights on the population estimates for all the subgroups within this dataset. For example, select the X Axis Variable to be “Region” and the Y Axis Variable to be “Political Preference”. What do you notice about the weighted graph in comparison to the unweighted graph? You can also find datasets and several student lab activities giving details for proper estimation and testing for survey (weighted) data at this website.

Enter your e-mail address to receive notifications of new blog posts.
You can leave the list at any time. Removal instructions are included in each message.

Powered by WPNewsman

Please like & share: