# Understanding Population Estimates Based Upon Stratified Random Samples

When a researcher is interested in examining distinct subgroups within a population, it is common to use a stratified random sample to better represent the entire population. This method involves dividing the population of interest into several small subgroups (called strata) based on specific variables of interest and then taking a simple random sample from each of these smaller groups. To account for stratified random samples, weights are used to better estimate population parameters.

Many people fail to recognize that data from a stratified random sample should not treated as a simple random sample (SRS), as Kathy Kamp, Professor of Anthropology, mentions in an earlier blog post. The following example explains why it is important to treat stratified random samples and SRS differently.

In 2010, CBS and the New York Times conducted a national phone survey (a stratified random sample) of 1,087 subjects as part of “a continuing series of monthly surveys that solicit[ed] public opinion on a range of political and social issues” (ICPSR 33183, 2012 March 15). In addition to political preference, they gathered information on race, sex, age, and region of residence.

The figure below demonstrates how population estimates vary depending on the use of weights. The unweighted graph incorrectly overestimates the number of females in the democratic party (52% Democrat and 40% Republican). This leads to an incorrect overestimate of the number of democrats in the nation. However, when weights are properly incorporated into the analysis we see that the ratios are actually much closer (46% Democrat and 45% Republican).

As demonstrated above, there is a difference between the weighted and unweighted graphs and resulting proportions. Specifically, the number and percent of Republican supporters increases when we take into account the weights. The weighted graph and proportions give a more accurate estimation of Political Preference by Sex in the population than the unweighted graph.

Through a summer MAP with Pam Fellers and Shonda Kuiper, we have created a Political Data app using this dataset. Follow this link in  to view the influence of weights on the population estimates for all the subgroups within this dataset. For example, select the X Axis Variable to be “Region” and the Y Axis Variable to be “Political Preference”. What do you notice about the weighted graph in comparison to the unweighted graph? You can also find datasets and several student lab activities giving details for proper estimation and testing for survey (weighted) data at this website.

You can leave the list at any time. Removal instructions are included in each message.

# Data Across the Curriculum: Helping the Local and the International with Consulting Research

Students in Monty Roper’s Anthropology and Global Development Studies classes gain practical experience in fieldwork, data analysis, and ways to deal effectively with clients when they act as consultants for both local organizations in Grinnell and internationally in an agricultural village in Costa Rica.  The clients they work with get free research which is presented to them both in the form of an oral consultation and in a written report.

From left: Roni Finkelstein ’15, Ellen Pinnette ’15, Liberty Britton ’14, Rosalie Curtain ’15, Emily Nucaro ’14, Ben Mothershead ’15, Zhaoyi Chen ’14, and M’tep Blount ’15, listen to Juan Carlos Bejarono explain the palm growing process.

For a Global Development Studies/Anthropology seminar, students prepare research plans during the first half of the semester and then travel to a rural agricultural community in Costa Rica to spend the two weeks of spring break collecting data which is then analyzed and written up during the remaining weeks of the semester.  The first year of the project, the class conducted an in-depth community development diagnostic.  Since then, they have investigated a variety of rural development issues, mainly focusing on tourism, women’s empowerment, and organizational issues and agricultural projects of the town’s two cooperatives.

From left: Chloe Griffin ’14 and Samanea Karrfalt ’14 present their research on “Professional Black Hair Care in Grinnell, IA”

From left: Irene Bruce ’15 and Matt Miller ’15 present their research and answer questions.

In Grinnell, Monty works with Susan Sanning, Director of Service and Social Innovation, to identify and explore possible collaborations with community partners who have research needs.  In the past, for example, Mid-Iowa Community Action (MICA) was interested in knowing why families dropped out of their Family Development and Self-Sufficiency Program (FaDSS) before their benefits were fully used, Drake Library was interested in what kinds of programming would best serve the town’s “tween” population, and a hair salon wanted to find out whether it was economically viable to invest in special hair care products and services for black customers.

Ideally positive change occurs because of the class’ research.  Grinnell students, Dillon Fischer ’13 and Sarah Burnell ’13, interviewed graduates of Grinnell High School who had gone on to attend college about their preparedness for college academics. According to the GHS Principal, these findings led the school to revise its minimum writing standards, making them more challenging. The local after school youth program, Galaxy, requested a study on donor perceptions and desires and subsequently used the results to write a successful grant proposal for support. This year’s class is planning to do more follow-ups on previous projects to ascertain longer term results.

You can leave the list at any time. Removal instructions are included in each message.

# Data Across the Curriculum: Using Real Data in Classes

What does eating a grub have to do with interviewing children in Costa Rica?  What does studying Don Quixote or Shakespeare have to do with examining business transactions, consulting for an NGO or designing a visualization on terrorist incidents?

ANSWER:  They all describe ways that Grinnell students are engaging with real-world data.

Well-educated individuals should be able to create, evaluate, and analyze data, so that they can ultimately engage in the most well-informed decision-making.  They will then need to be able to effectively communicate patterns in the data to others, using statistics and/or visualization tools in addition to, of course, well-crafted words.

None of these analytic or communicative skills are easy to learn and acquiring expertise demands both theory and the opportunity for practice. In an age of ubiquitous data (much of it of dubious quality) and numerous computer-assisted visualization and analytic tools (all of which can be both used and misused) the pitfalls are many, although the rewards for are great.  Employers love the data-savvy, but data analysis is an important part of decision-making in daily life as well!

Grinnell College classes in disciplines ranging from Anthropology to Spanish, History to Biology, Psychology to English, and Political Science to Statistics are engaging with real data in a variety of ways.  Some classes include data collection as well as analysis and display; others are more focused on evaluating data, interpreting it and communicating the results.

DASIL’s mission is to assist faculty and students explore the world using data.  We are embarking on a series of profiles designed to highlight some of the innovative ways the Grinnell faculty incorporate data in their classes.

See future posts for more details about how grubs, high school students, Don Quixote , and even baboons, all fit into the picture.

You can leave the list at any time. Removal instructions are included in each message.

# Analyzing the American Political Sphere with “We the People”: Part 2

DASIL’s “We the People” data explorer allows users to search petitions based on subject—civil rights, economics, and defense, to name a few. My previous post about “We the People” briefly examined government responses to petitions but did not consider their subject. This analysis expands on that post by grouping together petitions with similar subjects into three broad categories: Government, Science, and Sociology. The government category includes subjects like “Budget and Taxes” and “Defense,” science includes “Technology and Telecommunication,” “Environment,” and sociology includes “Disabilities,” “Education” and “Poverty.”  I only included petitions with over 5,000 signatures in my analysis to limit the number of results.

A frequency analysis for each category reveals an interesting trend when compared with the analysis of the petitions with 100,000+ signatures.  Continue reading →

# Historical Data Requires Historical Finesse

Utilizing contemporary tools to analyze historical data provides a unique way to approach historical research, but can prove to be an arduous process as modern tools may not be compatible with historical data. This summer, I have been working with Professor Sarah Purcell to create maps for her book on spectacle funerals of key figures during the U.S. Civil War and Reconstruction. Most commonly, famous bodily remains traveled from city to city on railroads, in some cases on a special funeral train, though they also traveled on rivers and in one case, across the Atlantic Ocean. Nearly every historical figure discussed in the book has an accompanying map which charts their extended funeral processional route. Using GIS technology, we are able to juxtapose census and election data with the geographic routes in highly analytical maps.

In order to layer election data onto the map for Col. Elmer Ellsworth (died 1861), I gathered county-level election data from the Interuniversity Consortium for Political and Social Research (ICPSR) and county-level census data from the National Historical Geographic Information System (NHGIS). I then needed to combine the ICPSR election data and the NHGIS census data in a joined spreadsheet before importing the data into ArcGIS software to link the data to its county location.

At first, I thought we could link the data using something called a “FIPS code.” In an effort to standardize big data and allow for easy joining of tables by location, the Federal Information Processing Standard assigned each county in the United States during the 1990s with a unique five-digit code, more commonly known as a FIPS code. The first two digits are the state FIPS code and the last three are the county code within the state. For example, the FIPS code for Poweshiek County, Iowa is 19157. This code is assigned to the current borders of Poweshiek County. Yet the data I was analyzing is from 1860. Poweshiek County in 2015 represents a different land area than Poweshiek County in 1860. Thus, joining ICPSR and NHGIS data from the 19th century could not be completed using FIPS codes without introducing historical inaccuracy in the maps.

In order to join two tables of data in any computer program, there must be a common column between them. From ICPSR, I had a table of county-level election data from 1860 and from NHGIS, I had a table of county-level 1860 census data. If I were to join data tables of current counties, the FIPS code would serve as my common column. However, instead of using FIPS codes to join the data, I created a common column using the name of the county and state. Creating a unique name for each county assures that I correctly joined the historic county data to the historic county borders. Poweshiek County’s unique identifier would be: “PoweshiekIowa.” I quickly discovered that joining data by this concatenated column was not without error. I went through each county individually to discover discrepancies, many of which resulted from spelling inconsistencies between the two databases.

After cleaning the data, the tables joined neatly. Using GIS, I then linked the combined election and census dataset to the geographic borders of the counties on the electronic map. I color coded the map by political party. The darker shade of each color show where the political party won the majority of votes in the county (greater than 50%), while the lighter shade of the color shows where the party won a plurality of the votes in the county. As you can see from the map’s multiple colors, unlike modern American politics, the 1860 presidential election involved more than two prominent political parties including Republicans, Northern and Southern Democrats, and the Constitutional Union Party. The political divide between North and South is clearly apparent along the Mason-Dixon Line between Pennsylvania and Maryland foreshadowing the sectional conflict of the American Civil War nearly six months after the election.

Mapping historical data is certainly a different process than mapping current data and can prove to be more time-consuming and complex. Though current tools (like FIPS codes) can help standardize mapping techniques, they may not be applicable in historical data settings and current tools may need to be discarded or updated. Historical FIPS codes, anyone?