# A Tool for Visualizing Regression Models

Will sales of a good increase when its price goes down? Does the life expectancy of a country have anything to do with its GDP? To help answer these questions concerning different measures, researchers and analysts often employ the use of regression techniques.

Linear regression is a widely-used tool for quantifying the relationship between two or more quantitative variables. The underlying premise is simple: no more complicated than drawing a straight line through a scatterplot! This simple tool is nevertheless used for everything from market forecasting to economic models. Due to its pervasiveness in analytical fields, it is important to develop an intuition behind regression models and what they actually do. For this, I have developed a visualization tool that allows you to explore the way regressions work.

You can import your own dataset or choose from a selection of others, but the default one is information on a selection of movies. Suppose you want to know the strategy for making the most money from a film. In regression terminology, you ask what variables (factors) might be good predictors of a film’s box office gross?

The response variable is the measure you want to predict, which in this situation will be the box office gross (BoxOfficeGross). The attribute that you think might be a good predictor is the explanatory variable. The budget of the film might be a good explanatory variable to predict the revenue a film might earn, for example. Let’s change the explanatory variable of interest to Budget to explore this relationship. Do you see a clear pattern emerge from the scatterplot? Can you find a better predictor of BoxOfficeGross?

If you want to control for the effects of other pesky variables without having to worry about them directly, you can include them in your model as control variables.

Below the scatterplot are two important measures that are used in evaluating regression models: the p-value and the R2 value. What the p-value tells us is the probability of getting our result just by chance. In the context of a regression model, it suggests whether the specific combination of explanatory and control variables really do seem to affect the response variable in some way: a lower p-value means that there seems to be something actually going on with the data, as opposed to the points being just scattered randomly.  The R2 value, on the other hand, tells us how what proportion of the variability in the response (predicted) variable is explained by the explanatory (predictor) variable, in other words, how good the model is. If a model has a low R2 value and is incredibly bad at predicting our response, it might not be such a good model after all.

If you want to predict a movie’s RottenTomatoesScore from its RunTime, for example, the incredibly small p-value might tempt you to conclude that, yes, longer movies do get better reviews! However, if you look at the scatterplot, you might get the feeling that something’s not right. The R2 value tells us this other side of the story: though RunTime does appear to be correlated to RottenTomatoesScore, the strength of that relationship is just too weak for us to do anything with!

Play around with the default dataset provided, or use your own dataset by going to the Change Dataset tab on top of the page. This visualization tool can be used to develop an intuition for regression analysis, to get a feel of a new dataset, or even in classrooms for a visual introduction to linear regression techniques.

You can leave the list at any time. Removal instructions are included in each message.

# Modeling Population Growth in Excel

The Malthus and Condorcet Equations, simple formulas that model relatively complex trends in population growth, are now accessible with an Excel calculator that allows the user full control over every component of the equations. Students can use the Excel file to model human population growth under the assumption that a human carrying capacity exists.

The Malthus Equation expresses the growth rate of a population as a function of the current population size and current carrying capacity. Specifically, the growth rate of a population is equal to a Malthusian parameter multiplied by the current population size multiplied by the difference between the current carrying capacity and the current population size. This relationship creates a high growth rate once a population is large enough to reproduce at its full potential, but remains a low growth rate when the population is very small or when a population is nearing its carrying capacity and feeling the effect of constrained resources. The Malthusian parameter is almost invariably between zero and one because a negative Malthusian parameter would lead to a population’s gradual extinction while a Malthusian parameter greater than one would lead to explosive population growth that would greatly exceed the carrying capacity. In the latter situation, unrealistically rapid and extreme periods of growth and contraction would ensue.

The Condorcet Equation expresses the growth rate of the carrying capacity of a population as equal to the growth rate of the population multiplied by a constant termed the Condorcet parameter. The logic behind this mathematical relationship is that the carrying capacity of a population increases or decreases proportionally with the growth rate of a population because an additional person in a population can have a positive or negative effect on the carrying capacity. This implies that a Condorcet parameter greater than one results from a society where an additional individual somehow increases the number of people that can be supported even when taking into account the resources that additional individual consumes; this could result from a situation where there are increasing returns to labor. If doctors cure diseases better when more of them work together, this is reflected by a Condorcet parameter greater than one. A Condorcet parameter between zero and one is most realistic for human populations because the contribution of another person will probably grow the carrying capacity but not by more than one. A negative parameter implies that an additional person would actually lower the carrying capacity; perhaps every additional person would consume natural resources at a rate greater than the previous individual’s rate.

As Cohen (1995 Science 269: 341-346) points out, the equations are not necessarily realistic models of human population growth. There is no consensus about whether or not a human carrying capacity exists. In theory, we as a species might be able to continually develop technology at such a rate that we are unable to approach a carrying capacity. A slowdown in overall human population growth is more likely due to a global increase in income per capita that leads to altered reproductive strategies.

Figure 1: with r=0.1 and c=0.1 as parameters, the population experiences a positive but steadily decreasing growth rate because the carrying capacity increases at 1/10th the rate of population growth, and since population growth slows as the population size approaches the carrying capacity, we observe almost asymptotic behavior. This is a realistic pattern for human population growth if a carrying capacity exists.

The calculator defines the Malthus Equation as dP(t)/dt=rP(t)[K(t)-P(t)] and the Condorcet Equation as dK(t)/dt=c dP(t)/dt (See Cohen 1995: 343). The user may enter values for the initial states of r (the “Malthusian parameter”), P(t), (population size), K(t) (carrying capacity), c (“Condorcet parameter”), t_0 (the starting time for the model) and dt (the length of one interval in time) that determine all of the future changes in population size. The rates of change of population and carrying capacity at time t, dP(t)/dt and dK(t)/dt respectively, are determined by the equations. The Malthusian and Condorcet parameters are constant in a growth model provided that there are no exogenous shocks that affect the nature of population or carrying capacity growth. Because of this, they do not vary as a function of t.

You can leave the list at any time. Removal instructions are included in each message.

# Mental Health Mortality, by Gender and Race

US President Barack Obama announced on January 5th that he would be taking executive action on gun control in light of a tragic trend of mass shootings in the last several years. Among the details of his gun control plan, he mentioned an increase in mental health services. While the expansion of mental health support may help in ameliorating the mass shootings epidemic, it may also has positive implications for reducing the number of Americans who die due to mental health causes. Using DASIL’s United States Mortality by Cause of Death, Race, and Gender visualization, one can see how deaths due to mental illness have been on the rise since the 1990s, and how the trend has had varying effects on every demographic:

When looking at strictly male versus female deaths due to mental health causes, males in recent years are slightly more affected than females, at an average 3.84 deaths compared to 3.50 as of 2009. However, the 90s saw the reverse, with female fatalities at 1.92 compared to 1.37 in 1994.

When breaking down within each gender by race, a much different story emerges. For females, the sharp rise in deaths due to mental health is observed after the year 2000, which differs from the trend for all races and all genders. In addition, while each race follows the same sharp increase after the year 2000, white women are more adversely affected, at an average 5.92 deaths compared to 4.09 for blacks and 3.68 for other races in 2009. For males, on the other hand, the same sharp increase also appears after the year 2000, however the averages for each race are much less in comparison to their female counterparts. White males are also more adversely affected in comparison to other races, at 3.11 deaths, while black males are averaging 2.41 deaths and other races 2.33 deaths.

Why has mental health been more fatal for women across all demographics? One reason may be eating disorders. Women are more likely to contract an eating disorder than men (although that does not mean men do not develop eating disorders), and eating disorders have the highest mortality rate of any mental illness. For example, according to the South Carolina Department of Mental Health, the mortality rate associated with anorexia nervosa, one type of eating disorder, is 12 times higher than the death rate associated with all causes of death for females between the ages of 15-24 years old.

President Obama’s plan for enforced support and better resources for those suffering with mental illness will not only help in tackling the gun violence epidemic, but also larger instances of mental illness fatalities.

You can leave the list at any time. Removal instructions are included in each message.

# Highlighting the Importance of Intersectionality in the Gender Pay Gap

The gender pay gap is again receiving much-needed publicity in recent years as a topic of debate between US presidential hopefuls for 2016 and information uncovered from Sony’s email hack this time last year. While the phrase “women get paid 78% of what men are paid” is touted frequently in discussion, the 78% figure is static in dimension. Do all women get paid 78% of what men are paid, or is it just a subset of the female working population?

There is a lot more to the 78% figure than meets the eye, and the intersection of race and gender is important to telling the fuller story behind the 78%, and the wider issue of gender parity in earnings.

Using DASIL’s Pay by Race & Gender visualization, we can see that race plays a significant role in the pay of a full-time working woman and reveals the nuances to the widely-cited 78% figure. Asian women working full-time in the US are (and have been) the subset of women getting paid closest to what all men are getting paid throughout history, at 86% that of men in 2013. However, Asian women were only paid 75% that of Asian men in 2013. On the other end of the spectrum, Hispanic women were disproportionately getting paid only 60% of men’s wages in 2013, the lowest of all recorded races. Hispanic males also earn the lowest in comparison to all men, at 64% of what all men earn (not shown) in 2013. As the graph indicates, the asymmetric trends for Hispanic and Black women have remained relatively constant for the past twenty years.

With regard to part-time labor, however, there is virtually complete gender parity in 2013 when focusing on average figures, with “all women” receiving 99% of what a man earns. When filtering by race, part-time working White and Asian women even get paid more than that of average men; white women receive 106% of what a man earns, and Asian women 101% in 2013. However, racial disparity still persists: both Black and Hispanic women in part-time labor received 85% of what part-time men were paid in 2013, and the closest Black and Hispanic women have been in achieving pay parity with the average man was in 1994.

As this infographic suggests, one reason for full-time and part-time pay disparity can be due to industry: black women are more likely to work in less-lucrative jobs (e.g. service, healthcare) than high-lucrative jobs (e.g. STEM, management). Relatedly, education can be a contributing factor: Hispanic and Black women are less likely to graduate than whites. Yet, even if women of color have the same education levels as their white peers, they are still paid less; there is more contributing to pay disparity than the educational attainment of women of color.

While there is clear cause for more work to be done in bridging the pay gap between men and women, recognizing the multiple dimensions of the issue will be key to creating meaningful and effective policy changes.

Explore more trends with our Pay by Gender and Race visualization here.

You can leave the list at any time. Removal instructions are included in each message.

# Kick Off Summer Vacation With Tourism Data!

Commencement is over at Grinnell College, and here at DASIL we’re settling into our summer routine. What better way to start the summer than with a look at tourism data?

* Interested in knowing how many people visit U.S. national parks annually? The National Parks Service has 111 years of visitor data for national parks.

* Can you guess which country sends the most visitors to the United States annually? When you have your guess, visit the site of the Office of Travel and Tourism Industries of the U.S. International Trade Administration to find out the answer!

* Thinking about vacationing somewhere chilly this year? The International Association of Antarctica Tour Operators has fifteen years of historical data about visitors to Antarctica.

You can leave the list at any time. Removal instructions are included in each message.