# 5 Things To Do with a Data Set

Clustering

Like prediction and classification, understanding the way the data is organized can help us with analysis of data. One way to tease out the structure of the data is by examining clustering. Based on the patterns shown in the data, we can group individual observational units into distinct clusters. Clusters are defined so that observations within each cluster will have similar characteristics.  We can do further analysis of each group as well as comparing between groups. For example, marketers may want to know the customer segments to develop targeted marketing strategies. A cluster analysis will group customers so that people in the same customer segments tend to have similar needs but are different from those in other customer segments.  Some popular clustering methods are multi-dimensional scaling and latent class analysis.

Image source: http://www.dynamic-concepts.nl/en/segmentation/

Classification

The first step is constructing a classification system. The categories can be created based on either theories or observed statistical patterns such as those detected using clustering techniques. The next step is to identify the category or group to which a new observation belongs. For example, a new email can be put in the spam bin or non-spam bin based on the contents of the email. In statistics and machine learning, logistic regression, linear classifier, support vector machine and linear discriminant analysis are popular techniques used for classification problem.

Prediction

Predictive models can be built with the available data to tell you what is likely to happen. Predictive models assume either that a knowledge of past statistical patterning can be used to predict the future or the validity of some type of theoretical model.  For example, Netflix recommends movies to users based on the movies and shows which users have watched in the past.

Can we predict who will be the next president, Clinton or Trump? Yes, we can. Based on the polling data or candidates’ speeches, you can build a predictive model for the 2016 presidential election.  Nate Silver is well-known for the accuracy of his predictions of both political and sporting events. Here is his prediction model on the 2016 presidential election:

Source: http://projects.fivethirtyeight.com/2016-election-forecast/

Predictive modeling utilizes regression analysis, including linear regression, multiple regression and generalized linear models, as well as some machine learning algorithms, such as random forest tree and factor analysis.   Time series analysis can be used to forecast weather and the sales of a product of next season.

Anomaly Detection

Anomaly detection identifies unexpected or abnormal events. In the other words, we seek to find deviations from expected patterns. Detecting credit card fraud provides an example.  Credit card companies can analyze customers’ purchase behavior and history, so they can alert customers of possible fraud. Here are examples of popular anomaly detection techniques: k-nearest neighbor, neural network, support vector machine and cluster analysis.

Decision Making

One of the most common motivations for analyzing data is to drive better decision making. When a company needs to promote a new product, it can employ data analysis to set the price to maximize profit and avoid price wars with other competitors. Data analysis is so central to decision making that almost all analytic techniques – including not only the ones mentioned above but also geographical information systems, social network analysis, and qualitative analysis – can be applied.

You can leave the list at any time. Removal instructions are included in each message.

# 5 Must-See TED Talks on Data Visualization!

Data visualization is crucial in understanding data and identifying hidden connections that matter. Below are 5 TED talks on data visualization you don’t want to miss!

1. Hans Rosling: The best stats you’ve ever seen

Han Rosling, cofounder of the Gapminder Foundation, developed the Trendalyzer software that converts international statistics – such as life expectancy and child mortality rate – into innovative, interactive graphics. The statistics guru is a strong advocate for public access to data and the development of tools that make it accessible and usable for all.  In this classic talk, Rosling highlights the importance of data in debunking myths about the gap between developed countries and the so-called “developing world.” Even though the talk was filmed 10 years ago, it still carries very important and relevant messages.

Watch more of Rosling’s TED talks here.

2. David McCandless: The beauty of data visualization

In this visually captivating talk, data journalist David McCandless suggests that data visualization is a quick solution to our current problem of information overload. Visualizations allow us to see the hidden patterns, identify connections that matter, and tell stories with data. To McCandless, “even when the information is terrible, the visual can be quite beautiful”; this is a controversial claim, however, since the main goal of data visualization should be to communicate information effectively through graphical means.

3. Dave Troy: Social maps that reveal a city’s intersections – and separations

A serial entrepreneur and data-viz fan, Dave Troy takes a people-focused approach to data visualization. Troy has been mapping tweets among city dwellers, revealing what connects communities and what separates them – above and beyond demographic factors such as race or ethnicity. He compares a city to a “giant high school cafeteria” and suggests that we see “how everybody arranged themselves in a seating chart”, arguing that “maybe it’s time to shake up the seating chart a little bit” to reshape our cities.

4. Eric Berlow & Sean Gourley: Mapping ideas worth spreading

An ecologist and a physicist, Eric Berlow and Sean Gourley, collaborate in this presentation to create stunning 3D visualizations demonstrating the interconnectedness of ideas. Taking 4,000 TEDx talks from 147 countries representing 50 languages, they explore their “meme-omes” – the mathematical structures that underlie the ideas behind these talks – and discover similarities between seemingly unconnected topics. Berlow and Gourley also broke down complex themes into multiple more specific ones, seeing what topics resonated with viewers and what kind of audience looked at what topic. To Gourley, mapping ideas in this way will help us “to see what’s being said, to see what’s not being said, and to be a little bit more human and, hopefully, a little smarter.”

5. Manuel Lima: A visual history of human knowledge

Founder of VisualComplexity.com Manuel Lima, described by Wired Magazine as “the man who turns data into art,” explains the visual metaphor shift from the tree to the network as “a new lens to understand the world around us.” Lima argues that the tree – an important tool to map everything from genealogy to systems of law to Darwin’s “Tree of Life” – is being replaced by a new metaphor – the network. Rigid structures are evolving into interdependent systems, and networks emerge to embody the nonlinearity, decentralization, interconnectedness, and multiplicity of ideas and knowledge. The shift in visual metaphor also represents a new way of thinking – one that is critical for us to solve many complex problems we are facing.

You can leave the list at any time. Removal instructions are included in each message.

# Meet Yujing Cao, DASIL’s new data scientist!

This year, DASIL welcomes a new member of our staff, Yujing Cao, who will be serving as the new data scientist. In her position at DASIL, Yujing will bring her expertise in data analysis and visualization to further expand DASIL’s capability to help students and faculty members integrate data analysis into research and classroom work.  In today’s big data era, enormous quantities of data are available, and Yujing will help Grinnell students and faculty explore them.

Yujing Cao is excited about joining DASIL and bringing a new level of data analysis to faculty research and teaching!

Originally from China, Yujing got her bachelor degree in Statistics from Anhui University. Her passion for data science led her to a PhD program in Statistics at the University of Texas at Dallas, where she obtained her degree in 2016. Her research was on graphical modeling of biological pathways in genomic studies. She is also interested in network analysis, machine learning, and trying different tools for data visualization. In her spare time, she enjoys reading, hiking, and exercising.

Yujing was excited about the position at Grinnell because of her strong interests in teaching and in data visualization. As she puts it:

“I wanted to look for a position which provides opportunities to create interesting data visualizations along with other data analysis work. I love using graphs to tell stories behind different data sets.

Working environment is another factor that led to my decision to come to Grinnell.  I strongly resonate with the core values of a liberal arts education. At Grinnell College, I can work in an academic environment helping faculty and students while promoting the use of data in research and learning.

Yujing also discusses a number of skills crucial to succeed in the field of data science. Data science is an interdisciplinary field requiring knowledge from mathematics, statistics, data mining and machine learning. Statistical knowledge and knowledge from other fields can help form good questions and seek direction, while programming skills (e.g. joining data sets and visualizing data) are needed for implementing our ideas. To be a good data scientist, you should possess strong programming and analytical skills.”

According to Yujing, “One of the most important qualities for any data scientist is curiosity. Curiosity encourages us to dig in and make interesting discoveries about data. Also, good communication skills can make a great data scientist. You should be able to clearly articulate your results and the implications of your findings to others, including other data scientists and people who don’t share a similar background.”

Her tip for students interested in a career in data science is to keep an open mind to learn from different disciplines and sharpen your programming skills.  In addition, a student who is interested in being a data scientist should take advantage of any opportunities to get hands-on projects that use real data.”

Faculty or students interested in meeting with Yujing should drop by DASIL(ARH 130) or her office (Goodnow 103) or contact her via email at caoyujin@grinnell.edu for an appointment.