10 minutes

How to Analyze Survey Data (by a Data Scientist)

So you just collected all this valuable data, but now what? How do you analyze it to draw conclusions and find correlations in the data? This guide will teach you how to analyze your dataset even with zero prior knowledge about stats. Let's get started!

What is Survey Data Analysis?

Survey data analysis is the process of extracting meaning from the dataset you've gathered. Using survey analysis techniques, one can find correlations, patterns, trends and other insights that can be useful for businesses to guide their decision making process.

For example: You run an ecommerce store and conducted a survey asking your customers why they bought your product. You find out that most of your customers were on the fence about buying the product since there weren't any reviews online, but a Reddit post convinced them to buy it. You then invest more money into online reputation management & PR.

Why is Survey Data Analysis Important?

Surveys are important for 4 main things:

  • Marketing/sales
  • Social sciences
  • Academia
  • Personal reasons

They are relatively cheap, easy to administer and allows you to get a good sample size quickly.

In marketing/sales, surveys are a great tool for understanding audience personas, and conducting market research. They can also be used to monitor trends over time. Analyzing survey data can be closely related to analyzing marketing data.

Surveys are also a key component in social sciences to study human behaviour. For example, you want to find out how happy people are living with their partner and how that differes from country to country.

Sometimes surveys can be used for academic or personal reasons. Example: You want to find out what were people's favorite characters in Game of Thrones. And does this differ depending on the demographic?

Surveys can provide really important insights to businesses that give them a market edge.

How to Analyze Survey Data

The easiest (and often the best) way to analyze survey data is by using univariate and bivariate analysis.

What is Univariate/Bivariate analysis?

Univariate analysis is the analysis of one variable. (One spreadsheet column).

Bivvariate analysis is the analysis of two variables. (Two spreadsheet columns).

Let's take a look at an example:

example survey

To get a feel for your data, it's best to start off with a univariate analysis, by seeing how many males, females and other genders participated in the survey. To perform univariate analysis, bar charts are your best friend!

Univariate analysis provides a good starting point to your data and often is a good way to show what demographics took your survey. However, bivariate analysis is where all the interesting analysis happens.

Analyzing survey data is a matter of cross-checking how different variables interact with each other, for instance:

  • "How does gender/ethnicity/education level influence salary?"
  • "How does household income affect education levels?
  • "Is there any correlation between age and salary?"

These are all examples of bivariate analysis. The graphs you use will differ depending on the types of data you have.

How do I know which variables to cross check? Oftentimes, intuition is the best way, but there are some methods which I'll show you later.

Types of Survey Data:

The first step towards analyzing survey data is identifying the type of data you're dealing with.

There are 3 types of survey data: numerical, categorical and sentences.

Numerical Answers:

  • 53 kg
  • 170 cm
  • 100 IQ
  • Score: 50 points

Categorical answers: 

  • Yes/no
  • Male/female
  • USA, Canada, UK, Australia
  • Age ranges: 16-25, 25-34, 35-24
  • Never, rarely, sometimes, frequently

Sentence/long answers:

  • "I think the app can be improved if it loads faster."
  • "Customer service was rude."
  • "I loved the app!"

These can be as long as several paragraphs or even several documents long.

Numeric answers are the easiest to analyze, followed by categorical answers and long answers are the most difficult to analyze.

Types of Survey Questions

The most common types of survey questions are:

  • Short/long answer responses
  • Multiple choice & dropdowns
  • Checkboxes
  • Linear scales 
  • Ranking Questions
  • Date/Time
  • Geographic

Classifying Each One (With Examples)

Multiple choice example: What is your gender?

  • Male
  • Female
  • Other
  • Prefer Not to Say

Multiple choice questions provide you with categorical answers.

categorical survey data


Linear scale example: On a scale of 1-5, how severely do you experience VR motion sickness?

linear scale survey question example

It might seem confusing, but answers on a linear scale can be considered both categorical and numerical. This is because there are only 5 possible answers: 1, 2, 3, 4, 5 and therefore these can be classified into categories.

This type of data is known as ordinal data, meaning 3 is higher than 2, but the distance between them is unknown i.e. we can't say that 3 is 50% higher than 2.

Ranking question example: Rank your most used VR locomotion styles:

ranking question example
Ranking Question Answer

Checkboxes example: What do you use VR for? Select all that applies:

Sometimes these questions may be in the form of "select up to 5." Checkboxes provide you with categorical answers, but the format is different to multiple choice questions. The output of your answers will look like this:

survey question: select up to 5

Analyzing these types of "list data" can be tricky, but we’ll show you a neat little trick that makes it extremely easy!

Short answer responses example: What VR experience causes the worst motion sickness for you?

“Damn rollercoasters! I nearly vomited one time because of them. Also driving vehicles in VR.”

Short answer responses (and also longer answer responses) can fall under unstructured data. We’ll need to convert these to categorical answers.

Dates: What year were you born?

22/11/1993

Dates are a weird one. They don’t fall into either categorical or numerical data. Just treat them as ‘dates.’

Geographic: What country are you from?

UK

Geographic data is categorical, but can also be visualized different than other data types (e.g. geographical heatmaps).

Converting "Sentence Answers" into "Categorical Variables"

Sentence answers are incredibly difficult to analyze. That's why we often convert them into categorical answers whenever possible. Here's how to do that:

survey data analysis

  1. First, go through each answer one by one and take note of any commonalities.
  2. Next create a new column.
  3. Lastly, label each sentence into a specific category.

Decide Which Variables to Compare

How do you know which variables to compare? And which tools should you be using to analyze the data?

  1. Start with an initial question or hypothesis you have about the data.
  2. Cross check every variable against each other. Look at the column headers, and see which variables are worth comparing.
  3. Use Polymer’s Auto-Explainer feature for deeper insights.

Analyzing survey data is a matter of cross-checking every variable against each other and seeing which ones make sense to analyze.

If your survey only has 10 question and under, it's fine to manually do this yourself, however, for larger surveys,Polymer's Auto-Explainer feature can speed up this process greatly.

Survey Data Analysis Methods

Now you’ll have a good idea at what variables you’re trying to compare. Here are the techniques you can use for each of them:

  • Categorical variable vs numeric variable -> Use a bar chart
  • Numeric vs. numeric variable -> Use a scatterplot
  • Changes over time: 1 numeric variable vs dates  -> Use a time series
  • See where all the volume is coming from -> Heatmaps
  • Summarize and aggregate/group data -> Pivot tables
  • Finding top/worst performers -> Polymer’s Auto-Insights tool

In order to make this step easier, we'll be using an online web tool called Polymer Search which allows us to generate AI insights about our data.

Analyzing surveys is a matter of comparing variables against each other. There are many different combinations for comparison e.g. comparing "gender vs. height" or "IQ vs. income vs. gender"

Polymer Search allows us to see all the different combinations and automatically ranks them from highest variability to lowest. More on how to use the tool later.

After doing conversions, we should have:

  • numerical data
  • categorical data

In general, these are the tools at your disposal:

  • Pivot tables -> Allows you to quickly get answers about your data
  • Polymer's Auto Explainer tool -> Allows you to generate AI insights about your data to find top rankings, anomalies and generate summaries.
  • Bar charts -> for comparing categorical data vs. numerical data.
  • Scatterplots -> for comparing numerical data vs. numerical data.
  • Heatmaps -> See where all the volume is coming from.
  • More about data visualization techniques.

Example to Learn From

For this example, we’ll be analyzing VR Heaven’s survey on motion sickness which contains all the different types of questions you’ll see.

The questions asked:

  • Select Your Age Group (Multiple Choice)
  • What is your Gender? (Multiple Choice)
  • Do you experience motion sickness in VR? (Multiple Choice: Frequently, Sometimes, Rarely, Never)
  • Were you able to grow your VR legs? (VR legs = overcoming VR sickness by using more VR). 
  • Do you experience motion sickness in cars? (Multiple Choice: Frequently, Sometimes, Rarely, Never)
  • Do you experience motion sickness in planes? 
  • Do you experience motion sickness in boats?
  • Rank your most used locomotion styles
  • What do you use VR for? Select all that applies.
  • If you have experienced motion sickness in VR before, can you describe what was the worst experience for you? (optional)

Step One:Clean the Data

To clean your data, first address 'null' or 'missing' fields.

Sometimes the respondents don't answer some questions or other times there are data collection errors.

clean data

Due to this being an online survey, there were a few people who never used VR before i.e. they weren't qualified for the survey. Delete these rows entirely. Their answers don't matter.

Next, there'll be answers like 'N/A' or 'Not applicable' or just a dash (-). These values can be deleted and left blank, whilst keeping the rest of the row intact. Leaving them empty can make the analysis step simpler.

To delete these, press CTRL + F -> Replace -> Find 'Not Applicable' and leave the replace field empty.

There are other methods of dealing with missing data, and it entirely depends on the situation. Sometimes you delete the entire row, sometimes you leave it blank and other times it's appropriate to get an estimate for that value.

Step Two: Convert Qualitative Data into Quantitative (if Possible)

Quantifying qualitative data will make the analysis step tenfold easier! 

This particular survey was designed around the MSSQ-S, a series of questions that researchers use to determine motion sickness susceptibility score. Following the research, we're able to transform the qualitative measurements into quantitative measurements:

  • Never = 0 points
  • Rarely = 1 point
  • Sometimes = 2 points
  • Frequently = 3 points

This allows us to calculate the VR sickness score and get an average, something we were unable to do before.

Using Excel, we can easily edit this data by pressing CTRL + F -> 'Replace' tab. Find all instances of 'Never' and replace it with the number 0. Find 'Rarely' and replace it with the number 1. Do the same for 'sometimes' and 'frequently.'

convert qualitative data into quantitative

Rename Columns:

The current column names are too long. This was because it was imported from Google Forms, which uses the survey question as the column name. Longer column names are harder to analyze, so we make them short and sweet:

"Do you experience motion sickness in VR?" -> VR sickness

Do the same for the rest of the columns.

Step Three: Add More Columns

Adding more columns = adding more dimensions for analysis.

Whilst it's useful to have scores for motion sickness in vehicles (cars, boats, planes), it'll be more useful to have a score that provides an overall motion sickness value.

We followed the steps in the MSSQ-S and created a new column called 'susceptibility score' using some basic Excel formulas which you can find online:

add columns

This score tells us the person's overall susceptibility to motion sickness in vehicles and will be a crucial component to analyzing this dataset.

Here's another example of how Alex Almedia creates more dimensions in his dataset to find top converting Facebook ads.

Step Four: Start with Broad Questions

A question like "how many people experience VR motion sickness?" is a good starting point. Pie charts are great for yes/no answers:

pie chart

It's also a good idea to get to know who your demographics are - so find out the age and gender of the people who took your survey.

Head over to the 'visualize' tab in Polymer and input 'gender' and 'age group' into the y-axis (which is reserved for categorical variables):

Bar Chart male vs female vs other
age groups bar chart

Step Five: Cross-check Variables

Cross-check every variable against each other, using some logic to see what makes sense.

Start by cross-checking categorical variables against numeric variables.

We’ve identified the categorical variables are:

  • Age
  • Gender
  • VR Legs
  • VR sickness score

Whilst the numerical variables are:

  • Vehicle susceptibility score
  • VR sickness score (both numerical and categorical)

So cross-checking these, we find these are the useful ones:

  • Age vs. Vehicle susceptibility score (i.e. do older people experience more real life vehicle sickness?)
  • Gender vs. Vehicle susceptibility score (i.e. does gender play a role in how likely someone experiences vehicle sickness?)
  • Age vs. VR sickness score (does age play a factor in how likely someone experiences VR sickness?)
  • Gender vs. VR sickness score (does gender play a factor?)
  • VR sickness score vs. vehicle susceptibility score (are you more like to experience VR sickness if you experience vehicle sickness?)

Step Six: Visualize

Bar charts are your best friend for categorical vs numerical variables.

Again, head over to the ‘visualize’ section in Polymer and let’s set up a bar chart for age vs. VR sickness.  

  • Put “Age” into the Y-Axis (reserved for categorical variables)
  • Put “VR sickness score into the X-Axis (for numerical variables)
  • Choose calculation “Average”

It’s immediately apparent that age is a big influencer in how often someone experiences VR sickness. 

Now let’s do the same for gender:


Again, there’s a big discrepancy between males and females whilst “other” remains in between.

If you only care about seeing male and female, you can filter out results by using the left sidebar. 

  • You can filter in tags by clicking on them
  • You can also filter out tags by clicking the minus symbol at the top right of the tag

Conclusion: Men experience less VR sickness than women.

Overall: Bar charts are your best friend when it comes to survey analysis! Use them well! 

Analyzing List Data "Select up to 5"

Let's say you asked the question: "What do you use VR for? Select all that applies:"

choose all that applies
survey question: select up to 5

And you want to compare these answers to "age group" and "gender" to see whether different demographics have different uses for VR.

Analyzing this can be a real pain in Excel and can be tricky even to professional data analysts, but Polymer Search makes analyzing this ultra simple. Polymer automatically recognizes that the answers are separated by commas, so all you have to do is put this data into the 'visualize' section (just like we did before) and you'll have your answer!



Posted on
December 22, 2021
under Blog
December 22, 2021
Written by
Ash Gupta
Former Tech Lead for Machine Learning at Google AdWords (6 years) and a quant developer on Wall Street. Co-Founder & CEO of Polymer Search.

Latest Stories