So you just collected all this valuable data, but now what? How do you analyze it to draw conclusions and find correlations in the data? This guide will teach you how to analyze your dataset even with zero prior knowledge about stats. Let's get started!
What is Survey Data Analysis?
Survey data analysis is the process of extracting meaning from the dataset you've gathered. Using survey analysis techniques, one can find correlations, patterns, trends and other insights that can be useful for businesses to guide their decision making process.
For example: You run an ecommerce store and conducted a survey asking your customers why they bought your product. You find out that most of your customers were on the fence about buying the product since there weren't any reviews online, but a Reddit post convinced them to buy it. You then invest more money into online reputation management & PR.
Why is Survey Data Analysis Important?
Surveys are important for 4 main things:
They are relatively cheap, easy to administer and allows you to get a good sample size quickly.
In marketing/sales, surveys are a great tool for understanding audience personas, and conducting market research. They can also be used to monitor trends over time. Analyzing survey data can be closely related to analyzing marketing data.
Surveys are also a key component in social sciences to study human behaviour. For example, you want to find out how happy people are living with their partner and how that differes from country to country.
Sometimes surveys can be used for academic or personal reasons. Example: You want to find out what were people's favorite characters in Game of Thrones. And does this differ depending on the demographic?
Surveys can provide really important insights to businesses that give them a market edge.
How to Analyze Survey Data
The easiest (and often the best) way to analyze survey data is by using univariate and bivariate analysis.
What is Univariate/Bivariate analysis?
Univariate analysis is the analysis of one variable. (One spreadsheet column).
Bivvariate analysis is the analysis of two variables. (Two spreadsheet columns).
Let's take a look at an example:
To get a feel for your data, it's best to start off with a univariate analysis, by seeing how many males, females and other genders participated in the survey. To perform univariate analysis, bar charts are your best friend!
Univariate analysis provides a good starting point to your data and often is a good way to show what demographics took your survey. However, bivariate analysis is where all the interesting analysis happens.
Analyzing survey data is a matter of cross-checking how different variables interact with each other, for instance:
"How does gender/ethnicity/education level influence salary?"
"How does household income affect education levels?
"Is there any correlation between age and salary?"
These are all examples of bivariate analysis. The graphs you use will differ depending on the types of data you have.
How do I know which variables to cross check? Oftentimes, intuition is the best way, but there are some methods which I'll show you later.
Types of Survey Data:
The first step towards analyzing survey data is identifying the type of data you're dealing with.
There are 3 types of survey data: numerical, categorical and sentences.
Score: 50 points
USA, Canada, UK, Australia
Age ranges: 16-25, 25-34, 35-24
Never, rarely, sometimes, frequently
"I think the app can be improved if it loads faster."
"Customer service was rude."
"I loved the app!"
These can be as long as several paragraphs or even several documents long.
Numeric answers are the easiest to analyze, followed by categorical answers and long answers are the most difficult to analyze.
Types of Survey Questions
The most common types of survey questions are:
Short/long answer responses
Multiple choice & dropdowns
Classifying Each One (With Examples)
Multiple choice example: What is your gender?
Prefer Not to Say
Multiple choice questions provide you with categorical answers.
Linear scale example: On a scale of 1-5, how severely do you experience VR motion sickness?
It might seem confusing, but answers on a linear scale can be considered both categorical and numerical. This is because there are only 5 possible answers: 1, 2, 3, 4, 5 and therefore these can be classified into categories.
This type of data is known as ordinal data, meaning 3 is higher than 2, but the distance between them is unknown i.e. we can't say that 3 is 50% higher than 2.
Ranking question example: Rank your most used VR locomotion styles:
Checkboxes example: What do you use VR for? Select all that applies:
Sometimes these questions may be in the form of "select up to 5." Checkboxes provide you with categorical answers, but the format is different to multiple choice questions. The output of your answers will look like this:
Analyzing these types of "list data" can be tricky, but we’ll show you a neat little trick that makes it extremely easy!
Short answer responses example: What VR experience causes the worst motion sickness for you?
“Damn rollercoasters! I nearly vomited one time because of them. Also driving vehicles in VR.”
Short answer responses (and also longer answer responses) can fall under unstructured data. We’ll need to convert these to categorical answers.
Dates: What year were you born?
Dates are a weird one. They don’t fall into either categorical or numerical data. Just treat them as ‘dates.’
Geographic: What country are you from?
Geographic data is categorical, but can also be visualized different than other data types (e.g. geographical heatmaps).
Converting "Sentence Answers" into "Categorical Variables"
Sentence answers are incredibly difficult to analyze. That's why we often convert them into categorical answers whenever possible. Here's how to do that:
First, go through each answer one by one and take note of any commonalities.
Next create a new column.
Lastly, label each sentence into a specific category.
Decide Which Variables to Compare
How do you know which variables to compare? And which tools should you be using to analyze the data?
Start with an initial question or hypothesis you have about the data.
Cross check every variable against each other. Look at the column headers, and see which variables are worth comparing.
Use Polymer’s Auto-Explainer feature for deeper insights.
Analyzing survey data is a matter of cross-checking every variable against each other and seeing which ones make sense to analyze.
If your survey only has 10 question and under, it's fine to manually do this yourself, however, for larger surveys,Polymer's Auto-Explainer feature can speed up this process greatly.
Survey Data Analysis Methods
Now you’ll have a good idea at what variables you’re trying to compare. Here are the techniques you can use for each of them:
Categorical variable vs numeric variable -> Use a bar chart
Numeric vs. numeric variable -> Use a scatterplot
Changes over time: 1 numeric variable vs dates -> Use a time series
See where all the volume is coming from -> Heatmaps
Summarize and aggregate/group data -> Pivot tables
For this example, we’ll be analyzing VR Heaven’s survey on motion sickness which contains all the different types of questions you’ll see.
The questions asked:
Select Your Age Group (Multiple Choice)
What is your Gender? (Multiple Choice)
Do you experience motion sickness in VR? (Multiple Choice: Frequently, Sometimes, Rarely, Never)
Were you able to grow your VR legs? (VR legs = overcoming VR sickness by using more VR).
Do you experience motion sickness in cars? (Multiple Choice: Frequently, Sometimes, Rarely, Never)
Do you experience motion sickness in planes?
Do you experience motion sickness in boats?
Rank your most used locomotion styles
What do you use VR for? Select all that applies.
If you have experienced motion sickness in VR before, can you describe what was the worst experience for you? (optional)
Step One:Clean the Data
To clean your data, first address 'null' or 'missing' fields.
Sometimes the respondents don't answer some questions or other times there are data collection errors.
Due to this being an online survey, there were a few people who never used VR before i.e. they weren't qualified for the survey. Delete these rows entirely. Their answers don't matter.
Next, there'll be answers like 'N/A' or 'Not applicable' or just a dash (-). These values can be deleted and left blank, whilst keeping the rest of the row intact. Leaving them empty can make the analysis step simpler.
To delete these, press CTRL + F -> Replace -> Find 'Not Applicable' and leave the replace field empty.
There are other methods of dealing with missing data, and it entirely depends on the situation. Sometimes you delete the entire row, sometimes you leave it blank and other times it's appropriate to get an estimate for that value.
Step Two: Convert Qualitative Data into Quantitative (if Possible)
Quantifying qualitative data will make the analysis step tenfold easier!
This particular survey was designed around the MSSQ-S, a series of questions that researchers use to determine motion sickness susceptibility score. Following the research, we're able to transform the qualitative measurements into quantitative measurements:
Never = 0 points
Rarely = 1 point
Sometimes = 2 points
Frequently = 3 points
This allows us to calculate the VR sickness score and get an average, something we were unable to do before.
Using Excel, we can easily edit this data by pressing CTRL + F -> 'Replace' tab. Find all instances of 'Never' and replace it with the number 0. Find 'Rarely' and replace it with the number 1. Do the same for 'sometimes' and 'frequently.'
The current column names are too long. This was because it was imported from Google Forms, which uses the survey question as the column name. Longer column names are harder to analyze, so we make them short and sweet:
"Do you experience motion sickness in VR?" -> VR sickness
Do the same for the rest of the columns.
Step Three: Add More Columns
Adding more columns = adding more dimensions for analysis.
Whilst it's useful to have scores for motion sickness in vehicles (cars, boats, planes), it'll be more useful to have a score that provides an overall motion sickness value.
We followed the steps in the MSSQ-S and created a new column called 'susceptibility score' using some basic Excel formulas which you can find online:
This score tells us the person's overall susceptibility to motion sickness in vehicles and will be a crucial component to analyzing this dataset.
Here's another example of how Alex Almedia creates more dimensions in his dataset to find top converting Facebook ads.
Step Four: Start with Broad Questions
A question like "how many people experience VR motion sickness?" is a good starting point. Pie charts are great for yes/no answers:
It's also a good idea to get to know who your demographics are - so find out the age and gender of the people who took your survey.
Head over to the 'visualize' tab in Polymer and input 'gender' and 'age group' into the y-axis (which is reserved for categorical variables):
Step Five: Cross-check Variables
Cross-check every variable against each other, using some logic to see what makes sense.
Start by cross-checking categorical variables against numeric variables.
We’ve identified the categorical variables are:
VR sickness score
Whilst the numerical variables are:
Vehicle susceptibility score
VR sickness score (both numerical and categorical)
So cross-checking these, we find these are the useful ones:
Age vs. Vehicle susceptibility score (i.e. do older people experience more real life vehicle sickness?)
Gender vs. Vehicle susceptibility score (i.e. does gender play a role in how likely someone experiences vehicle sickness?)
Age vs. VR sickness score (does age play a factor in how likely someone experiences VR sickness?)
Gender vs. VR sickness score (does gender play a factor?)
VR sickness score vs. vehicle susceptibility score (are you more like to experience VR sickness if you experience vehicle sickness?)
Step Six: Visualize
Bar charts are your best friend for categorical vs numerical variables.
Again, head over to the ‘visualize’ section in Polymer and let’s set up a bar chart for age vs. VR sickness.
Put “Age” into the Y-Axis (reserved for categorical variables)
Put “VR sickness score into the X-Axis (for numerical variables)
Choose calculation “Average”
It’s immediately apparent that age is a big influencer in how often someone experiences VR sickness.
Now let’s do the same for gender:
Again, there’s a big discrepancy between males and females whilst “other” remains in between.
If you only care about seeing male and female, you can filter out results by using the left sidebar.
You can filter in tags by clicking on them
You can also filter out tags by clicking the minus symbol at the top right of the tag
Conclusion: Men experience less VR sickness than women.
Overall: Bar charts are your best friend when it comes to survey analysis! Use them well!
Analyzing List Data "Select up to 5"
Let's say you asked the question: "What do you use VR for? Select all that applies:"
And you want to compare these answers to "age group" and "gender" to see whether different demographics have different uses for VR.
Analyzing this can be a real pain in Excel and can be tricky even to professional data analysts, but Polymer Search makes analyzing this ultra simple. Polymer automatically recognizes that the answers are separated by commas, so all you have to do is put this data into the 'visualize' section (just like we did before) and you'll have your answer!