Statoria Brand LogoStatoria
Data preparation

How to Prepare Your Thesis Data: Step-by-Step Guide for SPSS, Excel, and Jamovi

4 min read

Data preparation is where most thesis students lose two to three days - or worse, discover critical errors during their defense. Preparing your data correctly means structuring rows and columns properly, coding variables with the right measurement level, recoding reversed Likert items, computing composite scores, and checking for outliers before you run a single test. This guide covers all five steps with SPSS, Excel, and Jamovi instructions.

Free sample chapter

Data Analysis From Survey to Results

Step-by-step guidance for choosing the right test, running it, and writing up APA results - in plain language, not theory. Get the free sample chapter when you join the waitlist.

Key takeaways

  • Tidy data structure: one row per participant, one column per variable - any other structure will break your analysis.
  • Variable coding: set Likert items to Ordinal in SPSS Variable View - leaving them as Scale is the #1 coding error in student theses.
  • Reversed items: always recode negatively worded items before computing composite scores (formula: 6 − original score for a 1–5 scale).
  • Cronbach's alpha ≥ .70 is required before treating composite scores as a reliable scale - run this check before any hypothesis test.
  • Outlier threshold: |z| > 3.29 flags extreme outliers at p < .001 - document your decision to keep or remove them.

Structure Your Dataset: One Row Per Participant, One Column Per Variable

Every row in your dataset must represent one observation - one participant, one case, one event. Every column must represent one variable. This is called 'tidy data' structure and is required by SPSS, Jamovi, Excel, and every other analysis tool.

If your data is in any other format (e.g., one row per question, or multiple measurements per row), you must reshape it before analysis.

Correct StructureWhat It MeansExample
One row = one participantEach survey response is a single rowRow 1 = Participant 1's answers to all questions
One column = one variableEach variable has its own columnColumn A = Age, Column B = Gender, Column C = Q1
No merged cellsEvery cell has exactly one valueNo 'spanning' headers across multiple columns

Code Variables Correctly in SPSS and Excel

In SPSS Variable View, set the correct measurement level for every variable before running any analysis. The wrong level will cause SPSS to suggest inappropriate tests.

Variable TypeSPSS Measurement LevelExample Variables
Categorical (no order)NominalGender (1=male, 2=female), Study programme
Ranked / Likert itemsOrdinalSatisfaction 1–5, Agreement 1–7
Continuous / metricScaleAge, weight, exam score, composite scale score
⚠️

Setting a Likert item to 'Scale' in SPSS Variable View is the #1 coding error in student theses. SPSS will treat it as metric and produce parametric test results without warning you.

Recode Reversed Likert Items Before Computing Composites

Surveys often include reverse-worded items to detect careless responses. If your scale runs 1–5 and an item is negatively worded, a high score means low agreement - opposite to other items. These must be recoded before computing composite scores.

Original ScoreRecoded Score (1–5 scale)Formula
1 (Strongly disagree)5 (Strongly agree)6 − 1 = 5
2 (Disagree)4 (Agree)6 − 2 = 4
3 (Neutral)3 (Neutral)6 − 3 = 3
4 (Agree)2 (Disagree)6 − 4 = 2
5 (Strongly agree)1 (Strongly disagree)6 − 5 = 1

Compute Composite Scores and Check Cronbach's Alpha

Most questionnaire-based theses analyse constructs (motivation, anxiety, satisfaction) measured across multiple items. Before analysing the construct, compute the mean or sum across all items.

  • In SPSS: Transform → Compute Variable → MEAN(item1, item2, item3)
  • In Excel: =AVERAGE(C2:G2) for participant in row 2
  • In Jamovi: Variables → Computed Variable
  • Before using the composite in any hypothesis test, run Cronbach's alpha:
  • SPSS → Analyze → Scale → Reliability Analysis

Target: α ≥ .70. Below this threshold, the items do not consistently measure the same construct.

Check for Outliers Before Running Statistical Tests

Extreme outliers distort means, inflate standard deviations, and destabilise regression coefficients. Check before running any test.

MethodThresholdHow to Run in SPSSDecision
Z-score|z| > 3.29 (p < .001)Analyze → Descriptives → Save standardised valuesFlag, inspect, document decision
Boxplot whiskersPoints beyond whiskersGraphs → Chart Builder → BoxplotSame as above
Mahalanobis distancep < .001 in regressionRegression → Save → MahalanobisMultivariate outlier in regression

Software Comparison: SPSS vs. Excel vs. Jamovi for Data Preparation

Choose the tool that fits your analysis and institution.

TaskSPSSExcelJamovi
Variable coding / labelsVariable View (best)Manual codebook tabVariable editor
Recode reversed itemsTransform → RecodeFormula: =6-C2Data → Compute
Compute compositeTransform → Compute=AVERAGE(C2:G2)Variables → Computed
Cronbach's alphaScale → ReliabilityNot available nativelyReliability module
Outlier detectionDescriptives + BoxplotConditional formatting on z-scoresDescriptives + Boxplot

Frequently asked questions

Can I use Excel to prepare my data before importing into SPSS?

Yes. Clean and structure your data in Excel first, then import the .xlsx file directly into SPSS. Make sure column headers are variable names (short, no spaces or special characters), and numeric codes are used for categorical variables rather than text labels.

What is a codebook and do I need one for my thesis?

A codebook documents what each variable represents, how it was measured, and what the numeric codes mean. You need one - not necessarily as a formal document, but for your own reference when writing the methods section and for your supervisor if they review your data.

What is Cronbach's alpha and what value is acceptable?

Cronbach's alpha measures the internal consistency of a multi-item scale - how well the items measure the same construct. A value of .70 or above is generally considered acceptable for thesis research. Values below .60 indicate the items may not all belong to the same scale.

What is the correct measurement level for a Likert scale variable in SPSS?

Single Likert items (e.g., one 5-point satisfaction question) should be set to Ordinal in SPSS Variable View. Composite scores - the mean or sum of multiple Likert items measuring the same construct - are often treated as Scale (metric) when Cronbach's alpha ≥ .70. Setting the wrong level is one of the most commonly flagged thesis statistics mistakes.

How do I handle missing data when preparing my thesis dataset?

First, quantify how much data is missing per variable. Under 5% missing: listwise deletion is acceptable. Over 5%: analyse whether the pattern is random (MCAR) or systematic (MAR/MNAR) and document your strategy in the methods section. In SPSS, set your missing value code (e.g., 99 or -9) in Variable View so the software excludes those cases automatically.

Free tool

Not sure which statistical test to use?

Answer 5 quick questions about your research design and get the right test - with an explanation of why - in under two minutes.

Statoria Team

Statistics educators & software developers

We build Statoria to help bachelor and master students get through their thesis data analysis without stress. Our guides are written by researchers with experience in social science statistics and student supervision.

Related guides