Unit Author: Professor Nick J Fox
Having successfully completed this chapter, you will be able to:
- differentiate between variables, values and cases;
- identify the correct measurement scale of your data;
- code your data for entry into a database;
- select an appropriate statistic to analyse your data.
Social scientists frequently report results in a form that quantifies their findings. Quantitative data analysis provides ways of representing study findings numerically, ranging from simple counts of events through to complex statistical interpretations of results. This unit looks at how to prepare data for analysis, and how to report it.
This course does not cover statistical analysis, but the principles of quantitative data analysis are central to the research process, and all those embarking on research should be familiar with these principles. Statistics and statistical computing are useful only within the context of the sound methodological approach to data analysis that this chapter will provide.
1. Why use quantitative data analysis methods?
Quantitative data analysis is about the use of numerical symbols (including graphical representations of numbers). The reasons chosen for reporting data in ways that provide such numerical summaries are worth thinking about briefly.
Firstly, there is always a trade-off in quantitative analysis between detail of data and its manageability. Numbers provide a means of reducing data (the range of ages of subjects in a study is simply reported as a mean value and a range), or of comparing things that are dissimilar from each other (rates of marriage against rates of co-habitation), or reporting proportions (percentages of vegetarians and meat-eaters in a population). We have lost some detail by this reduction, but what we have gained is a way of summarising the data that is easily comprehended or absorbed. These simple counts are examples of descriptive analysis, and provide a summary of study findings without any attempt to interpret or analyse.
Secondly, numbers can be submitted to logical operations (addition, multiplication, squares etc.), so they can be manipulated to draw conclusions. At the simplest level, one figure can be said to be larger than another. More complicated manipulations in the form of statistical analysis can be used analytically to discern trends or to test hypotheses. Such quantitative data analyses make claims about a study’s findings and seek to draw conclusions about their relevance to the natural or social world. Thus an analysis of food consumption might conclude that eating 5-a-day of fruit and vegetables is associated with educational background.
So the objective of any quantitative data analysis is to turn observations into some kind of numbers, which can then be summarised and perhaps analysed statistically. And a study which wishes to use such kinds of analysis must have a clear idea – before any data is actually collected – precisely how such numerical summaries or statistical tests are to be undertaken. As far as quantitative analysis goes, data that cannot be turned into meaningful numbers (and so cannot be analysed) is worthless. Of course, some data cannot be quantified, or quantification would result in the loss of crucial elements of what is to be studied. In such cases, a qualitative approach may be appropriate.
We will now consider the basics of how we turn our findings into numbers that can be analysed quantitatively.
2. Turning data into numbers: cases, variables and values
Before considering how to start quantitative analysis, three definitions are useful: cases, variables and values. The easiest way to understand these three aspects of data is by looking at a typical data array (or ‘dataset’), as would be produced by a statistics programme such as SPSS (Statistical Package for the Social Sciences). Look at Table 7.1, which shows data from a study of meat consumption among adults.
Table 7.1: Raw data on meat consumption
In Table 7.1, the cases are numbered 1–10, down the left-hand column of the array.
In this dataset, the cases are the subjects whose data has been collected and that we wish to analyse, and in this study the first ten cases are shown (more would be visible if we looked further down the array). In this example, the cases are individuals, but they could be anything that is our unit of analysis (for instance, a person, a family or even an entire town or city).
In Table 7.1, we can see that there are five data columns, with codes at the top: these are the variables: the characteristics of the subjects that we wish to measure. Another name for the variables would be indicators: any measurement (for example, a question on a questionnaire) will need to have a variable (and label) in the array.
Each variable has a code at its top and we need to know what this code means. Some of these variables have been given labels, for example:
meat = grams of meat eaten per week
gend = gender of subject (1 = male; 2 = female)
age = age of subject in years
educ = years of education post 16
smoke = cigarette smoking (1 = yes; 2 = no)
The values in the table refer to the actual measurements we have made for each case on the five variables. In Table 7.1, these are the figures in the boxes. Each box represents the value of a measurement of one variable for one case.
In this dataset, the value of an individual measurement of the meat eaten per week for subject #5 can be read off by looking down the meat column to row 5. In this case the value is 453.
Note that values are numbers, but refer to specific units of measurement. Thus meat eaten will be measured in grams per week, while education post-16 will be in years. The data array does not show the units in which the measurements are made, but these are essential if the data is to have any meaning.
Also note that this array is a simple form of descriptive report. It would be very unlikely that a large array would be reported in a paper, as it would be totally useless as a manageable summary. However we can see the first step in data reduction in this representation of findings as an array of numbers, which can be subjected to manipulation.
SAQ 7.1 Tables
Are the following underlined terms cases, variables or values?
3. Coding data for analysis
An array (or database) based on cross-referencing cases and variables in terms of values, such as that shown in Table 7.1, is a pre-requisite for quantitative analysis, and all the techniques covered in this chapter require such an array.
But we have glossed over one essential step before this array or database was constructed. Before it can be summarised, all data must be coded so that it can be represented numerically. This may seem obvious, but it is worth reflecting upon in some detail.
The simplest coding of data is based upon readings from an instrument: for example a thermometer. Readings may be taken from the instrument, and entered directly because they are pre-coded numerically, for instance from zero to 100 degrees Celsius. Few social research variables provide such an opportunity, though in the study reported in Table 7.1, the amount of meat eaten will have been weighed on a scale prior to consumption. Age is also a measure with a numeric value already attached to it.
In Table 7.1, gender has been coded according to an arbitrary basis so that male = 1 and female = 2. In each case, the transformation of the observation into a numerical value is an example of coding.
Other data must be submitted to a more complex coding procedure to turn observation into numerical value. Look at the following questionnaire (Figure 7.1), which might be used to investigate meat eating.
Figure 7.1: Meat consumption questionnaire
|1. How old are you?|
|2. Are you male or female?|
|3. What is your occupation?|
|4. What is your annual income?|
|5000 to 20,000|
|20,001 to 40,000|
|5. Which of the following describes your highest educational achievement?|
|No qualification post-16|
|PhD or higher|
|6. Are you a smoker?|
|7. Mark below your enjoyment of eating meat.|
|Equal to other food|
These questions need to be coded so that values can be entered into the data array. The following fig 7.2 shows the same questionnaire with a coding frame. The relevant code can be circled by the interviewer.
Figure 7.2: Meat consumption questionnaire with coding frame
Question 1 remains open-ended, with the proviso that we can convert into bands later if we wish.
Questions 2, 4 and 6 have been allocated numerical codes.
Question 3 is currently open-ended but once the data has been collected it will need to be coded using the SES scale, to transform it into socio-economic groupings.
Question 5 has been coded to calculate from the answer the numbers of years of education undertaken post-16.
Question 7 could be coded in a number of ways. You could visually assess the position of the respondent’s mark on the line but this could be difficult and would be subjective. It would be preferable to measure the line accurately.
3.1 Recoding data
Sometimes having coded data initially, we may wish to re-code it later on in data analysis. For instance, in our questionnaire Question 1 asks people to state their age and provides a box to write their actual age in. It is useful to have data at this level of detail, because the type of statistical analysis that can be undertaken is more sophisticated. On the other hand, it can be more difficult to analyse the data when it is at this level of detail. So, in this case, we might wish to set up a new variable (called something like ‘ageband’), that re-codes the age variable data into age bands, according to the following.
|under 20 years||1|
|20- 29 years||2|
|30 – 49 years||3|
|50 – 59 years||4|
Software such as SPSS can automatically do this kind of re-coding very easily. Note that such banding makes data less detailed, but may make it easier to manipulate. However, we need to retain information about what the new codes mean: clearly, in the coding of age bands, a code of 2 does not have an obvious meaning unless one is aware of what the original data were!
4. Nominal, ordinal and interval scales of measurement
Although it is possible to code all data so that it can be represented numerically, the meanings of these numbers differ. There are three different kinds of numerical data that we encounter in quantitative research, each of which permits differing kinds of analysis. We can see this if we refer back to the example of the meat consumption questionnaire used earlier.
4.1 Nominal data
Nominal data is composed of a set of unordered categories. The allocation of numbers to a nominal scale is on an arbitrary basis, since the nominal codes have no numerical meaning. One example is the coding of the sex of respondents in a study (Male = 1, Female = 2). Another example might be the occupation of respondents in the study above, in which a coding frame is constructed so that various occupations are coded 1, 2, 3 and so on.
In all these examples, the numbers simply provide symbols for the categories. In no sense can a doctor (professional: code = 1) be considered half of a bank manager (manager: code = 2), nor could a female be assumed to be twice a male (at least, not in the sense used in this categorisation exercise!). Nominal data cannot be added or subtracted or submitted to other logical operations. You may also see nominal data referred to as categorical data.
Usually, nominal categories are used to provide descriptive data. So, for example, we could identify the ratio or the proportion of women in a sample. (We will look at such frequency charts in the section on descriptive analysis.)
Another type of descriptive statistic that can be used to describe nominal data is the mode. This is a measure of the most populous category in a sample. N.B. a mean cannot be used with nominal data: a mean gender of 1.5 for a sample is meaningless.
4.2 Ordinal data
Sometimes the categories that one chooses for data are not simply arbitrary, but reflect some kind of order. Such data lies on an ordinal scale, but while the different categories have some form of order, they are not arithmetically related.
Look at question 5 on the questionnaire. Here there are five categories of educational achievement, and the fifth category (PhD or higher) is ‘greater’ in terms of achievement than the fourth category (Master’s degree), which in turn is ‘greater’ than the third, and so on. We can accept that there is an ‘order’ to these achievements.
However, ordinal data cannot be assumed to be additive, because the distances between items on an ordinal scale are not necessarily equivalent. This is actually made quite clear by the way we coded these achievements, which was in terms of the years of education needed post-16 to achieve them. The distances between the points on this ordinal scale are not equal.
This has implications for data analysis. We can do more with ordinal data than the frequency or proportion comparisons available with nominal data, but we cannot use the statistics that depend on additivity. Ordinal data can be described using ratios and percentages. As well as using a mode, it is also possible to calculate a median statistic for ordinal data. A median is the mid-point or middle value when all the observations are placed in order. It splits the observations into two equal-sized groups.
Most statistical tests for ordinal data avoid this problem of additivity in a clever way. What they do is rank the frequency of occurrence of different categories. In Table 7.2, the frequencies of different educational achievements for two samples have been ranked. The ranking can then be treated as an interval scale, and compared statistically, although in this case you could probably make some comment about the differences between these two samples simply by looking at the ranks.
Table 7.2 Ranking of ordinal data for two samples
|Urban Sample||Rural Sample|
|No qualification post-16||2||1||12||4|
|PhD or higher||3||2||0||1|
(Ranking: 1 is the lowest frequency and 5 is the highest frequency)
4.3 Interval and ratio data
Unlike the ordinal scale, an interval scale is one which shows the quality of additivity. This means that wherever one is along the scale then the gaps between the measurements are the same. In the questionnaire above, the first question provides interval data, because the distance between cases with age 10 and age 12 is the same as that between 13 and 15.
Sometimes one sees reference to ratio data. A ratio scale is simply an interval scale has a true zero point. Not all interval data have an absolute zero. For instance, Celsius temperature is an interval scale, but 0°C is not an absolute zero, but arbitrary (the freezing point of water). Ratio data, on the other hand, do have a true zero. For instance, height, age and income level all have a true zero.
4.4 Parametric and non-parametric statistics
Interval and ratio data can be manipulated by a wider range of statistics because they can be directly submitted to logical operations such as addition and squaring. At the simplest level, we can calculate averages using the mean. A mean is a measure of central tendency, and can be calculated by adding all the individual values and dividing this figure by the total number of individual cases.
Such tests are known as parametric (meaning ‘equal measures’). Parametric statistics are based on the assumptions of both additivity, and that the data follows a normal distribution (the distribution of data forms a bell-shaped curve with most points grouped towards the middle). Examples of parametric statistics are T-tests and analysis of variance (ANOVA).
Non-parametric statistics, unlike parametric statistics do not make any assumptions about additivity or the underlying distribution of data. Non-parametric statistics are therefore suitable for skewed interval data, and for ordinal and nominal levels of measurement.
Ordinal data can only be described in terms of a median (the value of the middle case) or a mode (the most frequent result). However, as was noted in 4.2, ordinal data can be manipulated to enable powerful statistical analyses to be undertaken.
Nominal data can only be described in terms of a mode (the most frequent result), and possible statistical analysis is quite limited. However, if an independent variable (cause) is nominal and the dependent variable (effect) is interval, a parametric test can still be used (see section 6.3 below).
SAQ 7.2 Kinds of data
You read an article in a journal which explores the effect of socio-economic status (a measure of social class based on occupational categories) upon amount of meat eaten per week. The investigator used the following coding frame for SES:Professional = 4Managerial = 3Skilled manual = 2Unskilled = 1The investigator took two samples of 100 subjects from different neighbourhoods, devised average SES scores for the two neighbourhoods, and recorded these results:
[table id=40 /]
The investigator concluded: ‘The higher the social class, the lower the meat eaten per week’. What do you think?
5. Rates and percentages
Rates and percentages are simple ways to compare data.
Percentages are a way of standardising data that is familiar to most people, enabling samples of different sizes to be compared. Percentages should not normally be used when there are fewer than 100 cases in total, as they may appear to inflate the number of cases studied. In such circumstances, use a proportion (for example 0.3 or 0.5, in place of 33% or 50%).
Another way of standardising data is through the use of a rate, which is like a percentage, although the denominator may be 1000, 10,000 or even 100,000.
Consider for example, a social survey on poverty. I would like to compare the levels of poverty in three areas of a city. As a proxy measure, I collect data on the number of people attending a food bank (a facility providing free food to people on low incomes) in each area. The survey shows that the number of registered food banks in area A was 80; in area B it was 42 and in area C, 240. These figures seem to tell a story, but to compare them we need to use a rate to express the number of users as a proportion of the number of people in each area. This rate is a more accurate indicator of the amount of poverty. This is illustrated in Table 7.3. Clearly, the rate was substantially higher in area C than in the other two areas.
Table 7.3 Food bank use rates
|Area||No. of food bank users||Area population||Rate/1000 population|
6. Statistical analysis
One of the advantages of quantitative data is that it enables the use of statistics, and a range of different statistical tests have been devised to allow you to analyse data and make judgements about its contribution to knowledge. The two main kinds of statistics are descriptive and inferential.
6.1 Descriptive statistics
Descriptive statistics provide a means to summarise data in terms of numbers. The simplest descriptive statistic is an average. Earlier we looked at an array of data in which values for each case are entered for a number of variables. From such an array or database it is possible to calculate simple descriptive statistics. For example, an average can be calculated (mean, median or mode, dependent upon type of data). As already noted:
A mean is a calculation of the arithmetic average of a set of data. It can be used for interval or ratio data.
A median is the middle value when the data values are put in ascending or descending order. It can be used for ordinal data.
A mode is the most frequent value in a set of data. It can be used for ordinal or nominal data. N.B. in multimodal data, there can be more than one mode, suggesting that the data are made up of more than one kind of case.
For interval data, a standard deviation ( a measure of the range of values) may also be calculated, as can other measures of distribution such as skew (data shifted to one side or other of a distribution) and kurtosis (how long the ‘tails’ of a distribution curve are).
Table 7.4 shows a typical frequency table and descriptive statistics for one of variables in the data set we looked at earlier: post-16 education. Note that the data is ordinal, and a mean is not calculated.
Table 7.4 Descriptive statistics for post-16 education
|5||PhD or higher||5||3.8|
Median = 2 Mode = 2
Such tables can be presented in a research report, summarising the values for each variable.
6.2 Inferential statistics
Inferential statistics allow us to draw conclusions about the meaning of data, and to test hypotheses (for instance, what effect an independent variable such as social class has upon a dependent variable such as attitudes to politics). They are associated with ideas of statistical significance which is a measure of how likely or unlikely it is that a finding occurred by chance.
This course cannot go into detail in terms of using statistical tests, which is a subject in its own right. However, we will quickly describe the different kinds of inferential statistics that are available, and then provide a guide to which kind of test can be used with different kinds of data.
(If you do not have expertise in statistics, it is good practice to consult a statistician when setting out to analyse your data, or even earlier, when you are designing the study, to make sure it will be possible to interpret the data you collect!)
There are four kinds of inferential statistics.
6.2.1 Contingency table
A contingency table gets its name from the extent to which variations in one variable is contingent upon another, for example a cause (independent variable) such as gender and an effect (dependent variable) such as voting intention (see table 7.5).
Table 7.5 Contingency table for gender and voting intention
|Left-wing||33 (66%)||25 (50%)|
|Right-wing||17 (34%)||25 (50%)|
|Totals||50 (100%)||50 (100%)|
Such associations between interval or ordinal data can be assessed by correlations, which are discussed in a moment. But all data, suitably transformed, can be evaluated by means of a contingency table, using one of a number of statistics, the best known of which is chi-squared. This is of use if both independent and dependent variables are nomimal. In the example just mentioned, data can be entered into a contingency table and chi-squared or another similar test used to see whether there is a significance (non-random) association.
Like contingency tables, correlations are concerned with the strength of associations between variables. Correlations are used when both variables are at least ordinal, and usually comprised of interval data.
An association between two interval variables can be represented visually in the form of a scattergram (see figure 7.3). In this example, we look at an imaginary data set that compares incomes (along the horizontal axis) and days spent on holiday per year (on the vertical axis)! There appears to be a strong positive association between these variables: as incomes rise, so do days spent on holiday. However it is not a perfect correlation: otherwise all the points would fall along a straight diagonal line.
Figure 7.3 Scattergram of income vs. annual holiday length
Statistical tests of correlation such as Pearson’s r assess whether this apparent association is significant statistically. They are relatively straightforward to interpret because they vary between fixed limits. A correlation of 1 is a perfect association (a straight diagonal line in the scattergram), down to 0 for no association whatsoever (random scatter) through to –1 for a perfect negative association (a diagonal line sloping from left down to right).
You can also do correlations if both variables are ordinal data, using a non–parametric test such as Spearman’s rho or Kendall’s tau.
6.2.3 Tests of differences in means or variances.
These are used to compare two or more groups of cases, subjects or samples, where the dependent variable is interval or ordinal, while the grouping variable (the independent variable) is categorical (nominal). For example, we can compare men and women (nominal) on the number of hours they spend a week on domestic tasks, childcare and so forth (interval).
This could be represented in a simple bar chart (see section 7). Tests of significance are used to assess if the differences between the means for the two groups are due to chance or due to the independent variable (gender).
The test used to assess two groups is the t-test, while for more than two groups use the F-test (also known as analysis of variance or ANOVA).
In the example of gender and hours of housework, these are independent samples. These tests can also be used for paired or matched samples. An example of this would be comparing the hours of childcare done by men who had full-time jobs with the hours they spent a month after they had lost their jobs due to redundancy. Here it is the same subjects in each of the two categories (working and non-working), so the data is paired.
There are also some tests that can be used if dependent variable data is ordinal, and these can be found in the guide to choosing tests in 6.3 below.
6.2.4 Multivariate analysis
It is possible to do more complex analyses to explore how a range of independent variables affect a dependent variable. The most common multivariate analyse is multiple regression. This is used when all independent variables are either interval data or dichotomous (two category) variables, and the dependent variable is interval data. For example, the dependent variable (outcome) could be days a year spent in paid work, and there could be a range of independent variables (causes) which we want to test to see which of them affects the outcome and to what extent. These might be variables such as gender, age, educational level, number of jobs held since 18, political attitudes, kind of diet, etc.
It is a sophisticated analytical tool (which must be done by computer) and will indicate the strengths of each independent variable in predicting the value of the dependent variable, and can rule out some variables as having little or no effect on an outcome.
There are other multivariate approaches such as logistic regression which are beyond the scope of this overview.
6.3 Choosing the right test
We have only scratched the surface of statistics here, and you should either take a specialist course or consult a statistician before deciding how to use statistics to analyse your data. The following table is a simple guide to the main statistical tests we have mentioned.
The top row of Figure 7.4 shows the type of data for the outcome (dependent) variable: nominal, ordinal or interval/ration. Each of the columns shows the tests appropriate to that kind of data. So when describing nominal data it is appropriate to calculate a mode; likewise when describing ordinal data, it is appropriate to use both the mode and the median, and so on.
The left-hand column shows the different kinds of statistics you might wish to use, ranging from descriptive to analytical statistics. Note that when it comes to associations, you need to choose a test according to whether the independent (causal) variable is nominal, ordinal or interval in character.
Figure 7.4 Choosing an appropriate statistical test
[table id=41 /]
SAQ 7.3 Which test?
Using Fig 7.4, work out which kind of statistical analysis should be used with the following studies.
[table id=42 /]
7. Presenting data
So far, we have considered how to manipulate data during analysis. Once analysis is completed, we will need to present it. The main ways to present data are:
A table provides a way to compare data easily and is useful where there are multiple categories. Any of the tables described earlier in this unit could be used in a research report to summarise data.
(For an example of a paper which should really have used tables to report an overwhelming mass of data, see Eaton, D.K. et al. (2006). Youth risk behavior surveillance – United States, 2005. Journal of School Health, 76(7), 353-372. This can be downloaded free at http://www.pactonline.org/docs/June%2006-Youth%20Risk.pdf).
7.2 Descriptive statistics
Data can be summarised in terms of simple descriptive statistics (e.g. an average such as a mean or a median value) or a range. These which may be coupled with tests of significance (see section 6 above).
7.3 Visual representations: figures and charts.
Figures are non-numerical representations of data (for example, a flow diagram showing how one variable affects another which in turn affects another).
Charts provide a means to visually represent numerical data such as frequencies. The chart used should be clear and should illustrate the data in a way that assists the reader to understand the findings.
7.3.1 Types of chart
Bar charts are a means to compare frequencies for discrete categories. Chart 7.1 displays the number of people in four socio-economic status (ordinal) categories.
Chart 7.1: Bar chart of proportion of meat-eaters by SES
Pie charts provide a more intuitive way to present percentages. Chart 7.2 shows the same data as chart 7.1, with different segments of the ‘pie’ representing proportions of the cases with each value.
Chart 7.2: Pie chart of proportion of meat-eaters by SES
Other more complex charts are available in statistical software packages or online.
This chapter covers a lot of ground and you may want to go over some of it a number of times. There are a number of key points with which you should now be familiar, namely:
- Quantitative data comes in three types: nominal, ordinal and interval.
- Each type of data requires a different sort of analysis and only certain types of statistical tests can be applied to particular data types.
- Data needs to be coded before it is entered into a computer, according to thought-through coding frames, and ensuring assumptions are not made about the type of data (e.g. interval rather than ordinal).
- Once analysis is complete, the findings of the analysis need to be presented in a clear way, using tables, statistical findings and visual representations, as appropriate.
Now please complete the following reflective exercise for your log book.
Reflective Exercise 7.1
[table id=43 /]
Bryman, A. and Cramer, D. (2011) Quantitative Data Analysis with IBM SPSS 17, 18 & 19: A Guide for Social Scientists. Hove: Routledge.
Kent, R.A. (2015) Analysing Quantitative Data: Variable-based and Case-based Approaches to Non-experimental Datasets. London: Sage.
Answers to SAQ 7.1
- Value (the variable is ‘family size’)
Answer to SAQ 7.2
The first thing you should have noted is that the investigator is using a confusing coding frame, in which a larger number represents ‘higher SES’, whereas conventionally, SES is described using Roman numerals with I = professional through to V = unskilled.
But the most remarkable thing about the study is the use of ‘mean SES’ scores. The investigator is treating the data as interval, whereas it is at best ordinal, and it arguably nominal, based on a set of occupational categories. (The dated term ‘lower classes’ should not be used to imply some kind of hierarchical order in these different occupational groups.) A mode may be the most appropriate summary statistic to use to describe the average social class for these groups.
The investigator is also confusing dependent and independent variable. In data analysis, the investigator should seek to examine the dependent variable (i.e. the outcome measure) by the independent variables. So in this example the investigator should display the mean meat eaten (the dependent variable) by each social-economic status grouping (the independent variable). If samples from the four socio-economic groups were compared on the amount of meat eaten (using an Analysis of Variance – see later in this chapter), it would be possible to discern if SES really had an effect on meat eaten.
Answer to SAQ 7.3
- Test of differences in means: independent samples, parametric data (T-test).
- Two-by-two contingency table (Chi-squared).
- Test of difference of means: related samples, non-parametric data (Wilcoxon).
- Test of differences in variance, parametric data, three independent samples (ANOVA).
- Non-parametric correlation (Spearman’s rho or Kendall’s tau).
- Correlation: parametric data (Pearson product-moment coefficient of correlation).