From SPSS to R, Rewriting Code for Survey Data Tabulation: MICS Survey Example

Filip Mitrovic
Analytics Vidhya
Published in
8 min readSep 6, 2021

--

Household surveys are among most complex statistical data collection and compilation exercises out there. They are conducted by national statistical offices, other government bodies and private research organizations world wide.

Analysis for data collected in them can be done in a variety of tools.

To me, value added in using any survey data would be a possibility to run the analysis on different data analysis platforms. This opens up possibilities for additional quality control checks, integration of parts/modules of the one survey into other surveys that are processed on different platforms, or adding additional reporting modes next to tabulations (like choropleths or complex visualizations)

For this exercise an SPSS syntax for one of many tables produced in MICS surveys is used and re-coded in R. Multiple Indicator Cluster Surveys (MICS), is the largest source of statistically sound and internationally comparable data on children and women worldwide. To collect data in MICS, interviewers administer the household questionnaire as well as individual questionnaires to women and men aged 15 to 49 years, to mothers or caretakers of all children under 5 years of age, and a randomly selected child age 5–17.

For the purpose of this exercise one of the SPSS syntax files used in tabulating a MICS survey standard table SR.6.1M was recoded in R. All of the MICS standard SPSS syntax is available online on the MICS website.

This is a rather simple table in MICS, it shows literacy rates for men age 15–49. All men who have attained education level above primary school in MICS are considered literate. Next to formal level of education, any men interviewed in the survey who does not have primary school education completed, or has only primary school education (question MWB6 in the questionnaire) is asked to read a sentence from a card given by the interviewer (question MWB14 in the questionnaire). All these questions are standard to a Questionnaire for Individual Men age 15–49 years of age and a template version of this questionnaire is also available online.

Large international social statistics survey programs like MICS and DHS usually implement specialized premium software for analysis like SPSS or Stata. Reasons are simple, like a possibility to create custom tables (‘ctables’ function in SPSS or in Stata combination of ‘collect layout’) with all relevant socio-economic background characteristics that are used to break down the indicator, like by area (urban/rural), region, ethnicity, religion, wealth quintile and others defined for each survey.

Goal of the post is to show same functionality with open source software like R using ‘expss’ package. Next to crisp outlay of the tabulation and comparison of the results with the original, using R allows for integration with other different packages, as mentioned above.

Specific dataset used for this exercise of running SR.6.1M SPSS syntax on ‘Literacy (men)
Percent distribution of men age 15–49 years by highest level of school attended and literacy, and the total percentage literate’

was from Tonga 2019 MICS survey, and dataset for it is available also online.

This particular table was published on page 64 of the Survey Findings Report.

In SPSS initial section of the syntax looks like this:

include "surveyname.sps".get file = 'mn.sav'.include "CommonVarsMN.sps".select if (MWM17 = 1).weight by mnweight.compute literate  = 2.
if (MWB6A >= 2 and MWB6A < 8) literate = 1.
if (MWB14 = 3) literate = 1.
variable labels literate " ".
value labels literate 1 "Literate" 2 "Illiterate".
compute literateP = 0.
if (literate = 1) literateP = 100.
variable labels literateP "Total percentage literate [1]".
compute layer = 0.
variable labels layer " ".
value labels layer 0 "Percent distribution of highest level attended and literacy".
* add A to welevel label.
* add value labels mwelevel 3 "Secondary or higher [A]".
* country specicif add A to welevel label.
add value labels mwelevel 3 "Upper secondary or higher [A]".
compute numMen = 1.
variable labels numMen "Number of men".
value labels numMen 1 "".
compute total = 1.
variable labels total "Total".
value labels total 1 " ".
compute tot = 100.
variable labels tot "Total".
value labels tot 100 " ".

this section of the syntax (lines 1–59) does the following: computes literacy (variable ‘literate’) for respondents who had read the card, and takes into account only those who were able to read the whole sentence, adds survey weight for men, adds percentage variable for literacy (variable ‘literateP’), computes total number of men (variable ‘numMen’) and total percentage (variable ‘tot’ )

for same code in R it will take a few more lines though:

---
title: "SPSS to R - recoding of a Survey Syntax"
author: "Filip Mitrovic"
output:
word_document: default
html_document: default
pdf_document: default
---
## Library needed to reproduce the SPSS syntax household survey table in R are below. Author of the 'expss' package notes that it should be loaded only after the 'heaven' package.
```{r setup, echo=FALSE}
library(foreign)
library (haven)
library (expss)
```
### path to *.sav file with the survey dataset.```{r, results="hide"}

path = ''
setwd(path)
data = read.spss('mn.sav', to.data.frame = T, use.value.labels = FALSE) ##data set is loaded with numerical values to save keystrokesdata %>% filter(data$MWM17 == 1) # selects only responses from completed interviews with men age 15-49 in the survey```

Next to defining the same variables: literate, literateP, tot, numMen, this syntax defines libraries needed to run spss *.sav file in R environment.

after that segment same computation as in SPSS syntax is performed: ### computing literacy as in SPSS syntax, re-written

```{r pressure, echo=TRUE}data = compute(data, {
literate = 2
})
data$literate<- ifelse((data$MWB6A >=2 & data$MWB6A <8), 1,
ifelse(data$MWB14==3,1,2))
# making data numeric in R dataframe for calculations
data$literate<-as.numeric(as.character(data$literate))
# adding value labels in R, comparable to 'value labels' function in SPSS.
val_lab(data$literate) = num_lab("
1 Literate
2 Illiterate
")
# adding variable with overall percent of literate respondents. This is comparable to 'compute' function in SPSS. All variables have identical names as in SPSS syntax for easier reference.
data = compute(data, {
literateP = 0
})
data$literateP<- ifelse(data$literate ==1,100,0)#recode(data$mwelevel) = c(0 ~ 0, 1 ~ 1, 2:3 ~ 2, 9 ~ 9, other ~ NA)# adds value labels to the column headings in the table in R. This is not in SPSS syntax as those are already pre-defined in the *sav dataset
data$mwelevel<-as.numeric(as.character(data$mwelevel))
val_lab(data$mwelevel) = num_lab("
1 Up to primary
2 Lower secondary
3 Upper Secondary or higher [A]
9 Don't know/ Missing
")
# computes additional columns as in SPSS syntax, for overall number of cases and a total percentage. 'numMen' and 'tot' are identical variable names as in SPSS syntax
data = compute(data, {
numMen = 1
})
data = compute(data, {
tot = 1
})
# adding variable labels. Labels added identical as in comparable SPSS syntax. In addition labels also added to the background characteristic variables that are shown in rows. This is not seen in SPSS syntax as they are already in the *.sav file.
data<-apply_labels(data,
literateP = "Total percentage literate [1]",
numMen = "Number of men",
HH6 = "Area", # up to variable 'mwelevel' all variables are background characteristics.
HH7 = "Region",
mdisability = "Functional difficulty",
religion = "Religion of the household head",
ethnicity = "Ethnicity of the household head",
windex5 = "Wealth quintile",
mwelevel = "Percent Distribution of highest level attended and literacy",
tot = "Total",
literate ="Literacy rate")
# adding value labels as in SPSS. next to syntax specific value labels, that are shown in the comparative document, the labels for background characteristic variables are also added.
val_lab(data$tot) = num_lab("
1 | " )
val_lab(data$HH6) = num_lab("
1 Urban
2 Rural")
val_lab(data$mdisability) = num_lab("
1 Has functinal difficulty
2 No funcitonal difficulty")
val_lab(data$windex5) = num_lab("
1 Poorest
2 Second
3 Middle
4 Fourth
5 Richest")
val_lab(data$religion) = num_lab("
1 Free Wesleyan Church
2 Latter Day Saints
3 Roman Catholic
4 Free Church of Tonga
5 Other religion
99 Don't know/missing")
val_lab(data$ethnicity) = num_lab("
1 Tongan
2 Chinese
3 Fijian
4 Other ethnicity
99 Don'tknow/missing")
val_lab(data$HH7) = num_lab("
1 Tongatapu
2 Vava'u
3 Ha'apai
4 'Eua
5 Ongo Niua")

But in addition to syntax specific variables, a set of variable names and labels is added too, as these are pre-defined in *sav file for SPSS but have to be introduced in the R code as well.

Now the tabulation syntax in SPSS is pretty straightforward and uses ‘ctables’ function to bind column level variables define din this syntax (‘literate’, ‘literateP’, ‘tot’, mwelevel’ and ‘numMen’) and tabulates it against pre-defined above mentioned background characteristic variables. Syntax looks like this:

ctables
/vlabels variables = layer mwelevel literate display = none
/table total[c]
+ hh6 [c]
+ hh7 [c]
+ $mage [c]
+ mdisability [c]
+ ethnicity [c]
+ religion [c]
+ windex5 [c]
by
layer [c] > mwelevel [c] > literate [c] [layerrowpct.validn '' f5.1] + tot [s] [mean '' f5.1] +
literateP [s] [mean '' f5.1]
+ numMen[s][sum '' f5.0]
/categories variables=all empty=exclude missing=exclude
/slabels position=column visible = no
/titles title=
"Table SR.6.1M: Literacy (men)"
"Percent distribution of men age 15-49 years by highest level of school attended and literacy, and the total percentage literate, " + surveyname
caption =
"[1] MICS indicator SR.2 - Literacy rate (age 15-24 years)"
"[A] Respondents who have attended upper secondary school or higher are considered literate and are not tested."
.

In R, tabulation would look like this:

```{r}
expss_output_viewer() # function to see the table in the viewer in R studio
data %>%
tab_total_row_position("none")%>% # suppresses the total values for each row. As SPSS syntax 'ctables' does not show totals, it is disabled in R code as well
tab_cells(tot, HH6, HH7, mdisability, ethnicity ,religion, windex5) %>% #defines rows for the table
tab_cols(literate, mwelevel, tot) %>% # defines columns to shoe percentage for with 'tab_stat_rpct'
tab_weight(weight = mnweight) %>% #adds weights from the dataset
tab_stat_rpct(total_label = NULL,total_statistic = "w_cases",)%>%
tab_cols(net(literateP, "Total percentage literate" = greater_or_equal(1), "TO_DELETE" = other))%>% #defines mean for the additional column as a different calculation method, and marks a columnt to be deleted
tab_stat_rpct(total_label = NULL,total_statistic = "w_cases",)%>%
tab_cols(total(numMen))%>% # last column and different calculation method only to count total cases
tab_stat_cases()%>%
tab_last_round(digits = get_expss_digits())%>%
tab_pivot(stat_position = "outside_columns")%>%
where(!grepl("TO_DELETE", row_labels)) %>%
drop_empty_rows()%>%
set_caption( "Table SR.6.1M: Literacy (men)
Percent distribution of men age 15-49 years by highest level of school attended and literacy, and the total percentage literate, " ) #adds caption on the top of the table

Final output of the R code looks like this in html format. Further customization can be done for word.

Full knitr file for this exercise is available here too:

--

--

Filip Mitrovic
Analytics Vidhya

Believer in the whole "Better data, better lives" thing. Wish I had a dog.