CENW Duplicate Rows for the Same ALF (ONS Census Wales 2011)

CENW Duplicate Rows for the Same ALF (ONS Census Wales 2011)

When working with CENW (ONS Census Wales 2011) data, duplicate rows for the same ALF is a known feature.

ONS response regarding this feature of the dataset:

"We have three processes for checking and resolving duplicate responses so that the main census data we end up with should simply be one record for each person:

  1. We resolve duplicates coming in for the same address (actually checks for duplicates within the same postcode) using a process called Resolve Multiple Responses (RMR). For instance, if two people both fill in a form for their whole household, or someone from a household also submits an individual response unknown to the main submission. We have rules for how we check they are duplicates, and rules for which to keep.

  2. We also do an over coverage check on a sample basis for duplicates across the rest of the country, and then factor the findings into our coverage estimation calculations. This sampling focuses on the types of population which are more likely to be duplicated (people who have indicated they have a second residence on the census; students aged 18-25, armed forces personnel, children, adults enumerated at a communal establishment, etc.) but does sample from the remaining populations too.

  3. We ask parents to fill in basic demographic information for any children who are away studying, and when they get to the question on their term-time address, if they answer that the term-time address is elsewhere, we then use that to filter those out-of-term students out of the main database. Then when that student does respond actually at their term-time address, we only include them there.

There is a risk that they could be someone that appears more than once in the data. The overcount adjustment (method 2 in my list) won't remove anyone, because we can't say which is 'right' - what it does is make the best estimate of the number of usual residents.

In 2011 the estimated level of over coverage across England & Wales was 0.6%, which is approximately 352,000 people. This paper from 2011 may help: https://webarchive.nationalarchives.gov.uk/20160108085304/http:/www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census-data/2011-first-release/first-release--quality-assurance-and-methodology-papers/overcount-estimation-and-adjustment.pdf

For over coverage estimation we estimate the level of duplication and wrong location records in the census. The Census Coverage Survey (CCS) sample is matched to the full census data to be able determine who is an overcount record from the CCS sample data and then can be used to estimate the level of over coverage in the census.

Also, the full census is matched to itself where the level of duplication is estimated in each region within overcount groups (the ones outlined below in point 2). Inverse sampling is used so each region and each group is sampled until 102 duplicates are found, which gives a high level of precision around the estimates. The initial sample size is 5000 for each region and group, but can be larger if 102 duplicates are not found. These estimates of the level of duplication is then used to calibrate the level found from the census to ccs matching exercise.

We would not make a duplicate match unless we were sure that it genuinely was the same person. So we would not reply on just name, dob and sex. In most cases we look for either some geography match using all of the census addresses (one year ago, alternative, workplace) or other duplicates in the same household that indicate that a whole family has moved and responded twice. We also make use of the variables ethnicity, marital status and occupation to help with the matching."

 

Advice for how to handle this from SAIL User & Data Support Services team:

Look for all ALFs with duplicate rows and compare their ALF_STS_CD as well as their census responses for each variable you have access to (NB this can only be applied to CENW, not CENS which uses ALF2):

If they have a mixture of ALF_STS_CD = 4 and 35, preferentially select the row with code 4 and disregard other rows (unless there are multiple rows with code 4).

If they have the same ALF_STS_CD across all (or remaining) rows, retain one of those rows if all their census data matches across the entire dataset.

If they have the same ALF_STS_CD across all (or remaining) rows but their census data does not match across the entire dataset then disregard all rows and drop the ALF entirely.