Guide on ALF Usage

Guide on ALF Usage

 

What is an ALF

An ALF (Anonymised Linkage Field) is a key that joins records which are determined to be the same entity through various methods. It is essentially an ID that groups the records together.

An ALF can be produced in many ways, but the most common approaches:

Deterministic Linkage: Rules are used to describe whether two records should be linked. A common example is an exact match on a strong ID such as Social Security Number, NHS Number, but it could be more complex rulesets.

Probabilistic Linkage: Records are linked based on a trained model that describes the probability that two records belong to the same individual.

 


What is ALF1?

An ALF1 is historically how linkage has been carried out in SAIL to perform linkage utilising WDS as a spine. The presence of high-quality data, a strong person identifier field (NHS Number), and a focused scope on individuals registering at a Welsh General Practice means that linking many datasets to this spine has been possible.

An ALF1 is used to pseudo-anonymise an individual based on their NHS Number.

ALF1 therefore allows provisioning of cohorts that fit this scope, and has been consistently used within SAIL datasets for a long time. For a given dataset, we link their System ID to an ALF1, then any dataset that has been linked to ALF1 can be linked together:

Dataset_A ↔ ALF1 ↔ Dataset_B

 

ALF1 in it’s unecrypted form (prior to provisioning), can be used to join any dataset which has been allocated ALF1:

Dataset_A ↔ ALF1

Dataset_B ↔ ALF1

Therefore we can join the two datasets through the shared ALF1: Dataset_A <-> ALF1 <-> Dataset_B

 

In it’s project-specific encrypted form, ALF1_E, we can only join across datasets if the ALF1_E are produced using the same encryption keys:

Dataset_A ↔ ALF1_E ↔ Dataset_B

These two ALF1_Es have been provisioned to the same project and should use the same encryption key. ALF1_Es from different projects will use different encryption keys, and so the join cannot be done.

 


What is ALF2?

An ALF2 behaves in the same manner as ALF1, in that it is an ALF which links items together through keys which group records. The main difference is in how these links are produced.

Unlike ALF1, which is predominantly a proxy for NHS Number based on WDS, an ALF2 utilises various (sometimes differing) PII fields across (potentially numerous) dataset(s). These PII fields are often the usual personal identifiers such as Name, DoB, Address etc., but can also include stronger identifiers such as NHS Numbers, Case Numbers, Account Numbers depending on the dataset.

 

ALF2 links are run specific and do not map across different linkage projects like ALF1 does. This means you cannot use ALF2s from different linkage runs to join datasets, they must come from the same linkage. 

Say we have two linkage runs:

Dataset_A ↔ ALF21

Dataset_B ↔ ALF22

You cannot join Dataset_A to Dataset_B here, as Dataset_A ↔ ALF2↔ ALF22 ↔ Dataset_B does not work. You would need to perform a new linkage run to produce a common ALF2. You could also theoretically use daisy-chaining if they are linked to a common set (see section on Daisy-chaining). 

 

Example 1:

One clear example in how the two differ is in the ALF2 produced for WDS. This uses both NHS Number AND other PII fields in order to establish links, and so goes beyond ALF1 in identifying individuals who may have been allocated more than one NHS Number, or have been registered (even briefly) onto other people’s NHS Number. This can cause some confusion, as when looking at ALF1 we see distinct groupings with no duplicate ALF1 IDs, but then a given ALF2 ID may show multiple NHS Numbers are present. This is just the nature of the data, and so would be up to the end user to really understand what they want to ask of this data, and what the limitations/complexities may mean for their analysis. In this scenario you may choose to discard the ALF2 IDs which contain more than one ALF1, but this would mean that you miss some links. Conversely you may choose to keep those multi-ALF1 groups in, but this may mean including records where people have been erroneously added to another individual’s NHS Number.

 

Example 2:

Imagine we have linked 2 datasets (Dataset_A and Dataset_B).

A given ALF2 can contain records which are:

  • Records only from Dataset_A, 

  • Records only from Dataset_B, or

  • A mix of records from Dataset_A and Dataset_B 

This may mean that in a provisioned cohort you have ALF2s which do not link the two datasets together. This is a valid scenario, as your analysis may still want to consider those individuals, and again it is up to the user to know what that means for them. 

An example here may be linking the UK Census to the WDS records. We wouldn't expect all Census records to have matches in WDS, as the Census is wider than the coverage of WDS. Similarly, we also might not see matches for WDS records in Census due to data quality and linkage approach. 

 


Daisy-chaining

It is theoretically possible to daisy-chain datasets together if their linkages contain a shared dataset. This is similar to how we are using WDS as our spine in conjunction with ALF1. 

 

E.g. if we have

Dataset_A ↔ ALF2↔ Dataset_B

Dataset_B ↔ ALF22 ↔ Dataset_C

Then would could in theory connect Dataset_A to Dataset_C without running a new linkage as follows: Dataset_A ↔ ALF2↔ Dataset_B ↔ ALF22 ↔ Dataset_C

This approach does come with a likelihood that the links are bottlenecked through the links to Dataset_B, which may have a significant impact on the resulting cohort. 

 


Notes on ALF1 and ALF2:

At the moment we have no intention of getting rid of ALF1. It is a consistent and strong link that has worked for SAIL for a significant period of time. It works well with medical data within Wales and so has helped to create a rich and extensive catalogue of linkable cohorts across SAIL’s history.

ALF2 simply provides a higher abstraction of the ALFing process, providing clustering of the records beyond the ALF1. 

 


Notes about linkage:

Please be aware that in linkage there will often be false positives and false negatives, this includes in deterministic linkage.

These errors can come in from various sources, including:

  • Errors at point of data collection.

  • Missingness in the data.

  • Significant overlap in data for a pair of records, meaning difficultly in separating them.

  • Selection of linkage rules and model training.

 

Why are there no probabilities provided with the probabilistic linkages?

The probabilistic methods used are Bayesian and heavily depend on the priors produced in the model training, producing a log-likelihood match probability. This can often be misleading if the user is not familiar with all of the prior assumptions that go into the model. For example a match probability of 0.9 might seem very strong on first glance, but in reality cannot be truly appreciated without that information on the model and the priors. Similarly we may have two edges which both score the same but really encode different meanings. In addition, the log natural of the metric can be difficult to separate out at the extremes of the scale, for example is the difference between 0.9→0.92 the same as 0.9→0.95?  

To counter this, and still provide users with some ability to refine the linkage even further, we may choose to provide match scores in future ALF2 outputs. Although these still require knowledge of how the linkage was performed to truly appreciate, they avoid the trap of looking like a percentage. 

In a similar manner, assigning the classic "ALF_STS_CD" to a produced edge is tricky. Status codes 1 and 2 refer to NHS Number (not always present in the dataset, and not a perfect identifier), code 4 can introduce issues with multi-birth families or those from certain demographic backgrounds, and codes 35, 39, and 99 rely on a probability score which may be subject to the prior bias mentioned above.