# Singularity flaw in double-hashing scheme for privacy-preserving record linkage

Banner photo: An artist’s impression of Cygnus X-1 (Photo: NASA/CXC/M.Weiss/Wikimedia Commons)

We, the confidential computing team of CSIRO's Data61, recently released clkhash, a library for generating CLK's (a certain type of anonymous linking codes) which can be used for Privacy-Preserving Record Linkage.

This starts a series of posts about things we learned and issues we came across while working on clkhash. The first post is about a flaw in the commonly used double-hashing scheme.

### What is Privacy-Preserving Record Linkage (PPRL)

Record linkage represents the process of identifying records that refer to the same entity across multiple data sources. If these entities refer to persons, then the records used for linkage contain personally identifiable information (PII); examples of PII include name, address, email, phone numbers etc. Such information must be treated as highly sensitive which makes it hard or impossible to simply share between organisations to be used for linking purposes.

In these cases, record linage has to be carried out in a privacy-preserving way. That is, the PII stays with the corresponding data holder and cannot be inferred by any other party participating in the linkage.

Schnell, Bachtler and Reiher proposed in 2009 the usage of Bloom filters for privacy-preserving probabilistic record linkage. A Bloom filter is a simple space-efficient randomised data structure for representing a set in order to support membership queries. The basic idea is to first partition the PII strings into $$n$$-grams, and then mapping those $$n$$-grams into a Bloom filter in a privacy-preserving way. The Bloom filter can than be used to compute the similarity to other Bloom filters.

The set of $$n$$-grams is produced by breaking the string records into substrings of length $$n$$. For example, the 2-grams for the name 'SMITH' are ' S', 'SM', 'MI', 'IT', 'TH', 'H ', whereas the 2-grams for the name 'SMYTH' are ' S', 'SM', 'MY', 'YT', 'TH', 'H '. A Bloom filter is a bit array of length $$l$$, initialised to $$0$$. Furthermore, $$k$$ independent hash functions $$h_1, \ldots, h_k$$ are defined, each mapping $$n$$-grams to values between $$0$$ and $$l-1$$. In order to store a set of $$n$$-grams in a Bloom filter, each element $$x_j$$ of the set is hashed using the $$k$$ hash functions, and all bits in the Bloom filter having indices $$h_i(x_j)$$, $$i \in \{1,\ldots, k\}$$, are set to $$1$$.

### The double hashing scheme

Instead of using $$k$$ different independent hash functions, Schnell et al. proposed to use a double hashing scheme as presented by Kirsch and Mitzenmacher. They show that only two independent hash functions are necessary to implement a Bloom filter with $$k$$ hash functions without any increase in the asymptotic false positive probability. The $$k$$ different hash functions $$g_j$$ can be simulated with the help of just two hash functions $$h_1$$ and $$h_2$$ by computing

$$g_j(x_i) = h_1(x_i) + jh_2(x_i) \mod l, \text{with } j \in \{1, \ldots, k\}$$

#### Note

In the same paper, Kirsch and Mitzenmacher also proposed the so called Enhanced Double Hashing Scheme:

$$g_j(x_i) = h_1(x_i) + jh_2(x_i) + f(j) \mod l, \text{with } j \in \{1, \ldots, k\},$$

with an arbitrary function $$f$$. Although this scheme does not suffer from the same singularity issue as the double hashing scheme, there might be other issues which reduce applicability to privacy-preserving record linkage. For the rest of this post, we will only consider the double hashing scheme, as the enhanced scheme is not adopted in the PPRL literature.

### The Singularity

In the double hashing scheme, the hash values for the $$k$$ hash functions $$g_j$$ are derived from a linear combination of the hash values of two hash functions $$h_1$$ and $$h_2$$.

A hashing scheme suffers from a singularity, if $$g_j(x_i) = g_m(x_i)$$ for all $$j,m \in \{1, \ldots, k\}$$.

The double hashing scheme from above has a singularity at $$n$$-gram $$x_i$$ if $$h_2(x_i) = 0 \mod l$$, because then $$g_j(x_i) = h_1(x_i) \mod l,$$ irrespective of the value of $$j$$.

Or in words, the n-gram $$x_i$$ will only be encoded in one bit in the Bloom filter, irrespective of the value of $$k$$.

### Probability of a Singularity

In the case of the scheme proposed by Schnell et al., the hash functions $$h_1$$ and $$h_2$$ are realised by HMAC-SHA1 and HMAC-MD5. As shown by Fouque et al., HMAC can be seen as a pseudo random function, such that the output of an HMAC is indistinguishable from a random bit string. Thus when using HMAC, the output of $$h_2$$ is uniformly distributed, and the probability for any $$x_i$$ to hit a singularity is $$P[h(x_i) \bmod l = 0] = \frac{1}{l}.$$ The expected number of $$n$$-grams that hit a singularity is then just the total number of different $$n$$-grams divided by $$l$$.

### Example

In order to illustrate the problem, we ran an experiment with generated PII data. We used the febrl tool to generate 100000 entity records with 3 features, and a corresponding second set of these records with the default perturbations as defined in febrl.

We then generated two sets of the corresponding CLKs for both datasets, using the double hashing scheme with the parameters $$l = 1024, k=30$$, one with the double hashing scheme, and the other one with the non-singular double hashing scheme.

For this particular choice of data and parameters, we have exactly one singularity. To visualise this, we plot the histograms of the bits for the different indices of the Bloom filters. The following figure shows the difference between the double hashing scheme (orange), and a modified double hashing scheme (blue) which does not suffer from singularities.

We can see that the values for the two different schemes are almost identical, except for 29 positions. These differences are caused by the singularity of one n-gram. Whereas the double hashing scheme encodes this n-gram in one bit, the non-singular scheme encodes it with $$k$$ different bits.

Subsequently, this also affects the accuracy of the matching. We used the anonlink library to compute the linkage between the CLKs and their perturbed counterparts. Using the standard double hashing scheme, we achieved an accuracy of 0.75508, whereas with the singularity free scheme, the score improved to 0.75713.

### What are the Consequences?

In the following we will explore the effects of a singularity on matching accuracy and the security implications.

### Matching accuracy

In case of a singularity, the $$n$$-gram gets encoded by just one bit whereas all other n-grams are represented by $$k$$ bits in the corresponding Bloom filter. This lead to an undesirable distortion of the individual importance of each $$n$$-gram.

Going back to our toy example from above, the Bloom filters for 'SMITH' and 'SMYTH'. The similarity of these two Bloom filters can be computed by the Sørensen-Dice coefficient of the two: $$\text{sim}_{AB} = \frac{2|A \cap B|}{|A|+|B|} = \frac{2*8}{10+11} \approx 0.762$$

Now, let's say there is a singularity for the encoding of the n-gram ' S', then the similarity changes to:

$$\text{sim}_{AB}^{\text{singular}} = \frac{2*7}{9+10} \approx 0.737$$

Also, this effect is more pronounced if the value of $$k$$ increases.

### Security

The difference between setting just one or $$k$$ bits will most likely have a detectable influence on the generated Bloom filter.

The set of all realistically possible $$n$$-grams for an identifier is often not too big. Let's assume we want to encode surnames with bi-grams, and the names are all composed of ASCII characters and camel case. Then there are a maximum of about 1400 different bi-grams. Thus, in this and similar cases, one can expect that only very few $$n$$-grams will hit a singularity.

Kroll and Steinmetzer have shown that, knowing the bit indices corresponding to the different $$bi$$-grams, combined with the knowledge of the underlying distribution, one can reconstruct a large proportion of the PII.

### Possible remedies

A singularity for $$n$$-gram $$x_i$$ happens iff $$h_2(x_i) = 0 \bmod l$$. One obvious remedy is to make sure that this does not happen by changing the image of the hash function $$h_2$$ to $$\{1, \ldots, l-1\}$$. Let $$\hat{h}_2(x_i) = (h_2(x_i) \bmod (l-1)) + 1$$ then $$\hat{g}_j(x_i) = h_1(x_i) + j\hat{h}_2(x_i) \mod l, \text{with } j \in \{1, \ldots, k\}$$ describes a non-singular double hashing scheme.

### Conclusions

Both, the effects on security and on accuracy are not fully explored. However, there are indications of negative effects, which leads to the conclusion that we should err on the side of caution and either show that these concerns are unfounded, or modify/replace the double hashing scheme.

### Try it out yourself

As stated in the introduction, we, the confidential computing team of CSIRO's Data61, recently released clkhash, a library for generating CLK's. It contains the implementation of both, the double hashing, and the singularity free double hashing scheme.

You can install clkhash with

pip install clkhash --pre

(the --pre is necessary, as the 0.11 version is still a release candidate). After installation of the clkhash library you should have a clkutil program in your path. Alternatively you can use python -m clkhash.cli.

This command line tool can be used to process PII data into Cryptographic Longterm Keys. The tool also has an option for generating fake PII data.

$clkutil generate 1000 fake-pii-out.csv The data looks like this: $ head -n 4  fake-pii-out.csv
INDEX,NAME freetext,DOB YYYY/MM/DD,GENDER M or F
0,Libby Slemmer,1933/09/13,F
1,Garold Staten,1928/11/23,M
2,Yaritza Edman,1972/11/30,F

A schema is required to generate CLKs from this data. You can retrieve the default schema with

$clkutil generate-default-schema fake-pii-schema.json The default schema looks like this: { "version": 1, "clkConfig": { "l": 1024, "k": 20, "hash": { "type": "doubleHash" }, ... }, "features": [ ..., { "identifier": "NAME freetext", "format": { "type": "string", "encoding": "utf-8", "case": "mixed", "minLength": 3 }, "hashing": { "ngram": 2, "weight": 0.5 } }, ... ] } Apart from data and a corresponding schema, we also need a pair of secret keys. When we use the keys rainbow and treasure with the default schema, we will have a singularity for the bi-gram nc in the name column. We generate the CLKs by calling: $ clkutil hash fake-pii-out.csv rainbow treasure fake-pii-schema.json fake-clk.json

In order to use the singularity free hashing scheme, we modify the "hash" section of the schema.

{
...
"clkConfig": {
...
"hash": {
"type": "doubleHash",
"prevent_singularity": true
},
}
}

Then generate the singularity free CLKs.

$clkutil hash fake-pii-out.csv rainbow treasure fake-pii-schema.json fake-clk-sf.json Now let's evaluate the results. First, we have to figure out which rows in the data are affected by the singularity. $ grep nc fake-pii-out.csv
79,Lawrance Sobol,1952/12/07,M
91,Chauncy Dincher,1939/06/03,M
130,Joseph Cofrancesco,1923/01/01,F
...

The two CLKs for row 79 are:

'Tj8KfwwqA8iIRaMoLObMH4AK8BBb1FaxoLhQCtIDqpTgCxyoAiDWIkmwiljyNpRYOwgDLuiRsOKDwjiPQGj8UAiNqvKQnriQ3viQForIe9qJKAIAJaInj2qCWQACczAx+5ydsxH8YmqwMIOBpiAoOlJQ7HQaDAIoGLhYMbhyyBo='

'Tj8KfwwqC8iIRaMoLObMH4AK8BBb1FaxoLhQC9IDqpTgCxyoAiDWIkmwiljyNpRYOygDLuiRsOKDwjiPQGj8UAiNqvKQnr2Q3viQForIe9qJKAIAJaInj2qCWQKiczAx+5ydsxH8YmqwMIOBpiAoOlJU7HQaDAIoGLhYMbhyyBo='

The first difference is at the 9th position, A for the standard double hash vs. C for the singularity free scheme.

### Get in touch

Join the discussion on reddit or send an email to confidential-computing@csiro.au .