Page:On the Robustness of Topics API to a Re-Identification Attack.pdf/2

This page has been proofread, but needs to be validated.

Proceedings on Privacy Enhancing Technologies YYYY(X)

Jha et al.

researchers must verify the robustness of such an approach as done by Mozilla and Google [8, 25].

In this paper, we provide an independent evaluation of the Topics API. Using a data-driven approach, we build realistic population models that we use to quantify the feasibility of a re-identification attack: We assume that the attacker i) exploits the Topic API to reconstruct the victim’s profile by accumulating her/his topics over epochs, and ii) tries to re-identify the victim on the audience of a second website – as studied by Epasto et al. [8]. If successful, such an attack would tamper with the abandonment of third-party cookies, allowing platforms to still track users across websites. We face the problem by mapping it to the probability that a user is 𝑘-anonymous among the website audience, i.e., that there are 𝑘 − 1 other users with the same reconstructed profile. Generalising the attack sketched by Thomson [25], we propose a robust denoising algorithm that aims to filter the random topics introduced by the Topics API.

We contribute to three main results:

We show that the introduction of Topics API algorithm mitigates but cannot prevent re-identification. Depending on the website’s audience size (e.g., 100,000 visitors) and population heterogeneity, a sizeable fraction (e.g., 40%) of users would still let the attacker reconstruct a denoised and unique profile that allows re-identification if matched on a second population.
We demonstrate the replacement of actual topics with random ones is key to limiting the attack. Yet, the denoising algorithm is very efficient in removing random topics from the reconstructed profiles the attacker builds.
We show that in practice the probability of correctly reidentify a user in a pool of 1,000 can top 15-17%, with false positives being negligible (less than 0.2%). However, it is also important to consider that such probabilities are a function

of the attacker’s observation period and that many weeks may be needed to carry out the attack in practice. Our study highlights the need for continued research and development of privacy-preserving advertising techniques to ensure that user privacy is respected in the digital age. To foster research in this field, we release the code and data to replicate and extend our experiments.^[1] The remainder of the paper is organized as follows: Section 2 formalizes Topics API operation and the threat model. In Section 3 and 4, we describe the dataset and models to generate synthetic populations we use to run simulations, respectively. Section 5 illustrates the results in terms of 𝑘-anonymity, while Section 6 explores the effectiveness of a re-identification attack. Section 7 summarizes related work, and, finally, Section 8 discusses our findings and concludes the paper.

Table 1: Main terminology to model Topics API algorithm and threat model.


Symbol	Definition
$n_{topic}$	Number of topics in the taxonomy
$E$	Number of past epochs included in the profile
$p$	Probability a random topic to replace a real topic
$N$	Epochs of observation by the attacker
$U$	User population set
$\lambda _{u,t}$	Rate of visit by user $u$ to topic $t$
${\mathcal {B}}_{u,e}$	Bag of visited websites by user $u$ at epoch $e$
${\mathcal {T}}_{u,e}$	Bag of visited topics by user $u$ at epoch $e$
${\mathcal {P}}_{u,e}$	Profile for the user $u$ at epoch $e$
${\mathcal {P}}_{u,e,w}$	Exposed Profile to website $w$ for user $u$ at epoch
${\mathcal {G}}_{u,N,w}$	Global Reconstructed Profile by $w$ after $N$ epochs
${\mathcal {R}}_{u,N,w}$	Denoised Reconstructed Profile by $w$ after $N$ epochs

2 The Topics API and the Threat Model

In this section, we describe how the Topics API operates for creating a profile from the user’s browsing history. Then, we describe our threat model – i.e., the possibility that an attacker links two profiles referring to the same user as they are uniquely identifiable within a given population.

We consider a browser that a user employs to navigate the Internet.^[2] We assume time is divided into epochs of duration $\Delta T$ (one week in the current proposed Topics API operation). During each epoch $e$ , the browser collects and counts the number of visits to each website and forms a bag of websites ${\mathcal {B}}_{u,e}$ for the user $u$ . It keeps track only of the website hostnames the user intentionally visited, e.g., by typing its URL, or by clicking on a link in a web page or other applications. Formally, given a user $u$ and the epoch $e$ , let ${\mathcal {B}}_{u,e}=\{(w_{1},f_{1,u,e}),(w_{2},f_{2,u,e}),\ldots ,(w_{n},f_{n,u,e})\}$ , where $\{w_{i}\}$ represent the visited websites and 𝑓𝑖,𝑢,𝑒 the number of times $u$ visited $w_{i}$ during epoch $e$ .

2.1 The Topics API profile construction

The Topics API algorithm operates in the browser and processes the history of ${\mathcal {B}}_{u,e}$ over the past 𝐸 epochs to create a corresponding Exposed Profile ${\mathcal {P}}_{u,e,w}$ for the user $u$ , epoch $e$ and each specific website $w$ the user visits during the current epoch. In fact, the browser builds a separate Exposed Profile for each visited website 𝑤 to mitigate re-identification attacks. We base the following description on the public documentation of the Topics API available online.^[3] The operation of the Topics API has the following steps.

Step 1 - From websites to topics: For each of the websites $w_{i}\in {\mathcal {B}}_{u,e}$ , the browser extracts a corresponding topic $t_{i}$ . To this end, the browser uses a Machine Learning (ML) classifier model that returns the topic of a website given the characters and strings that compose the website hostname. At this step, each browsing history ${\mathcal {B}}_{u,e}$ is transformed into a topic history ${\mathcal {T}}_{u,e}=\{(t_{1},f'_{1,u,e}),(t_{2},f'_{2,u,e}),\ldots ,(t_{m},f'_{m,u,e})\}$ where $t_{i}$ represents the topic the model outputs, and $f_{i,u,e}^{'}$ counts its total occurrences. Each website is mapped to a topic and the original frequencies $f_{i,u,e}$ are summed by topics into $f'_{j,u,e}$ . There are $n_{topic}$ which form a taxonomy of possible interests the users have. Such