Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting

AI-generated keywords: Device fingerprinting Web trackers Graph analysis Machine learning classifiers Cross-browser performance

AI-generated Key Points

Bahrami et al. focused on device fingerprinting to detect web trackers by combining graph analysis with supervised and unsupervised learning.
Their method utilized historical JavaScript file data from 2010 to 2019 and achieved high accuracies up to 89.13%.
Kalavri et al. achieved over 97% accuracy through neighborhood analysis and label propagation on a bipartite graph representing connections between first-party websites and third-party services.
Castell-Uroz et al. proposed a tripartite network graph for identifying third-party tracking resources within first-party websites based on hash value popularity and dirt level correlation.
Metwalley et al. developed an unsupervised algorithm that analyzed URL queries and HTTP request headers for tracker identification among Alexa-ranked websites.
The study evaluates the cross-browser performance of machine learning classifiers trained on HTTP response headers for web tracker detection, focusing on delineating tracking entities rather than tracking activity.
Ten supervised models were trained on Chrome data and tested across Chrome, Firefox, and Brave browsers using data obtained through the T.EX traffic monitoring browser extension, showing high accuracy for Chrome and Firefox but subpar performance for Brave due to distinct data distribution.
While promising, real-world application testing of these classifiers is still pending, highlighting the need for further exploration into distinguishing tracker types and broader label sources in future studies.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Wolf Rieder, Philip Raschke, Thomas Cory

Proceedings on Privacy Enhancing Technologies (PoPETs) 1 (2025) 100-117

arXiv: 2402.01240v3 - DOI (cs.CR)

License: CC BY 4.0

Abstract: The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP request messages to identify web trackers, HTTP response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our dataset. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for web tracker detection. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.

Submitted to arXiv on 02 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.01240v3

Comprehensive Summary
Key points
Layman's Summary
Blog article

In recent research, Bahrami et al. focused on device fingerprinting to detect web trackers by combining graph analysis with supervised and unsupervised learning. Their method utilized historical JavaScript file data from 2010 to 2019 and achieved high accuracies up to 89.13%. This was made possible by incorporating temporal aspects and Abstract Syntax Tree-based keyword extraction. Similarly, Kalavri et al. employed a graph-based approach and achieved even higher accuracy of over 97% through neighborhood analysis and label propagation on a bipartite graph representing connections between first-party websites and third-party services. Castell-Uroz et al. proposed a tripartite network graph for identifying third-party tracking resources within first-party websites. They were able to successfully detect new trackers based on hash value popularity and dirt level correlation. On the other hand, Metwalley et al. developed an unsupervised algorithm that analyzed URL queries and HTTP request headers for tracker identification. Their method was successful in identifying new trackers among Alexa-ranked websites. This study aims to address the gap in existing research by evaluating the cross-browser performance of machine learning classifiers trained on HTTP response headers for web tracker detection. The focus is on delineating tracking entities rather than tracking activity, with the definition of trackers contingent upon ground truth labeling in each dataset. The research methodology includes training ten supervised models on Chrome data and testing them across Chrome, Firefox, and Brave browsers using data obtained through the T.EX traffic monitoring browser extension. Results showed high accuracy, F1-score, precision, recall for Chrome and Firefox but subpar performance for Brave due to distinct data distribution. While these classifiers show promise for web tracker detection, real-world application testing is still pending. Overall, this study contributes to the advancement of machine learning approaches in detecting web trackers while highlighting the need for further exploration into distinguishing tracker types and broader label sources in future studies.

- Bahrami et al. focused on device fingerprinting to detect web trackers by combining graph analysis with supervised and unsupervised learning.
- Their method utilized historical JavaScript file data from 2010 to 2019 and achieved high accuracies up to 89.13%.
- Kalavri et al. achieved over 97% accuracy through neighborhood analysis and label propagation on a bipartite graph representing connections between first-party websites and third-party services.
- Castell-Uroz et al. proposed a tripartite network graph for identifying third-party tracking resources within first-party websites based on hash value popularity and dirt level correlation.
- Metwalley et al. developed an unsupervised algorithm that analyzed URL queries and HTTP request headers for tracker identification among Alexa-ranked websites.
- The study evaluates the cross-browser performance of machine learning classifiers trained on HTTP response headers for web tracker detection, focusing on delineating tracking entities rather than tracking activity.
- Ten supervised models were trained on Chrome data and tested across Chrome, Firefox, and Brave browsers using data obtained through the T.EX traffic monitoring browser extension, showing high accuracy for Chrome and Firefox but subpar performance for Brave due to distinct data distribution.
- While promising, real-world application testing of these classifiers is still pending, highlighting the need for further exploration into distinguishing tracker types and broader label sources in future studies.

SummaryResearchers studied different ways to find and stop web trackers on the internet. They used special methods like graph analysis and learning techniques to detect these trackers. Some methods looked at data from JavaScript files over many years and achieved high accuracy in finding trackers. Others focused on analyzing connections between websites and services to identify tracking resources. One study even developed a new algorithm that looked at website queries and headers to spot trackers. Overall, the goal was to improve how we can identify and understand web tracking. Definitions- Device fingerprinting: A method of identifying devices based on unique characteristics they possess. - Graph analysis: The study of relationships between objects represented as nodes connected by edges in a graph structure. - Supervised learning: A type of machine learning where models are trained using labeled data. - Unsupervised learning: A type of machine learning where models learn patterns from unlabeled data. - Bipartite graph: A graph with two distinct sets of vertices such that no edge connects vertices within the same set. - Hash value: A fixed-size string generated from input data using a mathematical function for indexing or comparing purposes. - Dirt level correlation: The relationship between the presence of undesirable elements (dirt) in a system or dataset. - HTTP request headers: Information sent by a browser when requesting content from a server, containing details about the request and client capabilities. - Machine learning classifiers: Algorithms that learn patterns from data to make predictions or decisions. - Tracker identification: The process of recognizing and categorizing

In today's digital age, online privacy has become a major concern for internet users. With the rise of web tracking and data collection by third-party entities, it is essential to have effective methods in place to detect and protect against these practices. In recent research, Bahrami et al. focused on device fingerprinting to detect web trackers by combining graph analysis with supervised and unsupervised learning. The study conducted by Bahrami et al. utilized historical JavaScript file data from 2010 to 2019 and achieved high accuracies up to 89.13%. This was made possible by incorporating temporal aspects and Abstract Syntax Tree-based keyword extraction into their method. By analyzing the structure of JavaScript files over time, they were able to identify patterns that could be used as fingerprints for detecting web trackers. Similarly, Kalavri et al. employed a graph-based approach in their research and achieved even higher accuracy of over 97%. They used neighborhood analysis and label propagation on a bipartite graph representing connections between first-party websites and third-party services. This method allowed them to identify relationships between different entities involved in web tracking, leading to more accurate detection. Another study conducted by Castell-Uroz et al. proposed a tripartite network graph for identifying third-party tracking resources within first-party websites. Their approach involved looking at the popularity of hash values associated with trackers as well as correlations between "dirt levels" (a measure of how invasive or harmful a tracker may be). Through this method, they were able to successfully detect new trackers that had not been previously identified. On the other hand, Metwalley et al. developed an unsupervised algorithm that analyzed URL queries and HTTP request headers for tracker identification. Their method was successful in identifying new trackers among Alexa-ranked websites without relying on pre-labeled data sets. Building upon existing research, this study aims to address the gap in current literature by evaluating the cross-browser performance of machine learning classifiers trained on HTTP response headers for web tracker detection. The focus is on delineating tracking entities rather than tracking activity, with the definition of trackers contingent upon ground truth labeling in each dataset. The research methodology involved training ten supervised models on Chrome data and testing them across Chrome, Firefox, and Brave browsers using data obtained through the T.EX traffic monitoring browser extension. Results showed high accuracy, F1-score, precision, and recall for both Chrome and Firefox but subpar performance for Brave due to distinct data distribution. This highlights the need for further exploration into distinguishing tracker types and broader label sources in future studies. While these classifiers show promise for web tracker detection, real-world application testing is still pending. It is essential to test these methods in a live environment to assess their effectiveness in detecting web trackers accurately. In conclusion, this study contributes to the advancement of machine learning approaches in detecting web trackers while highlighting the need for further exploration into distinguishing tracker types and broader label sources. With the continuous evolution of technology and online privacy concerns, it is crucial to have effective methods in place to detect and protect against web tracking practices. Further research in this area will help improve our understanding of how web trackers operate and enable us to develop more robust solutions for protecting user privacy online.

Created on 17 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

55.3%

Machine Learning Based Intrusion Detection Systems for IoT Applications

cs.CR

54.7%

That Escalated Quickly: An ML Framework for Alert Prioritization

cs.CR

54.0%

Preventing the attempts of abusing cheap-hosting Web-servers for monetization…

cs.CR

51.1%

SmartX Intelligent Sec: A Security Framework Based on Machine Learning and eB…

cs.CR

50.7%

In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

cs.CR

50.3%

Enhancing ML-Based DoS Attack Detection Through Combinatorial Fusion Analysis

cs.CR

50.1%

Survey on the Usage of Machine Learning Techniques for Malware Analysis

cs.CR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.