Recovering from Privacy-Preserving Masking with Large Language Models
AI-generated Key Points
- Model adaptation is crucial for handling the discrepancy between proxy training data and actual user data.
- Storing user data raises privacy and security concerns.
- Recent research explores replacing identifying information with generic markers to address privacy concerns.
- The authors propose using large language models (LLMs) to suggest substitutes for masked tokens to preserve privacy while maintaining model effectiveness.
- Multiple pre-trained and fine-tuned LLM-based approaches are evaluated on various datasets through empirical studies.
- Models trained on obfuscation corpora achieve comparable performance to models trained on original data without token masking.
- Model adaptation has potential risks to user privacy and security.
- LLMs provide a solution that effectively addresses these concerns while maintaining model performance.
- The proposed approaches are evaluated through empirical studies, showcasing their effectiveness across different datasets.
- This work contributes insights into privacy-preserving techniques for model adaptation in the field of NLP.
Authors: Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang, Zeeshan Ahmed, Ozlem Kalinli
Abstract: Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.