Publications tagged: #privacy

  • The Impact of Automatic Speech Transcription on Speaker Attribution

    Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.

    Cristina Aggazzotti , Matthew Wiesner , Elizabeth Allyn Smith , Nicholas Andrews

    Transactions of the Association for Computational Linguistics (TACL), 2025

    PDF BibTeX

    #speech #privacy #forensics

  • Content Anonymization for Privacy in Long-form Audio

    Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

    Cristina Aggazzotti , Ashi Garg , Zexin Cai , Nicholas Andrews

    arXiv preprint arXiv:2510.12780, 2025

    PDF BibTeX

    #speech #privacy #preprint

  • Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization

    Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.

    Zexin Cai , Henry Li Xinyuan , Ashi Garg , Leibny Paola Garcia-Perera , Kevin Duh , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner

    2024 IEEE Spoken Language Technology Workshop (SLT), 2024

    PDF BibTeX

    #speech #privacy

  • HLTCOE Submission to the 2024 Voice Privacy Challenge

    We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.

    Henry Li Xinyuan , Zexin Cai , Ashi Garg , Leibny Paola Garcia-Perera , Kevin Duh , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner

    Proc. 4th Symposium on Security and Privacy in Speech Communication, 2024

    Awards: Best paper

    PDF BibTeX

    #speech #privacy

  • Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

    Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., um, uh-huh), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present an analysis of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.

    Cristina Aggazzotti , Nicholas Andrews , Elizabeth Allyn Smith

    Transactions of the Association for Computational Linguistics, 2024

    PDF BibTeX

    #speech #privacy #forensics

Back to all publications