Publications tagged: #speech
-
Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts
We address the challenge of detecting synthesized speech under distribution shifts—arising from unseen synthesis methods, speakers, languages, or audio conditions—relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples—achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.
Ashi Garg , Zexin Cai , Henry Li Xinyuan , Leibny Paola García-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025
-
Scalable Controllable Accented TTS
We tackle the challenge of scaling accented TTS systems, expanding their capabilities to include much larger amounts of training data and a wider variety of accent labels, even for accents that are poorly represented or unlabeled in traditional TTS datasets. To achieve this, we employ two strategies: 1. Accent label discovery via a speech geolocation model, which automatically infers accent labels from raw speech data without relying solely on human annotation; 2. Timbre augmentation through kNN voice conversion to increase data diversity and model robustness. These strategies are validated on CommonVoice, where we fine-tune XTTS-v2 for accented TTS with accent labels discovered or enhanced using geolocation. We demonstrate that the resulting accented TTS model not only outperforms XTTS-v2 fine-tuned on self-reported accent labels in CommonVoice, but also existing accented TTS benchmarks.
Henry Li Xinyuan , Zexin Cai , Ashi Garg , Kevin Duh , Leibny Paola García-Perera , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025
-
GenVC: Self-Supervised Zero-Shot Voice Conversion
Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.
Zexin Cai , Henry Li Xinyuan , Ashi Garg , Leibny Paola García-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025
-
The Impact of Automatic Speech Transcription on Speaker Attribution
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
Cristina Aggazzotti , Matthew Wiesner , Elizabeth Allyn Smith , Nicholas Andrews
Transactions of the Association for Computational Linguistics (TACL), 2025
-
Content Anonymization for Privacy in Long-form Audio
Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.
Cristina Aggazzotti , Ashi Garg , Zexin Cai , Nicholas Andrews
arXiv preprint arXiv:2510.12780, 2025
-
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions may exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods. Although some existing datasets target subsets of these distribution shifts, systematic analysis remains difficult due to inconsistencies between source data and synthesis systems across datasets. This difficulty is further exacerbated by the rapid development of new text-to-speech (TTS) and vocoder systems, which continually expand the diversity of synthetic speech. To enable systematic benchmarking of model performance under distribution shifts, we introduce ShiftySpeech, a large-scale benchmark comprising over 3,000 hours of synthetic speech across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages. ShiftySpeech is specifically designed to evaluate model generalization under controlled distribution shifts while ensuring broad coverage of modern synthetic speech generation techniques. It fills a key gap in current benchmarks by supporting fine-grained, controlled analysis of generalization robustness. All tested distribution shifts significantly degrade detection performance of state-of-the-art detection approaches based on self-supervised features. Overall, our findings suggest that reliance on synthetic speech detection methods in production environments should be carefully evaluated based on anticipated distribution shifts.
Ashi Garg , Zexin Cai , Lin Zhang , Henry Li Xinyuan , Leibny Paola García-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews
arXiv preprint arXiv:2502.05674, 2025
-
Privacy Versus Emotion Preservation Trade-Offs in Emotion-Preserving Speaker Anonymization
Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.
Zexin Cai , Henry Li Xinyuan , Ashi Garg , Leibny Paola Garcia-Perera , Kevin Duh , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner
2024 IEEE Spoken Language Technology Workshop (SLT), 2024
-
HLTCOE Submission to the 2024 Voice Privacy Challenge
We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.
Henry Li Xinyuan , Zexin Cai , Ashi Garg , Leibny Paola Garcia-Perera , Kevin Duh , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner
Proc. 4th Symposium on Security and Privacy in Speech Communication, 2024
Awards: Best paper
-
Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?
Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., um, uh-huh), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present an analysis of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.
Cristina Aggazzotti , Nicholas Andrews , Elizabeth Allyn Smith
Transactions of the Association for Computational Linguistics, 2024