Publications from 2026

  • Content Anonymization for Privacy in Long-form Audio

    Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

    Cristina Aggazzotti , Ashi Garg , Zexin Cai , Nicholas Andrews

    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026

    PDF BibTeX

    #speech #privacy

  • Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

    Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space–the stylistic feature space–that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.

    Rafael Rivera Soto , Barry Chen , Nicholas Andrews

    Forty-Third International Conference on Machine Learning (ICML), 2026

    PDF BibTeX

    #llm #deepfake_detection #preprint

  • CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

    Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

    Suarez et al. (with Cristina Aggazzotti)

    The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

    PDF BibTeX

    #benchmark #preprint

  • Universal Speech Content Factorization

    We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

    Henry Li Xinyuan , Zexin Cai , Lin Zhang , Leibny Paola García-Perera , Berrak Sisman , Sanjeev Khudanpur , Nicholas Andrews , Matthew Wiesner

    arXiv preprint arXiv:2603.08977, 2026

    PDF BibTeX

    #speech #controllable_generation #preprint

  • Can LLMs Help Localize Fake Words in Partially Fake Speech?

    Large language models (LLMs), trained on large-scale text, have recently attracted significant attention for their strong performance across many tasks. Motivated by this, we investigate whether a text-trained LLM can help localize fake words in partially fake speech, where only specific words within a speech are edited. We build a speech LLM to perform fake word localization via next token prediction. Experiments and analyses on AV-Deepfake1M and PartialEdit indicates that the model frequently leverages editing-style pattern learned from the training data, particularly word-level polarity substitutions for those two databases we discussed, as cues for localizing fake words. Although such particular patterns provide useful information in an in-domain scenario, how to avoid over-reliance on such particular pattern and improve generalization to unseen editing styles remains an open question.

    Lin Zhang , Thomas Thebaud , Zexin Cai , Sanjeev Khudanpur , Daniel Povey , Leibny Paola García-Perera , Matthew Wiesner , Nicholas Andrews

    arXiv preprint arXiv:2603.11205, 2026

    PDF BibTeX

    #speech #deepfake_detection #llm #preprint

  • Integrated Spoofing-Robust Automatic Speaker Verification via a Three-Class Formulation and LLR

    Spoofing-robust automatic speaker verification (SASV) aims to integrate automatic speaker verification (ASV) and countermeasure (CM). A popular solution is fusion of independent ASV and CM scores. To better modeling SASV, some frameworks integrate ASV and CM within a single network. However, these solutions are typically bi-encoder based, offer limited interpretability, and cannot be readily adapted to new evaluation parameters without retraining. Based on this, we propose a unified end-to-end framework via a three-class formulation that enables log-likelihood ratio (LLR) inference from class logits for a more interpretable decision pipeline. Experiments show comparable performance to existing methods on ASVSpoof5 and better results on SpoofCeleb. The visualization and analysis also prove that the three-class reformulation provides more interpretability.

    Kai Tan , Lin Zhang , Ruiteng Zhang , Johan Rohdin , Leibny Paola García-Perera , Zexin Cai , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews

    arXiv preprint arXiv:2603.13780, 2026

    PDF BibTeX

    #speech #deepfake_detection #preprint

Back to all publications