Language Identification of Encrypted VoIP Traffic: Alejandra y Roberto or Alice and Bob?
Charles V. Wright,
Fabian Monrose, and
Gerald M. Masson
Voice over IP (VoIP) has become a popular protocol for making phone
calls over the Internet. Due to the potential transit of sensitive
conversations over untrusted network infrastructure, it is well
understood that the contents of a VoIP session should be encrypted.
However, we demonstrate that current cryptographic techniques do not
provide adequate protection when the underlying audio is encoded
using bandwidth-saving Variable Bit Rate (VBR) coders. Explicitly,
we use the length of encrypted VoIP packets to tackle the
challenging task of identifying the language of the conversation.
Our empirical analysis of 2,066 native speakers of 21 different
languages shows that a substantial amount of information can be
discerned from encrypted VoIP traffic. For instance, our 21-way
classifier achieves 66% accuracy, almost a 14-fold improvement over
random guessing. For 14 of the 21 languages, the accuracy is greater than 90%.
We achieve an overall binary classification (e.g., "Is this a
Spanish or English conversation?"
) rate of 86.6%. Our analysis
highlights what we believe to be interesting new privacy issues in VoIP.