Pseudonymize Speech

A Praat script to pseudonymize speech. That is, Pseudonymize Speech tries to make it difficult to recognize the speaker while still retaining relevant (para-)linguistic features and intelligibility.

Running the script

When running the script, a form will appear with the following fields:


Note that all file paths can be relative to the place where the main script is stored.


If the Source is not a path to soundfiles, it can be a path to a .csv or .tsv table. Such a table should have the following column labels:

Source Reference Target_Phi Target_Pitch Target_Rate Target_Directory

Optional columns:


If the Reference is not a path to soundfiles, it can be a path to a .csv or .tsv table containing speaker profiles. For a single source speaker, the format is: <speaker id>@table path, e.g., F24I@Examples/SpeakerProfiles.csv. Such a table should have the following column labels:

Reference MedianPitch Phi Phi1 Phi2 Phi3 Phi4 Phi5 ArtRate Int0 Int1 Int2 Int3 Int4 Int5 Duration Corpus Gender

This table is automatically created when the main script is run using audio files as a reference. It can be saved after finishing the script as these calculations can take some time. The next time the script is run for the same speaker(s), the resulting table can be used as Reference input. If the Reference is * or GENERATE, only this table is generated, and no pseudonymized speech is produced. In this last, GENERATE, case, any non-empty Reference labels in a source table will be interpreted as reference labels of the new speaker profiles.

Target Phi, Target Pitch, and Target Rate

These fields accept a list of comma separated values, e.g.,

Target Phi: 510, 550, 585
Target Pitch: 110, 145, 175
Target Rate: 3.8, 4.0, 4.2

These values are used in turn. If several fields have lists, they are used in parallel, that is, the first of each list (510, 110, 3.8), then the second of each list (550, 145, 4.0), then the third (585, 175, 4.2), etc. If one list ends before the others, the last value is reused. A value of "0" for any of these indicates that the corresponding reference value, φ, median F0, or articulation rate, will be used, i.e., no change.

If a label of a speaker is entered in the Target Phi field, its profile will be copied into this and the other fields, including the Randomization fields. Effectively, the system will convert the input speech using the parameters of the target speaker. The Pitch and Rate fields should be empty (or 0) in this case.

The label Random will select a random speaker from the speaker profiles, different from the one that is converted. The labels RandomFemale and RandomMale work like Random, but they select random female or male speakers. The labels RandomSgender and RandomXgender select a random target speaker profile of the same, respectively, different gender as the source speaker. If no gender information is available, these latter labels work like Random. These labels can be followed by a "=" and a corpus name to limit selection to a single corpus. For example, RandomSgender=ASVspoof2019 will limit the selection to speakers of the same gender from the corpus called ASVspoof2019.

Randomize bands and Randomize intensity

A list of frequency bands, e.g., "F0, F3, F4, F5" for which the frequency or intensity is shifted by random amounts. Only a single such list can be entered. It is possible to fix the target values of these bands by entering the target φ value (neutral F1, roughly between 500-600 Hz) or intensity (roughly between 45-70 dB) to hold for that band. For instance, Randomize bands "F0 = 550, F3=510, F4 = 570, F5=540" (Hz) or Randomize intensity "F0 = 65, F3=50, F4 = 43, F5=44" (dB).

Example frequency values for F0-F5 for a speaker could be, frequencies: 550, 550, 530, 530, 510, 560 ±50 Hz and corresponding intensities: 64±4.5, 67±2.5, 58±4.5, 50±8, 47±10, 45±9 dB (±2SD). For randomization of frequency bands, φ±40 and φ±75 Hz are used for F0-1 and F2-5, respectively. For randomization of intensities, the above example values are used, including the 2SD values as ranges.

A special option is to replace frequency bands by modulated pink noise. This option might be useful when it is desired to remove all information from certain spectral regions, e.g., for studying Automatic Speaker Identification. For example, Randomize bands "F4 = NOISE" will replace the band around F4 with modulated pink noise scaled to the original intensity. This will also work for F6-F9, e.g., "F7=NOISE". Note, φ frequency values or randomization are not supported for F6-F9.

The Ignore freq bands checkbox is supplied to suppress all individual frequency band and intensity changes. This option is most usefull in testing the effectivity of frequency band and intensity changes.

Exaggerated or caricature targets

The differences between the source and the target φ's, both φ, the bands φ15, and the intensity bands φ05, can be exaggerated, or diminished, by adding a multiplication factor between []-brackets to the target. For instance, when selecting random cross-gender targets, RandomXgender[1.5]=ASVspoof2019, the effective target will be:

φeffective = φtarget + 1.5·(φtarget - φsource).

The exaggeration factor between the []-brackets can also be negative.

Note: The size of the changes in pitch and rate are maximized to prevent deterioration of the sound quality. This can be undone by setting the maximalPitchChange and maximalRateChange variables in the script to a high value.


Two examples are available:

Convert a single speech recording:

Convert a list of speech recordings in a more complex manner:


[1] Lammert AC, Narayanan SS. On Short-Time Estimation of Vocal Tract Length from Formant Frequencies. PLOS ONE. 2015 Jul 15;10(7):e0132193.

The vocal tract length (VTL) is calculated as: VTL = 100 · 352.95 / (4 · φ) cm. Note that a different formant tracking algorithm is used in Pseudonymize Speech than was used in [1] and the estimated φ values here are used only as scale factors. They should not be interpreted as corresponding to "real" VTL values.

© Rob van Son, October 17, 2019