A Praat script to pseudonymize speech. That is, Pseudonymize Speech tries to make it difficult to recognize the speaker while still retaining relevant (para-)linguistic features and intelligibility.
When running the script, a form will appear with the following fields:
.csv, this is interpreted as a table from which the values needed are read. This table allows to automate pseudonymization of a large number of recordings.
.csvfile with a table of speaker values. In case the reference path contains a wildcard, a single profile is calculated from all the matching audio files. If Reference is
GENERATE, the source is used to determine the speaker characteristics.
Note that all file paths can be relative to the place where the main script is stored.
If the Source is not a path to soundfiles, it can be a path to a
.tsv table. Such a table should have the following column labels:
Source Reference Target_Phi Target_Pitch Target_Rate Target_Directory
Source: A path to sound files.
Reference: Either a path to sound files or a label also present in the Reference table (see below).
Target_Phi: The desired vocal tract length as the neutral F1 (φ) in Hz, see .
Target_Pitch: The desired F0 in Hz.
Target_Rate: The desired articulation rate in syllables per second.
Target_Directory: The path to the directory where the resulting audio is written to.
Randomize_bands: A list of bands, F0-F5, which will be frequency shifted by random amounts.
Randomize_intensities: A list of bands, F0-F5, which will be intensity shifted by random amounts (in dB).
If the Reference is not a path to soundfiles, it can be a path to a
.tsv table containing speaker profiles. For a single source speaker, the format is: <speaker id>@table path, e.g., F24I@Examples/SpeakerProfiles.csv. Such a table should have the following column labels:
Reference MedianPitch Phi Phi1 Phi2 Phi3 Phi4 Phi5 ArtRate Int0 Int1 Int2 Int3 Int4 Int5 Duration Corpus Gender
Reference: A label indicating the source audio or the speaker.
MedianPitch: Median F0 value of the speaker (audio).
Phi: The φ value calculated according to .
Phi1: The φ value calculated from the median F1 only.
Phi2: The φ value calculated from the median F2 only.
Phi3: The φ value calculated from the median F3 only.
Phi4: The φ value calculated from the median F4 only.
Phi5: The φ value calculated from the median F5 only.
ArtRate: Articulation rate in syllables per second.
Int0-5: Intensities measured in bands 0-5. Values ≤ 0 are ignored.
Duration: Duration of the reference audio in seconds.
Corpus: A label to indicate the source corpus of speaker profiles (ignored).
Gender: The gender of the speaker, f or m. Used with selecting target profiles.
This table is automatically created when the main script is run using audio files as a reference. It can be saved after finishing the script as these calculations can take some time. The next time the script is run for the same speaker(s), the resulting table can be used as Reference input. If the Reference is
GENERATE, only this table is generated, and no pseudonymized speech is produced. In this last, GENERATE, case, any non-empty Reference labels in a source table will be interpreted as reference labels of the new speaker profiles.
These fields accept a list of comma separated values, e.g.,
Target Phi: 510, 550, 585
Target Pitch: 110, 145, 175
Target Rate: 3.8, 4.0, 4.2
These values are used in turn. If several fields have lists, they are used in parallel, that is, the first of each list (510, 110, 3.8), then the second of each list (550, 145, 4.0), then the third (585, 175, 4.2), etc. If one list ends before the others, the last value is reused. A value of "0" for any of these indicates that the corresponding reference value, φ, median F0, or articulation rate, will be used, i.e., no change.
If a label of a speaker is entered in the Target Phi field, its profile will be copied into this and the other fields, including the Randomization fields. Effectively, the system will convert the input speech using the parameters of the target speaker. The Pitch and Rate fields should be empty (or 0) in this case.
The label Random will select a random speaker from the speaker profiles, different from the one that is converted. The labels RandomFemale and RandomMale work like Random, but they select random female or male speakers. The labels RandomSgender and RandomXgender select a random target speaker profile of the same, respectively, different gender as the source speaker. If no gender information is available, these latter labels work like Random. These labels can be followed by a "=" and a corpus name to limit selection to a single corpus. For example, RandomSgender=ASVspoof2019 will limit the selection to speakers of the same gender from the corpus called ASVspoof2019.
A list of frequency bands, e.g., "F0, F3, F4, F5" for which the frequency or intensity is shifted by random amounts. Only a single such list can be entered. It is possible to fix the target values of these bands by entering the target φ value (neutral F1, roughly between 500-600 Hz) or intensity (roughly between 45-70 dB) to hold for that band. For instance, Randomize bands "F0 = 550, F3=510, F4 = 570, F5=540" (Hz) or Randomize intensity "F0 = 65, F3=50, F4 = 43, F5=44" (dB).
Example frequency values for F0-F5 for a speaker could be, frequencies: 550, 550, 530, 530, 510, 560 ±50 Hz and corresponding intensities: 64±4.5, 67±2.5, 58±4.5, 50±8, 47±10, 45±9 dB (±2SD). For randomization of frequency bands, φ±40 and φ±75 Hz are used for F0-1 and F2-5, respectively. For randomization of intensities, the above example values are used, including the 2SD values as ranges.
A special option is to replace frequency bands by modulated pink noise. This option might be useful when it is desired to remove all information from certain spectral regions, e.g., for studying Automatic Speaker Identification. For example, Randomize bands "F4 = NOISE" will replace the band around F4 with modulated pink noise scaled to the original intensity. This will also work for F6-F9, e.g., "F7=NOISE". Note, φ frequency values or randomization are not supported for F6-F9.
The Ignore freq bands checkbox is supplied to suppress all individual frequency band and intensity changes. This option is most usefull in testing the effectivity of frequency band and intensity changes.
The differences between the source and the target φ's, both φ, the bands φ1-φ5, and the intensity bands φ0-φ5, can be exaggerated, or diminished, by adding a multiplication factor between -brackets to the target. For instance, when selecting random cross-gender targets, RandomXgender[1.5]=ASVspoof2019, the effective target will be:
φeffective = φtarget + 1.5·(φtarget - φsource).
The exaggeration factor between the -brackets can also be negative.
Note: The size of the changes in pitch and rate are maximized to prevent deterioration of the sound quality. This can be undone by setting the maximalPitchChange and maximalRateChange variables in the script to a high value.
Two examples are available:
Convert a single speech recording:
F0, F3, F4, F5
F0, F3, F4, F5
Convert a list of speech recordings in a more complex manner:
F0, F3, F4, F5
F0, F3, F4, F5
 Lammert AC, Narayanan SS. On Short-Time Estimation of Vocal Tract Length from Formant Frequencies. PLOS ONE. 2015 Jul 15;10(7):e0132193.
The vocal tract length (VTL) is calculated as: VTL = 100 · 352.95 / (4 · φ) cm. Note that a different formant tracking algorithm is used in Pseudonymize Speech than was used in  and the estimated φ values here are used only as scale factors. They should not be interpreted as corresponding to "real" VTL values.
© Rob van Son, October 17, 2019