Pseudonymize Speech

Pseudonymize Speech

A Praat script to pseudonymize speech. That is, Pseudonymize Speech tries to make it difficult to recognize the speaker while still retaining relevant (para-)linguistic features and intelligibility.

Running the script

When running the script, a form will appear with the following fields:

Source: Path to the audio file to change. The path may contain one (and only one!) wildcard character (*) or end in a "/", in which case all the matching files will be pseudonymized. If the file name ends in .tsv or .csv, this is interpreted as a table from which the values needed are read. This table allows to automate pseudonymization of a large number of recordings.
Reference: Path to the audio file(s) containing the reference audio from which the basic features of the speaker are measured to construct a speaker profile. A lot of audio is needed to get reliable values for each speaker, preferably hundreds of seconds. The path may contain one (and only one!) wild card character (*), end in a "/", or point to a .tsv or .csv file with a table of speaker values. In case the reference path contains a wildcard, a single profile is calculated from all the matching audio files. If Reference is -, * or GENERATE, the source is used to determine the speaker characteristics.
Target Phi: (φ) The "neutral" F₁ value corresponding to the target vocal tract length. For instance, 560 Hz for an average female or 535 Hz for an average male speaker.
Target Pitch: The desired F₀ value. For instance, 110 Hz for an average male or 175 Hz for an average female voice.
Target Rate: The desired speaking rate (articulation rate) in syllables per second, e.g., 4.2 syl/second for average read speech.
Target Directory: Path to the directory where the results are stored. Existing files are not overwritten, but a number is appended to the filename instead.
Randomize bands: A list of bands, F0-F5, which will be frequency shifted by random amounts. The F₀ band is from [0, φ/2]. The other bands are stacked above each other every 2 times φ. F₁: [φ/2, 2·φ], F₂: [2·φ, 4·φ] and so on to F₅: [8·φ, 10·φ]
Randomize intensity: A list of bands, F0-F5, which will be intensity shifted by random amounts (in dB). The F₀ band is from [0, φ/2]. The other bands are stacked above each other every 2 times φ. F₁: [φ/2, 2·φ], F₂: [2·φ, 4·φ] and so on to F₅: [8·φ, 10·φ]
Output format: A choice of output formats, WAV or FLAC.
Remove pauses: A check box to indicate whether pauses in the source audio are removed or not.
Ignore freq bands: A check box to indicate that frequency and intensity in the F0-F5 band should not be changed. If this box is checked, only the vocal tract length (Target Phi), pitch, and rate are changed.
Switch F4 F5: A check box to indicate that frequency bands corresponding to F4 and F5 should be exchanged. This is a rather coarse operation to check the use of the higher frequencies.

Buttons:

Stop: Abort the script.
Help: Open this manual instead of running the script.
Continue: Run the script with the values given.

Note that all file paths can be relative to the place where the main script is stored.

Source

If the Source is not a path to soundfiles, it can be a path to a .csv or .tsv table. Such a table should have the following column labels:

Source Reference Target_Phi Target_Pitch Target_Rate Target_Directory

Source: A path to sound files.
Reference: Either a path to sound files or a label also present in the Reference table (see below).
Target_Phi: The desired vocal tract length as the neutral F₁ (φ) in Hz, see [1].
Target_Pitch: The desired F₀ in Hz.
Target_Rate: The desired articulation rate in syllables per second.
Target_Directory: The path to the directory where the resulting audio is written to.

Optional columns:

Randomize_bands: A list of bands, F0-F5, which will be frequency shifted by random amounts.
Randomize_intensities: A list of bands, F0-F5, which will be intensity shifted by random amounts (in dB).

Reference

If the Reference is not a path to soundfiles, it can be a path to a .csv or .tsv table containing speaker profiles. For a single source speaker, the format is: <speaker id>@table path, e.g., F24I@Examples/SpeakerProfiles.csv. Such a table should have the following column labels:

Reference MedianPitch Phi Phi1 Phi2 Phi3 Phi4 Phi5 ArtRate Int0 Int1 Int2 Int3 Int4 Int5 Duration Corpus Gender

Reference: A label indicating the source audio or the speaker.
MedianPitch: Median F0 value of the speaker (audio).
Phi: The φ value calculated according to [1].
Phi1: The φ value calculated from the median F₁ only.
Phi2: The φ value calculated from the median F₂ only.
Phi3: The φ value calculated from the median F₃ only.
Phi4: The φ value calculated from the median F₄ only.
Phi5: The φ value calculated from the median F₅ only.
ArtRate: Articulation rate in syllables per second.
Int0-5: Intensities measured in bands 0-5. Values ≤ 0 are ignored.
Duration: Duration of the reference audio in seconds.
Corpus: A label to indicate the source corpus of speaker profiles (ignored).
Gender: The gender of the speaker, f or m. Used with selecting target profiles.

This table is automatically created when the main script is run using audio files as a reference. It can be saved after finishing the script as these calculations can take some time. The next time the script is run for the same speaker(s), the resulting table can be used as Reference input. If the Reference is * or GENERATE, only this table is generated, and no pseudonymized speech is produced. In this last, GENERATE, case, any non-empty Reference labels in a source table will be interpreted as reference labels of the new speaker profiles.

Target Phi, Target Pitch, and Target Rate

These fields accept a list of comma separated values, e.g.,

Target Phi: 510, 550, 585

Target Pitch: 110, 145, 175

Target Rate: 3.8, 4.0, 4.2

These values are used in turn. If several fields have lists, they are used in parallel, that is, the first of each list (510, 110, 3.8), then the second of each list (550, 145, 4.0), then the third (585, 175, 4.2), etc. If one list ends before the others, the last value is reused. A value of "0" for any of these indicates that the corresponding reference value, φ, median F0, or articulation rate, will be used, i.e., no change.

If a label of a speaker is entered in the Target Phi field, its profile will be copied into this and the other fields, including the Randomization fields. Effectively, the system will convert the input speech using the parameters of the target speaker. The Pitch and Rate fields should be empty (or 0) in this case.

The label Random will select a random speaker from the speaker profiles, different from the one that is converted. The labels RandomFemale and RandomMale work like Random, but they select random female or male speakers. The labels RandomSgender and RandomXgender select a random target speaker profile of the same, respectively, different gender as the source speaker. If no gender information is available, these latter labels work like Random. These labels can be followed by a "=" and a corpus name to limit selection to a single corpus. For example, RandomSgender=ASVspoof2019 will limit the selection to speakers of the same gender from the corpus called ASVspoof2019.

Randomize bands and Randomize intensity

A list of frequency bands, e.g., "F0, F3, F4, F5" for which the frequency or intensity is shifted by random amounts. Only a single such list can be entered. It is possible to fix the target values of these bands by entering the target φ value (neutral F₁, roughly between 500-600 Hz) or intensity (roughly between 45-70 dB) to hold for that band. For instance, Randomize bands "F0 = 550, F3=510, F4 = 570, F5=540" (Hz) or Randomize intensity "F0 = 65, F3=50, F4 = 43, F5=44" (dB).

Example frequency values for F0-F5 for a speaker could be, frequencies: 550, 550, 530, 530, 510, 560 ±50 Hz and corresponding intensities: 64±4.5, 67±2.5, 58±4.5, 50±8, 47±10, 45±9 dB (±2SD). For randomization of frequency bands, φ±40 and φ±75 Hz are used for F0-1 and F2-5, respectively. For randomization of intensities, the above example values are used, including the 2SD values as ranges.

A special option is to replace frequency bands by modulated pink noise. This option might be useful when it is desired to remove all information from certain spectral regions, e.g., for studying Automatic Speaker Identification. For example, Randomize bands "F4 = NOISE" will replace the band around F4 with modulated pink noise scaled to the original intensity. This will also work for F6-F9, e.g., "F7=NOISE". Note, φ frequency values or randomization are not supported for F6-F9.

The Ignore freq bands checkbox is supplied to suppress all individual frequency band and intensity changes. This option is most usefull in testing the effectivity of frequency band and intensity changes.

Exaggerated or caricature targets

The differences between the source and the target φ's, both φ, the bands φ₁-φ₅, and the intensity bands φ₀-φ₅, can be exaggerated, or diminished, by adding a multiplication factor between []-brackets to the target. For instance, when selecting random cross-gender targets, RandomXgender[1.5]=ASVspoof2019, the effective target will be:

φ_effective = φ_target + 1.5·(φ_target - φ_source).

The exaggeration factor between the []-brackets can also be negative.

Note: The size of the changes in pitch and rate are maximized to prevent deterioration of the sound quality. This can be undone by setting the maximalPitchChange and maximalRateChange variables in the script to a high value.

Examples

Two examples are available:

Convert a single speech recording:

Source: Examples/Audio/F24I2PS27A_fm.aifc
Reference: F24I@Examples/SpeakerProfiles.csv
Target Phi: 500
Target Pitch: 120
Target Rate: 3.8
Target Directory: Examples/Pseudonymized
Randomize bands: F0, F3, F4, F5
Randomize intensity: F0, F3, F4, F5
Output format: WAV
Remove pauses: no
Ignore freq bands: no
Switch F4 F5: no

Convert a list of speech recordings in a more complex manner:

Source: Examples/ControlPseudonymization.csv
Reference: Examples/SpeakerProfiles.csv
Target Phi: 500
Target Pitch: 120
Target Rate: 3.8
Target Directory: Examples/Pseudonymized
Randomize bands: F0, F3, F4, F5
Randomize intensity: F0, F3, F4, F5
Output format: WAV
Remove pauses: no
Ignore freq bands: no
Switch F4 F5: no

References

[1] Lammert AC, Narayanan SS. On Short-Time Estimation of Vocal Tract Length from Formant Frequencies. PLOS ONE. 2015 Jul 15;10(7):e0132193.

The vocal tract length (VTL) is calculated as: VTL = 100 · 352.95 / (4 · φ) cm. Note that a different formant tracking algorithm is used in Pseudonymize Speech than was used in [1] and the estimated φ values here are used only as scale factors. They should not be interpreted as corresponding to "real" VTL values.