Welcome to the demonstration of the StyleTSE model, a text-guided target speaker extraction model trained on the dataset TextrolMix.
💡💡 StyleTSE model takes a single text clue that describes the speaking style of the target speech. It handles text input of various lengths.
Mixture | Text Clue | Estimate Target | True Target |
---|---|---|---|
"With excitement in her nauseated tone, she speaks energetically with a high-pitched voice." | |||
"The man conveys his message energetically, speaking rapidly and with a low voice. " | |||
"The man addresses the audience with an ordinary pitch, talking at a regular speed with normal energy." |
Mixture | Text Clue | Estimate Target | True Target |
---|---|---|---|
"A sad speaker in a high pitch" | |||
"He has a low voice. " | |||
"The man sounds cheerful." |
Mixture | Text Clue | Estimate Target | True Target |
---|---|---|---|
"Voice pitch is sharp" | |||
"Speaks at a quick pace." | |||
"British speaker." |
🌻🌻 StyleTSE model takes a reference audio and a text prompt to extract matched styles, including emotion, accent, pitch, gender, and speaker identity.
Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Emotion Class |
---|---|---|---|---|---|
"Isolate the voice that echoes the enroll's emotion." | Sad | ||||
"Select the voice with a similar emotional tone." | Angry | ||||
"Separate the speech with a similar mood to the clue." | Happy |
Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Accent Class |
---|---|---|---|---|---|
"Keep only the accent from the enroll." | American | ||||
"Extract the same accented speech." | British | ||||
"Extract speech with similar accent" | Scottish | ||||
"Identify same accent as the audio, should be newzealand." | New Zealand |
Mixture | Reference Audio | Text Prompt | Estimate Target | True Target |
---|---|---|---|---|
"Identify and enhance the same speaker." | ||||
"Filter out all but the identical speaker." | ||||
N/A | ||||
N/A |
Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Gender Class |
---|---|---|---|---|---|
"Select gender-consistent audio." | Male | ||||
"Separate by sex similarity." | Female | ||||
"Retain voice with same gender." | Female |
Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Pitch Class |
---|---|---|---|---|---|
"Extract similar pitched speaker to the clue." | High | ||||
"Extract similar pitched people to the enroll." | Low |