Welcome to the demonstration of the StyleTSE model, a text-guided target speaker extraction model trained on the dataset TextrolMix.
💡💡 StyleTSE model takes a single text clue that describes the speaking style of the target speech. It handles text input of various lengths.
| Mixture | Text Clue | Estimate Target | True Target |
|---|---|---|---|
| "With excitement in her nauseated tone, she speaks energetically with a high-pitched voice." | |||
| "The man conveys his message energetically, speaking rapidly and with a low voice. " | |||
| "The man addresses the audience with an ordinary pitch, talking at a regular speed with normal energy." |
| Mixture | Text Clue | Estimate Target | True Target |
|---|---|---|---|
| "A sad speaker in a high pitch" | |||
| "He has a low voice. " | |||
| "The man sounds cheerful." |
| Mixture | Text Clue | Estimate Target | True Target |
|---|---|---|---|
| "Voice pitch is sharp" | |||
| "Speaks at a quick pace." | |||
| "British speaker." |
🌻🌻 StyleTSE model takes a reference audio and a text prompt to extract matched styles, including emotion, accent, pitch, gender, and speaker identity.
| Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Emotion Class |
|---|---|---|---|---|---|
| "Isolate the voice that echoes the enroll's emotion." | Sad | ||||
| "Select the voice with a similar emotional tone." | Angry | ||||
| "Separate the speech with a similar mood to the clue." | Happy |
| Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Accent Class |
|---|---|---|---|---|---|
| "Keep only the accent from the enroll." | American | ||||
| "Extract the same accented speech." | British | ||||
| "Extract speech with similar accent" | Scottish | ||||
| "Identify same accent as the audio, should be newzealand." | New Zealand |
| Mixture | Reference Audio | Text Prompt | Estimate Target | True Target |
|---|---|---|---|---|
| "Identify and enhance the same speaker." | ||||
| "Filter out all but the identical speaker." | ||||
| N/A | ||||
| N/A |
| Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Gender Class |
|---|---|---|---|---|---|
| "Select gender-consistent audio." | Male | ||||
| "Separate by sex similarity." | Female | ||||
| "Retain voice with same gender." | Female |
| Mixture | Reference Audio | Text Prompt | Estimate Target | True Target | Pitch Class |
|---|---|---|---|---|---|
| "Extract similar pitched speaker to the clue." | High | ||||
| "Extract similar pitched people to the enroll." | Low |