StyleTSE Demo

Welcome to the demonstration of the StyleTSE model, a text-guided target speaker extraction model trained on the dataset TextrolMix.

💡💡 StyleTSE model takes a single text clue that describes the speaking style of the target speech. It handles text input of various lengths.

Text Description - Long

Mixture Text Clue Estimate Target True Target
"With excitement in her nauseated tone, she speaks energetically with a high-pitched voice."
"The man conveys his message energetically, speaking rapidly and with a low voice. "
"The man addresses the audience with an ordinary pitch, talking at a regular speed with normal energy."

Text Description - Mid

Mixture Text Clue Estimate Target True Target
"A sad speaker in a high pitch"
"He has a low voice. "
"The man sounds cheerful."

Text Description - Short

Mixture Text Clue Estimate Target True Target
"Voice pitch is sharp"
"Speaks at a quick pace."
"British speaker."

🌻🌻 StyleTSE model takes a reference audio and a text prompt to extract matched styles, including emotion, accent, pitch, gender, and speaker identity.

Emotion

Mixture Reference Audio Text Prompt Estimate Target True Target Emotion Class
"Isolate the voice that echoes the enroll's emotion." Sad
"Select the voice with a similar emotional tone." Angry
"Separate the speech with a similar mood to the clue." Happy

Accent

Mixture Reference Audio Text Prompt Estimate Target True Target Accent Class
"Keep only the accent from the enroll." American
"Extract the same accented speech." British
"Extract speech with similar accent" Scottish
"Identify same accent as the audio, should be newzealand." New Zealand

Speaker Identity

Mixture Reference Audio Text Prompt Estimate Target True Target
"Identify and enhance the same speaker."
"Filter out all but the identical speaker."
N/A
N/A

Gender

Mixture Reference Audio Text Prompt Estimate Target True Target Gender Class
"Select gender-consistent audio." Male
"Separate by sex similarity." Female
"Retain voice with same gender." Female

Pitch

Mixture Reference Audio Text Prompt Estimate Target True Target Pitch Class
"Extract similar pitched speaker to the clue." High
"Extract similar pitched people to the enroll." Low