StyleTSE Demo

Welcome to the demonstration of the StyleTSE model, a text-guided target speaker extraction model trained on the dataset TextrolMix.

💡💡 StyleTSE model takes a single text clue that describes the speaking style of the target speech. It handles text input of various lengths.

Text Description - Long

Mixture	Text Clue	Estimate Target	True Target
	"With excitement in her nauseated tone, she speaks energetically with a high-pitched voice."
	"The man conveys his message energetically, speaking rapidly and with a low voice. "
	"The man addresses the audience with an ordinary pitch, talking at a regular speed with normal energy."

Text Description - Mid

Mixture	Text Clue	Estimate Target	True Target
	"A sad speaker in a high pitch"
	"He has a low voice. "
	"The man sounds cheerful."

Text Description - Short

Mixture	Text Clue	Estimate Target	True Target
	"Voice pitch is sharp"
	"Speaks at a quick pace."
	"British speaker."

🌻🌻 StyleTSE model takes a reference audio and a text prompt to extract matched styles, including emotion, accent, pitch, gender, and speaker identity.

Emotion

Mixture	Reference Audio	Text Prompt	Estimate Target	True Target	Emotion Class
		"Isolate the voice that echoes the enroll's emotion."			Sad
		"Select the voice with a similar emotional tone."			Angry
		"Separate the speech with a similar mood to the clue."			Happy

Accent

Mixture	Reference Audio	Text Prompt	Estimate Target	True Target	Accent Class
		"Keep only the accent from the enroll."			American
		"Extract the same accented speech."			British
		"Extract speech with similar accent"			Scottish
		"Identify same accent as the audio, should be newzealand."			New Zealand