Clone a voice from a clip. The clip should be a 15-20 second recording of a person speaking with little to no background noise.
The endpoint will return an embedding that can either be used directly with text-to-speech endpoints or used to create a new voice.
Whether to enhance the clip to improve its quality before cloning. Useful if the clip is low quality.
A 192-dimensional vector (i.e. a list of 192 numbers) that represents the voice.