Voice Clone Has Unnatural Breathing and Pauses

Your AI voice clone speaks the words but breathes in the wrong places, takes weird mid-word pauses, or has no breath at all — usually a punctuation and pacing problem.

Your ElevenLabs (or PlayHT, or Hume) clone reads your script and the words are correct, the voice is yours, but it sounds wrong. The clone takes a breath mid-clause where no human would. Or it powers through three sentences without breathing and you can hear the model running out of air. Or it pauses awkwardly between “the” and “computer”. Modern TTS voice clones are good at timbre and emotion but their breath modeling is fragile, and it is highly sensitive to your input text. Most “unnatural breathing” complaints can be fixed by re-punctuating the script, not by switching models.

Common causes

Ordered by what most often fixes the problem.

1. Script has no commas or short sentences

The model uses punctuation as breath cues. A script written as one long paragraph with no commas tells the model “speak this in one breath” — which is impossible — so it inserts breaths at semantically wrong points.

How to spot it: Read your script aloud. If you naturally breathe at points that have no punctuation, the model lacks a cue at those points.

2. Reference audio had no breaths (or all breaths)

Voice clone models learn breath patterns from your reference. If your reference is 30 seconds of clean, edited speech with breaths removed (typical for marketing samples), the clone learns “no breath” as the natural state — and it pushes that to comical lengths in long form.

How to spot it: Listen to your reference audio. Count the breaths. If there are zero in 30+ seconds, that is the issue.

3. SSML breaks not used (or used wrong)

Some TTS APIs accept SSML tags like <break time="500ms"/> or <break strength="medium"/>. If you are sending plain text to an SSML-aware engine you lose pacing control. If you are sending SSML to an engine that does not parse it, the tags become spoken text.

How to spot it: Check your provider’s API docs for SSML support. Test with one <break/> and listen for either a pause (good) or the model saying “break time five hundred milliseconds” (bad — SSML not supported).

4. Sentence too long for the model’s context

Many TTS models process speech in chunks of about 30-60 seconds. A 90-second-long single sentence forces breath placement to be guessed. The model places breaths at the chunk boundaries, regardless of meaning.

How to spot it: Find the exact word where the weird breath happens. If it lands at about 30 seconds into a sentence, you hit a chunk boundary.

5. Stability setting is too high

In ElevenLabs and similar tools, “Stability” controls how much the voice can vary in prosody. Maxed-out stability produces flat, monotone, anti-breathy speech. The voice never breathes because variation is suppressed.

How to spot it: Lower stability to 0.4-0.5 and re-generate. If breathing returns to natural, stability was too high.

6. Style exaggeration is too high

ElevenLabs’ “Style” parameter pushes the voice toward more dramatic / emotional delivery. At 0.8+, it can introduce gasps, sighs, and theatrical breaths that sound unnatural in narration.

How to spot it: Style at 0 produces a flat read. Style at 0.5 produces natural emotion. Above 0.7 you get melodramatic breathing.

7. Script uses unusual orthography (ALL CAPS, ALL.PERIODS)

Writing “WE WILL FIGHT THEM” in caps makes the model emphasize each word — and emphasized words get inserted micro-breaths between them. “We. will. fight. them.” with periods has the same effect.

How to spot it: Convert to sentence case with one ending period. Re-generate. If breathing normalizes, formatting was the cause.

Before you start

  • Save your current script and reference audio. You may need to compare.
  • Identify which TTS engine and which model version you are on. SSML support varies.
  • Decide whether the problem is “too many breaths in wrong places” or “no breaths at all”. The fixes are different.

Information to collect

  • Full script text (the exact bytes sent to the API).
  • Reference audio file used for cloning, and its duration.
  • Engine name (ElevenLabs Multilingual v2, PlayHT 3.0, etc.) and version.
  • Stability / Similarity / Style slider values.
  • Whether you used SSML and which tags.
  • The timestamp(s) in the output where the unnatural breath occurs.

Step-by-step fix

Step 1: Re-punctuate the script for natural breath cues

Add commas at every natural pause:

BEFORE: After we finished the report we sent it to the client
        who asked us to come back the next day to present the
        findings in person.

AFTER:  After we finished the report, we sent it to the client.
        She asked us to come back the next day, to present the
        findings in person.

Shorter sentences with commas give the model 3-4 breath slots in a 5-second span. The model now picks one of them rather than guessing.

Step 2: Add explicit SSML breaks where you want a beat

For SSML-supported engines:

<speak>
  After we finished the report, <break time="400ms"/>
  we sent it to the client. <break time="600ms"/>
  She asked us to come back the next day.
</speak>

400ms is a short breath; 600ms is a sentence break; 1000ms is a beat for emphasis.

Step 3: Re-record reference audio with natural breathing

If your clone learned “no breath” from a clean reference, record a 60-90 second sample that includes natural pauses and audible breaths. Read a paragraph of normal prose conversationally. Do not over-edit.

- 60 to 90 seconds long
- Multiple sentences
- Natural breaths between sentences left in (do not remove)
- No background music or noise
- Same mic / room as your target use case

Re-train the clone with this reference. Breath behavior improves dramatically.

Step 4: Lower stability to allow prosody variation

In the API or UI:

{
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0.3
}

These are good defaults for natural narration. Stability above 0.7 flattens speech; below 0.3 over-varies it.

Step 5: Break long passages into chunks at sentence boundaries

If your script is 5 minutes, do not send it as one API call. Split at sentence boundaries:

chunks = re.split(r'(?<=[.!?])\s+', script)
for chunk in chunks:
    audio = tts.generate(text=chunk, voice=voice_id, ...)
    audio.export(f"chunk_{i}.mp3")

Concatenate the resulting audio files in your DAW or with ffmpeg. The model now has fresh context for each chunk and breath placement improves.

Step 6: Use the engine’s “speech rate” instead of fighting breaths

If breathing feels rushed, the underlying issue is often speech rate. Lower rate from 1.0 to 0.92:

{"speed": 0.92}

At a slightly slower rate, the model has more time to place breaths naturally.

Step 7: Post-process to clean residual artifacts

If one stubborn breath sounds wrong:

# Identify the breath timestamp
ffmpeg -i out.mp3 -af "volumedetect" -f null - 2>&1 | grep mean
# Cut a 200ms region
ffmpeg -i out.mp3 -ss 12.4 -t 0.2 -af "volume=0.2" out_fixed.mp3

Or in a DAW: locate the breath, attenuate by 6 dB, or replace with a recorded breath sample of your own voice.

Verify

  • Listen to a 60-second sample of the new output. Breath placement should align with sentence and clause boundaries.
  • Compare against a 60-second reference of your own natural speech. Breath count should be within ±2.
  • Loop the segment that originally had the unnatural pause. The pause should now be either gone or musical.

Long-term prevention

  • Write scripts in conversational sentence structure with explicit punctuation; this is the single biggest factor.
  • Use SSML for any production-quality work; do not rely on the model to guess pacing.
  • Re-record voice clone references every few months — model versions change and your clone’s behavior shifts.
  • Keep stability around 0.5 for narration, lower (0.3) for conversational, higher (0.7) only for read-from-paper formal tone.
  • Chunk long scripts at sentence boundaries; never send a 10-minute monologue in one call.
  • Save a “known-good” reference audio + script + parameter preset and benchmark new model versions against it.

Common pitfalls

  • Writing scripts in marketing-copy register with no commas. (“We are the trusted leader in AI productivity helping teams unlock potential delivering value every day.”)
  • Re-using a 5-second voice sample to clone — way too short for prosody learning. 60-90 seconds minimum.
  • Cranking style to 1.0 because it sounds more emotional in a 10-second test. Over a 5-minute narration it becomes exhausting and breathy.
  • Sending an entire chapter of an audiobook as one API call.
  • Adding SSML tags to an engine that does not parse them and producing audio that literally says “break time 400 milliseconds”.
  • Trying to remove all natural breaths in post — breath-less speech sounds robotic.

FAQ

Q: My clone never breathes. Is breath suppression a feature?

It is a side effect of clean training data. Re-record the reference with natural breaths included, or accept that this clone is best for short clips only.

Q: Why does the same script sound natural in one model and unnatural in another?

Each model’s breath inference is trained differently. Switching from ElevenLabs to PlayHT (or vice versa) can fix breathing for one script and break another. Test critical scripts on both.

Q: Can I record my own breath sounds and splice them in?

Yes — record a few of your own breaths at the same mic/room as the script and use them as inserts in a DAW. For long-form production work this is a common pro workflow.

Q: Does multilingual voice clone breathing differ across languages?

Yes. Breath placement is language-specific and the model may not transfer it perfectly. A clone trained on English and asked to speak Mandarin will often breathe at English clause boundaries that do not match Mandarin prosody. Re-train with a sample in the target language.

Tags: #Troubleshooting #voice-clone #tts #elevenlabs #prosody