Automatically Generated Captions

Status: This is an incomplete, unapproved draft. The current draft is at wai-media-guide.netlify.com/

The basics of automatic captions

Tools exist today that use sophisticated speech-to-text (STT) technology to turn a program's soundtrack into a timed caption file, ready for inclusion with corresponding video. In fact, most videos uploaded to YouTube are captioned by Google's automatic-captioning process, something many authors do not know. Automatic captions are available in a number of languages. However, the accuracy of these captions is frequently quite low and results in poor-quality captions that often contain…

  • text that does not match words spoken in the audio;
  • poor timing (e.g., captions that do not appear synchronously with the audio);
  • spelling errors;
  • little or no punctuation;
  • missing capitalization;
  • occasional obscenities (swears, for example).

Using automatic captions responsibly

Automatically generated captions should never be used as the sole method to produce captions. However, they can be used as a first-pass or rough-draft effort in the workflow that eventually leads to an accurate, high-quality caption track. Below is a sample workflow for using auto-captions as part of the caption-production process, using YouTube's auto-caption service as an example.

Example:
  1. Upload a video to YouTube.
  2. Generate automatic captions.
  3. Download the track.
  4. Using a caption editor, correct spelling, grammar and timing errors.
  5. Export the cleaned-up caption file to the appropriate caption format for YouTube.
  6. Upload the new caption file to YouTube.

Once an accurate caption track has been uploaded, disable the automatic-caption track.

These tutorials provide best-practice guidance on implementing accessibility in different situations. This page combines the following WCAG 2.0 success criteria and techniques from different conformance levels:

Success Criteria:

  • 1.2.2 Captions (Prerecorded): Captions are provided for all prerecorded audio content in synchronized media, except when the media is a media alternative for text and is clearly labeled as such. (Level A)