Dealing With Difficulties In Audio And Video Transcription


The DubbingKing Software - A Comprehensive Audio-Visual Translation (AVT) Software For Windows

The Dubbing King software caters for various Audio-Visual Translation (AVT) modes. It is used for subtitling, translation and the dubbing processes.


Language services come in two major components, translation and interpreting. Each one has several sub-branches, specific services to cater to different topics. One sub-branch under translation is transcription, which is one of the fastest-growing jobs in the United States.

Transcription converts audio files into text. Audio files can be in MP3 or WAV formats while video file formats vary, from AVI, MP4, FLV, MOV or WMV. Transcriptions serve many purposes, from medical to legal or business. It can be used as evidence in a case trial or as a reference for voice recordings and translation.

In transcription work, the ideal scenario is to have recordings where the audio is clear and audible and in the same language, even if there are multiple people in the recording. But that is not always the case. There will always be instances when the recordings are complicated.

The actual transcription process is already difficult, but the level of difficulty rises higher when several languages are heard in one recording and people are speaking in a hurried and tense manner.

Complications in transcription

A good transcriptionist is trained to expect complexities in transcription work. In transcription, timestamps are required to show the exact time when something is spoken. Not only the words spoken by the main speakers are transcribed. Almost everything within the recording is transcribed, including background noise, coughing, laughter, and other things that are heard or spoken in the background.

Transcription covers every element that is captured in the recording, including the elements that make transcription difficult.

Background noise includes strong wind, other people screaming or talking, sirens, or various traffic sounds. These are some elements that can compete with the voices of the main speakers that can make it harder for the transcription work to proceed faster. In most cases, the transcriptionist marks the audio parts with audible or inaudible if it is no longer possible to understand what is being said.

Speakers using other languages also make transcription difficult. In a report or interview, for example, one language can be used by one person while another could be speaking in a different language and a translator is needed. If this is the case, different persons may be needed to transcribe the recording.

Slang or strong accent is another element that complicates the transcribing process. Even if only one language is used, the accent of the speaker or the slang of the speaker can pose a challenge to the transcriptionist. It would be ideal if the speaker or speakers have a neutral vocabulary and accent.

The volume and speed of the conversation also affect the process of transcription. Since the transcriptionist has to type the transcription, the faster the speaker speaks, the slower the transcription becomes since the transcriptionist has to rewind the recording several times to pick up all the audio. Even if the transcriber is a quick typist, it will take time to understand what’s being said and translate it into text. The volume of the speaker’s voice is also important. If it is very low and quiet, it could be hard to understand or pick up what is being said.

Role of the transcriber

In the early days of transcription service, the transcriber had to use shorthand to write everything that is heard in the recording, before cleaning it up and type it. Today, transcribers use computers, foot pedals, and professional transcribing applications.

The audio or video file can be sent online through email or other file sharing applications. These make it easier for transcriptionists to download the file, load it into the professional software, and start typing the transcription.

The transcriber will add the proper punctuation marks, new paragraphs, and full stops.

A professional transcriber who uses touch-typing normally types about 75 words in one minute. Taking this as the basis, the industry standard to transcribe a 60-minute recorded video or audio is about 4 to 5 hours (minimum) of transcription work. Several factors may affect the speed of the transcription such as the speed of the conversation, the number of persons speaking in the recording, and its clarity, including the clarity of the speaking voice of the speakers.

All the mentioned variables add time to the transcribing work. The client must understand that a professional transcriptionist cannot type at the normal typing speed because of the need to capture all the audible sounds in the audio or video file. On average, a speaker will speak at a speed that is four to five times faster than the typing speed of a transcriber.

How to deal with transcriptions

Transcription is one of the most demanding and labor-intensive among all translation services. It requires high-skills from the transcriptionist, from listening to the audio or video file, researching the subject, understanding the context of the recording, and typing the audio into readable text.

For a professional transcriptionist, it is important to know what the clients want. Some of these include:

  • Typing the audio exactly as spoken, including the audible pauses such as ‘ers’ and ‘ums’ or remove them but keep the rest of the audio. Clients may want the transcription to be grammatically correct or make the non-native speakers of English sound like one.
  • Remove or include the full questions when working on an interview. If the interviewee says something ”off the record,” the transcriber has to find out if the client wants to remove it or include the response and add an ”off the record mark.”
  • Capture and mark the pauses. It is also important to know how to treat the pauses made by the speaker/s.
  • Putting a mark on the words or sections that are not clear.
  • Timestamps on the document.
  • Identifying the speakers.
  • K. or U.S. spelling.
  • Line spacing and special font.

Style standards

A style guide is followed by a translation company that offers transcription services. The example below illustrates how it is done.

  • Brackets are often used to show sounds that interrupt the main dialogue. They are used to enclose a short description of the sound, which is usually descriptive, such as [applause], [laughter], or [phone ringing]. If there is a notable stop in the speaker’s sentence, a bracket is used to show it, such as [cut off] or three succeeding ellipses. The transcriptionist can also enclose the description of the tone of voice of the speaker in brackets, such as [angry], [happy], or [joking].
  • The transcriber can also use brackets followed by a timestamp to show uncertainty with the spoken word, for example, [crosstalk][00:00]. The speaker’s sentence should be completed before putting the other speaker’s words in another paragraph. The term ”inaudible” should be enclosed in brackets followed by a timestamp if it is not possible to understand what’s being said. This method is also applied to instances when the transcriber is unsure of the name, title or spelling by enclosing ”phonetic” in brackets and spelling the word phonetically to the best of the transcriber’s ability.
  • Some transcriptions are required to have time stamps. Timestamps are also enclosed in brackets. They should be added every 30 seconds. It should come after the name of the speaker and before the transcribed words. Timestamps should also be added when there is a change of speaker.
  • A colon should be used after the name of the speaker. The speaker’s name should be bold to distinguish it from the rest of the text. If a timestamp is needed, it should be added after the colon that is placed after the speaker’s name.
  • Timestamps should only be indicated in minutes and seconds, so an hour and fifteen minutes is written as [75:00].
  • Only the name, title, or gender are used as speaker labels, such as [Atty. Landon], [Manager], or [woman] in this order of hierarchy. It is important to use descriptive speaker labels to make them distinct. If there is a large group and many are talking, it is all right to label them as [audience] and refer to a person from the audience speaking as [audience member] instead of [man].
  • If the persistent background sound does not affect the quality of the dialogue, add a note of it in the transcription at least once, at the time that it first occurred. It is all right to remove filler words or statements. Remove conjunctions such as ‘but’ or ‘and’ that start a speaker’s sentence.

Some clients need word-for-word transcriptions, where the verbal and nonverbal nuances are captured. When the request is for verbatim transcriptions, the transcriber must include all audible sounds and how the words are pronounced, such as ‘cos’, ‘coz’, or ‘cuz’ instead of because, slangs, phrasings that are repeated, fillers and false starts.

Accuracy is important in verbatim transcription so punctuations must be used as is, but grammatical changes are not permitted. Sounds that do not interrupt the conversation should be included in brackets every instance they occur. A hyphen is used mid-word when a speaker stutters or repeats certain words.

The transcriptionist should have a copy of the client’s special terminology so the transcription would be accurate.

So many rules and conventions must be followed when doing audio transcription, which is why it is difficult. It demands more from the translator; thus, it is a specialized field. Be sure to work only with a professional transcription services provider with years of experience to ensure that your audio or video file is transcribed accurately and according to your specific needs.

Other Posts

Leave a Comment

Your email address will not be published. Required fields are marked *