What are your options for converting text to speech?
  • Updated:6 Apr 2004


John Hambley has a collection of cassette tapes that he has recorded throughout the years. They’re all of a single person speaking and include poems, sermons and other talks.

Having converted the cassette tape recordings to PC audio files, John wants to find a speech-to-text program that can convert the audio files to word documents, much like optical character recognition (OCR) software for scanners turns pre-typed documents into usable word processor files.

John would also like to make a copy suitable for listening to while driving. The cassette tapes didn’t sound noisy at home, but burning them to CD produced low-quality tracks that were very difficult to listen to in the car.

Please note: this information was current as of June 2007 but is still a useful guide to today's market.

Speech recognition

One possible solution to John’s challenge is a speech recognition package. These generally come in two forms: speech recognition software, and real-time speech-to-text software. John was happy to make some corrections comparable to the amount necessary with OCR.

Speech recognition software requires training so that the software can recognise your voice and interpret what you're saying accurately. For most programs, this means reading a specific passage of text (say, three pages of Alice in Wonderland) out loud in the voice you want recognised. That way, the software calibrates the words it expects to hear against the speaker’s voice. The problem is that calibration isn’t usually possible for pre-recorded speech.

Additionally, most packages cost more than $100, and John would prefer not to pay that much. Although there’s work on an open-source speech recognition engine, it’s not yet available.

Speech to text

Another technology, real-time speech-to-text, is usually used for a small range of words, and is more suited to computer commands and similar, rather than long passages of speech. Not only that, the cheap and shareware versions of these programs often only recognise American accents, and would have difficulty with the noisiness of John’s recordings.

Transcribe your tapes

Another option is transcription. Many secretarial services also provide transcription, usually priced from $20–40 per hour of audio. While this is an option for small amounts of audio, it can be expensive if you have many hours of recordings.

Dictation software allows for the functionality of a dictaphone: slowing text down so that it can be more easily transcribed. We found a free program called Express Scribe at

The current best option: clean up the audio

The final option is to clean up the current audio files so that they can be transferred to CD with a sound-quality that's good enough to listen to while driving.

To clean up audio, the sound file must be imported into a sound-editing software package before adjustments can be made. Some sound-editing programs allow you to remove hisses, pops and background noise, or isolate the vocals (often to remove them, so you can make your own karaoke tracks, for example). This requires a bit of fine-tuning, however it’s considerably less hassle than transcribing text or using a speech-to-text software package.

John’s progress

We tried two speech recognition packages using a poem that John sent to us, without success, so we suggested that he try some sound-editing software. We recommended the open-source package, Audacity, and a freeware program, Waverepair. Both these packages feature specialised tools to help clean up audio recordings and are highly recommended on audio enthusiast sites.

To remove hiss, John provided the programs with a few minutes of hiss with no signal, so that the software knew what to remove. Then he processed his sound files to remove enough hiss without affecting the audio signal he wanted to keep. Older sound-editing packages that John had tried used a filter spectrum but these were harder to use and much less effective at eliminating hiss.

John has so far tried both sound-editing packages, and has found that they provide a sharper-sounding audio file than his original taped recording. Apart from removing the hiss, the processed audio is more suitable for recording to CD. He hopes that with a little more practise, he’ll be ready to listen to them while driving.



