We, the modern people, are tickled with our phones’ voice-recognition powers. We can ask questions! We can open apps with our voice!
You know who’s not so tickled? Anyone who records other people talking. Our phones are terrible at transcribing voices—converting them to text that we can edit.
I realize that most people don’t care about transcribing audio. But if you’re a reporter, producer, editor, author, YouTuber, filmmaker, student, documentarian, researcher, government agency, doctor, lawyer, or police officer, for example, you might really care. Manual transcription of audio and video files is an excruciating, tedious, soul-sucking exercise, and we’ve doing it pretty much the same way for 50 years.
The world waits for a method that’s fast, cheap, and accurate. We want this:
But you can’t have that.
Until now, there have been only a few ways to convert a recording into text:
- Transcribe it manually. You do the typing yourself as you listen. Hit Play, Stop, Rewind, Play, Stop, Rewind, over and over. That’s accurate and cheap, but not fast. It’s a royal pain, especially if you have several long interviews to do.
- Transcribe it manually, with web assistance. This Chrome extension combines an audio player and a text editor, so at least you’re spared some of the back-and-forth between two apps as you type it out yourself. Still tedious.
- Let your phone transcribe it. Yeah, play the recording into your phone, as though you’re speaking to it. The results are terrible. There’s no punctuation, no paragraph breaks, and the result needs so much editing, you could have done the job yourself faster. Fast and cheap, but not accurate. (Same thing for the automatic transcription features of YouTube and Google Docs. The results are generally a mess.)
- Hire a web-based service to do it. Services like Rev.com, Scribie.com, Transcribeme.com, and VoiceBase.com employ human transcriptionists to type out your audio. Usually, they charge between $1 and $3 per minute of recorded audio or video—more if you want same-day turnaround, and even more if you want the transcriber to add time codes, the names of who’s speaking, and the little “ums” and false starts. Bottom line, you’re looking at $60 to $150 per hour of audio. Accurate, but not fast or cheap.
- Hire a professional service. Professional news channels hire professional transcription services like Transcript Associates, Audio Transcription Center, or Professional Transcriptions. You get incredible quality—flawless transcriptions; “ums” and “uhhs” and dashes representing pauses; time codes typed in; the speakers’ names identified. And you get it in a matter of hours. But we’re talking $220 per hour of audio, or more. Accurate and fairly fast, but not what you’d call cheap.
This is a review of a new, fifth approach: Trint.com. (The name, we’re told, is a combo of “transcript” and “interview.”) It lands on a new point in that speed-cost-accuracy continuum by (a) automating the conversion instead of hiring humans, and (b) providing a slick, easy way for you to breeze through the results and correct the errors.
“The idea is to take the very best of automated speech recognition [ASR] and push it as far as it will go, then give the user a simple tool to get those last yards,” Trint founder Jeffrey Kofman told me. “By combining a text editor with an audio/video player, we let you quickly search, verify, and correct the output of our ASR.”
The cost is $15 per hour of video—about a quarter of the cost of even the cheapest human web-based services.
How it works
You sign up. You upload your audio or video file. You pay in advance: $15 for an hour of converted audio or video. (If you’re willing to commit to doing a lot of this, the cost comes down to $12 an hour.)
You wait maybe five minutes—an insanely short time—and then it’s done. You open the transcription right there in your browser, looking like this:
Already, what you get is good enough that you can search for words or highlight the good parts.
But while the system correctly detects sentences and adds periods, it adds no other punctuation. It makes no attempt to add commas, for example. So you wind up with phrases like “I went to you know the store and bought peaches plums and pickles.” No question marks, either.
Now you read through it, correcting the errors, adding punctuation and paragraph breaks, and identifying speaker names using a pop-up menu. The video above shows what this process is like.
What’s kind of wonderful is how the audio or video playback is integrated with the editing: Wherever you click your mouse, that bit plays back automatically. There’s no Play, Pause, Rewind cycle here; the system always knows what to play when. (You can turn this playback on or off with a keystroke, and also control the playback speed.)
So how long does this cleanup process work? I tried Trint on seven interviews, and the editing generally wound up equaling the length of the interview. Thirty-minute interview, 30 minutes to clean it up.
That will never fly in the professional world. CBS News won’t be using Trint any time soon. But doing the job yourself would take five to ten times the length of the original recording. And if you hired a professional service, you’d pay four to eight times as much.
(For many people, of course, a full cleanup isn’t necessary; often, the point of transcribing an interview is just to skim it to find the good parts. That’s where Trint really shines. It’s simple to read or search for text in a transcript, highlight the juicy parts, and even play back only the highlighted portions. In that case, you can clean up only those few bits.)
How does it compare?
To compare the results, I submitted the same interview recording to Trint, to Rev.com ($1 a minute for 24-hour turnaround), and to a high-end pro service. Here’s what I got back:
It’s in beta
There are some bugs left to squash in Trint. For example, copy and paste don’t work in the text editor. (I was editing an interview that contained the word GISHWHES over and over again, a non-word that Trint never once transcribed correctly. I thought I could just paste it in over and over again, but no joy.) The company explains that if you went nuts, pasting in blobs of text, you’d throw off the software’s underlying links between the audio and the text.
Chrome extensions can trip up Trint, too. Every time I inserted a Return to break up a paragraph, some text would disappear. Turning off all my extensions fixed that.
You should also keep in mind that Trint requires clean, clear audio, in which your subject was miked. You can’t feed it the echoey recording of your kid’s school play, for example, and expect decent results. And, as you’d guess, thick accents dramatically impair the accuracy. (When you post the recording, you specify which accent the speaker has; that helps.)
The company acknowledges that it has some work to do, and says that it has big plans for Trint 2.0 this summer.
In the meantime, Trint is here now. It’s not this—
—but it’s this:
—and that’s a new spot on the time-cost-accuracy spectrum. It’s therefore a welcome new weapon in the fight against the costly, time-consuming, soul-sucking act of transcribing the human voice.
More from David Pogue:
David Pogue, tech columnist for Yahoo Finance, welcomes nontoxic comments in the comments section below. On the web, he’s davidpogue.com. On Twitter, he’s @pogue. On email, he’s firstname.lastname@example.org. You can read all his articles here, or you can sign up to get his columns by email.