As part of my Ubiquitous Course, I am supposed to summarize two research papers on assistive technology every week. I will post the summaries of interesting papers here for you to read. This paper talks about using a crowd-sourced real-time captioning system that performs better than a stenographer. Here’s my summary:
In this study, Lasecki and Bigham investigate the ability to harness coordinated groups to solve the problem of real-time captioning for deaf and hard of hearing (DDH) people. Real-time speech to text conversion is essential for DDH people to capture the essence of a presentation. It is even more critical for deaf students who are counting on these techniques to learn at schools and universities. Automatic Speech Recognition (ASR) technologies that are being developed by tech giants still have a load of problems, and even an error of five percent could confuse students. Stenographers, professional speech to text scribes, are very expensive and hard to find and need to be booked days in advance. To target this issue, the authors built a system which uses multiple non-experts to do captioning in real-time.
The back-end of this system combines outputs from its various non-expert scribes on-the-go providing a real-time captioning machine that does the speech-to-text conversion in less than five seconds. They use an MSA algorithm, a standard biological sequencing approach, to combine different words of a sentence in the right order. It was clear that non-experts were overwhelmed when asked to type all the words they heard. To address this, each worker is now responsible for three seconds of video/audio and gets 9-15 seconds break. The system also speeds up the video when the worker isn’t accountable for captioning (but needs to hear to be aware of the context) and slows down the part when they are. Both these changes allowed the workers to transcribe at an improved precision while reducing their stress.
This study takes a vital step towards democratizing captioning. Using Scribe, users of variable skills can now provide real-time captioning on-demand on multiple devices and for various purposes. It also allows workers with no certified training to make livable wages while working on a flexible schedule. This research provides a template for building systems that current Artificial Intelligence cannot robustly automate. Valuable data collected through this system over time can be used in the future to train better machine learning models for transcribing.
Limitations and Future Research
Though the paper claims that the performance of multiple non-experts is better than a professional stenographer or ASR, no quantitative measurements are provided to gauge how big that difference is. It disregards ASR, claiming it is not good enough yet, though using ASR in conjunction with a group of non-experts would be an exciting alternative to consider. As we have many people captioning a given section, we can learn the skill of a person over time and use this information towards weighing his contribution in case of a disagreement between multiple workers.