Language models are complex. Now imagine adapting it for children with speech and hearing disabilities

Artificial intelligence has changed many industries and made them more efficient, safer, and cheaper. But there are still areas that AI has not yet penetrated such as Speech Therapy for example. According to the National Institute of Deafness and Communication Disorders, in the US, about seven percent of children aged 3-17 (or 1 in 12) suffer from problems related to voice, language and speech. Today, only about 60 percent of children receive treatment, and speech therapists feel the burden as they treat about 80-100 children at the same time. At best, they can allocate 5-10 minutes of treatment per week to each child. One of the most in-depth studies in the field, which included approximately 7,000 participants and lasted almost 30 years, found that those with communication disabilities also suffer in their adult lives from a lower socioeconomic status, low self-esteem and a higher risk of mental health problems. Additionally, according to research recently conducted in the UK, untreated communication disorders are a significant risk factor for child development, and there is a correlation between untreated communication disorders and crime. In standard educational settings, communication deficiencies are not always detected by teaching staff and sometimes the issue is fused to the child’s laziness, low IQ, and lack of discipline.

And this is exactly where AI comes in since it leads to smart systems that can be very advantageous. For example, it can help speech therapists in certain stages of the treatment thereby relieving them of their burden. Treatments usually include two main phases: learning new material and concepts and practice. Practice, which usually takes quite a bit of time, is an essential part of the entire learning process, especially in speech therapy. An AI-based system can work with a student to help them practice, check performance and report back to their clinician on progress. Such an automatic system could help an unlimited number of students at all hours of the day, yet is also significantly cheaper, when compared to investing in personnel for the same purpose.

Existing solutions are not good enough

The solution to having software include those with communication disabilities seems simple: to understand what the child is saying, use a Speech-to-Text engine (S2T for short) like Google’s engine, which can convert the speech into text. The problem is that commercial S2T engines are often trained using data taken from mature speakers with no impairments, such as LibriSpeech, which has about 1,000 hours of audiobooks. Children with speech and language problems do not speak like book narrators, so commercial S2T engines often fail at the task.

From tests we conducted, for example, with the commercial S2T engine, we discovered that it correctly recognized only about 30-40 percent of the words spoken by children with communication disabilities. The solution was clear: to develop an S2T system that could understand them.

So how do you build an AI system for speech therapy?

Until the era of deep networks, the construction of S2T was mainly done by huge companies and required a huge investment in collecting, cleaning, and tagging data. Sometimes hundreds or even thousands of hours of tagged speech were required to train classical models, such as HMM. But that reality has changed with the development of deep networks.

To develop S2T for children with communication disabilities, we used Transfer Learning. This method allows you to take a network that has been trained for a similar purpose and refine it to improve performance for specific data. As the contributor, we chose to use wav2vec 2.0. The acoustic model for speech recognition wav2vec was developed several years ago by Facebook. This is a Transformers-based deep web. The advantage of wav2vec is the network’s ability to learn from unlabeled data. The learning process of the network is carried out in two stages: self-learning on untagged data and fine-tuning of tagged data (speech signal with appropriate text).

In the process of self-learning, the network is required to reproduce part of the original signal – a hidden part of it. This is how the system learns to recognize the sounds of the language and the structure of the phonemes. In the second stage, the system learns to associate the learned phonemes with the characters of the text. One of the amazing things we discovered is that the amount of tagged data required for the second stage can be relatively small compared to classical systems: the network manages to reach an error of 8.2 percent per test set with only 10 minutes of tagged data. One hour of tagged data equates to 5.8 percent and 100 hours, only four percent. A variety of wav2vec networks are available to the general public and can be downloaded free of charge. We chose a network that underwent full training on LibriSpeech and fine-tuning with 960 hours.

To train the network, we collected thousands of recordings of children with communication problems. Collection of the data was carried out during treatments using a computer; some are labeled, and some are not. As we saw earlier, wav2vec allows us flexibility in using tagged data as well as untagged ones. Labeled data improves the accuracy of S2T, so it is always better to label the data. As the number of tagged data increases, the accuracy of the system will also improve.

After the data was collected, we recruited a team of speech therapists to label it. During the labeling, the experts were required to provide the text of the recording as well as give additional indications related to the nature of the recording itself. In quite a few cases there are disturbances during the lesson: background noises, voices of other children who are in the same room, and more. Using noisy recordings can complicate the learning process.

After some of the data was tagged, we ran a fine-tuning of the wav2vec system on a few hours of data and saw a dramatic increase in accuracy in recognizing children’s speech. The WER (Word Error Rate) dropped almost twice. True, it still does not reach the performance level of commercial systems for adult speakers, but it is much better for speech recognition in children. The data tagging project is still ongoing, but there is already cautious optimism about expected results.

Written by Edward Roddick Director of Core Tech at AmplioSpeech

Leave a Reply

Your email address will not be published.