Article In Brief
A research team has developed a method to synthesize a person's speech using the brain signals that are related to the movements of their jaw, larynx, lips and tongue. Challenges remain to translating this development into clinical practice for patients who can no longer speak.
Scientists at the University of California, San Francisco (UCSF) have identified the brain signals that coordinate movements of the jaw, lips, tongue, and larynx, and used the information to create a computer algorithm that turns these brain signals into audible speech. Ultimately, their goal is to help patients who can no longer speak acquire faster, smoother, and more natural ways to communicate.
The advance is akin to a brain-computer interface used to drive a prosthetic arm: Think about moving your paralyzed arm, and the robotic device interprets the electrical signals and drives the arm to carry out a task. The study results were published in Nature.
Right now, the technology is built on movements that drive a computer cursor to highlight letter by letter to string into words. It is slow and frustrating and, at best, patients can piece together eight to 10 words a minute. Natural speech is about 150 words in that span of time.
The scientists wanted to see whether they could get a read-out of the brain in areas that regulate speech production—the sensorimotor córtex—and use this information to create synthetic speech. The idea is to create speech sounds from the brain, and it occurs in two stages: first, converting brain function to articulation, and then converting articulation to audible speech—in the same way that language is processed in the human brain.
Study Methods, Findings
The research team—UCSF neurosurgeon Edward F. Chang, MD; speech scientist Gopala Krishna Anumanchipalli, PhD; and doctoral candidate Josh Chartier—recruited five epilepsy patients who had tiny recording electrodes implanted inside their brains to monitor seizure activity.
For this study, the patients were asked to read several hundreds of sentences aloud while the scientists used high-density electrocorticography to record from the speech cortex.
Earlier work from their lab showed that this area of sensorimotor cortex doesn't encode sound but rather the movement used to create that sound. They wanted to determine the electrical patterns when the patients were reading naturally from a script of English sentences. (One participant was also asked to mime what they were reading so that no sound was being produced.)
Image of the array of intracranial electrodes used to record brain activity in the current study.
Once the scientists had an algorithm of speech synthesized from this cortical activity, they tested whether other listeners could understand what was being said. Hundreds of listeners were given lists of about 50 words that might appear in the language produced by the computer-generated algorithm; this would help give them some context for what they were hearing.
“If you know the context—in this case, the words likely to be used—it limits the possibility of words that you are hearing, and improves intelligibility,” said Dr. Anumanchipalli. “The rate of accuracy was better if they had a context, clues as to what the patient may be talking about.”
They were asked to write down exactly what they thought they heard. The listeners were about 70 percent accurate in deciphering the sentences.
“This biomimetic approach focuses on vocal tract movements and the sounds that they produce,” said Dr. Anumanchipalli, “and this technique has the potential to for high bandwidth communication.”
“It remains to be seen how well this will translate into a technology that will help patients who can no longer speak,” he added. It would require an invasive procedure of boring through the skull to implant electrodes. “We are now trying to improve the intelligibility of the sounds,” he added.
The researchers are designing a clinical trial to test the feasibility of the technique for patients with neurological conditions, including amyotrophic lateral sclerosis and brainstem stroke.
“We really are excited about this paper,” said James W. Gnadt, PhD, program director for the National Institute of Neurological Disorders and Stroke and the NIH's BRAIN Initiative, which was one of the funders of the research. “It is a multidisciplinary approach bringing together a neurosurgeon, neurologists, engineers, speech experts, mathematicians, and statisticians. They came up with a fundamental biological discovery and something that may be ultimately translatable to patients who can no longer produce speech.”
“It's a beautifully-designed, well-executed study of how to decode speech directly from brain signals,” added Marc Slutzky, MD, PhD, associate professor of neurology, physiology, and physical medicine & rehabilitation at Northwestern University Feinberg School of Medicine.
“This is the first paper to leverage recent discoveries from the Chang lab and our own that the primary motor cortex preferentially encodes the movements of speech articulators, rather than the sounds produced. This allowed them to build “biomimetic” decoders—that is, ones that mimic the way the brain normally controls speech—that translated brain signals to articulator movements.
“They used a second stage (neural network) to convert the movements to sound and showed the ability to translate brain signals into speech using this method, which they argue might simplify the translation of brain signals to speech. They also showed the method improves performance over simply decoding sounds directly (without the intermediate step).”
“I think overall this is an important step forward in using our knowledge of the brain's encoding of speech to inform the decoding of speech, and to synthesize speech,” Dr. Slutzky continued. “It also showed that the second stage decoder (translating movements into sounds) could generalize across participants, that is, that a decoder built from one participant's data could be used in another's data. This could be helpful in simplifying the translation of this algorithm to paralyzed patients, though there still is the issue of training the brain-articulator movement decoder.”
Dr. Slutzky noted there are substantial hurdles to testing this method in clinical trials, let alone adopting it into clinical practice. The biggest challenge is still how to build these decoders in paralyzed patients when no actual example speech is available to train on, he said.
“There are reasons to believe this is achievable, given the success of brain-computer interfaces in the arm areas (to control cursors or robot arms) in paralyzed patients,” Dr. Slutzky said. “Speech movements are much more complicated and higher dimensional than arm movements, so how to train them is still a substantial hurdle to be overcome. But I believe ultimately it will happen.”
Dr. Slutzky and his colleagues published a paper in the Journal of Neural Engineering in April that also decoded speech using cortical signals. That study decoded sound from single words directly from brain signals, without intervening decoding of movements. The performance was roughly similar to that reported in the Neuron paper, even though they trained on much less data (~10 mins per participant), he said.
“They came up with a fundamental biological discovery and something that may be ultimately translatable to patients who can no longer produce speech.”
—DR. JAMES W. GNADT
“I believe this study has a sound premise, both in terms of how they model the relationship between neural activity and speech, and also how they evaluate this model,” said Francisco Pereira, PhD, director of the machine learning team and functional magnetic resonance imaging core facility at the National Institute of Mental Health.
“They leverage the fact that the relationship between neural activity and motor articulation for speech should be identifiable in specific locations, and that there are already efficient ways to predict speech sound from those articulation kinematics from audio recordings (without needing more time/effort from the subjects). The evaluation of how well this will work for general language is still relatively limited, since listeners transcribe the sentences they hear from a closed vocabulary. It is very convincing as a proof-of-concept, though, especially because they also try it in a situation where subjects silently mime speech, instead of overtly vocalizing.”
“There are several reasons I am optimistic about this work ultimately being relevant to patients,” Dr. Pereira continued. “The first is that they directly target speech production, rather than having a general brain-computer interface (which requires more calibration and is much slower than speaking rate).
“The second, more important one, is that there is a clear path for improvement by ongoing research on both the hardware (improved sensors) and also, more importantly, in the decoding model. For an example of the latter, given several possible and equally likely sequences of sounds, only some of them may be viable sequences of phonemes in English. Something analogous happens for possible sequences of words, only some of which will be grammatical sentences.
“Current research on speech recognition addresses these problems, and I imagine this is why the authors focused their effort on demonstrating viability on the hardest part of the problem.”
The sources in this story disclosed no competing interests.