Noise-canceling headphones have significantly improved over the years, offering users an auditory blank slate by erasing unwanted sounds. However, selectively allowing specific sounds to pass through this barrier has remained a challenge for researchers. While the latest edition of Apple’s AirPods Pro can adjust sound levels based on environmental cues, users still lack precise control over whom they listen to and when.
A groundbreaking development from a University of Washington team promises to change this. The researchers have created an artificial intelligence system called “Target Speech Hearing” (TSH) that enables a user wearing headphones to isolate and listen to a single person’s voice in a crowded, noisy environment by simply looking at them for a few seconds. This innovative system was presented on May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems.
How It Works
The TSH system leverages AI to modify auditory perception based on the user’s preferences. By wearing standard headphones equipped with microphones, a user can tap a button while looking at someone who is speaking. This action enrolls the speaker by capturing their vocal patterns. The system then cancels out all other ambient sounds and focuses solely on the enrolled speaker’s voice, even as the user and speaker move around.
Senior author Shyam Gollakota, a professor in the Paul G. Allen School of Computer Science & Engineering at UW, explained the broader implications of this technology: “We tend to think of AI now as web-based chatbots that answer questions. But in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences. With our devices, you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.”
Technical Details and User Experience
To use the system, the user directs their head towards the target speaker and taps a button on their headphones. The microphones on the headphones pick up the sound waves from the speaker’s voice, which are then processed by an on-board embedded computer running machine learning software. This software learns the speaker’s vocal patterns and continues to filter and play their voice exclusively to the listener.
The system’s precision improves with more vocal input from the speaker, enhancing the clarity of their voice over time. The technology, however, can currently enroll only one speaker at a time and works best when there is no other loud voice coming from the same direction as the target speaker.
Performance and Future Prospects
The TSH system was tested on 21 subjects, who rated the clarity of the enrolled speaker’s voice nearly twice as high as unfiltered audio. This work builds on the team’s previous research on “semantic hearing,” which allowed users to select specific sound classes—such as birds or voices—to focus on.
While the system is still in the proof-of-concept stage and not yet commercially available, the team has made the code accessible for further development. This innovation opens the door to a future where personal auditory experiences can be finely tuned in real-time, transforming how we interact in noisy environments.
Stay tuned to Brandsynario for latest news and updates.