As a robotic ear, the REAL system needs to continuously capture a specific voice source. In Figure 1b, the system is mounted on a motorized gimbal, and a camera is used to detect and track the throat or the mask of the speaking person. The detected target position is fed into the control loop of the gimbal, which points the laser to the target continuously as the target is moving. A microphone is attached to REAL system to collect the audio signal for comparison, as well as further augment the REAL signal by fusing the two independent modalities. Figure 1c shows a cocktail party scenario where the gimbaled REAL system operates to ‘hear’ a specific person remotely without acoustic channel interference.

Result and application

Frequency characterization

A systematic investigation on REAL’s frequency responses is needed to justify the new approach. Short-time Fourier transform (STFT) is a commonly used tool to understand both the temporal and frequency response of audio signals.[23] In Figure 2a-b, the STFT responses of the microphone, the REAL signal on the throat and the REAL signal on the mask are obtained when the speaker pronounces ‘Ah~~’. Sharp vocal resonances from all three STFT spectrograms could be identified. In Figure 2c, the frequency cross-sections of the signals at time 1 s were plotted for a better comparison. Although the envelopes of the frequency spectra are slightly different, the peaks are prominent and accurate. From the point of view of speech synthesis, these frequency responses already provide enough information to understand the content of the speech.[24] Notably, the throat signal is reasonably good in low frequencies (< 1 kHz) but slightly lacking in the higher frequencies, while the mask signal is more uniform across the vocal spectrum. This different response is related to the respective elastic and damping properties of the surface materials, and in the case of the human throat, biological tissues attenuate higher frequency contents while transmitting lower-frequency information.[25]. We discovered that many other objects could be used with the REAL system, such as plastic bags, packaging boxes and papers, given that they are close to and excited mainly by the target sound source. A detailed analysis of the frequency response of various materials is presented in the Supporting Information.