Application of Power-saving Playback Algorithm to Speech Signals

Power consumption is one of the important issues for portable electronic devices. The authors have previously proposed a power-saving audio playback algorithm based on auditory masking. In this paper, the feasibility of the power saving audio playback algorithm is confirmed towards speech signals in the viewpoints of both speech intelligibility and power consumption. The influence of the speech conversion for power saving playback was quantified as the word intelligibility. In practice, the word intelligibility was measured in each of six phoneme categories for the converted speech signals of which the reduction rate of power consumption was set at 20 %, 50 %, and 70 %. It is confirmed that the word intelligibility does not decrease drastically for the converted speech of which the reduction rate of power consumption was set at 50%.


Introduction
High fidelity audio focuses on high resolution and distortion-free reproduction, but for casual audio listeners using smartphones and smart speakers, power saving while audio playback is an important factor. Power-saving audio playback has been studied both in hardware and software implementation. In the hardware-based approaches, each functional circuit of audio playback systems needs to be redesigned. A power amplifier has been promoted the efficiency based on the class AB and class D topologies (1) . A full digital audio system achieves energy saving by using digital loudspeakers with multiple coils (2) (3) . On the other hand, software attempts are relatively easier to be achieved by digital signal processing.
The authors approached a software-oriented for power-saving audio playback based on auditory characteristics which are implemented with the filterbank process (4) and the masking processing (5) . First, the proposed method focused on the musical source and achieved the reduction of power consumption by 25 % (4) . Then, the availability of the proposed method was confirmed using a variety of sound sources including speech signals in simulated noisy conditions. The results of the listening tests suggested that the sound quality was mostly satisfactory under noisy envioronments (6) . It is important how accurately we can understand meaningful words when we perceive speech (7) . In this paper, the relationship between the word intelligibility and the current consumption is investigated by carrying out the diagnostic rhyme test in Japanese.

Power-saving Audio Playback Algorithm
The overview of the proposed power-saving playback algorithm is summarized in figure 1 (6) . The method mainly consists of the auditory-oriented sub-band optimization in the filterbank processing and the reduction of inaudible components below the masking thresholds.
It is supposed that the reduction ratio of current consumption affects the word intelligibility. Therefore, it is necessary to prepare some reduction ratios and investigate the relationship between both the reduction ratio and the word intelligibility. In this paper, the mel-frequency filterbanks are employed in the filterbank analysis shown in figure 1 because the mel-frequency filterbanks achieved the efficient power saving compared to other filterbanks (4) . The reduction ratios of current consumption are adjusted at approximately 20 %, 50 %, and 70 % by setting the number of filterbanks at 5, 15, and 25 when the whole frequency range of the filterbank is fixed up to 20 kHz, respectively.
The power-saving audio playback algorithm relies on the auditory masking for eliminating non-audible frequency components. The auditory masking occurs when the perception of the target sound is affected by the presence of another sound. The masking threshold is calculated by using the psycho-acoustic model used in the MPEG-1 codec (8) . Then, the frequency components below the masking threshold are eliminated before the filterbank analysis (5) .
After the filterbank analysis and synthesis, the sub-band power is increased up to 3 dB at the auditory sensitive frequency bands from 3 kHz to 4 kHz and decreased at 3 dB in power at 120 Hz and lower frequency components. Finally, band-pass filtering in between 40 Hz to 15 kHz was applied for cutting out the extra high and low frequency components.

Measurement of Power Consumption
Power consumption was measured using an electronic circuit shown in figure 2. The prepared speech signals were played back through a simplified power amplifier and a loudspeaker. The current consumption was obtained from the voltage in between the resistor, and it was averaged by measured five times. The speech signals were prepared from the male utterances recorded in a soundproof room. Figure 3 shows the exact reduction ratios for the each versions of the filterbanks with 5, 15, and 25 channels, respectively.

Diagnostic Rhyme Test
It is important for our speech communication to accurately convey the contents of the utterances rather than the sound quality. In general, speech intelligibility can be measured with phonemes, words, and sentences (9) . In this study, the word intelligibility is measured by carrying out the Diagnostic Rhyme Test (DRT) standardized by ANSI (10) . The six phoneme categories were used which defined by Jacobson et al as shown in Table 1. The Japanese DRT word pair lists shown in annex A.1. (11) were used in the experiment. In each word pair, the first phonemes are different and the latter phonemes are the same. The subjects listened to the above words, and were asked to write down what they have heard.
Subjects were 24 students, 6 females and 18 males, with normal hearing The DRT was carried out in the soundproof room in Kyushu Institute of Technology. The listening tests was carried out as follows.
(1) A word was randomly selected from the ten pairs of Japanese DRT words to be presented to the subject.
(2) After listening the presented word, the subjects wrote down what they heard in the answer sheet shown in annex A.2 in three seconds.
(3) The next word was automatically presented to the  subjects after the three seconds break. It continued until the whole words of the ten pairs were presented.
(4) The above steps: (1), (2), and (3) were repeated in each phoneme category. Figure 4 shows the results of the DRT in each phoneme category for the original speech signals and the processed speech signals of which the power consumption reduction rates are 20 %, 50 %, and 70 %, respectively. It is confirmed that the word intelligibility rates over 80 % were obtained in all phoneme categories.
In particular, the word intelligibility rates are more than 90 % for the processed speech signals with the reduction rates of 20 % and 50 %. For more detailed discussion in each phoneme category, the word intelligibility is relatively low in the Graveness category. It might be due to the intelligibility of the recorded word utterances without any processing.

Conclusions
The power-saving audio playback algorithm is modified for speech reproduction systems. The number of melfrequency filterbanks were adjusted to achieve the reduction of power consumption up to approximately 20 %, 50 %, and 70 %. The feasibility of the modified algorithm was confirmed by carrying out the Japanese Diagnostic Rhyme Test in six phoneme categories. The experimental results suggest that the word intelligibility rates attain 90 % and more in the whole phoneme categories for the processed speech signals with the reduction rates of 20 % and 50 %. Even for those with the reduction rate of 70 %, the intelligibility rate exceeds 80 %. The proposed method could drastically decrease the power consumption for speech signals compared to music sound sources. Future works include the performance tests under the practical conditions.