What is ASR automatic speech recognition technology?

ASR refers to automatic speech recognition technology, which is a technology that converts human speech into text. Its goal is to convert the lexical content of human language into computer-readable input.

If a machine wants to have a dialogue with a human, it needs to complete three steps: corresponding to the work of the “ear”, “brain”, and “mouth”. If the machine wants to understand human speech, it cannot do without speech recognition technology (ASR).

aidu AI virtual human sign language video

If you want to know the power of ASR automatic speech recognition technology, you have to watch this video, CCTV host Zhu Guangquan and the “AI virtual human sign language anchor” online PK video, in the screen, Zhu Guangquan speaks at an amazing speed, while the virtual human sign language on the side The anchor did not show any weakness, and performed sign language interpretation simultaneously and in real time according to Zhu Guangquan’s words, and finally successfully completed the challenge.

In order for the virtual human to understand what the host is saying, Baidu uses the Speech Recognition (ASR) model to help the virtual human anchor accurately recognize speech, and can also accurately identify dialects, Chinese, English, etc.

 

Speech recognition ASR process

The first step is to create an acoustic model

Most mainstream systems of acoustic models are modeled using Hidden Markov Models. For the same word, since each person has different pronunciation, intonation, speech speed, etc., in order to allow the machine to recognize more people, it is necessary to input a large number of original user voices during the establishment of the acoustic model and extract the features Process and build an acoustic model database. The parameters of the acoustic model are estimated in the acoustic training step; then through the loop training and phase alignment. In this step, the importance of big data is reflected.

 

The second step is to build a language model

Language models include grammatical networks formed by recognizing speech commands or language models formed by statistical methods. According to the objective facts of language, language abstract mathematical modeling is carried out, which is a corresponding relationship. The language model can well adjust the illogical words obtained by the acoustic model, making the recognition result smooth and correct, which is also of great significance to the information processing of natural speech.

 

The third step is to conduct speech recognition

The first two steps need to be done in advance, and the final database will be stored locally on the device or in the cloud, and this step is a real-time speech recognition process. First encode and extract features from the user’s speech input, then match the extracted features with the acoustic model library to obtain a single word, and then query it with the language model library to get the most matching word up.

 

Application Scenarios of ASR Speech Transcription

1. Customer service

The intelligent transcription function of the call center set up by the enterprise can record the questions asked by customers in real time. Voice customer service robots can better query and match to answer questions, and can effectively solve simple and repetitive tasks.

2. Education and training institutions

The application of speech transcription in education and training institutions includes oral assessment in Chinese and English.

3. Medical treatment

The application in the medical field is mainly for electronic medical record entry. Doctors can convert diagnostic information into text in real time during clinical diagnosis, and automatically enter it into the hospital diagnosis and treatment system, effectively improving the efficiency of doctors.

4. Finance

At this stage, some banks have realized basic services such as voice navigation, voice transactions, and business handling through the use of ASR voice transcription.

ASR Speech Recognition Application Scenarios

ASR speech recognition has become a very common technology, which is often used in daily life:

1. Apple users must have experienced Siri, which is a typical voice recognition

2. There is a function in WeChat called “text-to-speech-to-text”, which also uses speech recognition

3. The recently popular smart speakers are products with speech recognition as the core

4. Newer cars basically have the function of voice control, which is also voice recognition

 

 

Table of Contents