Kinect for Windows SDK v2.0 Development Notes (7) Speech Recognition (1)

(Reprinted please indicate the source)

Use SDK: Kinect for Windows SDK v2.0 public preview

This time we discuss speech recognition. Before that, I wrote two parts in a section. This time, on the contrary, I divided them into two parts.

Speech recognition is one of the reasons for using the official SDK. Otherwise, OpenNI is used. After all, Microsoft's achievements on SR (Speech Recognition) are good.

First, you need to download voice recognition SDK,Runtime library And what you want to support Runtime language SR stands for speech recognition in the runtime language.

TTS stands for text to voice, depending on your choice. I chose to download American English and Mainland Chinese. Of course, pit dad's Microsoft is also dedicated to Kinect

Get ready Kinect Runtime Language There are fewer supporting languages, but even Japanese does not support Chinese.

Maybe it's just Japanese pronunciation.

If you use VS Express as I do, please download it again. WDK7.1 This voice library needs a little bit of ATL

Keeping (apparently COM smart pointer), Hangdad's Microsoft doesn't have ATL Library in Express. To get legal ATL library, you need to download this.

WDK 7.1, which has a legitimate ALT library. I install it directly to C disk. Please include the directories of these three libraries in the project.

Voice Platform, WDK, Kinect

If you have your own needs, please include other libraries. It's better to put the voice platform in the first place.

At present, speech recognition should only recognize PCM coding, and the processed data is floating-point coding. We can not directly use IStream provided by SDK by default.

Yes, based on the commonality of COM components, we only need to inherit IStream to implement a Stream class to complete audio stream processing.

This class mainly implements Read method. Seek method is important in S R, but Kinect does not support it. This method can return S_OK directly.

There are two ways to implement this class. One is based on the audio frames mentioned in the previous section. Use a larger buffer (preferably circular in design)

Get the audio frame and write the data in it. Write the data in Read, which is more troublesome. Another is the example provided by SDK.

Puppet tactics.

Get the default IStream provided by SDK, Read the puppet when you read, and convert the data after you get it. The code is as follows:

//Implementation of IStream Read Method

STDMETHODIMP KinectAudioStreamWrapper::Read(void *pBuffer, ULONG cbBuffer, ULONG *pcbRead){

    //Parametric checking

    if (!pBuffer || !pcbRead) return E_INVALIDARG;

    //Return S_OK without using m_SpeechActive before reading

    if (!m_SpeechActive){

        *pcbRead = cbBuffer;

        return S_OK;

    }

    HRESULT hr = S_OK;

    //The goal is to convert floating-point coding into 16-bit PCM coding

    INT16* const p16Buffer = reinterpret_cast<INT16*>(pBuffer);

    //Length multiples

    const int multiple = sizeof(float) / sizeof(INT16);

    //Check that the buffer is released sufficiently

    auto float_buffer_size = cbBuffer / multiple;

    if (float_buffer_size > m_uFloatBuferSize){

        //Re-apply for memory if not enough

        m_uFloatBuferSize = float_buffer_size;

        if (m_pFloatBuffer) delete[]m_pFloatBuffer;

        m_pFloatBuffer = new float[m_uFloatBuferSize];

    }

    //Buffer write progress in bytes

    BYTE* pWriteProgress = reinterpret_cast<BYTE*>(m_pFloatBuffer);

    //Current readout

    ULONG bytesRead = 0;

    //Need to read

    ULONG bytesNeed = cbBuffer * multiple;

    //Circular reading

    while (true){

        //Where voice is no longer required

        if (!m_SpeechActive){

            *pcbRead = cbBuffer;

            hr = S_OK;

            break;

        }

        //Getting data from packaging objects

        hr = m_p32BitAudio->Read(pWriteProgress, bytesNeed, &bytesRead);

        bytesNeed -= bytesRead;

        pWriteProgress += bytesRead;

        //Check for adequacy

        if (!bytesNeed){

            *pcbRead = cbBuffer;

            break;

        }

        //Otherwise, sleep for a period of time.

        Sleep(20);

    }

    //Data processing float-> 16bit PCM

    if (!bytesNeed){

        for (UINT i = 0; i < cbBuffer / multiple; i++) {

            float sample = m_pFloatBuffer[i];

            //Interval guarantees

            //sample = max(min(sample, 1.f), -1.f);

            if (sample > 1.f) sample = 1.f;

            if (sample < -1.f) sample = -1.f;

            //Data conversion

            float sampleScaled = sample * (float)SHRT_MAX;

            p16Buffer[i] = (sampleScaled > 0.f) ? (INT16)(sampleScaled + 0.5f) : (INT16)(sampleScaled - 0.5f);

        }

    }

    return hr;

}

Note that when the speech recognition engine is initialized, it needs to acquire certain audio data to complete. It may take several seconds or even more to initialize, which is very painful.

I don't know what Microsoft thinks. To this end, we add a variable to indicate whether the SR is active or not, and return false data to deceive the SR engine when it is not activated.

This eliminates the need to wait for initialization.

The way we use speech recognition here is to load static. SRGS Grammar files, of course, can be loaded into dynamic grammar, but please refer to the official documents for details.

As for the SRGS grammar, you can look at the W3C documents, you can also see Microsoft Documentation.

Okay, let's write a SRGS grammar file. SRGS is based on xml, so use XML for suffix names. I use Chinese here.

After all, Chinese has a great advantage in word processing. A few words can represent a sentence. Through my test, Chinese can recognize English, but the recognition rate

<?xml version="1.0" encoding="UTF-8" ?>

<grammar version="1.0" xml:lang="zh-CN" mode="voice" root="Root speech" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">

    <rule id="Root speech" scope="public">

        <one-of>

            <item>Forward</item>

        </one-of>

        <one-of>

            <item>Back off</item>

        </one-of>

    </rule>

</grammar>

This is almost the simplest SRGS file, the important ones are:

At the beginning, the standard of xml is utf8.

The grammar tag is the main tag, in which the xml:lang attribute chooses your language, and the mainland Chinese is zh-CN. And root, the name of the root tag.

The SR engine recognizes the basic phrases from here, and the rest can be captured.

Each rule is labeled with rule tag, one-of tag is used for multiple choices, item is used for basic phrase, rule ref is used for rule reference, and string or tag tag is used for tag tag.

js-like scripts are marked with the data they want, as well as token tags, which allow you to view documents for understanding.

Our goal is to achieve the following phrases:

1. I found ZZ AA BB in XX YY

For example, I found two (ZZ) high explosive armour-piercing bombs (BB) on my foot (XX).

XX means location. This information is very important.

YY denotes relative orientation. This information is not important.

ZZ denotes quantity. This information is important.

AA denotes quantifiers, almost useless

BB stands for items. This information is very important.

<rule id="Find things">

    <example> I found two armour-piercing bullets on my feet </example>



    <item>I am here</item>

    <ruleref uri="#Location "/>

    <ruleref uri="#Relative position/>

    <item>find</item>

    <item repeat="0-1">了</item>

    <ruleref uri="#Quantity "/>

    <ruleref uri="#Quantifier "/>

    <ruleref uri="#Target object"/>

</rule>



<rule id="place">

    <example> foot </example>

    <example> House </example>



    <one-of>

        <item>foot</item>

        <item>House</item>

        <item>ship</item>

        <item>head</item>

    </one-of>

</rule>



<rule id="relative position">

    <example> upper </example>



    <one-of>

        <item>upper</item>

        <item>Above</item>

        <item>inside</item>

        <item>Side</item>

        <item>nearby</item>

    </one-of>

</rule>



<rule id="Number">

    <example> Two </example>



    <one-of>

        <item>One</item>

        <item>Two</item>

        <item>Two</item>

        <item>Three</item>

        <item>Four</item>

        <item>Five</item>

        <item>Six</item>

        <item>Seven</item>

        <item>Eight</item>

        <item>Nine</item>

        <item>Ten</item>

    </one-of>

</rule>



<rule id="Classifier">

    <example> Mei </example>



    <one-of>

        <item>Mei</item>

        <item>individual</item>

        <item>block</item>

        <item>slice</item>

        <item>Vehicle</item>

        <item>frame</item>

        <item>second</item>

        <item>ministry</item>

        <item>platform</item>

        <item>hold</item>

    </one-of>

</rule>



<rule id="Target object">

    <example> Tank </example>



    <one-of>

        <item>High Explosive Armor Piercing Bomb</item>

        <item>Armor piercing shell</item>

        <item>Tank</item>

        <item>pencil</item>

        <item>Computer</item>

        <item>Apple</item>

        <item>Hammer</item>

        <item>Mobile phone</item>

        <item>Armstrong Cyclotron Accelerated Jet Armstrong Gun</item>

    </one-of>

</rule>

This part is almost complete, and then updated in the "Root Language":

<rule id="Root speech" scope="public">

    <one-of>

        <item>

            <ruleref uri="#Find something./>

        </item>

    </one-of>

</rule>

If we add another one, "war situation"

<rule id="Root speech" scope="public">

    <one-of>

        <item>

            <ruleref uri="#Find something./>

        </item>

        <item>

            <ruleref uri="#War situation "/>

        </item>

    </one-of>

</rule>

</pre><p></p><pre>

That's it.

In practical use, we will encounter many synonyms, or choose many branches. If comparing recognized strings is laborious, we can use tag tags.

So in "quantity" you can write: out = 10; you can export 10,

<rule id="Number">

        <example> Two </example>



        <one-of>

            <item>One<tag>out=1;</tag></item>

            <item>Two<tag>out=2;</tag></item>

            <item>Two<tag>out=2;</tag></item>

            <item>Three<tag>out=3;</tag></item>

            <item>Four<tag>out=4;</tag></item>

            <item>Five<tag>out=5;</tag></item>

            <item>Six<tag>out=6;</tag></item>

            <item>Seven<tag>out=7;</tag></item>

            <item>Eight<tag>out=8;</tag></item>

            <item>Nine<tag>out=9;</tag></item>

            <item>Ten<tag>out=10;</tag></item>

        </one-of>

    </rule>

Strings can also be used. But data processing is digital, isn't it?

Similarly, we update Discovery to:

<rule id="Find things">

    <example> I found two armour-piercing bullets on my feet </example>



    <item>I am here</item>

    <ruleref uri="#Location "/>

    <tag> out.place = rules.place; </tag>

    <ruleref uri="#Relative position/>

    <item>find</item>

    <item repeat="0-1">了</item>

    <ruleref uri="#Quantity "/>

    <tag> out.Number = rules.Number; </tag>

    <ruleref uri="#Quantifier "/>

    <ruleref uri="#Target object"/>

    <tag> out.object = rules.Target object; </tag>

</rule>

After all, it's a script that you can read. out.A = rules.B, AB may not be the same, but B must be the same as the id quoted earlier.

Now let's add a phrase: "war situation" such as

We destroyed enemy toilets

It's also very simple. I'll put the code right here.

<?xml version="1.0" encoding="UTF-8" ?>

<grammar version="1.0" xml:lang="zh-CN" mode="voice" root="Root speech" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">

    <rule id="Root speech" scope="public">

        <one-of>

            <item>

                <ruleref uri="#Find something./>

                <tag> out.Find things = rules.Find things; </tag>

            </item>

            <item>

                <ruleref uri="#War situation "/>

                <tag> out.War situation = rules.War situation; </tag>

            </item>

        </one-of>

    </rule>



    <rule id="War situation">

        <example> We destroyed enemy toilets </example>

        <example> They pierced our armor. </example>



        <ruleref uri="#Character Object/>

        <tag> out.subject = rules.Character object; </tag>

        <ruleref uri="#The war verb "/>

        <tag> out.Predicate = rules.War verb; </tag>

        <item repeat="0-1">了</item>

        <ruleref uri="#Character Object/>

        <tag> out.object = rules.Character object; </tag>

        <item repeat="0-1">Of</item>

        <ruleref uri="#The term "war situation"/>

        <tag> out.object = rules.War condition NOUN; </tag>

    </rule>



    <rule id="Character object">

        <example> We </example>

        <example> they </example>



        <one-of>

            <item>We<tag>out=0;</tag></item>

            <item>they<tag>out=1;</tag></item>

            <item>We<tag>out=0;</tag></item>

            <item>enemy<tag>out=1;</tag></item>

        </one-of>

    </rule>



    <rule id="War verb">

        <example> Destroy </example>

        <example> beat </example>



        <one-of>

            <item>Destroy<tag>out=0;</tag></item>

            <item>beat<tag>out=1;</tag></item>

            <item>breakdown<tag>out=2;</tag></item>

        </one-of>

    </rule>



    <rule id="War condition NOUN">

        <example> armor </example>

        <example> Computer </example>



        <one-of>

            <item>armor<tag>out=0;</tag></item>

            <item>Toilet<tag>out=1;</tag></item>

            <item>Computer<tag>out=2;</tag></item>

            <item>Computer<tag>out=2;</tag></item>

            <item>Nuclear Silo<tag>out=3;</tag></item>

        </one-of>

    </rule>



    <rule id="Find things">

        <example> I found two armour-piercing bullets on my feet </example>



        <item>I am here</item>

        <ruleref uri="#Location "/>

        <tag> out.place = rules.place; </tag>

        <ruleref uri="#Relative position/>

        <item>find</item>

        <item repeat="0-1">了</item>

        <ruleref uri="#Quantity "/>

        <tag> out.Number = rules.Number; </tag>

        <ruleref uri="#Quantifier "/>

        <ruleref uri="#Target object"/>

        <tag> out.object = rules.Target object; </tag>

    </rule>



    <rule id="place">

        <example> foot </example>

        <example> House </example>



        <one-of>

            <item>foot<tag>out=0;</tag></item>

            <item>House<tag>out=1;</tag></item>

            <item>ship<tag>out=2;</tag></item>

            <item>head<tag>out=3;</tag></item>

        </one-of>

    </rule>



    <rule id="relative position">

        <example> upper </example>



        <one-of>

            <item>upper</item>

            <item>Above</item>

            <item>inside</item>

            <item>Side</item>

            <item>nearby</item>

        </one-of>

    </rule>



    <rule id="Number">

        <example> Two </example>



        <one-of>

            <item>One<tag>out=1;</tag></item>

            <item>Two<tag>out=2;</tag></item>

            <item>Two<tag>out=2;</tag></item>

            <item>Three<tag>out=3;</tag></item>

            <item>Four<tag>out=4;</tag></item>

            <item>Five<tag>out=5;</tag></item>

            <item>Six<tag>out=6;</tag></item>

            <item>Seven<tag>out=7;</tag></item>

            <item>Eight<tag>out=8;</tag></item>

            <item>Nine<tag>out=9;</tag></item>

            <item>Ten<tag>out=10;</tag></item>

        </one-of>

    </rule>



    <rule id="Classifier">

        <example> Mei </example>



        <one-of>

            <item>Mei</item>

            <item>individual</item>

            <item>block</item>

            <item>slice</item>

            <item>Vehicle</item>

            <item>frame</item>

            <item>second</item>

            <item>ministry</item>

            <item>platform</item>

            <item>hold</item>

        </one-of>

    </rule>



    <rule id="Target object">

        <example> Tank </example>



        <one-of>

            <item>High Explosive Armor Piercing Bomb<tag>out=0;</tag></item>

            <item>Armor piercing shell<tag>out=1;</tag></item>

            <item>Tank<tag>out=2;</tag></item>

            <item>pencil<tag>out=3;</tag></item>

            <item>Computer<tag>out=4;</tag></item>

            <item>Apple<tag>out=5;</tag></item>

            <item>Hammer<tag>out=6;</tag></item>

            <item>Mobile phone<tag>out=7;</tag></item>

            <item>Armstrong Cyclotron Accelerated Jet Armstrong Gun<tag>out=8;</tag></item>

        </one-of>

    </rule>

</grammar>

That's all for this time. This SRGS example is expected to help you. Let's write some C++ code in the next section.

This article has been included in the following columns:
Kinect for Windows SDK v2.0 Development Notes

Posted by ntg on Mon, 31 Dec 2018 19:48:10 -0800

Programmer Group

Kinect for Windows SDK v2.0 Development Notes (7) Speech Recognition (1)

Hot Keywords