Pocketsphinx API Core Ideas

Pocketsphinx API is designed to ease the use of speech recognizer functionality in your applications

  1. It is much more likely to remain stable both in terms of source and binary compatibility, due to the use of abstract types.
  2. It is fully re-entrant, so there is no problem having multiple decoders in the same process.
  3. It has enabled a drastic reduction in code footprint and a modest but significant reduction in memory consumption.

Reference documentation for the new API is available at http://cmusphinx.sourceforge.net/api/pocketsphinx/

Basic Usage (hello world)

There are few key things you need to know on how to use the API:

  1. Command-line parsing is done externally (in <cmd_ln.h>)
  2. Everything takes a ps_decoder_t * as the first argument.

To illustrate the new API, we will step through a simple “hello world” example. This example is somewhat specific to Unix in the locations of files and the compilation process. We will create a C source file called hello_ps.c. To compile it (on Unix), use this command:

gcc -o hello_ps hello_ps.c \
    -DMODELDIR=\"`pkg-config --variable=modeldir pocketsphinx`\" \
    `pkg-config --cflags --libs pocketsphinx sphinxbase`

Please note that compilation errors here mean that you didn't carefully read the tutorial and didn't follow the installation guide above. For example pocketsphinx needs to be properly installed to be available through pkg-config system. To check that pocketsphinx is installed properly, just run pkg-config –cflags –libs pocketsphinx sphinxbase from the command line and see that output looks like

-I/usr/local/include -I/usr/local/include/sphinxbase -I/usr/local/include/pocketsphinx  
-L/usr/local/lib -lpocketsphinx -lsphinxbase -lsphinxad

Initialization

The first thing we need to do is to create a configuration object, which for historical reasons is calledcmd_ln_t. Along with the general boilerplate for our C program, we will do it like this:

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
        ps_decoder_t *ps = NULL;
        cmd_ln_t *config = NULL;

        config = cmd_ln_init(NULL, ps_args(), TRUE,
                 "-hmm", MODELDIR "/en-us/en-us",
                 "-lm", MODELDIR "/en-us/en-us.lm.bin",
                 "-dict", MODELDIR "/en-us/cmudict-en-us.dict",
                 NULL);

        return 0;
}

The cmd_ln_init() function takes a variable number of null-terminated string arguments, followed by NULL. The first argument is any previouscmd_ln_t * which is to be updated. The second argument is an array of argument definitions - the standard set can be obtained by callingps_args(). The third argument is a flag telling the argument parser to be “strict” - if this isTRUE, then duplicate arguments or unknown arguments will cause parsing to fail.

The MODELDIR macro is defined on the GCC command-line by using pkg-config to obtain the modeldir variable from PocketSphinx configuration. On Windows, you can simply add a preprocessor definition to the code, such as this:

#define MODELDIR "c:/sphinx/model"

(replace this with wherever your models are installed). Now, to initialize the decoder, use ps_init:

        ps = ps_init(config);

Decoding a file stream

Because live audio input is somewhat platform-specific, we will confine ourselves to decoding audio files. The “turtle” language model recognizes a very simple “robot control” language, which recognizes phrases such as “go forward ten meters”. In fact, there is an audio file helpfully included in the PocketSphinx source code which contains this very sentence. You can find it intest/data/goforward.raw. Copy it to the current directory. If you want to create your own version of it, it needs to be a single-channel (monaural), little-endian, unheadered 16-bit signed PCM audio file sampled at 16000 Hz.

Main pocketsphinx use case is to read audio data in blocks of memory from somewhere and feed them to the decoder. To do that we first open the file and start decoding of the utterance usingps_start_utt():

        rv = ps_start_utt(ps);

We will then read 512 samples at a time from the file, and feed them to the decoder usingps_process_raw():

        int16 buf[512];
        while (!feof(fh)) {
            size_t nsamp;
            nsamp = fread(buf, 2, 512, fh);
            ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
        }

Then we will need to mark the end of the utterance using ps_end_utt():

        rv = ps_end_utt(ps);

Then we retrieve the hypothesis to get recognition result

        hyp = ps_get_hyp(ps, &score);
        printf("Recognized: %s\n", hyp);

We can also retrieve the hypothesis during recognition, it will return partial result.

Cleaning up

To clean up, simply call ps_free() on the object that was returned byps_init(). Free the configuration object with cmd_ln_free_r.

Code listing

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
    ps_decoder_t *ps;
    cmd_ln_t *config;
    FILE *fh;
    char const *hyp, *uttid;
    int16 buf[512];
    int rv;
    int32 score;

    config = cmd_ln_init(NULL, ps_args(), TRUE,
                 "-hmm", MODELDIR "/en-us/en-us",
                 "-lm", MODELDIR "/en-us/en-us.lm.bin",
                 "-dict", MODELDIR "/en-us/cmudict-en-us.dict",
                 NULL);
    if (config == NULL) {
        fprintf(stderr, "Failed to create config object, see log for details\n");
        return -1;
    }
    
    ps = ps_init(config);
    if (ps == NULL) {
        fprintf(stderr, "Failed to create recognizer, see log for details\n");
        return -1;
    }

    fh = fopen("goforward.raw", "rb");
    if (fh == NULL) {
        fprintf(stderr, "Unable to open input file goforward.raw\n");
        return -1;
    }

    rv = ps_start_utt(ps);
    
    while (!feof(fh)) {
        size_t nsamp;
        nsamp = fread(buf, 2, 512, fh);
        rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
    }
    
    rv = ps_end_utt(ps);
    hyp = ps_get_hyp(ps, &score);
    printf("Recognized: %s\n", hyp);

    fclose(fh);
    ps_free(ps);
    cmd_ln_free_r(config);
    
    return 0;
}

http://cmusphinx.sourceforge.net/wiki/tutorialpocketsphinx#basic_usage_hello_world

Advanced Usage

For more complicated uses of the API please check the API reference.

  1. For word segmentations, the API provides an iterator object which is used to, well, iterate over the sequence of words. This iterator object is an abstract type, with some accessors provided to obtain timepoints, scores, and (most interestingly) posterior probabilities for each word.
  2. Confidence of the whole utterance can be accessed with ps_get_prob method.
  3. You can access lattice if needed
  4. You can configure multiple searches and switch between them in runtime.

Searches

Developer can configure several “search” objects with different grammars and language models and switch them in runtime to provide interactive experience for the user.

There are different possible search modes:

  1. keyword - efficiently looks for keyphrase and ignores other speech. allows to configure detection threshold.</li>
  2. grammar - recognizes speech according to JSGF grammar. Unlike keyphrase grammar search doesn't ignore words which are not in grammar but tries to recognize them.
  3. ngram/lm - recognizes natural speech with a language model.
  4. allphone - recognizes phonemes with a phonetic language model.

Each search has a name and can be referenced by a name, names are application-specific. The function ps_set_search allows to activate the search previously added by a name.

To add the search one needs to point to the grammar/language model describing the search. The location of the grammar is specific to the application. If only a simple recognition is required it is sufficient to add a single search or just configure the required mode with configuration options.

The exact design of a searches depends on your application. For example, you might want to listen for activation keyword first and once keyword is recognized switch to ngram search to recognize actual command. Once you recognized the command you can switch to grammar search to recognize the confirmation and then switch back to keyword listening mode to wait for another command.

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐