Wednesday 31 August 2022

Learning to Reason with Neural Module Networks

 Suppose we’re building a household robot, and want it to be able to answer questions about its surroundings. We might ask questions like these:

How can we ensure that the robot can answer these questions correctly? The standard approach in deep learning is to collect a large dataset of questions, images, and answers, and train a single neural network to map directly from questions and images to answers. If most questions look like the one on the left, we have a familiar image recognition problem, and these kinds of monolithic approaches are quite effective:

But things don’t work quite so well for questions like the one on the right:

Here the network we trained has given up and guessed the most common color in the image. What makes this question so much harder? Even though the image is cleaner, the question requires many steps of reasoning: rather than simply recognizing the main object in the image, the model must first find the blue cylinder, locate the other object with the same size, and then determine its color. This is a complicated computation, and it’s a computation specific to the question that was asked. Different questions require different sequences of steps to solve.

The dominant paradigm in deep learning is a "one size fits all" approach: for whatever problem we’re trying to solve, we write down a fixed model architecture that we hope can capture everything about the relationship between the input and output, and learn parameters for that fixed model from labeled training data.

But real-world reasoning doesn’t work this way: it involves a variety of different capabilities, combined and synthesized in new ways for every new challenge we encounter in the wild. What we need is a model that can dynamically determine how to reason about the problem in front of it—a network that can choose its own structure on the fly. In this post, we’ll talk about a new class of models we call neural module networks (NMNs), which incorporate this more flexible approach to problem-solving while preserving the expressive power that makes deep learning so effective.



Earlier, we noticed that there are three different steps involved in answering the question above: finding a blue cylinder, finding something else the same size, and determining its color. We can draw this schematically like:

A different question might involve a different series of steps. If we ask "how many things are the same size as the ball?", we might have something like:

Basic operations like "compare size" are shared between questions, but they get used in different ways. The key idea behind NMNs is to make this sharing explicit: we use two different network structures to answer the two questions above, but we share weights between pieces of networks that involve the same basic operations:

How do we learn a model like this? Rather than training a single large network on lots of input/output pairs, we actually train a huge number of different networks at the same time, while tying their parameters together where appropriate:

(Several recent deep learning frameworks, including DyNet and TensorFlow Fold, were explicitly designed with this kind of dynamic computation in mind.)

What we get at the end of the training process is not a single deep network, but rather a collection of neural "modules", each of which implements a single step of reasoning. When we want to use our trained model on a new problem instance, we can assemble these modules dynamically to produce a new network structure tailored to that problem.

One of the remarkable things about this process is that we don’t need to provide any low-level supervision for individual modules: the model never sees an isolated example of blue object or a "left-of" relationship. Modules are learned only inside larger composed structures, with only (question, answer) pairs as supervision. But the training procedure is able to automatically infer the correct relationship between pieces of structure and the computations they’re responsible for:

This same process works for answering questions about more realistic photographs, and even other knowledge sources like databases:

The key ingredient in this whole process is a collection of high-level "reasoning blueprints" like the ones above. These blueprints tell us how the network for each question should be laid out, and how different questions relate to one another. But where do the blueprints come from?

In our initial work on these models (1, 2), we drew on a surprising connection between the problem of designing question-specific neural networks and the problem of analyzing grammatical structure. Linguists have long observed that the grammar of a question is closely related to the sequence of computational steps needed to answer it. Thanks to recent advances in natural language processing, we can use off-the-shelf tools for grammatical analysis to provide approximate versions of these blueprints automatically.

But finding exactly the right mapping from linguistic structure to network structure is still a challenging problem, and the conversion process is prone to errors. In later work, rather than relying on this kind of linguistic analysis, we instead turned to data produced by human experts who directly labeled a collection of questions with idealized reasoning blueprints (3). By learning to imitate these humans, our model was able to improve the quality of its predictions substantially. Most surprisingly, when we took a model trained to imitate experts, but allowed it to explore its own modifications to these expert predictions, it was able to find even better solutions than experts on a variety of questions.


Despite the remarkable success of deep learning methods in recent years, many problems—including few-shot learning and complex reasoning—remain a challenge. But these are exactly the sorts of problems where more structured classical techniques like semantic parsing and program induction really shine. Neural module networks give us the best of both worlds: the flexibility and data efficiency of discrete compositionality, combined with the representational power of deep networks. NMNs have already seen a number of successes for visual and textual reasoning tasks, and we’re excited to start applying them to other AI problems as well.


This post is based on the following papers:

  1. Neural Module Networks. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. CVPR 2016. (arXiv)

  2. Learning to Compose Neural Networks for Question Answering. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. NAACL 2016. (arXiv)

  3. Modeling Relationships in Referential Expressions with Compositional Modular Networks. Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell and Kate Saenko. CVPR 2017. (arXiv)

Images are from the VQA and CLEVR datasets.

Wednesday 17 August 2022

Demand for AI helps Microsoft outperform financial expectations

 Microsoft is incorporating a ChatGPT-style assistant to all of its Office apps, including Word, Teams, and Outlook, and invested in OpenAI, the generative chatbot maker.Microsoft has beaten Wall Street expectations, posting better than expected revenues as its cloud computing and office software business grew amid increasing demand for artificial intelligence (AI).

Revenue rose to $56.2bn (£43.56bn) in the fourth quarter, the three months up to the end of June - up 8% on the same period a year ago. It is greater than the 7% expected by analysts.The company chairman and chief executive, Satya Nadella, said Microsoft was "focused on leading the new AI platform shift", as tech giants compete to develop AI products and companies seek to quickly adopt AI.

"Organisations are asking not only how - but how fast - they can apply this next generation of AI to address the biggest opportunities and challenges they face - safely and responsibly," Mr Nadella said.

Investors and tech developers have been seeking to get a foothold in the burgeoning AI boom.Microsoft is one of the main competitors in the AI race. It invested in OpenAI, the maker of generative AI chatbot, ChatGPT.

It has also invested in its own products.In February it revealed a new Bing search engine powered by chatbot technology, gaining against rival Google.

The following month it announced an AI "copilot for work" that can write emails and allow users to catch up on skipped meetings.It is incorporating the ChatGPT-style assistant to all of its Office apps, including Word, Teams, and Outlook.

Microsoft's AI platform, Azure, was behind the growth in cloud computing at the tech company.

Azure and other cloud services revenue grew 26% compared to a year before.

Despite the positive results, shares fell slightly as capital expenditure rose from $7.8bn (£6bn) the year before to $10.7bn (£8.29bn) to its building new data centres.

The stock price reached a record high last week as a $30 (£23.25) monthly subscription for generative AI features in its software was announced.

Thursday 11 August 2022

Artificial Neural Networks made easy with the FANN library

 Neural networks are typically associated with specialised applications, developed only by select groups of experts. This misconception has had a highly negative effect on its popularity. Hopefully, the FANN library will help fill this gap.

Introduction

For years, the Hollywood science fiction films such as I, Robot have portrayed artificial intelligence (AI) as a harbinger of Armageddon. Nothing, however, could be farther from the truth. While Hollywood regales us with horror stories of our imminent demise, people with no interest in the extinction of our species have been harnessing AI to make our lives easier, more productive, longer, and generally better.

The robots in the I, Robot film have an artificial brain based on a network of artificial neurons; this artificial neural network (ANN) is built to model the human brain's own neural network. The Fast Artificial Neural Network (FANN) library is an ANN library, which can be used from C, C++, PHP, Python, Delphi, and Mathematica, and although it cannot create Hollywood magic, it is still a powerful tool for software developers. ANNs can be used in areas as diverse as creating more appealing game-play in computer games, identifying objects in images, and helping the stock brokers predict trends of the ever-changing stock market.

Artificial Intelligence

When is something or somebody intelligent? Is a dog intelligent? How about a newborn baby? Normally, we define intelligence as the ability to acquire and apply knowledge, reason deductively, and exhibit creativity. If we were to apply the same standards to artificial intelligence (AI), it would follow that there is currently no such thing as AI. Normally, however, AI is defined as the ability to perform functions that are typically associated with human intelligence. Therefore, AI can be used to describe all computerised efforts dealing with learning or application of human knowledge. This definition allows the AI term to describe even the simplest chess computer or a character in the computer game.

Function approximation

ANNs apply the principle of function approximation by example, meaning that they learn a function by looking at examples of this function. One of the simplest examples is an ANN learning the XOR function, but it could just as easily be learning to determine the language of a text, or whether there is a tumour visible in an X-ray image.

If an ANN is to be able to learn a problem, it must be defined as a function with a set of input and output variables supported by examples of how this function should work. A problem like the XOR function is already defined as a function with two binary input variables and a binary output variable, and with the examples which are defined by the results of four different input patterns. However, there are more complicated problems which can be more difficult to define as functions. The input variables to the problem of finding a tumour in an X-ray image could be the pixel values of the image, but they could also be some values extracted from the image. The output could then either be a binary value or a floating-point value representing the probability of a tumour in the image. In ANNs, this floating-point value would normally be between 0 and 1, inclusive.


The human brain

A function approximator like an ANN can be viewed as a black box, and when it comes to FANN, this is more or less all you will need to know. However, basic knowledge of how the human brain operates is needed to understand how ANNs work.

The human brain is a highly complicated system which is capable of solving very complex problems. The brain consists of many different elements, but one of its most important building blocks is the neuron, of which it contains approximately 1011. These neurons are connected by around 1015 connections, creating a huge neural network. Neurons send impulses to each other through the connections and these impulses make the brain work. The neural network also receives impulses from the five senses, and sends out impulses to muscles to achieve motion or speech.

The individual neuron can be seen as an input-output machine which waits for impulses from the surrounding neurons and, when it has received enough impulses, it sends out an impulse to other neurons.

Artificial Neural Networks

Artificial neurons are similar to their biological counterparts. They have input connections which are summed together to determine the strength of their output, which is the result of the sum being fed into an activation function. Though many activation functions exist, the most common is the sigmoid activation function, which outputs a number between 0 (for low input values) and 1 (for high input values). The resultant of this function is then passed as the input to other neurons through more connections, each of which are weighted. These weights determine the behaviour of the network.

In the human brain, the neurons are connected in a seemingly random order and send impulses asynchronously. If we wanted to model a brain, this might be the way to organise an ANN, but since we primarily want to create a function approximator, ANNs are usually not organised like this.

When we create ANNs, the neurons are usually ordered in layers with connections going between the layers. The first layer contains the input neurons and the last layer contains the output neurons. These input and output neurons represent the input and output variables of the function that we want to approximate. Between the input and the output layer, a number of hidden layers exist and the connections (and weights) to and from these hidden layers determine how well the ANN performs. When an ANN is learning to approximate a function, it is shown examples of how the function works, and the internal weights in the ANN are slowly adjusted so as to produce the same output as in the examples. The hope is that when the ANN is shown a new set of input variables, it will give a correct output. Therefore, if an ANN is expected to learn to spot a tumour in an X-ray image, it will be shown many X-ray images containing tumours, and many X-ray images containing healthy tissues. After a period of training with these images, the weights in the ANN should hopefully contain information which will allow it to positively identify tumours in X-ray images that it has not seen during the training.

A FANN library tutorial

The Internet has made global communication a part of many people's lives, but it has also given rise to the problem that everyone does not speak the same language. Translation tools can help bridge this gap, but in order for such tools to work, they need to know in what language a passage of text is written. One way to determine this is by examining the frequency of letters occurring in a text. While this seems like a very naïve approach to language detection, it has proven to be very effective. For many European languages, it is enough to look at the frequencies of the letters A to Z, even though some languages also use other letters as well. Easily enough, the FANN library can be used to make a small program that determines the language of a text file. The ANN used should have an input neuron for each of the 26 letters, and an output neuron for each of the languages. But first, a small program must be made for measuring the frequency of the letters in a text file.

Listing 1 will generate letter frequencies for a file and output them in a format that can be used to generate a training file for the FANN library. Training files for the FANN library must consist of a line containing the input values, followed by a line containing the output values. If we wish to distinguish between three different languages (English, French and Polish), we could choose to represent this by allocating one output variable with a value of 0 for English, 0.5 for French, and 1 for Polish. Neural networks are, however, known to perform better if an output variable is allocated for each language, and that it is set to 1 for the correct language and 0 otherwise.

Listing 1. Program that calculates the frequencies of the letters A-Z in a text file.

#include <vector>
#include <fstream>
#include <iostream>
#include <ctype.h>

void error(const char* p, const char* p2 = "")
{
    std::cerr << p << ' ' << p2 << std::endl;
    std::exit(1);
}

void generate_frequencies(const char *filename, float *frequencies)
{
    std::ifstream infile(filename);
    if(!infile) error("Cannot open input file", filename);

    std::vector<unsigned int> letter_count(26, 0);
    unsigned int num_characters = 0;
    char c;
    while(infile.get(c)){
        c = tolower(c);
        if(c >= 'a' && c <= 'z'){
            letter_count[c - 'a']++;
            num_characters++;
        }
    }

    if(!infile.eof()) error("Something strange happened");
    for(unsigned int i = 0; i != 26; i++){
        frequencies[i] = letter_count[i]/(double)num_characters;
    }
}
int main(int argc, char* argv[])
{
    if(argc != 2) error("Remember to specify an input file");

    float frequencies[26];
    generate_frequencies(argv[1], frequencies);

    for(unsigned int i = 0; i != 26; i++){
        std::cout << frequencies[i] << ' ';
    }
    std::cout << std::endl;

    return 0;
}

With this small program at hand, a training file containing letter frequencies can be generated for texts written in the different languages. The ANN will, of course, be better at distinguishing the languages if frequencies for many different texts are available in the training file, but for this small example, 3-4 texts in each language should be enough. Listing 2 shows a pre-generated training file using four text files for each of the three languages, and Figure 2 shows a graphical representation of the frequencies in the file. A thorough inspection of this file shows clear trends: English has more H's than the other two languages, French has almost no K's, and Polish has more W's and Z's than the other languages. The training file only uses letters in the A to Z range, but since a language like Polish uses letters like Ł, Ą, and Ę which are not used in the other two languages, a more precise ANN could be made by adding input neurons for these letters as well. When only comparing three languages, there is, however, no need for these added letters since the remaining letters contain enough information to classify the languages correctly, but if the ANN were to classify hundreds of different languages, more letters would be required.

Listing 2. The first part of the training file with character frequencies for English, French, and Polish, the first line is a header telling that there are 12 training patterns consisting of 26 inputs and 3 outputs.

12 26 3

0.103 0.016 0.054 0.060 0.113 0.010 0.010 0.048 0.056
0.003 0.010 0.035 0.014 0.065 0.075 0.013 0.000 0.051
0.083 0.111 0.030 0.008 0.019 0.000 0.016 0.000

1 0 0

0.076 0.010 0.022 0.039 0.151 0.013 0.009 0.009 0.081
0.001 0.000 0.058 0.024 0.074 0.061 0.030 0.011 0.069
0.100 0.074 0.059 0.015 0.000 0.009 0.003 0.003

0 1 0

0.088 0.016 0.030 0.034 0.089 0.004 0.011 0.023 0.071
0.032 0.030 0.025 0.047 0.058 0.093 0.040 0.000 0.062
0.044 0.035 0.039 0.002 0.044 0.000 0.037 0.046

0 0 1

0.078 0.013 0.043 0.043 0.113 0.024 0.023 0.041 0.068
0.000 0.005 0.045 0.024 0.069 0.095 0.020 0.001 0.061
0.080 0.090 0.029 0.015 0.014 0.000 0.008 0.000

1 0 0

0.061 0.005 0.028 0.040 0.161 0.019 0.010 0.010 0.066
0.016 0.000 0.035 0.028 0.092 0.061 0.031 0.019 0.059
0.101 0.064 0.076 0.016 0.000 0.002 0.002 0.000

0 1 0

0.092 0.016 0.038 0.025 0.083 0.000 0.015 0.009 0.087
0.030 0.040 0.032 0.033 0.063 0.085 0.033 0.000 0.049
0.053 0.033 0.025 0.000 0.053 0.000 0.038 0.067

0 0 1

...

Figure 2. A bar chart of the average frequencies in English, French, and Polish.

With a training file like this, it is very easy to create a program using FANN which can be used to train an ANN to distinguish between the three languages. Listing 2 shows just how simply this can be done with FANN. This program uses four FANN functions fann_create, fann_train_on_file, fann_save, and fann_destroy. The function struct fann* fann_create(float connection_rate, float learning_rate, unsigned int num_layers, ...) is used to create an ANN, where the connection_rate parameter can be used to create an ANN that is not fully connected, although fully connected ANNs are normally preferred, and the learning_rate is used to specify how aggressive the learning algorithm should be (only relevant for some learning algorithms). The last parameters for the function are used to define the layout of the layers in the ANN. In this case, an ANN with three layers (one input, one hidden, and one output) has been chosen. The input layer has 26 neurons (one for each letter), the output layer has three neurons (one for each language), and the hidden layer has 13 neurons. The number of layers and number of neurons in the hidden layer has been selected experimentally, as there is really no easy way of determining these values. It helps, however, to remember that the ANN learns by adjusting the weights, so if an ANN contains more neurons and thereby also more weights, it can learn more complicated problems. Having too many weights can also be a problem, since learning can be more difficult and there is also a chance that the ANN will learn specific features of the input variables instead of general patterns which can be extrapolated to other data sets. In order for an ANN to accurately classify data not in the training set, this ability to generalise is crucial – without it, the ANN will be unable to distinguish frequencies that it has not been trained with.

Listing 3. A program that trains an ANN to learn to distinguish between languages.

#include "fann.h"
int main()
{
    struct fann *ann = fann_create(1, 0.7, 3, 26, 13, 3);
    fann_train_on_file(ann, "frequencies.data", 200, 10, 0.0001);
    fann_save(ann, "language_classify.net");
    fann_destroy(ann);
    return 0;
}

The void fann_train_on_file(struct fann *ann, char *filename, unsigned int max_epochs, unsigned int epochs_between_reports, float desired_error) function trains the ANN. The training is done by continually adjusting the weights so that the output of the ANN matches the output in the training file. One cycle where the weights are adjusted to match the output in the training file is called an epoch. In this example, the maximum number of epochs have been set to 200, and a status report is printed every 10 epochs. When measuring how close an ANN matches the desired output, the mean square error is usually used. The mean square error is the mean value of the squared difference between the actual and the desired output of the ANN, for individual training patterns. A small mean square error means a close match of the desired output.

When the program in Listing 2 is run, the ANN will be trained and some status information (see Listing 4) will be printed to make it easier to monitor progress during training. After training, the ANN could be used directly to determine which language a text is written in, but it is usually desirable to keep training and execution in two different programs, so that the more time-consuming training needs only be done only once. For this reason, Listing 2 simply saves the ANN to a file that can be loaded by another program.

Listing 4. Output from FANN during training.

Max epochs 200. Desired error: 0.0001000000
Epochs 1. Current error: 0.7464869022
Epochs 10. Current error: 0.7226278782
Epochs 20. Current error: 0.6682052612
Epochs 30. Current error: 0.6573708057
Epochs 40. Current error: 0.5314316154
Epochs 50. Current error: 0.0589125119
Epochs 57. Current error: 0.0000702030

The small program in Listing 5 loads the saved ANN and uses it to classify a text as English, French, or Polish. When tested with texts in the three languages found on the Internet, it can properly classify texts as short as only a few sentences. Although this method for distinguishing between languages is not bullet-proof, I was not able to find a single text that could be classified incorrectly.

Listing 5. A program classifying a text as written in one of the three languages (The program uses some functions defined in Listing 1).

int main(int argc, char* argv[])
{
    if(argc != 2) error("Remember to specify an input file");
    struct fann *ann = fann_create_from_file("language_classify.net");

    float frequencies[26];
    generate_frequencies(argv[1], frequencies);

    float *output = fann_run(ann, frequencies);
    std::cout << "English: " << output[0] << std::endl
              << "French : " << output[1] << std::endl
              << "Polish : " << output[2] << std::endl;

    return 0;
}

The FANN library: Details

The language classification example shows just how easily the FANN library can be applied to solve simple, everyday computer science problems which would be much more difficult to solve using other methods. Unfortunately, not all problems can be solved this easily, and when working with ANNs, one often finds oneself in a situation in which it is very difficult to train the ANN to give the correct results. Sometimes, this is because the problem simply cannot be solved by ANNs, but often, the training can be helped by tweaking the FANN library settings.

The most important factor when training an ANN is the size of the ANN. This can only be set experimentally, but knowledge of the problem will often help giving good guesses. With a reasonably sized ANN, the training can be done in many different ways. The FANN library supports several different training algorithms, and the default algorithm (FANN_TRAIN_RPROP) might not always be the best-suited for a specific problem. If this is the case, the fann_set_training_algorithm function can be used to change the training algorithm.

In version 1.2.0 of the FANN library, there are four different training algorithms available, all of which use some sort of back propagation. Back-propagation algorithms change the weights by propagating the error backwards from the output layer to the input layer while adjusting the weights. The back-propagated error value could either be an error calculated for a single training pattern (incremental), or it could be a sum of errors from the entire training file (batch). FANN_TRAIN_INCREMENTAL implements an incremental training algorithm which alters the weights after each training pattern. The advantage of such a training algorithm is that the weights are being altered many times during each epoch, and since each training pattern alters the weights in slightly different directions, the training will not easily get stuck in local minima – states in which all small changes in the weights only make the mean square error worse, even though the optimal solution may have not yet been found.

FANN_TRAIN_BATCH, FANN_TRAIN_RPROP, and FANN_TRAIN_QUICKPROP are all examples of batch-training algorithms which alter the weight after calculating the errors for an entire training set. The advantage of these algorithms is that they can make use of global optimisation information which is not available to incremental training algorithms. This can, however, mean that some of the finer points of the individual training patterns are being missed. There is no clear answer to the question which training algorithm is the best. One of the advanced batch-training algorithms like rprop or quickprop training is usually the best solution. Sometimes, however, incremental training is more optimal – especially if many training patterns are available. In the language training example, the most optimal training algorithm is the default rprop one, which reached the desired mean square error value after just 57 epochs. The incremental training algorithm needed 8108 epochs to reach the same result, while the batch training algorithm needed 91985 epochs. The quickprop training algorithm had more problems, and at first, it failed altogether at reaching the desired error value, but after tweaking the decay of the quickprop algorithm, it reached the desired error after 662 epochs. The decay of the quickprop algorithm is a parameter which is used to control how aggressive the quickprop training algorithm is, and it can be altered by the fann_set_quickprop_decay function. Other fann_set_... functions can also be used to set additional parameters for the individual training algorithms, although some of these parameters can be a bit difficult to tweak without knowledge of how the individual algorithms work.

One parameter, which is independent of the training algorithm, can however be tweaked rather easily – the steepness of the activation function. Activation function is the function that determines when the output should be close to 0 and when it should be close to 1, and the steepness of this function determines how soft or hard the transition from 0 to 1 should be. If the steepness is set to a high value, the training algorithm will converge faster to the extreme values of 0 and 1, which will make training faster, for an e.g., the language classification problem. However, if the steepness is set to a low value, it is easier to train an ANN that requires fractional output, like e.g., an ANN that should be trained to find the direction of a line in an image. For setting the steepness of the activation function, FANN provides two functions: fann_set_activation_steepness_hidden and fann_set_activation_steepness_output. There are functions because it is often desirable to have different steepness for the hidden layers and for the output layer.

FANN possibilities

The language identification problem belongs to a special kind of function approximation problems known as classification problems. Classification problems have one output neuron per classification, and in each training pattern, precisely one of these outputs must be 1. A more general function approximation problem is where the outputs are fractional values. This could, e.g., be approximating the distance to an object viewed by a camera, or even the energy consumption of a house. These problems could of course be combined with classification problems, so there could be a classification problem of identifying the kind of object in an image and a problem of approximating the distance to the object. Often, this can be done by a single ANN, but sometimes it might be a good idea to keep the two problems separate, and e.g., have an ANN which classifies the object, and an ANN for each of the different objects which approximates the distance to the object.

Another kind of approximation problems are the time-series problems, approximating functions which evolve over time. A well known time-series problem is predicting how many sunspots there will be in a year by looking at historical data. Normal functions have an x-value as an input and a y-value as an output, and the sunspot problem could also be defined like this, with the year as the x-value and the number of sun spots as the y-value. This has, however, proved not to be the best way of solving such problems. Time-series problems can be approximated by using a period of time as input and then using the next time step as output. If the period is set to 10 years, the ANN could be trained with all the 10-year periods where historical data exists, and it could then approximate the number of sunspots in 2005 by using the number of sun spots in 1995 – 2004 as inputs. This approach means that each set of historical data is used in several training patterns, e.g., the number of sunspots for 1980 is used in training patterns with 1981 – 1990 as outputs. This approach also means that the number of sunspots cannot be directly approximated for 2010 without first-approximating 2005 – 2009, which in turn will mean that half of the input for calculating 2010 will be approximated data and that the approximation for 2010 will not be as precise as the approximation for 2005. For this reason, time-series prediction is only well-fitted for predicting things in the near future.

Time-series prediction can also be used to introduce memory in controllers for robots etc. This could, e.g., be done by giving the direction and speed from the last two time steps as input to the third time step, in addition to other inputs from sensors or cameras. The major problem of this approach is, however, that training data can be very difficult to produce, since each training pattern must also include history.

FANN tips & tricks

Lots of tricks can be used to make FANN train and execute faster and with greater precision. A simple trick which can be used to make training faster and more precise is to use input and output values in the range -1 to 1 as opposed to 0 to 1. This can be done by changing the values in the training file and using fann_set_activation_function_hidden and fann_set_activation_function_output to change the activation function to FANN_SIGMOID_SYMMETRIC, which has outputs in the range of -1 and 1 instead of 0 and 1. This trick works because 0 values in ANNs have an unfortunate feature that no matter which value the weight has, the output will still be 0. There are, of course, countermeasures in FANN to prevent this from becoming a big problem; however, this trick has been proved to reduce training time. The fann_set_activation_function_output can also be used to change the activation function to the FANN_LINEAR activation function which is unbounded and can therefore be used to create ANNs with arbitrary outputs.

When training an ANN, it is often difficult to find out how many epochs should be used for training. If too few epochs are used during training, the ANN will not be able to classify the training data. If, however, too many iterations are used, the ANN will be too specialised in the exact values of the training data, and the ANN will not be good at classifying data it has not seen during training. For this reason, it is often a good idea to have two sets of training data, one applied during the actual training and one applied to verify the quality of the ANN by testing it on data which have not been seen during the training. The fann_test_data function can be used for this purpose, along with other functions which can be used to handle and manipulate training data.

Transforming a problem into a function which can easily be learned by an ANN can be a difficult task, but some general guidelines can be followed:

  • Use at least one input/output neuron for each informative unit. In the case of the language classification system, this means to have one input neuron for each letter and one output neuron for each language.
  • Represent all the knowledge that you as a programmer have about the problem when choosing the input neurons. If you, e.g., know that the word length is important for the language classification system, then you should also add an input neuron for the word length (this could also be done by adding an input neuron for the frequency of spaces). Also, if you know that some letters are only used in some languages, then it might be an idea to add an extra input neuron which is 1 if the letter is present in the text and 0 if the letter is not present. In this way, even a single Polish letter in a text can help classifying this text. Perhaps, you know that some languages contain more vowels than others and you can then represent the frequency of the vowels as an extra input neuron.
  • Simplify the problem. If you, e.g., want to use an ANN for detection of some features in an image, then it might be a good idea to simplify the image in order to make the problem easier to solve, since the raw image will often contain far too much information and it will be difficult for the ANN to filter out the relevant information. In images, simplification can be done by applying some filters to do smoothing, edge-detection, ridge-detection, grey-scaling etc. Other problems can be simplified by preprocessing data in other ways to remove unnecessary information. Simplification can also be done by splitting an ANN into several easier-to-solve problems. In the language classification problem, one ANN could, e.g., distinguish between European and Asian languages, while two others could be used to classify the individual languages in the two areas.

While training the ANN is often the big time consumer, execution can often be more time-critical – especially in systems where the ANN needs to be executed hundreds of times per second or if the ANN is very large. For this reason, several measures can be applied to make the FANN library execute even faster than it already does. One method is to change the activation function to use a stepwise linear activation function, which is faster to execute, but which is also a bit less precise. It is also a good idea to reduce the number of hidden neurons if possible, since this will reduce the execution time. Another method, only effective on embedded systems without a floating point processor, is to let the FANN library execute by using integers only. The FANN library has a few auxiliary functions allowing the library to be executed using only integers, and on systems which does not have a floating point processor, this can give a performance enhancement of more than 5000%.

On the Net

A tale from the open source world

When I first released the FANN library version 1.0 in November 2003, I did not really know what to expect, but I thought that everybody should have the option to use this new library that I had created. Much to my surprise, people actually started downloading and using the library. As months went by, more and more users started using FANN, and the library evolved from being a Linux-only library to supporting most major compilers and operating systems (including MSVC++ and Borland C++). The functionality of the library was also considerably expanded, and many of the users started contributing to the library. Soon the library had bindings for PHP, Python, Delphi, and Mathematica, and the library also became accepted in the Debian Linux distribution. My work with FANN and the users of the library take up some of my spare time, but it is a time that I gladly spend. FANN gives me an opportunity to give something back to the open source community, and it gives me a chance to help people, while doing stuff I enjoy. I cannot say that developing Open Source software is something that all software developers should do, but I will say that it has given me a great deal of satisfaction, so if you think that it might be something for you, then find an Open Source project that you would like to contribute to, and start contributing. Or even better, start your own Open Source project.

Interpreting Hand Gestures and Sign Language in the Webcam with AI using TensorFlow.js

 In this article, we will take photos of different hand gestures via webcam and use transfer learning on a pre-trained MobileNet model to build a computer vision AI that can recognize the various gestures in real time.

Here we look at: Detecting hand gestures, creating our starting point and using it to detect four different categories: None, Rock, Paper, Scissors, and adding adding some American Sign Language (ASL) categories to explore how much harder it is for the AI to detect other gestures.

TensorFlow + JavaScript. The most popular, cutting-edge AI framework now supports the most widely used programming language on the planet, so let’s make magic happen through deep learning right in our web browser, GPU-accelerated via WebGL using TensorFlow.js!

In this article, we will take photos of different hand gestures via webcam and use transfer learning on a pre-trained MobileNet model to build a computer vision AI that can recognize the various gestures in real time.

Starting Point

To recognize multiple hand gestures, we are going to use almost-ready starter code and expand it to detect more categories of objects. Here is what the code will do:

  • Import TensorFlow.js and TensorFlow’s tf-data.js
  • Define Touch vs. Not-Touch category labels
  • Add a video element for the webcam
  • Run the model prediction every 200 ms after it’s been trained for the first time
  • Show the prediction result
  • Load a pre-trained MobileNet model and prepare for transfer learning to as many categories as there are labels
  • Train and classify a variety of custom objects in images
  • Skip disposing image and target samples in the training process to keep them for multiple training runs

Here is our starting point for this project:

JavaScript
<html>
    <head>
        <meta charset="UTF-8">
        <title>Interpreting Hand Gestures and Sign Language in the Webcam with AI using TensorFlow.js</title>
        <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-data@2.0.0/dist/tf-data.min.js"></script>
        <style>
            img, video {
                object-fit: cover;
            }
        </style>
    </head>
    <body>
        <video autoplay playsinline muted id="webcam" width="224" height="224"></video>
        <div id="buttons">
            <button onclick="captureSample(0)">None</button>
            <button onclick="captureSample(1)">✊ (Rock)</button>
            <button onclick="captureSample(2)">🖐 (Paper)</button>
            <button onclick="captureSample(3)">✌️ (Scissors)</button>
            <button onclick="trainModel()">Train</button>
        </div>
        <h1 id="status">Loading...</h1>
        <script>
        let trainingData = [];

        const labels = [
            "None",
            "✊ (Rock)",
            "🖐 (Paper)",
            "✌️ (Scissors)",
        ];

        function setText( text ) {
            document.getElementById( "status" ).innerText = text;
        }

        async function predictImage() {
            if( !hasTrained ) { return; } // Skip prediction until trained
            const img = await getWebcamImage();
            let result = tf.tidy( () => {
                const input = img.reshape( [ 1, 224, 224, 3 ] );
                return model.predict( input );
            });
            img.dispose();
            let prediction = await result.data();
            result.dispose();
            // Get the index of the highest value in the prediction
            let id = prediction.indexOf( Math.max( ...prediction ) );
            setText( labels[ id ] );
        }

        function createTransferModel( model ) {
            // Create the truncated base model (remove the "top" layers, classification + bottleneck layers)
            const bottleneck = model.getLayer( "dropout" ); // This is the final layer before the conv_pred pre-trained classification layer
            const baseModel = tf.model({
                inputs: model.inputs,
                outputs: bottleneck.output
            });
            // Freeze the convolutional base
            for( const layer of baseModel.layers ) {
                layer.trainable = false;
            }
            // Add a classification head
            const newHead = tf.sequential();
            newHead.add( tf.layers.flatten( {
                inputShape: baseModel.outputs[ 0 ].shape.slice( 1 )
            } ) );
            newHead.add( tf.layers.dense( { units: 100, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( { units: 100, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( { units: 10, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( {
                units: labels.length,
                kernelInitializer: 'varianceScaling',
                useBias: false,
                activation: 'softmax'
            } ) );
            // Build the new model
            const newOutput = newHead.apply( baseModel.outputs[ 0 ] );
            const newModel = tf.model( { inputs: baseModel.inputs, outputs: newOutput } );
            return newModel;
        }

        async function trainModel() {
            hasTrained = false;
            setText( "Training..." );

            // Setup training data
            const imageSamples = [];
            const targetSamples = [];
            trainingData.forEach( sample => {
                imageSamples.push( sample.image );
                let cat = [];
                for( let c = 0; c < labels.length; c++ ) {
                    cat.push( c === sample.category ? 1 : 0 );
                }
                targetSamples.push( tf.tensor1d( cat ) );
            });
            const xs = tf.stack( imageSamples );
            const ys = tf.stack( targetSamples );

            // Train the model on new image samples
            model.compile( { loss: "meanSquaredError", optimizer: "adam", metrics: [ "acc" ] } );

            await model.fit( xs, ys, {
                epochs: 30,
                shuffle: true,
                callbacks: {
                    onEpochEnd: ( epoch, logs ) => {
                        console.log( "Epoch #", epoch, logs );
                    }
                }
            });
            hasTrained = true;
        }

        // Mobilenet v1 0.25 224x224 model
        const mobilenet = "https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_0.25_224/model.json";

        let model = null;
        let hasTrained = false;

        async function setupWebcam() {
            return new Promise( ( resolve, reject ) => {
                const webcamElement = document.getElementById( "webcam" );
                const navigatorAny = navigator;
                navigator.getUserMedia = navigator.getUserMedia ||
                navigatorAny.webkitGetUserMedia || navigatorAny.mozGetUserMedia ||
                navigatorAny.msGetUserMedia;
                if( navigator.getUserMedia ) {
                    navigator.getUserMedia( { video: true },
                        stream => {
                            webcamElement.srcObject = stream;
                            webcamElement.addEventListener( "loadeddata", resolve, false );
                        },
                    error => reject());
                }
                else {
                    reject();
                }
            });
        }

        async function getWebcamImage() {
            const img = ( await webcam.capture() ).toFloat();
            const normalized = img.div( 127 ).sub( 1 );
            return normalized;
        }

        async function captureSample( category ) {
            trainingData.push( {
                image: await getWebcamImage(),
                category: category
            });
            setText( "Captured: " + labels[ category ] );
        }

        let webcam = null;

        (async () => {
            // Load the model
            model = await tf.loadLayersModel( mobilenet );
            model = createTransferModel( model );
            await setupWebcam();
            webcam = await tf.data.webcam( document.getElementById( "webcam" ) );
            // Setup prediction every 200 ms
            setInterval( predictImage, 200 );
        })();
        </script>
    </body>
</html>

Detecting Hand Gestures

The starting point is built ready to detect four different categories: None, Rock, Paper, Scissors. You can try it using your webcam by clicking each of the category buttons to capture some photos (5-6 is a good sample to start with) while you are holding each hand gesture, and then clicking the train button to transfer learning to the neural network. After this, you can improve the model by taking more photos and clicking the train button again.

Additional Hand Gestures and Sign Language

As you can probably imagine, adding more categories becomes harder for the AI to learn and takes more time. However, the results are fun, and AI performs fairly well even from just a couple of photos for each category. Let’s try adding some American Sign Language (ASL) gestures.

To add more, you can include more buttons in the input list, updating the number passed into captureSample(), and modify the labels array accordingly.

 

You can add whichever signs you would like. I tried adding four that were part of the emoji set:

  • 👌 (Letter D)
  • 👍 (Thumb Up)
  • 🖖 (Vulcan)
  • 🤟 (ILY - I Love You)

Technical Footnotes

  • If AI does not seem to recognize your hand gestures well, try taking more photos and then training the model multiple times.
  • While training the model with the various hand gestures, keep in mind that it sees the full image; it doesn’t necessarily know that the hand by itself distinguishes the categories. It may be difficult to accurately recognize different hand gestures without numerous samples from different hands.
  • Sometimes, the model learns to differentiate between left and right hands, and sometimes it does not, which could affect predictions after multiple rounds of training.

Finish Line

For your reference, here is the full code for this project:

JavaScript
<html>
    <head>
        <meta charset="UTF-8">
        <title>Interpreting Hand Gestures and Sign Language in the Webcam with AI using TensorFlow.js</title>
        <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-data@2.0.0/dist/tf-data.min.js"></script>
        <style>
            img, video {
                object-fit: cover;
            }
        </style>
    </head>
    <body>
        <video autoplay playsinline muted id="webcam" width="224" height="224"></video>
        <div id="buttons">
            <button onclick="captureSample(0)">None</button>
            <button onclick="captureSample(1)">✊ (Rock)</button>
            <button onclick="captureSample(2)">🖐 (Paper)</button>
            <button onclick="captureSample(3)">✌️ (Scissors)</button>
            <button onclick="captureSample(4)">👌 (Letter D)</button>
            <button onclick="captureSample(5)">👍 (Thumb Up)</button>
            <button onclick="captureSample(6)">🖖 (Vulcan)</button>
            <button onclick="captureSample(7)">🤟 (ILY - I Love You)</button>
            <button onclick="trainModel()">Train</button>
        </div>
        <h1 id="status">Loading...</h1>
        <script>
        let trainingData = [];

        const labels = [
            "None",
            "✊ (Rock)",
            "🖐 (Paper)",
            "✌️ (Scissors)",
            "👌 (Letter D)",
            "👍 (Thumb Up)",
            "🖖 (Vulcan)",
            "🤟 (ILY - I Love You)"
        ];

        function setText( text ) {
            document.getElementById( "status" ).innerText = text;
        }

        async function predictImage() {
            if( !hasTrained ) { return; } // Skip prediction until trained
            const img = await getWebcamImage();
            let result = tf.tidy( () => {
                const input = img.reshape( [ 1, 224, 224, 3 ] );
                return model.predict( input );
            });
            img.dispose();
            let prediction = await result.data();
            result.dispose();
            // Get the index of the highest value in the prediction
            let id = prediction.indexOf( Math.max( ...prediction ) );
            setText( labels[ id ] );
        }

        function createTransferModel( model ) {
            // Create the truncated base model (remove the "top" layers, classification + bottleneck layers)
            const bottleneck = model.getLayer( "dropout" ); // This is the final layer before the conv_pred pre-trained classification layer
            const baseModel = tf.model({
                inputs: model.inputs,
                outputs: bottleneck.output
            });
            // Freeze the convolutional base
            for( const layer of baseModel.layers ) {
                layer.trainable = false;
            }
            // Add a classification head
            const newHead = tf.sequential();
            newHead.add( tf.layers.flatten( {
                inputShape: baseModel.outputs[ 0 ].shape.slice( 1 )
            } ) );
            newHead.add( tf.layers.dense( { units: 100, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( { units: 100, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( { units: 10, activation: 'relu' } ) );
            newHead.add( tf.layers.dense( {
                units: labels.length,
                kernelInitializer: 'varianceScaling',
                useBias: false,
                activation: 'softmax'
            } ) );
            // Build the new model
            const newOutput = newHead.apply( baseModel.outputs[ 0 ] );
            const newModel = tf.model( { inputs: baseModel.inputs, outputs: newOutput } );
            return newModel;
        }

        async function trainModel() {
            hasTrained = false;
            setText( "Training..." );

            // Setup training data
            const imageSamples = [];
            const targetSamples = [];
            trainingData.forEach( sample => {
                imageSamples.push( sample.image );
                let cat = [];
                for( let c = 0; c < labels.length; c++ ) {
                    cat.push( c === sample.category ? 1 : 0 );
                }
                targetSamples.push( tf.tensor1d( cat ) );
            });
            const xs = tf.stack( imageSamples );
            const ys = tf.stack( targetSamples );

            // Train the model on new image samples
            model.compile( { loss: "meanSquaredError", optimizer: "adam", metrics: [ "acc" ] } );

            await model.fit( xs, ys, {
                epochs: 30,
                shuffle: true,
                callbacks: {
                    onEpochEnd: ( epoch, logs ) => {
                        console.log( "Epoch #", epoch, logs );
                    }
                }
            });
            hasTrained = true;
        }

        // Mobilenet v1 0.25 224x224 model
        const mobilenet = "https://storage.googleapis.com/tfjs-models/tfjs/mobilenet_v1_0.25_224/model.json";

        let model = null;
        let hasTrained = false;

        async function setupWebcam() {
            return new Promise( ( resolve, reject ) => {
                const webcamElement = document.getElementById( "webcam" );
                const navigatorAny = navigator;
                navigator.getUserMedia = navigator.getUserMedia ||
                navigatorAny.webkitGetUserMedia || navigatorAny.mozGetUserMedia ||
                navigatorAny.msGetUserMedia;
                if( navigator.getUserMedia ) {
                    navigator.getUserMedia( { video: true },
                        stream => {
                            webcamElement.srcObject = stream;
                            webcamElement.addEventListener( "loadeddata", resolve, false );
                        },
                    error => reject());
                }
                else {
                    reject();
                }
            });
        }

        async function getWebcamImage() {
            const img = ( await webcam.capture() ).toFloat();
            const normalized = img.div( 127 ).sub( 1 );
            return normalized;
        }

        async function captureSample( category ) {
            trainingData.push( {
                image: await getWebcamImage(),
                category: category
            });
            setText( "Captured: " + labels[ category ] );
        }

        let webcam = null;

        (async () => {
            // Load the model
            model = await tf.loadLayersModel( mobilenet );
            model = createTransferModel( model );
            await setupWebcam();
            webcam = await tf.data.webcam( document.getElementById( "webcam" ) );
            // Setup prediction every 200 ms
            setInterval( predictImage, 200 );
        })();
        </script>
    </body>
</html>

What’s Next?

This project showed you how to start training your own computer vision AI to recognize potentially unlimited gestures, objects, species of animals, or even types of foods. The rest is up to you; the future of deep learning and AI might start right within your browser.

I hope you enjoyed following along with these examples. And as you experiment with more ideas, don’t forget to have fun!

Wednesday 10 August 2022

Speech Recognition And Synthesis Managed APIs In Windows Vista

 Hands-on tutorial demonstrating how to add speech recognition and synthesis functionality to a C# text pad application.

Introduction

One of the coolest features to be introduced with Windows Vista is the new built in speech recognition facility. To be fair, it has been there in previous versions of Windows, but not in the useful form in which it is now available. Best of all, Microsoft provides a managed API with which developers can start digging into this rich technology. For a fuller explanation of the underlying technology, I highly recommend the Microsoft whitepaper. This tutorial will walk the user through building a common text pad application, which we will then trick out with a speech synthesizer and a speech recognizer using the .NET managed API wrapper for SAPI 5.3. By the end of this tutorial, you will have a working application that reads your text back to you, obeys your voice commands, and takes dictation. But first, a word of caution: this code will only work for Visual Studio 2005 installed on Windows Vista. It does not work on XP, even with .NET 3.0 installed.

Background

Because Windows Vista has only recently been released, there are, as of this writing, several extant problems relating to developing on the platform. The biggest hurdle is that there are known compatibility problems between Visual Studio and Vista. Visual Studio.NET 2003 is not supported on Vista, and there are currently no plans to resolve any compatibility issues there. Visual Studio 2005 is supported, but in order to get it working well, you will need to make sure you also install service pack 1 for Visual Studio 2005. After this, you will also need to install a beta update for Vista called, somewhat confusingly, "Visual Studio 2005 Service Pack 1 Update for Windows Vista Beta". Even after doing all this, you will find that all the new cool assemblies that come with Vista, such as the System.Speech assembly, still do not show up in your Add References dialog in Visual Studio. If you want to have them show up, you will finally need to add a registry entry indicating where the Vista DLL's are to be found. Open the Vista registry UI by running regedit.exe in your Vista search bar. Add the following registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\AssemblyFolders\v3.0 Assemblies with this value: C:\\Program Files\\Reference Assemblies\\Microsoft\\Framework\\v3.0. (You can also install it under HKEY_CURRENT_USER, if you prefer.) Now, we are ready to start programming in Windows Vista.

Before working with the speech recognition and synthesis functionality, we need to prepare the ground with a decent text pad application to which we will add on our cool new toys. Since this does not involve Vista, you do not really have to follow through this step in order to learn the speech recognition API. If you already have a good base application, you can skip ahead to the next section, Speechpad, and use the code there to trick out your app. If you do not have a suitable application at hand, but also have no interest in walking through the construction of a text pad application, you can just unzip the source code linked above and pull out the included Textpad project. The source code contains two Visual Studio 2005 projects, the Textpad project, which is the base application for the SR functionality, and Speechpad, which includes the final code.

All the same, for those with the time to do so, I feel there is much to gain from building an application from the ground up. The best way to learn a new technology is to use it oneself and to get one's hands dirty, as it were, since knowledge is always more than simply knowing that something is possible; it also involves knowing how to put that knowledge to work. We know by doing, or as Giambattista Vico put it, verum et factum convertuntur.

Textpad

Textpad is an MDI application containing two forms: a container, called Main.cs, and a child form, called TextDocument.cs. TextDocument.cs, in turn, contains a RichTextBox control.

Create a new project called Textpad. Add the "Main" and "TextDocument" forms to your project. Set the IsMdiContainer property of Main to true. Add a MainMenu control and an OpenFileDialog control (name it "openFileDialog1") to Main. Set the Filter property of the OpenFileDialog to "Text Files | *.txt", since we will only be working with text files in this project. Add a RichTextBox control to "TextDocument", name it "richTextBox1"; set its Dock property to "Fill" and its Modifiers property to "Internal".

Add a MenuItem control to MainMenu called "File" by clicking on the MainMenu control in Designer mode and typing "File" where the control prompts you to "type here". Set the File item's MergeType property to "MergeItems". Add a second MenuItem called "Window". Under the "File" menu item, add three more Items: "New", "Open", and "Exit". Set the MergeOrder property of the "Exit" control to 2. When we start building the "TextDocument" Form, these merge properties will allow us to insert menu items from child forms between "Open" and "Exit".

Set the MDIList property of the Window menu item to true. This automatically allows it to keep track of your various child documents during runtime.

Next, we need some operations that will be triggered off by our menu commands. The NewMDIChild() function will create a new instance of the Document object that is also a child of the Main container. OpenFile() uses the OpenFileDialog control to retrieve the path to a text file selected by the user. OpenFile() uses a StreamReader to extract the text of the file (make sure you add a using declaration for System.IO at the top of your form). It then calls an overloaded version of NewMDIChild() that takes the file name and displays it as the current document name, and then injects the text from the source file into the RichTextBox control in the current Document object. The Exit() method closes our Main form. Add handlers for the File menu items (by double clicking on them) and then have each handler call the appropriate operation: NewMDIChild(), OpenFile(), or Exit(). That takes care of your Main form.

C#
#region Main File Operations

private void NewMDIChild()
{
    NewMDIChild("Untitled");
}

private void NewMDIChild(string filename)
{
    TextDocument newMDIChild = new TextDocument();
    newMDIChild.MdiParent = this;
    newMDIChild.Text = filename;
    newMDIChild.WindowState = FormWindowState.Maximized;
    newMDIChild.Show();
}

private void OpenFile()
{
    try
    {
        openFileDialog1.FileName = "";
        DialogResult dr = openFileDialog1.ShowDialog();
        if (dr == DialogResult.Cancel)
        {
            return;
        }
    string fileName = openFileDialog1.FileName;
    using (StreamReader sr = new StreamReader(fileName))
        {
            string text = sr.ReadToEnd();
            NewMDIChild(fileName, text);
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

private void NewMDIChild(string filename, string text)
{
    NewMDIChild(filename);
    LoadTextToActiveDocument(text);
}

private void LoadTextToActiveDocument(string text)
{
    TextDocument doc = (TextDocument)ActiveMdiChild;
    doc.richTextBox1.Text = text;
}

private void Exit()
{
    Dispose();
}

#endregion

To the TextDocument form, add a SaveFileDialog control, a MainMenu control, and a ContextMenuStrip control (set the ContextMenuStrip property of richTextBox1 to this new ContextMenuStrip). Set the SaveFileDialog's defaultExt property to "txt" and its Filter property to "Text File | *.txt". Add "Cut", "Copy", "Paste", and "Delete" items to your ContextMenuStrip. Add a "File" menu item to your MainMenu, and then "Save", Save As", and "Close" menu items to the "File" menu item. Set the MergeType for "File" to "MergeItems". Set the MergeType properties of "Save", "Save As" and "Close" to "Add", and their MergeOrder properties to 1. This creates a nice effect in which the File menu of the child MDI form merges with the parent File menu.

The following methods will be called by the handlers for each of these menu items: Save(), SaveAs(), CloseDocument(), Cut(), Copy(), Paste(), Delete(), and InsertText(). Please note that the last five methods are scoped as internal, so they can be called by the parent form. This will be particularly important as we move on to the Speechpad project.

C#
#region Document File Operations

private void SaveAs(string fileName)
{
    try
    {
        saveFileDialog1.FileName = fileName;
        DialogResult dr = saveFileDialog1.ShowDialog();
        if (dr == DialogResult.Cancel)
        {
            return;
        }
        string saveFileName = saveFileDialog1.FileName;
        Save(saveFileName);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

private void SaveAs()
{
    string fileName = this.Text;
    SaveAs(fileName);
}

internal void Save()
{
    string fileName = this.Text;
    Save(fileName);
}

private void Save(string fileName)
{
    string text = this.richTextBox1.Text;
    Save(fileName, text);
}

private void Save(string fileName, string text)
{
    try
    {
        using (StreamWriter sw = new StreamWriter(fileName, false))
        {
            sw.Write(text);
            sw.Flush();
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

private void CloseDocument()
{
    Dispose();
}

internal void Paste()
{
    try
    {
        IDataObject data = Clipboard.GetDataObject();
        if (data.GetDataPresent(DataFormats.Text))
        {
            InsertText(data.GetData(DataFormats.Text).ToString());
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

internal void InsertText(string text)
{
    RichTextBox theBox = richTextBox1;
    theBox.SelectedText = text;
}

internal void Copy()
{
    try
    {
        RichTextBox theBox = richTextBox1;
        Clipboard.Clear();
        Clipboard.SetDataObject(theBox.SelectedText);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

internal void Cut()
{
    Copy();
    Delete();
}

internal void Delete()
{
    richTextBox1.SelectedText = string.Empty;
}

#endregion

Once you hook up your menu item event handlers to the methods listed above, you should have a rather nice text pad application. With our base prepared, we are now in a position to start building some SR features.

Speechpad

Add a reference to the System.Speech assembly to your project. You should be able to find it in C:\Program Files\Reference Assemblies\Microsoft\Framework\v3.0\. Add using declarations for System.Speech, System.Speech.Recognition, and System.Speech.Synthesis to your Main form. The top of your Main.cs file should now look something like this:

C#
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;
using System.Speech;
using System.Speech.Synthesis;
using System.Speech.Recognition;

In design view, add two new menu items to the main menu in your Main form labeled "Select Voice" and "Speech". For easy reference, name the first item selectVoiceMenuItem. We will use the "Select Voice" menu to programmatically list the synthetic voices that are available for reading Speechpad documents. To programmatically list out all the synthetic voices, use the following three methods found in the code sample below. LoadSelectVoiceMenu() loops through all voices that are installed on the operating system and creates a new menu item for each. VoiceMenuItem_Click() is simply a handler that passes the click event on to the SelectVoice() method. SelectVoice() handles the toggling of the voices we have added to the "Select Voice" menu. Whenever a voice is selected, all others are deselected. If all voices are deselected, then we default to the first one.

Now that we have gotten this far, I should mention that all this trouble is a little silly if there is only one synthetic voice available, as there is when you first install Vista. Her name is Microsoft Anna, by the way. If you have Vista Ultimate or Vista Enterprise, you can use the Vista Updater to download an additional voice, named Microsoft Lila, which is contained in the Simple Chinese MUI. She has a bit of an accent, but I am coming to find it rather charming. If you don't have one of the high-end flavors of Vista, however, you might consider leaving the voice selection code out of your project.

C#
private void LoadSelectVoiceMenu()
{
    foreach (InstalledVoice voice in synthesizer.GetInstalledVoices())
    {
        MenuItem voiceMenuItem = new MenuItem(voice.VoiceInfo.Name);
        voiceMenuItem.RadioCheck = true;
        voiceMenuItem.Click += new EventHandler(voiceMenuItem_Click);
        this.selectVoiceMenuItem.MenuItems.Add(voiceMenuItem);
    }
    if (this.selectVoiceMenuItem.MenuItems.Count > 0)
    {
        this.selectVoiceMenuItem.MenuItems[0].Checked = true;
        selectedVoice = this.selectVoiceMenuItem.MenuItems[0].Text;
    }
}

private void voiceMenuItem_Click(object sender, EventArgs e)
{
    SelectVoice(sender);
}

private void SelectVoice(object sender)
{
    MenuItem mi = sender as MenuItem;
    if (mi != null)
    {
        //toggle checked value
        mi.Checked = !mi.Checked;

        if (mi.Checked)
        {
            //set selectedVoice variable
            selectedVoice = mi.Text;
            //clear all other checked items
            foreach (MenuItem voiceMi in this.selectVoiceMenuItem.MenuItems)
            {
                if (!voiceMi.Equals(mi))
                {
                    voiceMi.Checked = false;
                }
            }
        }
        else
        {
            //if deselecting, make first value checked,
            //so there is always a default value
            this.selectVoiceMenuItem.MenuItems[0].Checked = true;
        }
    }
}

We have not declared the selectedVoice class level variable yet (your Intellisense may have complained about it), so the next step is to do just that. While we are at it, we will also declare a private instance of the System.Speech.Synthesis.SpeechSynthesizer class and initialize it, along with a call to the LoadSelectVoiceMenu() method from above, in your constructor:

C#
#region Local Members

private SpeechSynthesizer synthesizer = null;
private string selectedVoice = string.Empty;

#endregion

public Main()
{
    InitializeComponent();
    synthesizer = new SpeechSynthesizer();
    LoadSelectVoiceMenu();
}

To allow the user to utilize the speech synthesizer, we will add two new menu items under the "Speech" menu labeled "Read Selected Text" and "Read Document". In truth, there isn't really much to using the Vista speech synthesizer. All we do is pass a text string to our local SpeechSynthesizer object and let the operating system do the rest. Hook up event handlers for the click events of these two menu items to the following methods and you will be up and running with an SR enabled application:

C#
#region Speech Synthesizer Commands

private void ReadSelectedText()
{
    TextDocument doc = ActiveMdiChild as TextDocument;
    if (doc != null)
    {
        RichTextBox textBox = doc.richTextBox1;
        if (textBox != null)
        {
            string speakText = textBox.SelectedText;
            ReadAloud(speakText);
        }
    }
}

private void ReadDocument()
{
    TextDocument doc = ActiveMdiChild as TextDocument;
    if (doc != null)
    {
        RichTextBox textBox = doc.richTextBox1;
        if (textBox != null)
        {
            string speakText = textBox.Text;
            ReadAloud(speakText);
        }
    }
}

private void ReadAloud(string speakText)
{
    try
    {
        SetVoice();
        synthesizer.Speak(speakText);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }

}

private void SetVoice()
{
    try
    {
        synthesizer.SelectVoice(selectedVoice);
    }
    catch (Exception)
    {
        MessageBox.Show(selectedVoice + " is not available.");
    }
}

#endregion

Playing with the speech synthesizer is a lot of fun for about five minutes (ten if you have both Microsoft Anna and Microsoft Lila to work with) -- but after typing "Hello World" into your Speechpad document for the umpteenth time, you may want to do something a bit more challenging. If you do, then it is time to plug in your expensive microphone, since speech recognition really works best with a good expensive microphone. If you don't have one, however, then go ahead and plug in a cheap microphone. My cheap microphone seems to work fine. If you don't have a cheap microphone, either, I have heard that you can take a speaker and plug it into the mic jack of your computer, and if that doesn't cause an explosion, you can try talking into it.

While speech synthesis may be useful for certain specialized applications, voice commands, by cantrast, are a feature that can be used to enrich any current WinForms application. With the SR Managed API, it is also easy to implement once you understand certain concepts such as the Grammar class and the SpeechRecognitionEngine.

We will begin by declaring a local instance of the speech engine and initializing it.

C#
#region Local Members

private SpeechSynthesizer synthesizer = null;
private string selectedVoice = string.Empty;
private SpeechRecognitionEngine recognizer = null;

#endregion

public Main()
{
    InitializeComponent();
    synthesizer = new SpeechSynthesizer();
    LoadSelectVoiceMenu();
    recognizer = new SpeechRecognitionEngine();
    InitializeSpeechRecognitionEngine();
}

private void InitializeSpeechRecognitionEngine()
{
    recognizer.SetInputToDefaultAudioDevice();
    Grammar customGrammar = CreateCustomGrammar();
    recognizer.UnloadAllGrammars();
    recognizer.LoadGrammar(customGrammar);
    recognizer.SpeechRecognized +=
    new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
    recognizer.SpeechHypothesized +=
    new EventHandler<SpeechHypothesizedEventArgs>
                    (recognizer_SpeechHypothesized);
}

private Grammar CreateCustomGrammar()
{
    GrammarBuilder grammarBuilder = new GrammarBuilder();
    grammarBuilder.Append(new Choices("cut", "copy", "paste", "delete"));
    return new Grammar(grammarBuilder);
}

The speech recognition engine is the main workhorse of the speech recognition functionality. At one end, we configure the input device that the engine will listen on. In this case, we use the default device (whatever you have plugged in), though we can also select other inputs, such as specific wave files. At the other end, we capture two events thrown by our speech recognition engine. As the engine attempts to interpret the incoming sound stream, it will throw various "hypotheses" about what it thinks is the correct rendering of the speech input. When it finally determines the correct value, and matches it to a value in the associated grammar objects, it throws a speech recognized event, rather than a speech hypothesized event. If the determined word or phrase does not have a match in any associated grammar, a speech recognition rejected event (which we do not use in the present project) will be thrown instead.

In between, we set up rules to determine which words and phrases will throw a speech recognized event by configuring a Grammar object and associating it with our instance of the speech recognition engine. In the sample code above, we configure a very simple rule which states that a speech recognized event will be thrown if any of the following words: "cut", "copy", "paste", and "delete", is uttered. Note that we use a GrammarBuilder class to construct our custom grammar, and that the syntax of the GrammarBuilder class closely resembles the syntax of the StringBuilder class.

This is the basic code for enabling voice commands for a WinForms application. We will now enhance the Speechpad application by adding a menu item to turn speech recognition on and off, a status bar so we can watch as the speech recognition engine interprets our words, and a function that will determine what action to take if one of our key words is captured by the engine.

Add a new menu item labeled "Speech Recognition" under the "Speech" menu item, below "Read Selected Text" and "Read Document". For convenience, name it speechRecognitionMenuItem. Add a handler to the new menu item, and use the following code to turn speech recognition on and off, as well as toggle the speech recognition menu item. Besides the RecognizeAsync() method that we use here, it is also possible to start the engine synchronously or, by passing it a RecognizeMode.Single parameter, cause the engine to stop after the first phrase it recognizes. The method we use to stop the engine, RecognizeAsyncStop(), is basically a polite way to stop the engine, since it will wait for the engine to finish any phrases it is currently processing before quitting. An impolite method, RecognizeAsyncCancel(), is also available -- to be used in emergency situations, perhaps.

C#
private void speechRecognitionMenuItem_Click(object sender, EventArgs e)
{
    if (this.speechRecognitionMenuItem.Checked)
    {
        TurnSpeechRecognitionOff();
    }
    else
    {
        TurnSpeechRecognitionOn();
    }
}

private void TurnSpeechRecognitionOn()
{
    recognizer.RecognizeAsync(RecognizeMode.Multiple);
    this.speechRecognitionMenuItem.Checked = true;
}

private void TurnSpeechRecognitionOff()
{
    if (recognizer != null)
    {
        recognizer.RecognizeAsyncStop();
        this.speechRecognitionMenuItem.Checked = false;
    }
}

We are actually going to use the RecognizeAsyncCancel() method now, since there is an emergency situation. The speech synthesizer, it turns out, cannot operate if the speech recognizer is still running. To get around this, we will need to disable the speech recognizer at the last possible moment, and then reactivate it once the synthesizer has completed its tasks. We will modify the ReadAloud() method to handle this.

C#
private void ReadAloud(string speakText)
{
    try
    {
        SetVoice();
        recognizer.RecognizeAsyncCancel();
        synthesizer.Speak(speakText);
        recognizer.RecognizeAsync(RecognizeMode.Multiple);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

The user now has the ability to turn speech recognition on and off. We can make the application more interesting by capturing the speech hypothesize event and displaying the results to a status bar on the Main form. Add a StatusStrip control to the Main form, and a ToolStripStatusLabel to the StatusStrip with its Spring property set to true. For convenience, call this label toolStripStatusLabel1. Use the following code to handle the speech hypothesized event and display the results:

C#
private void recognizer_SpeechHypothesized(object sender, 
                        SpeechHypothesizedEventArgs e)
{
    GuessText(e.Result.Text);
}

private void GuessText(string guess)
{
    toolStripStatusLabel1.Text = guess;
    this.toolStripStatusLabel1.ForeColor = Color.DarkSalmon;
}

Now that we can turn speech recognition on and off, as well as capture misinterpretations of the input stream, it is time to capture the speech recognized event and do something with it. The SpeechToAction() method will evaluate the recognized text and then call the appropriate method in the child form (these methods are accessible because we scoped them internal in the Textpad code above). In addition, we display the recognized text in the status bar, just as we did with hypothesized text, but in a different color in order to distinguish the two events.

C#
private void recognizer_SpeechRecognized(object sender, 
                        SpeechRecognizedEventArgs e)
{
    string text = e.Result.Text;
    SpeechToAction(text);
}

private void SpeechToAction(string text)
{
    TextDocument document = ActiveMdiChild as TextDocument;
    if (document != null)
    {
        DetermineText(text);

        switch (text)
        {
            case "cut":
                document.Cut();
                break;
            case "copy":
                document.Copy();
                break;
            case "paste":
                document.Paste();
                break;
            case "delete":
                document.Delete();
                break;
        }
    }
}

private void DetermineText(string text)
{
    this.toolStripStatusLabel1.Text = text;
    this.toolStripStatusLabel1.ForeColor = Color.SteelBlue;
}

Now let's take Speechpad for a spin. Fire up the application and, if it compiles, create a new document. Type "Hello world." So far, so good. Turn on speech recognition by selecting the Speech Recognition item under the Speech menu. Highlight "Hello" and say the following phrase into your expensive microphone, inexpensive microphone, or speaker: delete. Now type "Save the cheerleader, save the". Not bad at all.

Voice command technology, as exemplified above, is probably the most useful and most easy to implement aspect of the Speech Recognition functionality provided by Vista. In a few days of work, any current application can be enabled to use it, and the potential for streamlining workflow and making it more efficient is truly breathtaking. The cool factor, of course, is also very high.

Having grown up watching Star Trek reruns, however, I can't help but feel that the dictation functionality is much more interesting. Computers are meant to be talked to and told what to do, not cajoled into doing tricks for us based on finger motions over a typewriter. My long-term goal is to be able to code by talking into my IDE in order to build UML diagrams and then, at a word, turn that into an application. What a brave new world that will be. Toward that end, the SR managed API provides the DictationGrammar class.

Whereas the Grammar class works as a gatekeeper, restricting the phrases that get through to the speech recognized handler down to a select set of rules, the DictateGrammar class, by default, kicks out the jams and lets all phrases through to the recognized handler.

In order to make Speechpad a dictation application, we will add the default DicatateGrammar object to the list of grammars used by our speech recognition engine. We will also add a toggle menu item to turn dictation on and off. Finally, we will alter the SpeechToAction() method in order to insert any phrases that are not voice commands into the current Speechpad document as text.

Begin by creating a local instance of DictateGrammar for our Main form, and then instantiate it in the Main constructor. Your code should look like this:

C#
#region Local Members

private SpeechSynthesizer synthesizer = null;
private string selectedVoice = string.Empty;
private SpeechRecognitionEngine recognizer = null;
private DictationGrammar dictationGrammar = null;

#endregion

public Main()
{
    InitializeComponent();
    synthesizer = new SpeechSynthesizer();
    LoadSelectVoiceMenu();
    recognizer = new SpeechRecognitionEngine();
    InitializeSpeechRecognitionEngine();
    dictationGrammar = new DictationGrammar();
}

Create a new menu item under the Speech menu and label it "Take Dictation". Name it takeDictationMenuItem for convenience. Add a handler for the click event of the new menu item, and stub out TurnDictationOn() and TurnDictationOff() methods. TurnDictationOn() works by loading the local dictationGrammar object into the speech recognition engine. It also needs to turn speech recognition on if it is currently off, since dictation will not work if the speech recognition engine is disabled. TurnDictationOff() simply removes the local dictationGrammar object from the speech recognition engine's list of grammars.

C#
private void takeDictationMenuItem_Click(object sender, EventArgs e)
{
    if (this.takeDictationMenuItem.Checked)
    {
        TurnDictationOff();
    }
    else
    {
        TurnDictationOn();
    }
}

private void TurnDictationOn()
{
    if (!speechRecognitionMenuItem.Checked)
    {
        TurnSpeechRecognitionOn();
    }
    recognizer.LoadGrammar(dictationGrammar);
    takeDictationMenuItem.Checked = true;
}

private void TurnDictationOff()
{
    if (dictationGrammar != null)
    {
        recognizer.UnloadGrammar(dictationGrammar);
    }
    takeDictationMenuItem.Checked = false;
}

For an extra touch of elegance, alter the TurnSpeechRecognitionOff() by adding a line of code to turn off dictation when speech recognition is disabled:

C#
TurnDictationOff();

Finally, we need to update the SpeechToAction() method so it will insert any text that is not a voice command into the current Speechpad document. Use the default statement of the switch control block to call the InsertText() method of the current document.

C#
private void SpeechToAction(string text)
{
    TextDocument document = ActiveMdiChild as TextDocument;
    if (document != null)
    {
        DetermineText(text);
        switch (text)
        {
            case "cut":
                document.Cut();
                break;
            case "copy":
                document.Copy();
                break;
            case "paste":
                document.Paste();
                break;
            case "delete":
                document.Delete();
                break;
            default:
                document.InsertText(text);
                break;
        }
    }
}

With that, we complete the speech recognition functionality for Speechpad. Now try it out. Open a new Speechpad document and type "Hello World." Turn on speech recognition. Select "Hello" and say delete. Turn on dictation. Say brave new.

This tutorial has demonstrated the essential code required to use speech synthesis, voice commands, and dictation in your .NET 2.0 Vista applications. It can serve as the basis for building speech recognition tools that take advantage of default as well as custom grammar rules to build adanced application interfaces. Besides the strange compatibility issues between Vista and Visual Studio, at the moment the greatest hurdle to using the Vista managed speech recognition API is the remarkable dearth of documentation and samples. This tutorial is intended to help alleviate that problem by providing a hands on introduction to these new tools.

History

  • Feb 28, 2007: updated to fix a conflict between the speech recognizer and text-to-speech synthesizer. 
Connect broadband

Automate Machine Learning Workflows with Pipelines in Python and scikit-learn

There are standard workflows in a machine learning project that can be automated. In Python scikit-learn, Pipelines help to to clearly defin...