By Jason Matthews

Most people know that computers communicate and understand exclusively in binary. The combinations of ones and zeros are a highly efficient way of imparting and processing information accurately. However, when using binary to teach neural networks to understand the "real world" accurately, ever more extensive and complex datasets are required. Researchers at the Creative Machines Lab at Columbia University think they have discovered a better way for neural networks to learn about the physical world: human speech.

Image Credit: Amiak via iStock/Getty Images - Edited by Universal-Sci

Mechanical Engineering Professor Hod Lipson, PhD student Boyuan Chen, and their Columbia Engineering team have carried out a study into this new teaching technique. The study looks at the idea that artificial intelligence systems may learn better and faster, with less input from programmers, if audio files of human speech replace binary databases.

A novel approach

Lipson's and Chen hypothesized that neural networks are restrained from reaching their full potential by traditional binary teaching techniques. They wanted to see how a neural network would perform using audio files of spoken words instead of large binary code datasets to recognize objects. Tasking neural networks with identifying images is a standard exercise in testing out new machine learning techniques.

For this controlled experiment, they created two neural networks, a control network using a traditional huge binary database and the other using the audio files of human speech. Each network had to correctly identify ten different objects [cat, dog, car, train, for example] by learning from 50,000 training images.

The traditional neural network was trained using a vast database and assigning one or zero to 10 columns for each image, each corresponding to one image type. A one in a column indicated a match to one of the objects it was trying to identify, and zero was no match.

The test network had only two columns per row, one with the image, the other with an audio file of a person saying the word for that object.

After 15 hours of training, they tested both networks for accuracy. The team was surprised to find that both neural networks performed equally well, scoring an accuracy of around 92% each. The same occurred when they repeated the experiment. However, when they reran the experiment, reducing the number of training pictures from 50,000 to 2,500, what they discovered was shocking.

Surprising results

As expected, the traditionally trained network performed significantly worse, only scoring around 35% accuracy. However, the experimental network scored 70% under the same conditions- twice that of the traditional system.

The following experiment was an image ambiguity test. The images shown to the neural networks were more challenging to identify, for example, different breeds of dogs or cats, odd angles of objects, and slightly distorted or obscured images. Again, the traditional network performed as expected, only scoring 20% accuracy. However, the voice-trained system astounded the team with an impressive 50% accuracy rating.

This radical idea seems to fly in the face of most programmers' core beliefs. The thought that something as analog as human speech could be better than the precise ones and zeros of binary at helping AI's learn sounds absurd. However, Professor Lipson and his team's approach harkens back to the ground-breaking hypothesis of "the father of information theory", Claude Shannon. Shannon suggested, "The most effective communication "signals" are characterized by an optimal number of bits, paired with an optimal amount of useful information, or "surprise." In other words, binary code is entirely two-dimensional, whereas speech incorporates the words, tone, inflections, and context.

The future of neural network learning

A single spoken word contains more information than just the word itself. To get the same level of information and understanding from binary would require a vast dataset of ones and zeros. Another way to look at it is that binary equates to our ape ancestors' squeaks and grunts, and modern speech is the most highly evolved sound in the natural world. It has taken around 250,000 years to develop, so it, therefore, makes sense that teaching neural networks to understand and use speech instead of binary is the next step in AI development.

"We should think about using novel and better ways to train AI systems instead of collecting larger datasets," said Chen. "If we rethink how we present training data to the machine, we could do a better job as teachers." "One of the biggest mysteries of human evolution is how our ancestors acquired language and how children learn to speak so effortlessly," Lipson said. "If human toddlers learn best with repetitive spoken instruction, then perhaps AI systems can, too."

Further reading: