The Power of Your Voice

Have you ever stopped to think about just how complicated spoken language in particular is? The ambiguities, homonyms, intonations, and even differences between languages themselves? Isn’t it such a suprise then that this is the default and most convenient method of communication that has evolved over the entirety of human evolution? Why is spoken language, with all of its imperfections and difficulties, the way we’ve chosen to communicate when written words, or heaven forbid, mathematical and logical notation, are so much more specific, and in some cases, much more concise. For example:

The above means (assuming I’ve built the LaTeX correctly) the following:

For any prime number X, there exists another prime number Y, such that Y is larger than X

Notice that the written words are much longer than the mathematical/logical notation that encodes the same thing. Not to mention that the mathematical/logical notation leaves absolutely nothing to interpretation. In this particular case, the english words are also unambiguous, but that’s not always the case.

For example, take the word ‘read’. Is this pronounced ‘reed’ and meant to be present tense, or is it pronounced ‘red’ and intended to be past tense? Of course, usually context is provided to help in distinguishing the difference, but the point here is to illustrate the difference. And even just using that same example, if I was to say ‘reed’ out loud, and give no additional context, it would be unclear as to whether I meant ‘reed’ or ‘read’, which shows the ambiguity in spoken language.

And yet, spoken language is the pinnacle of human evolution when it comes to communication - we’ve even figured out how to use it over phones, and the internet! And the biggest advances to it are still yet to come.

I recently had the privilege to attend the most recent AWS re:Invent conference in Las Vegas where they gave out a recently released product called the Echo Dot. This device is simply a smaller version of a product they announced and released 2 years ago called the Echo, and was nothing spectacular in and of itself, but the purpose of giving all of these devices out was to generate interest from the developers at the conference for the AI platform built on the Echo products.

This platform has been named Alexa, and is a voice initiated AI that is capable of executing several tasks based on context and information provided at the time of the request. Granted, each of these tasks must be individually built, but the beauty is that the platform itself is free and open sourced. Anybody can spend the time to teach Alexa a ‘skill’. Skills are simply applications developed for a specific purpose for the platform.

I was quite intrigued, and decided to look into to the platform and by doing so I learned some very interesting things. First and foremost, one of the beauties of spoken language (and one of the biggest pain points for voice based applications) is that humans have a fantastic and complicated ability to perform context switches that completely change the course of a conversation. In a conversation with another human being, this context switch usually goes unnoticed. When building a skill for a platform where the primary means of interaction is voice, this tendency to context switch needs to be taken into account, or the desires and wishes of the user won’t be able to be fulfilled.

Along the same lines, the ability to maintain that context is just as important. A user speaking to financial application about a credit card bill amount might subsequently ask the platform to ‘Please pay it’. Without the previous context of which credit card and how much money was being discussed, that request cannot be fulfilled.

If you combine this ability to context switch, or require previous context for any and all requests with the inherent fluidity and various ways that an individual can request the same thing, it can really complicate the understanding and building of the user interface and commands that your application might be able to respond to.

Now imagine what it would take, and how much access to resources would be required to develop a skill that can maintain a fluid conversation about any topic you might want to discuss for an extended period of time, including the many context and topic switches that happen as a result of tangents and random thought processes.

Amazon has issued a challenge to universities worldwide and their student base to develop a skill capable of just that - a conversation on any topic of choice for a length of 20 minutes at minimum. They’ve just selected their top 12 finalists and given each group $100,000 to help in their development. The winning team will be announced this time next year, and will receive $500,000. To incentivize the unversities to support and aid in these projects, the university to which the team belongs will receive $1,000,000.

This is a research field and topic with far reaching repercussions. Once an application like this is developed, advancements to it should be quick to follow. Combine this advanced conversational AI along with all of the actual practical applications of voice-based AI requests and you effectively have your very own private J.A.R.V.I.S. personal assistant.

All of this complication, yet humans are still able to use voice and language with such little effort. It really makes you think about the intricacies of the everyday conversations you have, and how long it will be before you can do the same with an AI.

Have you had any exposure to Amazon Alexa? Or maybe you’ve got some good insight into language itself? Whether it’s one of these, or you simply enjoyed this post, I’d love to hear any thoughts you might have on the topic in the comments below.


Follow @pcockwell on Twitter