Voice User Interface Interactive Computer Project

The Sorrow · May 31, 2012

I got inspired by Iron Man to try and make a VUI based linux machine. I know its probably a little advanced since i only really know python, but i can roll with it and see what crazy code i can come up with. My base idea is straightforward. Take a simple voice to text parsing program and use it to transform verbal commands into textual strings. The strings are then passed to an algorithm that will take the words and attempt to logically assign computer-friendly commands to them (possibly PHP script) and then execute the commands/script with an audio response. Now i can start coding the translation module immediately, but it would be nice to see what kind of output i can get from a program that parses voice to text so i can actually see what i would have to work with. Anyone know of such software? I am also more than open to input on how to execute this project as far as logical code flow and the like, since my design may be flawed.

Edited May 31, 2012 by The Sorrow

bwall · May 31, 2012

I got inspired by Iron Man to try and make a VUI based linux machine. I know its probably a little advanced since i only really know python, but i can roll with it and see what crazy code i can come up with. My base idea is straightforward. Take a simple voice to text parsing program and use it to transform verbal commands into textual strings. The strings are then passed to an algorithm that will take the words and attempt to logically assign computer-friendly commands to them (possibly PHP script) and then execute the commands/script with an audio response. Now i can start coding the translation module immediately, but it would be nice to see what kind of output i can get from a program that parses voice to text so i can actually see what i would have to work with. Anyone know of such software? I am also more than open to input on how to execute this project as far as logical code flow and the like, since my design may be flawed.

You are basically looking for a 3 part system. You only really need to write one. Most languages have some sort of voice to text and text to voice engine, whether they are built in(like in .net) or are external. This section would be the IO section, handling input and output.

The next layer would be what you actually have to write to make this work at all. This is the translation section. It would be best to have a selection of commands for input, that can parse the command they are running on the actual system(the actual system, ie. your linux box, is the 3rd part). With command line applications being what you are running from this, you would want to pipe their input and output via the translation section. I would suggest avoiding using scripting languages for something like this, but mono isn't very good with the most intricate parts of .Net, and the alternative that will save you time is Java. And I hate Java.

Let's walk through a use case:

You say "Get System Uptime"

The first layer recognizes speech, converts it to text, and passes it to the translation layer.

The translation layer parses this string to make sure it matches one of the acceptable commands, in this case system_uptime(just for example).

The translation layer then calls "uptime" with its stdout(standard out) piped into the translation layer, reading "18:51:13 up 2 days, 22:53, 4 users, load average: 0.47, 0.82, 1.02"

The translation layer parses this string to get the uptime, converting it to what should be said, "The system has been running for 2 days, 22 hours and 53 minutes".

This is passed to the first layer to be spoken.

I feel this would be a good starting design for at least prototyping the concept. If you have an Android device, it would be kind of an easy project to do on that.

My biggest concern about this project is my concern with any voice recognition system. When does it know you are talking to it? Random example, what if playing Uptown Girl makes it think you are asking for the uptime?

bobbyb1980 · May 31, 2012

Before I got into security I worked as a translator. Rosetta Stone has a pretty good interface for voice analysis and comparison, it gives a nice easy to read output in graph format. When we would train in accent optimization, we would have to speak into a microphone, then compare that output to that of a native speaker and work to get them as close to each other as possible to try to perfect an accent.

Your base idea is very possible, have someone say "attack". have a script that iterates the input from a mic (you'd have to convert a graph to a textual representation or something that can be easily iterated, not easy but not impossible either, that's if Rosetta Stone were used but tons of other possibilities), then based on those results executes a command accordingly, like a deauth attack. From what little I know, I'd say the tough part is going to be getting that script to recognize other accents besides your own. The way that the word "hello" looks on a graph spoken by someone from California looks extremely different than that of someone from New York, and even more different than London, etc etc.

Edited May 31, 2012 by bobbyb1980

The Sorrow · May 31, 2012

@Bwall - Well if you could do something like in Iron man, normally Tony has to state Jarvis's name and he quickly replies with a "Yes sir" or similar. Almost like a dialogue version of clicking the bash icon and a command prompt opening up. Then you begin the dialogue and the commands are ran. You could then make a universal terminating command. So if you could make a universal verbal initiator like that, problem more or less solved.

@Bibbyb - Yes, Ive seen how voice synthesis works (in a small way, not in depth) so it will be a difficult thing to fine tune beyond personal use.

bwall · June 1, 2012

@Bwall - Well if you could do something like in Iron man, normally Tony has to state Jarvis's name and he quickly replies with a "Yes sir" or similar. Almost like a dialogue version of clicking the bash icon and a command prompt opening up. Then you begin the dialogue and the commands are ran. You could then make a universal terminating command. So if you could make a universal verbal initiator like that, problem more or less solved.

@Bibbyb - Yes, Ive seen how voice synthesis works (in a small way, not in depth) so it will be a difficult thing to fine tune beyond personal use.

The engines I mentioned have their own methods of training. With .Net, you preset training words, and train it at runtime. I guess you could also set the words at runtime. Granted, the cookie cutter engines I'm mentioning are not nearly as accurate as those bobbyb is mentioning. The question is, is it a feature or a bug that the system would only respond properly to your voice/accent? :P

Sign In

Voice User Interface Interactive Computer Project

Recommended Posts

The Sorrow

Link to comment

Share on other sites

bwall

Link to comment

Share on other sites

bobbyb1980

Link to comment

Share on other sites

The Sorrow

Link to comment

Share on other sites

bwall

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Browse

Activity