This is a complex topic and is difficult to answer in a couple of sentences. Basically they learn from millions of data samples and a collection of several thousand audio features, common patterns to known emotions and are thus able to predict emotions and confidences automatically.
Accuracy is measured on reference databases. There are a few public ones used by researchers, but they don’t reflect realistic conditions for commercial use cases. Thus, it is essential to have commercial grade corpora to tune and evaluate algorithms.
Real-time operation is the core feature of openSMILE. All processing is carried out incrementally in order to be able to return a response with a minimum possible delay. For emotion recognition and paralinguistic speech analysis it is required to collect data over windows (typically 2-5 seconds), which will create a minimum delay. All of this process is however transparently handled by openSMILE. You can record audio from a microphone, or stream audio over a UDP socket to openSMILE and retrieve analysis results as soon as they are available in JSON Format.
There are many use cases, ranging from marketing research, telecom service centres, sales, speech training, and depression monitoring and treatment.
Our openSMILE licensing offers are generally custom offers which fit our client’s business model. You can choose between several options, using openSMILE through our web-APIs remotely (i.e. we will set up a custom API for you), or licensing commercial version openSMILE binaries. In the latter case the price of the license depends on the scale. We either have licenses that scale with the number of instances, or with the volume of analysed audio. In any case the license will be given for your use-case and application. Further information on commercial licensing can be found here.
For details our pricing models please contact us.
We have accurate predictors for fundamental emotion dimensions (arousal, valence, power/control) from which probabilities for emotion classes (such as angry, happy, sad) can be estimated. The big challenge is to cope with degraded acoustic conditions, background noise, and speaker and cultural differences. Therefore we are constantly improving our algorithms in a semi-automatic way.
We take a very new and open approach to innovative technology development. We believe that good technology can only be created when it is based on solid academic evidence. We thus closely collaborate with academic teams, organise and sponsor research challenges, and also give the core of our audio analysis tools to the academic world for free as open-source software. Our openSMILE toolkit was the first of it’s kind to define standard acoustic feature sets for emotion recognition and other paralinguistic speech analytics tasks such as detection of speaker age and gender. It has been downloaded over 50k times and cited over 1k times in academic publications. This gives our algorithms that we build on top a solid base framework that has been tested by many world-wide users.
Most other emotion recognition providers do not openly communicate what their technology is based on. Some of them have aquired patents which are more than a decade old, but there is no evidence that these vendors are actively doing research to improve their technology. The focus of most of our competitors is on emotion, while we aim for the broader field of paralinguistic analysis – i.e. everything that is carried by the voice, except the text.
It is not better, it is fundamentally different. What you say can drastically differ from how you say it. Our algorithms detect “how” you say it, not what you say it. Therefore, speech-to-text and sentiment analysis is complementary to our analysis of the tone of voice. The tone of voice reflects more our subconsciousness. It is easier to control what you are saying than how you are saying it.
Our vision is to become the leading provider of paralinguistic audio analysis technology that simply works and enables machines to interface with humans more socially and more natural. This means the technology shall be able to deal with all sorts of degraded and challenging acoustic conditions (poor microphones, background noise, etc.), and in an iterative, smart, self-evolving way learn to read from our voices – not only emotions, but also how healthy we sound, how old we are, or if we are nervous or stressed out.