Sometime back Nandeep took me along to NID to help out Masters students with their projects. Kalyani, one of the student was making a haptic device which can help students with hearing disabilities to practice and learn alphabets on their own. From her field visits she got to know that at the moment most of sessions are done in person with the trainer. In these sessions kids were feeling vibrations of trainer's throat to identify how to speak. Her project was around this concept of replicating these vibrations on a physical device along with an App which can "listen" to what students are saying and compare them against some standard audio samples of characters.
Nandeep quickly found this script which used Librosa to compare two audio samples. It was a good start, we were able to get some idea on how we can use these tools for our sample set. When we looked at returned values by MFCC we realized it was a vector. With DTW we tried few custom distance calculations like average, difference between max values etc but cosine similarity gave us the best results. We put together this script, tested it with different samples and checked the performance and kalyani was quite satisfied with it for her prototype.
We got quite lucky with this algorithm fitting in our problem scenario though simple online search engines might not have lead us in this direction.