HomePage Forums Development Speech to Text Where to start developing a acoustic model for Esperanto for CMUSphinx

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
  • #603

    The title says it all. I’m not sure where to begin reading. I’ve never worked on this sort of project.


    Hey, I think you are the first person besides me to use the forums. This is fairly new, so I’m still just learning my way around.

    Here is a description of the process of training a new model:

    From what I understand, esperanto is a pretty phonetic language, so it may be pretty straightforward to build a phonetic dictionary, if each letter corresponds to only one phoneme, you may be able to write a script generate phonetic transcriptions from a set of esperanto words.

    Unfortunately, you also need to train your model using audio clips, which Naomi might be able to help with. You’ll need to set Naomi to eo mode and create trainslation files. You can use ./update_translations --language=eo from the Naomi directory. You will need to adjust the headers of those files, but that’s mostly just going to require figuring out one and basically copying it to the others. There are controls for the ways that plurals are designated, etc.

    As far as building an acoustic model for Pocketsphinx, I’m thinking that you can probably start by adapting a model with similar phonemes, like spanish (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Spanish/), to your phoneme dictionary. You should put all the HMM files into a folder at ~/.naomi/pocketsphinx/standard/eo/. You should have the files “cmudict.dict” (which you create yourself), “feat.params”, “mdef”, “means”, “mixture_weights”, “noisedict”, “sendump”, “transition_matrices” and “variances”,

    Once you have a phoneme to grapheme dictionary, you can use the steps in https://projectnaomi.com/dev/docs/plugins/pocketsphinx-install.html to try to generate a basic model.

    Here is the official training guide. I’m hoping you can get away with adapting a model rather than training a new model.

    With Naomi there are a bunch of settings for capturing and storing audio clips in a database. So open ~/.naomi/configs/profile.yaml and add save_audio: True or save_active_audio: True, or use the “–save-audio” flag on the command line. This creates a sqlite3 database at ~/.naomi/audiolog/audiolog.db. The actual audio clips are stored in .wav files in the same directory. When recording audio for training, I always also set the print_transcript: True option in profile.yml, so I can follow see more clearly what Naomi is and isn’t getting. This may lead you to modify your phoneme mapping if Naomi is pretty consistently mishearing certain words. It doesn’t have to be the whole dictionary either. Pocketsphinx performs much better with a limited dictionary of just the words you are most likely to use.

    It probably won’t understand you well at all at first, but you can train it to your specific voice, which actually works pretty well. Once you have some audio clips, you can run the NaomiSTTTrainer.py program, which will start a small webserver and launch your browser. From there you can review your own audio files and correct Naomi’s transcriptions. When you are ready, click on the “Train STT Engines” tab and select the “Adapt Pocketsphinx” button. This will use all the clips that you have corrected or verified to train the model to your clips.

    Now, having a model that understands you specifically is great, but creating a model that a lot of people can use with a lot of different voices and inflections and whatnot will take longer and require a few people to participate, but you can send them your trained model to start with, and the data they add will help create a more general listener, that will then be a better starting point for new listeners.

    This is one of the strengths of Naomi right now, while everyone is building composite models that understand everyone poorly, you can have a specific model that understands just you less poorly.

    The database also collects word error rates, which can be used to help track how your model improves as you provide more training data.

    Don’t throw your recordings away, either. We will use them down the road for training Deepspeech and Kaldi models, which should provide better accuracy.

    Please let me know if you need additional help.



    By the way, did you create your avatar, or was it just assigned to you? I haven’t figured out how to change the profile pic yet.


    No idea, must have been on another wordpress website and it carried over.


    Okay, thanks.


    Thank you for you help. Yeah Esperanto is simple phonetically, but I think biggest challenge is the many different accents as it is a localized language. It Looks like I got a lot of work ahead!


    Yes, it will probably be quite a bit of work. Unless you can find a pre-built esperanto acoustic model. Even then, translating every utterance in Naomi will be a challenge.

    For french and german, by the way, I’ve had better luck with pico than with flite for some reason. I’d be interested to hear what the best speech to text engine is for esperanto.


    Oh, and again, you would be building your model for you. Your voice, your accent. Then you can pass the basic model to someone with a different accent for them to continue adapting it. Hopefully we can make it really easy to share and mix language models. That’s all pretty new stuff, though.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.