Problems getting streaming example to work with RNNT

Hi !

We have a custom trained Nemo model. In particular a Conformer, with RNNT Char encoded decoder layers…

At risk of repeating myself, it’s a conformer model using Char encoding (instead of BPE) and using an RNNT/Transducer (instead of CTC). The model class is EncDecRNNTModel.

I’m trying to get this working in streaming aka buffered inference mode.

There are some excellent notebooks with explanations and example code of how to do streaming with Nemo this, here and here

(Yes I do realize that these notebooks are in the Nemo github, not on google per se).

I’m getting problems that might be because the examples have not been updated to latest versions? Or maybe it’s something else. Anyway would really appreciate any help.

The short version for the problem I’m having is that I get this error when I try to use it.

AttributeError: 'EncDecRNNTModel' object has no attribute 'tokenizer'

Specifically the LongestCommonSubsequenceBatchedFrameASRRNNT class (from nemo/collections/asr/parts/utils/ makes reference to the model.tokenizer object.

It does that on this line 715

        if hasattr(asr_model.decoder, "vocabulary"):
            self.blank_id = len(asr_model.decoder.vocabulary)
            self.blank_id = len(asr_model.joint.vocabulary)
        self.tokenizer = asr_model.tokenizer # <-- here 

The problem is that the asr_model I’m using aka the EncDecRNNTModel from Nemo 1.20 doesn’t have a tokenizer. Methods like decode_ids_to_tokens are on the model.decoding object.

Maybe this stuff got moved around and I just need to make some small changes to the streaming code or maybe I’m very very confused about the whole thing.

Any help very much appreciated! Thanks in advance.

BTW I’m using …

>>> nemo.__version__

A follow up question… in the notebook it mentions that the approach used is similar to the work discussed in the paper Partially Overlapped Inference for Long-Form Speech Recognition “but operates on the notion of subword buffers rather than character tokens”.

Reading the paper it seems like the fact that their approach is character level rather than word level is quite an important part of what they studied, so that seems odd. But perhaps by subword buffers do they just mean BPE tokens… so therefore if we use this code with a char-encoded model it would be operating at the character level not the word level, is that right?

Hi @utunga,

You might have better luck getting a response in the NeMo Github discussion area. NVIDIA/NeMo · Discussions · GitHub