Nvidia Nemo training throws PicklingError

I hope to get some help with a Nvidia Nemo training error.

Environment used:
-> python 3.6
-> cudatoolkit=10.1
-> nemo-toolkit=10.0
-> additionally installed nemo_toolkit[asr]

OS used:
-> Windows 10 Professional x64

What did I attempt to do (quick summary)?
-> Use Nvidia Nemo ASR to train a Jasper model with own dataset.

Is there another similar topic/issue being reported here on the Nvidia forums?
-> The only topic I could find is Nvidia Digits not saving Jobs after training the model and throwing Pickle error in log which has to do with pickle but does not correlate to this issue since it does not use a static member. Furthermore, this topic had no response.

I am using Nvidia Nemo to implement an online ASR system. I set up everything locally (i.e. not using the provided docker image but I cloned it from github into my local project and use it via an external library reference within the Python project) and inference works fine.

I tried to set up a training notebook according to the corresponding ASR training section provided in https://nvidia.github.io/NeMo/asr/tutorial.html#training

I can execute everything without any issues (i.e. creating all the necessary neural modules and linking them accordingly to form the directed acyclic execution graph) until I attempt to execute the train-method on the neural factory. The error which appears is the following:

PicklingError: Can’t pickle <class ‘nemo.collections.asr.parts.collections.AudioTextEntity’>: attribute lookup AudioTextEntity on nemo.collections.asr.parts.collections failed

I had a look at the respective part in the source code where the pickle protocol fails which can be found in the Nvidia Nemo library under nemo.collections.asr.parts.collections.py. This Python collection module contains several ASR collection classes used by the Nvidia framework. The corresponding class which fails to execute the pickle protocol is called ‘AudioText’ which contains a static member named ‘OUTPUT_TYPE’. The named_tuple ‘AudioTextEntity’ of this static member can not be pickled. Upon further investigation it seems like Python has issues using the pickle protocol when using named_tuple packed within another class because Python does not resolve the reference correctly when using pickle in combination with a named_tuple. The necessary Python pickle protocol reference needs to be main.AudioText.AudioTextEntity but pickle seems to ignore the class reference and uses main.AudioTextEntity resulting in this error. There are threads on Stackoverflow pointing out this Python issue and can easily be verified locally with a separate test class. Suggested soltutions are not using the pickle protocol under these circumstances or refactoring the static member to reside outside the class (i.e. directly under the Python module) thereby resulting in a valid pickle object reference.

Apparently this approach does not work (may have worked in an older Python version). The same is true for all other classes within this Python module ‘collections.py’. I assume that it must have worked at some point in time since all classes use the same approach and it is necessary to use ASR training. I attempted to refactor this static member ‘OUTPUT_TYPE’ to be outside of the class. However my IDE (Pycharm) seems to not reload the changes in the external library correctly (also by using invalidate caches and restart). The exact same error occurs at the same spot.

Has anyone else stumbled across this problem? How did you solve it? Did you modify the file in the external library manually to get it working and if so how did you reload those library references? Did you use another workaround?

I am sure that other people will run eventually into the same issue when attempting to use Nvidia Nemo to train ASR. I would be very grateful if there is any useful hint which I might try/attempt or any further information needed in order to clarify my issue.

I ran into the same issue while training Quartznet. It seems an issue in Multiprocessing on Windows. As a workaround, you can set self.num_workers = 0 in place of self.num_workers = num_workers in \torch\utils\data\dataloader.py and it should work. Not sure but it might slow down training as it will not utilize multiprocessing. Any NeMo expert please help here to resolve it permanently.

1 Like

Thank you very much for your reply. Any number other than 0 being assigned to num_workers (i.e. Pytorch will attempt to use more than just the main thread) results in this issue. I still do not understand why this solves the issue but it works as a quick workaround (obviously potentially with a hefty performance penalty which I have not verified yet).

So for anyone reading this and experiencing the same issue open the corresponding Pytorch backend dataloader.py in which you find the class DataLoader. Manually assign the value 0 to num_workers in the init method (i.e. self.num_workers = 0). Maybe a developer or anyone with deeper understanding could jump in why it behaves this way or potentially fix this issue once and for all.

1 Like