Change tokenizer from pipeline NLP - Nvidia Morpheus

Dear Nvidia,
I would like to change de tokenizer of the nlp pipeline.
I am using other model (Roberta and Bertin) for Spanish NLP, but i want to change the tokenizer. When i execute the helm for nlp-pipeline, we can declare a file that have hashes inside (now it have a bert-uncase tokenizer).
I would like to know how i can generate this file using other model and how i can use it when i called with the helm.
Thanks!

Hi there! We’ve taken note of your requirement and Engineering will be evaluating the level of effort for this. It actually turns out that this is a request that has already been tracking in the RAPIDS cuDF project:

Morpheus currently uses the cuDF GPU-accelerated BERT tokenizer only. However, it may be possible to do this using a CPU-based BPE like this one as a proof-of-concept:

1 Like

Thanks you!

To change the tokenizer in the NLP pipeline, you will need to first train a new tokenizer using the desired model (Roberta or Bertin). Once the tokenizer is trained, you can export it as a tokenizer file in the appropriate format (e.g. a vocabulary file for BERT models).

To use the new tokenizer in the pipeline, you will need to update the configuration file for the pipeline to point to the new tokenizer file. In the case of using Helm to deploy the pipeline, you will need to update the values.yaml file to include the path to the new tokenizer file in the appropriate section.

It is also important to note that you may need to make changes to the pipeline code to handle the new tokenizer format, depending on the specific model and format used.
It is advisable to consult the documentation of the model you are using and look for examples of how to use the tokenizer with it.