The output list length of open question pipeline is wrong ?

I run this notebook :NeMo-Curator/tutorials/synthetic-data-hello-world/Synthetic Data Generation - Hello World Examples.ipynb at main · NVIDIA/NeMo-Curator · GitHub .Why is the length of the output list 374 instead of 20?
Here are the relevant screenshots .


Hi @ambo. When applied to a string, the Python len() function returns the number of characters in that string - hence 376. If you want to count the number of words in the string, try len(open_qa_questions.split()).

According to the user guide ,the output is a list .I think len() function can be used .

oh my bad - sorry for not reading this fully. let me take a look now! thanks for your patience.

Hey!

It’s certainly not supposed to generate so many open-lines.

Would you be able to post a snippet of the output?

It looks like the model may have been overzealous in generating topics or subtopics.

Thanks,
Chris

Here are the output screenshots .It is a snippet .



Thanks @ambo. What does open_qa_questions return?

Here are the open_qa_questions screenshots . It is a snippet .



I believe this is an artifact of model verbosity - can you confirm you’re using Meta’s Llama 405B model?