I am struggling to find what I am looking for within the NeMo framework, maybe what I need is not included in it. In terms of speaker tasks, NeMo supports recognition (along with verification) and diarization. Per my understanding recognition is trained by using audio recordings where during each recording a single person speaks, each recording and speaker are labelled and NeMo will learn how to recognize (and verify) the speakers it was trained on. Diarization goes a step further as it learns how to segment an audio recorded in which there are multiple (known) speakers and lets us know who spoke when. All of these is supervised learning where speakers in the train set need to be the same as in test set.
In a way may task is simpler (or not) as I do not have to assign labels/names to speakers. I need to be able to tell the number of speakers in a given audio recording and determine who spoke when. So there is no recognition and verification, the labels in my case are generic (e.g., speaker 0, speaker 1) since speakers can be unknown to the model. I tried playing around with exiting NeMo solutions (prebuilt speaker recognition and diarization models) but none seem to do what I am looking for. In my case speakers that I need to detect can be such that were never used in training. I guess I need some kind of an unsupervised learning method (clustering based on speaker’s characteristics). Any advice would be greatly appreciated.