Problem Description
I’m encountering an error when using the Shuffle operation in NeMo Curator. I was using the data_curation notebook provided by NVIDIA itself with their example data. Inside the notebook, there was this shuffle line which is causing an error:
shuffle = nc.Shuffle(seed=42)
blended_dataset = shuffle(blended_dataset)
I get this error:
raise ValueError(f"Cannot fuse tasks with multiple outputs {leafs}")
ValueError: Cannot fuse tasks with multiple outputs {('_add_rand_col-291b58e263efcf82992ae0c66d40a50f', 6), ('getitem-fd7aab8e5b4d8eeb405c0b551747810d', 4)}
Investigation
Looking at the implementation, it appears the issue is in how the _add_rand_col function works with Dask’s map_partitions:
From NeMo Curator source code
dataset.df[self.rand_col] = dataset.df.map_partitions(self._add_rand_col)
def _add_rand_col(self, partition, partition_info=None):
if partition_info is None:
partition_info = {
"number": 0,
}
if self.seed is not None:
np.random.seed(self.seed + partition_info["number"])
rand_col = np.random.randint(0, np.iinfo("int64").max, size=len(partition))
return rand_col
The issue seems to be that _add_rand_col returns a NumPy array instead of a DataFrame/Series with matching index structure, causing Dask fusion errors.
Environment Details
nemo_curator version : 0.7.0
Python version: 3.10.12
Operating system: ubuntu
Attempted Solutions
I’ve tried upgrading NeMo Curator but encountered dependency conflicts with protobuf versions:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-metadata 1.16.1 requires protobuf<4.21,>=3.20.3; python_version < "3.11", but you have protobuf 6.30.1 which is incompatible.
grpcio-status 1.71.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 6.30.1 which is incompatible.
Questions
Is this a known issue with the NeMo Curator Shuffle operation in the official NVIDIA example notebooks?
Are there any workarounds to make the shuffle work without changing dependency versions?
Is there a recommended approach to handle this kind of dependency conflict?
Any help would be greatly appreciated!