Trying to Run DAPT-Continual PreTraining on Chip_design Data

Compute: H100 GPUs - 4
AIM: Perform Continual Pretraining on the llama-3.1-8b-instruct with a customized tokenizer as well

I am new to Distributed Model Training and even Model Training as a fact, Can anyone let me know what I am doing wrong?

Data used here is a combination of data from arxiv paper , wikipedia and github code files on domain chip_design.
Size of 19 MB and around 5070 jsonlines , in which each line is a json object.

Code:
import nemo_run as run
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B,Llama31Config8B

Configure recipe to pre-train based on the default Llama-2-7B recipe

def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
recipe = llm.llama31_8b.pretrain_recipe(
name=“llama31_8b_dapt_72_files”,
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)

# Set parallelism and validation parameters
strategy = recipe.trainer.strategy
strategy.context_parallel_size = 1
strategy.tensor_model_parallel_size = 1
recipe.trainer.val_check_interval = 10

return recipe

Executor for running pretraining

def local_executor_torchrun(devices: int = 1) → run.LocalExecutor:
executor = run.LocalExecutor(ntasks_per_node=devices, launcher=“torchrun”)
return executor
import nemo.lightning as nl
from nemo.collections.common.tokenizers import AutoTokenizer,SentencePieceTokenizer
from nemo.lightning.pytorch.callbacks import ModelCheckpoint

Define dataset configuration

data = run.Config(
llm.PreTrainingDataModule,
paths=[‘/workspace/L31_PT_72/preprocessed_data_text_document’],
seq_length=1024,
tokenizer=run.Config(
AutoTokenizer,
pretrained_model_name=“/workspace/L31_PT_72/models/tokenizer/llama31/final_tokenizer/”,
),
micro_batch_size=1,
global_batch_size=4,
split=“60,20,20”
)

Instantiate the recipe

recipe = configure_recipe(nodes=1, gpus_per_node=4)

Assign dataset configuration to the recipe

recipe.data = data

Configure resume settings

recipe.resume = run.Config(
nl.AutoResume,
restore_config=run.Config(nl.RestoreConfig, path=“/workspace/base_models/L31_8B_nemo/”),
)

Ensure tokenizer is set

recipe.data.tokenizer = data.tokenizer

checkpoint_callback = run.Config(
ModelCheckpoint,
save_last=True,
monitor=“val_loss”,
save_top_k=2,
every_n_train_steps=None,
)
recipe.trainer.callbacks.append(checkpoint_callback)

Configure parallelism settings

recipe.trainer.strategy.tensor_model_parallel_size = 1
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 1

Configure training steps and validation intervals

recipe.trainer.max_steps = 100
recipe.trainer.max_epochs = 1
recipe.trainer.val_check_interval = 50
recipe.trainer.limit_train_batches = 100

Set batch size settings

recipe.data.global_batch_size = data.global_batch_size
recipe.data.micro_batch_size = data.micro_batch_size

Set checkpoint and log locations

recipe.log.log_dir = “/workspace/logs_28_07”
recipe.log.ckpt.save_optim_on_train_end = False

Configure learning rate scheduler

recipe.optim.config.lr = 1e-5
recipe.optim.lr_scheduler.min_lr = 1e-6

Configure data blending (if needed)

recipe.data.paths = [1, ‘/workspace/L31_PT_72/preprocessed_data_text_document’]

Launch the pretraining job

executor = local_executor_torchrun(devices=recipe.trainer.devices)
experiment = run.run(recipe, executor=executor)

Key Error Details
Primary Error:
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error
ncclUnhandledCudaError: Call to CUDA function failed.
Last error: Cuda failure 1 ‘invalid argument’
Failure Point:

Error occurs during trainer.fit(model, data) in the profiler setup phase
Specifically fails at torch.distributed.broadcast_object_list()
All 4 GPU ranks (0,1,2,3) fail with identical NCCL errors

NCCL Configuration:

NCCL version: 2.25.1+cuda12.8
CUDA Driver version: 12080
Network: Using Socket transport (no InfiniBand detected)
4 GPUs with tensor parallelism across single node

Environment

Container: NeMo Framework container (appears to be recent version)
Hardware: 4x GPU setup on single node
Network: Ethernet-based communication
NCCL Warnings: Multiple “transport/nvls.cc” CUDA failures with ‘invalid argument’

Reproduction Context

Task: Domain Adaptive Pre-training using custom LLaMA-3.1 tokenizer
Framework: NeMo 2.0 with PyTorch Lightning
Configuration: Multi-GPU distributed training setup

Thank you for reporting this issue. It is currently being tracked and addressed in the NeMo GitHub repo: DAPT on Chip_data, Fails with this error: · Issue #14348 · NVIDIA/NeMo · GitHub

Yes @shashank.verma , I know that because I was the one who opened that issue there as well. My objective to open it here is because the error was related to CUDA as well right? Considering that, I assumed maybe it could also be related to the GPUs so I opened a forum here.
Thanks