Thanks for your reply! I would like to share my workload.
My workload depends on a DL framework called DGL, and the python multi-thread usage is here. In short, everytime the DataLoader init, it will create a separate thread for data processing and transfer.
Here is the code:
import dgl
import torch
from ogb.nodeproppred import DglNodePropPredDataset
dataset = DglNodePropPredDataset('ogbn-arxiv')
device = 'cuda:0'
graph, node_labels = dataset[0]
graph = dgl.add_reverse_edges(graph)
graph.ndata['label'] = node_labels[:, 0]
node_features = graph.ndata['feat']
num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()
idx_split = dataset.get_idx_split()
train_nids = idx_split['train']
valid_nids = idx_split['valid']
test_nids = idx_split['test']
sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])
train_dataloader = dgl.dataloading.DataLoader(
# The following arguments are specific to DGL's DataLoader.
graph, # The graph
train_nids, # The node IDs to iterate over in minibatches
sampler, # The neighbor sampler
device=device, # Put the sampled MFGs on CPU or GPU
# The following arguments are inherited from PyTorch DataLoader.
batch_size=1024, # Batch size
shuffle=True, # Whether to shuffle the nodes for every epoch
drop_last=False, # Whether to drop the last incomplete batch
num_workers=0, # Number of sampler processes
)
import torch.nn as nn
import torch.nn.functional as F
from dgl.nn.pytorch import SAGEConv
class Model(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(Model, self).__init__()
self.conv1 = SAGEConv(in_feats, h_feats, aggregator_type='mean')
self.conv2 = SAGEConv(h_feats, num_classes, aggregator_type='mean')
self.h_feats = h_feats
def forward(self, mfgs, x):
h_dst = x[:mfgs[0].num_dst_nodes()]
h = self.conv1(mfgs[0], (x, h_dst))
h = F.relu(h)
h_dst = h[:mfgs[1].num_dst_nodes()]
h = self.conv2(mfgs[1], (h, h_dst))
return h
model = Model(num_features, 128, num_classes).to(device)
opt = torch.optim.Adam(model.parameters())
def train():
for epoch in range(4):
model.train()
# nvtx
torch.cuda.nvtx.range_push("epoch")
for step, (input_nodes, output_nodes, mfgs) in enumerate(train_dataloader):
inputs = mfgs[0].srcdata['feat']
# batch nodes label
labels = mfgs[-1].dstdata['label']
predictions = model(mfgs, inputs)
loss = F.cross_entropy(predictions, labels)
opt.zero_grad()
loss.backward()
opt.step()
# break
# break
torch.cuda.nvtx.range_pop()
train()
And I think it is hard to tell whether nsys causes the issue. Maybe the issue have to do with DGL, or most likely I did something wrong myself… At least, now I can get the explainable nsys report. As I said above,
I will only enable “collect CUDA trace”, “collect CPU IP/backtrace samples”, which may solve a lot of strange problems, and then enable other options one-by-one to see what causes the problem.
So if the issue really can be reproduced, I’m glad that I do a little contribution.
Thanks in advance!