Fundamentals of Deep Learning - Workspace Errors

scott183 · June 26, 2024, 6:56pm

Question on Completing the Assessment for “Fundamentals of Deep Learning”
Is there any way to reset the training workspace/jupyter files?

Error is below:
Epoch: 0
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.

RuntimeError Traceback (most recent call last)
Cell In[11], line 5
3 for epoch in range(epochs):
4 print(‘Epoch: {}’.format(epoch))
----> 5 utils.train(my_model, train_loader, train_N, random_trans, optimizer, loss_function)
6 utils.validate(my_model, valid_loader, valid_N, loss_function)

File /dli/task/utils.py:34, in train(model, train_loader, train_N, random_trans, optimizer, loss_function)
32 optimizer.zero_grad()
33 batch_loss = loss_function(output, y)
—> 34 batch_loss.backward()
35 optimizer.step()
37 loss += batch_loss.item()

File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
515 if has_torch_function_unary(self):
516 return handle_torch_function(
517 Tensor.backward,
518 (self,),
(…)
523 inputs=inputs,
524 )
→ 525 torch.autograd.backward(
526 self, gradient, retain_graph, create_graph, inputs=inputs
527 )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/init.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
262 retain_graph = create_graph
264 # The reason we repeat the same comment below is that
265 # some Python versions print out the first line of a multi-line function
266 # calls in the traceback and some print out the last line
→ 267 engine_run_backward(
268 tensors,
269 grad_tensors,
270 retain_graph,
271 create_graph,
272 inputs,
273 allow_unreachable=True,
274 accumulate_grad=True,
275 )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
742 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
→ 744 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
745 t_outputs, *args, **kwargs
746 ) # Calls into the C++ engine to run the backward pass
747 finally:
748 if attach_logging_hooks:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Topic		Replies	Views
Error during re-training SSD-Mobilenet using Jetson Nano 2GB Jetson Nano jetson-inference	3	16	April 16, 2025
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	297	September 23, 2024
Error while training ActionRecognitionNet with TAO TAO Toolkit	14	1508	February 8, 2022
While running evaluation part in jupyter notebook it is givng an error Routing, Switching & Virtualization cuda	0	11	January 7, 2025
Cuda runtime error Jetson Nano cuda , nano2gb	4	1765	October 15, 2021
Neural Network (Backpropagation) implementation in CUDA CUDA Programming and Performance	0	1723	October 1, 2017
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
cudaSynchronizeDevice() returns error code 6 CUDA Programming and Performance	7	8601	June 16, 2011
Core dumped while re-training pruned Detectnet model TAO Toolkit cuda , tensorflow , tao	5	628	April 21, 2022
Train.yaml Doesn't exist! TAO Toolkit	16	484	June 11, 2024

Fundamentals of Deep Learning - Workspace Errors

Related topics