Question on Completing the Assessment for “Fundamentals of Deep Learning”
Is there any way to reset the training workspace/jupyter files?
Error is below:
Epoch: 0
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertiont >= 0 && t < n_classes
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertiont >= 0 && t < n_classes
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertiont >= 0 && t < n_classes
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertiont >= 0 && t < n_classes
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertiont >= 0 && t < n_classes
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertiont >= 0 && t < n_classes
failed.
RuntimeError Traceback (most recent call last)
Cell In[11], line 5
3 for epoch in range(epochs):
4 print(‘Epoch: {}’.format(epoch))
----> 5 utils.train(my_model, train_loader, train_N, random_trans, optimizer, loss_function)
6 utils.validate(my_model, valid_loader, valid_N, loss_function)File /dli/task/utils.py:34, in train(model, train_loader, train_N, random_trans, optimizer, loss_function)
32 optimizer.zero_grad()
33 batch_loss = loss_function(output, y)
—> 34 batch_loss.backward()
35 optimizer.step()
37 loss += batch_loss.item()File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:525, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
515 if has_torch_function_unary(self):
516 return handle_torch_function(
517 Tensor.backward,
518 (self,),
(…)
523 inputs=inputs,
524 )
→ 525 torch.autograd.backward(
526 self, gradient, retain_graph, create_graph, inputs=inputs
527 )File /usr/local/lib/python3.10/dist-packages/torch/autograd/init.py:267, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
262 retain_graph = create_graph
264 # The reason we repeat the same comment below is that
265 # some Python versions print out the first line of a multi-line function
266 # calls in the traceback and some print out the last line
→ 267 engine_run_backward(
268 tensors,
269 grad_tensors,
270 retain_graph,
271 create_graph,
272 inputs,
273 allow_unreachable=True,
274 accumulate_grad=True,
275 )File /usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
742 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
→ 744 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
745 t_outputs, *args, **kwargs
746 ) # Calls into the C++ engine to run the backward pass
747 finally:
748 if attach_logging_hooks:RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.