CUDA Runtime Problem: CUDA error with code=700(cudaErrorIllegalAddress)

belanced · September 2, 2022, 3:33pm

I recently bought RTX 3090 Ti for my new desktop and I installed nvidia driver and CUDA via .runfile script. Now it started making tons of problems related to memcpy functions, kernel synchronization and basic math operations for no reason.

So I did some analysis with the tutorials provided at NVIDIA/cuda-samples.

Funny things I found are some of them run clean and some others run with errors, most of them seem to be related to memory handlings and math ops like ADD.

The most frequent case is memory copy from host to device and vice versa. Here is an example:

CUDA error at bodysystemcuda_impl.h:408 code=700(cudaErrorIllegalAddress) "cudaMemcpy(m_deviceData[0].dVel, data, m_numBodies * 4 * sizeof(T), cudaMemcpyHostToDevice)"

This is the error message from nbody simulation. Since the memcpy error happened in the first place, the simulation GUI window dies right after it is instantiated. This error happens with numbodies > 768.

Other one is the ‘vectorAdd’.

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
#1690: 0.160248 + 0.689566 = 0.849815 != 0.160248
Result verification failed at element 1690!
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

I added a line to the source file to prompt the addition elements if they do not match. Here the first thing is it fails occasionally. About 10% of the launch passes. And the second thing is the ADD operation is not even done at all at a specific array index #1690 all the time.

Here is my system config.

MB: ASUS ROG CROSSHAIR VIII EXTREME
VGA: Nvidia RTX 3090 Ti MSI Suprim
CPU: Ryzen 5950x
OS: Ubuntu 20.04
Nvidia Driver & CUDA: 510.47.03, 11.6.2

Hereby I attach the nbody log file from compute-sanitizer and nvidia-bug-report.log.
nbody.log (4.0 KB)
nvidia-bug-report.log (1.1 MB)

Please help me out of this misery.
Thank you for reading :)

Robert_Crovella · September 2, 2022, 4:33pm

From the nbody log file:

========= Program hit cudaErrorLaunchTimeout (error 702) due to "the launch timed out and was terminated" on CUDA API call to cudaMemcpy.

Your GPU is set to service a display on linux. When a GPU services a display, kernel duration is limited. When you increase the body count, the kernel takes longer to run. This is a very common issue (kernel timeout) and you will find many reports of it on the internet.

this may be of interest, although the linux methods for configuring X (or the display engine) have varied over time and by distro. I probably won’t be able to give you a recipe for your specific configuration. I generally avoid doing work on a display GPU where I am concerned about kernel duration.

apart from that advisory, you may wish to be sure that you are compiling without the -G switch, as this debug compile setting generally makes kernel code run slower. But it does not directly impact the kernel duration limit - you will eventually run into the limit if you make the bodycount large enough.

regarding the vectorAdd report, I suggest you provide a complete example of the changed code you ran.

belanced · September 2, 2022, 5:12pm

Is then illegal address related to the kernel timeout? Illegal address on GPU device occurs not only with cuda samples but vast cuda-based applications like pytorch and caffe networks which were built without any debug options. Although you mentioned that this is kinda common issue but I have never experienced until now on any other machines with or without x server interaction. Well, I am going to dig the issue in this direction further.

I added a cout line at 172 in vectorAdd.cu (within the for-loop part) where nothing about GPU operation. Here is the snippet.

  // Verify that the result vector is correct
  for (int i = 0; i < numElements; ++i) {
    if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
    std::cout << "#" << i << ": " << h_A[i] << " + " << h_B[i] << " = " << h_A[i] + h_B[i] << " != " << h_C[i] << std::endl;
      // fprintf(stderr, "Result verification failed at element %d!\n", i);
      // exit(EXIT_FAILURE);
    }
  }

Nothing is different other than this change.

Robert_Crovella · September 2, 2022, 5:21pm

Yes, it can be. The kernel timeout/watchdog is a catastrophic fault, like pulling the power plug. The running code may hit errors as a result of this, while it is shutting down/dying.

Any app is going to have a kernel duration limit/timeout, in your setting. It does seem like the timeout may be a bit short on your machine, it used to typically be ~2s kernel duration. But I don’t have your setup or a 3090 to play with. The best suggestion I have is to run on a GPU not driving a display and not configured by X.

Regarding vectorAdd, try running it under compute-sanitizer. Unfortunately, compute-sanitizer also typically makes GPU device code run more slowly.

belanced · September 3, 2022, 3:55pm

Nope! This is not about OS, CUDA, cuDNN, watchdog or any other SW but HW problem indeed . My brandnew 3090 Ti has own defect in VRAM itself. I simply ran OCCT and I got tons of error messages during GPU tests. And other third-party GPU test programs too. Is there any diagnostic toolkit available for the HW or RAM test from Nvidia officially? Let me know if there is one.

Robert_Crovella · September 3, 2022, 4:28pm

There isn’t anything from NVIDIA for GeForce cards.

belanced · September 5, 2022, 12:18pm

GPU replaced. Case resolved. Thank for sharing your time <3

ps1. My network models are running fine now. No matter if watchdog is on or off.
ps2. Cuda samples are also completely good in fit without any stupid errors like vectorAdd case.

system · September 19, 2022, 12:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA_ERROR_ILLEGAL_ADDRESS CUDA Programming and Performance	6	11010	September 26, 2017
Incidental error 700 - an illegal memory access is encountered CUDA Programming and Performance cuda	5	8955	March 25, 2021
Illegal memory access crash CUDA Programming and Performance	15	4533	January 30, 2022
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python CUDA Programming and Performance	11	9622	May 16, 2024
GPU errors during CUDA-based computations CUDA Programming and Performance cuda , pytorch , machine-learning	6	2105	May 8, 2023
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1722	March 20, 2017
CUDA invalid records warning CUDA Setup and Installation	10	6236	August 10, 2018
Help catching an illegal memory access CUDA Programming and Performance decoder , cuda , debugger	15	819	November 7, 2024
CUDA ERROR: no CUDA-capable device CUDA Programming and Performance	13	6630	February 4, 2012
illegal memory access - any help appreciated CUDA Programming and Performance	5	6781	February 8, 2018

CUDA Runtime Problem: CUDA error with code=700(cudaErrorIllegalAddress)

Related topics