I recently bought RTX 3090 Ti for my new desktop and I installed nvidia driver and CUDA via .runfile script. Now it started making tons of problems related to memcpy functions, kernel synchronization and basic math operations for no reason.
Funny things I found are some of them run clean and some others run with errors, most of them seem to be related to memory handlings and math ops like ADD.
The most frequent case is memory copy from host to device and vice versa. Here is an example:
CUDA error at bodysystemcuda_impl.h:408 code=700(cudaErrorIllegalAddress) "cudaMemcpy(m_deviceData[0].dVel, data, m_numBodies * 4 * sizeof(T), cudaMemcpyHostToDevice)"
This is the error message from nbody simulation. Since the memcpy error happened in the first place, the simulation GUI window dies right after it is instantiated. This error happens with numbodies > 768.
Other one is the ‘vectorAdd’.
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
#1690: 0.160248 + 0.689566 = 0.849815 != 0.160248
Result verification failed at element 1690!
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
I added a line to the source file to prompt the addition elements if they do not match. Here the first thing is it fails occasionally. About 10% of the launch passes. And the second thing is the ADD operation is not even done at all at a specific array index #1690 all the time.
Here is my system config.
MB: ASUS ROG CROSSHAIR VIII EXTREME
VGA: Nvidia RTX 3090 Ti MSI Suprim
CPU: Ryzen 5950x
OS: Ubuntu 20.04
Nvidia Driver & CUDA: 510.47.03, 11.6.2
Hereby I attach the nbody log file from compute-sanitizer and nvidia-bug-report.log. nbody.log (4.0 KB) nvidia-bug-report.log (1.1 MB)
Please help me out of this misery.
Thank you for reading :)
========= Program hit cudaErrorLaunchTimeout (error 702) due to "the launch timed out and was terminated" on CUDA API call to cudaMemcpy.
Your GPU is set to service a display on linux. When a GPU services a display, kernel duration is limited. When you increase the body count, the kernel takes longer to run. This is a very common issue (kernel timeout) and you will find many reports of it on the internet.
this may be of interest, although the linux methods for configuring X (or the display engine) have varied over time and by distro. I probably won’t be able to give you a recipe for your specific configuration. I generally avoid doing work on a display GPU where I am concerned about kernel duration.
apart from that advisory, you may wish to be sure that you are compiling without the -G switch, as this debug compile setting generally makes kernel code run slower. But it does not directly impact the kernel duration limit - you will eventually run into the limit if you make the bodycount large enough.
regarding the vectorAdd report, I suggest you provide a complete example of the changed code you ran.
Is then illegal address related to the kernel timeout? Illegal address on GPU device occurs not only with cuda samples but vast cuda-based applications like pytorch and caffe networks which were built without any debug options. Although you mentioned that this is kinda common issue but I have never experienced until now on any other machines with or without x server interaction. Well, I am going to dig the issue in this direction further.
I added a cout line at 172 in vectorAdd.cu (within the for-loop part) where nothing about GPU operation. Here is the snippet.
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
std::cout << "#" << i << ": " << h_A[i] << " + " << h_B[i] << " = " << h_A[i] + h_B[i] << " != " << h_C[i] << std::endl;
// fprintf(stderr, "Result verification failed at element %d!\n", i);
// exit(EXIT_FAILURE);
}
}
Yes, it can be. The kernel timeout/watchdog is a catastrophic fault, like pulling the power plug. The running code may hit errors as a result of this, while it is shutting down/dying.
Any app is going to have a kernel duration limit/timeout, in your setting. It does seem like the timeout may be a bit short on your machine, it used to typically be ~2s kernel duration. But I don’t have your setup or a 3090 to play with. The best suggestion I have is to run on a GPU not driving a display and not configured by X.
Regarding vectorAdd, try running it under compute-sanitizer. Unfortunately, compute-sanitizer also typically makes GPU device code run more slowly.
Nope! This is not about OS, CUDA, cuDNN, watchdog or any other SW but HW problem indeed . My brandnew 3090 Ti has own defect in VRAM itself. I simply ran OCCT and I got tons of error messages during GPU tests. And other third-party GPU test programs too. Is there any diagnostic toolkit available for the HW or RAM test from Nvidia officially? Let me know if there is one.
GPU replaced. Case resolved. Thank for sharing your time <3
ps1. My network models are running fine now. No matter if watchdog is on or off.
ps2. Cuda samples are also completely good in fit without any stupid errors like vectorAdd case.