Different results on different GPUs

rohitc.nitw · September 10, 2020, 5:31pm

I have 2 GPUs on my system, Device 0: GTX1050(4GB) and Device 1: RTX2070(8GB).
My CUDA/C++ code (where I make use of the thrust library) runs fine on Device 0 but fails to run on Device 1 with the following error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorLaunchOutOfResources: too many resources requested for launch
Aborted (core dumped)

I specify which device to use in the following manner:
CUDA_VISIBLE_DEVICES=0 ./my_executable

How is it possible for the GPU with larger resources to run out of them?
Am I missing something?

njuffa · September 10, 2020, 5:48pm

How is it possible for the GPU with larger resources to run out of them?

One possibility: The two GPUs are of different architectures, so the machine code generated for the two GPUs from the same source code may require a different number of registers per thread, leading to the launch exceeding the total number of register available per thread block on one of the platforms.

You would want to examine the number of registers needed for the relevant kernel by adding -Xptxas -v to the nvcc command line, extract the thread block configuration, and then plug that data into the CUDA occupancy calculator:

https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html

With close to no information to go by, I won’t even venture a guess whether the root cause is an issue with your code (generally speaking, that is the more likely scenario) or an issue somewhere in NVIDIA’s code.

robosmith · September 10, 2020, 9:51pm

I’ve been running a couple of kernels designed for multi-GPU Volta implementation on the single GPU RTX 2070 in my laptop. I was pulling my hair out when it started failing to produce an output for the 2nd launch on the 2070. Even though I could see the memory being written in the debugger, by the time it was copied to the CPU it was all zeros. Worked correctly for the first launch.

Since the target is a multi-GPU Volta machine, I ran the exact same code on that and it worked just fine. It seems like the cudaStreamSynchronize was failing on the 2070 and the copy was occurring before the GPU memory was written by the kernel?

njuffa · September 10, 2020, 10:08pm

@robosmith It seems like your post was intended for another forum thread?

robosmith · September 10, 2020, 10:24pm

No, I was just commenting on another instance in which the RTX 2070 is flaky, cause my code worked on Volta unchanged and as far as I can tell, should have worked on the 2070.

Seems newer generations always take time to get the bugs worked out.

njuffa · September 10, 2020, 10:31pm

I see no indication in this thread that the OP’s RTX2070 is flaky, nor am I aware of reports of general flakiness with the RTX 2070 or any other GPU in the RTX line.

The hardware and CUDA software stack for the Turing architecture have been around for two years and are sufficiently mature at this point. Generally speaking it is safe to assume that any potential “teething issues” have been resolved at this time, and that any issues encountered most likely (not: with certainty) fall into the user error category.