10x slowdowns on simple CUDA kernels when upgraded to 2060 RTX

A trivial CUDA kernel (doing nothing: just out[i] = in[i]), operating on a moderate amount of data (512MB)
has massive - about 10x performance penalty when:

  1. Running on NVidia RTX 2060 card.
  2. Using 432.00 driver aka 26.21.14.3200 (pushed with Windows Update on 04th January)

Running on 1060 card causes problem to go away,
manually downloading and installing the 441 driver from nvidia also solves the problem.

The funny thing is that doing grid-strided loops, with very carefully choosen grid size (aka blockCount) to be about 4096 gives a “only” 1.2x performance penalty, however such grid-strides cause slowdowns on “normal” hardware/driver configurations (for which grid size needs to be equal or bigger than 16384).

The code:

__global__ void nothing(const float* __restrict__ in, float* __restrict__ out, int64_t N)
{
	auto i = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = blockDim.x * gridDim.x;
	for(; i<N; i+= stride)
	{
		out[i] = in[i];
	}
}

cudaError_t test(const float* in, float* out, int64_t N)
{
	auto blocksCount = (N + BlockSize - 1) / BlockSize;
	nothing<<<blocksCount, 256>>>(in, out, N);
	return cudaGetLastError();
}

Tested on Windows 10 RS5 (10.0.17763), with CUDA 10.1 and MSVC 2017.

Is the 432.00 driver for RTX broken or am I doing something wrong here?
I don’t like the idea of blacklisting the 432.00 driver in my software and forcing our clients to manually install and download the newer driver…

Look at this blog on grid-stride looping https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/.

When using grid-stride, the number of blocks should be a multiple of the number of streaming multiprocessor. You’re current implement limits portability. You’re looking for a sweet-spot of saturating the CUDA cores, without launching too few or a monolithic kernel.

Next, profile your code with Nsight Compute - https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/. You will be able to compare performance between the 1060 and 2060, if they’re in the same machine.

I’m not sure if there are any issues between RTX and R432.

First of all, I’ve forgot to mention that above mentioned behavior was observed on three separate machines, so it is not a peculiarity of my setup.

As for grid-striding, I’ve done some tests and it doesn’t solve my problem, it only mitigates it. Here are benchmarks, I’ve also tested a thrust-based solution. I’ve also tested some cases with number of block count being a multiple of #SM, but the difference was not measurable.

BlockCount  432.00  441.66
30          1560    1520
120         1260    1220
510         1095    1026
512         1090    1024
2048        957     840
4096        945     784
8192        1014    764
16384       1191    756
32768       1580    752
65536       2355    746
monolithic 11846    748
thrust      6351    750

So the grid-stride is not a solution, but maybe such a strange behavior may point to a culprit.

Profiler is a different story altogether, trying to run an application under a profiler (either Visual Profiler, Nsight Compute or Nsight Systems) causes the application to fail with “cudaErrorIllegalAddress”, “cudaErrorLaunchFailure”, or some other (seemes to be random) error from a CUDA call.
I’ve tested CUDA 10.0 and 10.1, as the 10.2 cannot work with 432.

EDIT: Profiling with CUDA 10.0/10.1, R432 and GF1060 works fine. Profiling with CUDA 10.2, R441 and RTX2060 works fine.

EDIT2: Out of curiosity, we’ve also tested 431.70 and 431.98 drivers, and the it seems they also have the same performance problem as 432.

EDIT3: On the other hand, 436.xx drivers (we’ve tested 436.02 436.15 436.30) give proper performance, just like 441.66.

Are the columns execution time?

If you check your code with cuda-memcheck, do you receive the same errors as when you profile?

Columns are execution time, in milliseconds, 200 kernel runs on a 512MB buffer of floats.

Cuda memcheck (10.0.130) doesn’t report any errors.

I’ve experimented a little bit more and managed to get profilers to work with 432.00 driver. I don’t know how, maybe there was something wrong with my setup, and some experiment caused it to “fix itself”.

Anyway, the Visual Profiler 10.1 manages to run the test app just fine, however it reports a warning
“==21264== Warning: CDP tracing and profiling are not supported on devices with compute capability 7.0 and later.” and also there are no kernels visible on the GPU timeline, only memory copies. The CPU timeline shows some calls like cudaDeviceSynchronize, cudaMemcpy, cudaMalloc, but no kernel launches.

Nsight Compute 2019.5.0 shows some data. I’ve glanced over the metrices that are available, and I’m attaching a few. Should I upload the whole csv somewhere?

A	Grid Size				
B	dram__bytes_read.sum.pct_of_peak_sustained_elapsed [%]				
C	dram__bytes_write.sum [byte]				
D	gpu__time_duration.sum [nsecond]				
E	inst_executed [inst]				
F	smsp__pcsamp_warps_issue_stalled_barrier [warp]
G	launch__grid_size	
					
A			B			C				D			E			F			G
7680,    1,    1	19,53  (-46,68%)	537 105 472   (-6,46%)		8 617 984  (+94,02%)	58 722 656   (-0,27%)	0 (-100,00%)		30
122880,    1,    1	29,19  (-20,28%)	545 572 256   (-4,98%)		5 704 160  (+28,42%)	58 758 656   (-0,21%)	5  (-66,67%)		480
261120,    1,    1	34,91   (-4,67%)	555 609 248   (-3,24%)		4 692 608   (+5,64%)	58 801 856   (-0,14%)	8  (-46,67%)		1 020
522240,    1,    1	36,62   (+0,00%)	574 186 656   (+0,00%)		4 441 888   (+0,00%)	58 883 456   (+0,00%)	15   (+0,00%)		2 040
1044480,    1,    1	37,63   (+2,76%)	611 965 376   (+6,58%)		4 452 000   (+0,23%)	59 046 656   (+0,28%)	37 (+146,67%)		4 080
2088960,    1,    1	35,64   (-2,70%)	687 132 832  (+19,67%)		4 667 456   (+5,08%)	59 373 056   (+0,83%)	76 (+406,67%)		8 160
4177920,    1,    1	30,55  (-16,57%)	837 530 880  (+45,86%)		5 576 544  (+25,54%)	60 025 856   (+1,94%)	126 (+740,00%)		16 320
8355840,    1,    1	23,14  (-36,82%)	1 137 820 064  (+98,16%)	7 111 232  (+60,09%)	61 331 456   (+4,16%)	435 (+2 800,00%)	32 640
16711680,    1,    1	16,04  (-56,20%)	1 713 416 160 (+198,41%)	10 596 192 (+138,55%)	63 942 656   (+8,59%)	2 062 (+13 646,67%)	65 280
134217728,    1,    1	3,72  (-89,84%)		8 083 191 040 (+1 307,76%)	49 686 496 (+1 018,59%)	100 663 296  (+70,95%)	34 662 (+230 980,00%)	524 288

Here are metrics for 436.02 driver, which gives proper performance:

A			B	C		D	E		F	G
7680,    1,    1	19,82	536860736	8473696	50334048	0	30
122880,    1,    1	29,64	536894848	5619744	50370048	0	480
261120,    1,    1	36,95	536939136	4521600	50413248	0	1020
522240,    1,    1	39,45	536861184	4159424	50494848	0	2040
1044480,    1,    1	42,2	536818752	3891456	50658048	0	4080
2088960,    1,    1	43,26	536889728	3837888	50984448	0	8160
4177920,    1,    1	44,47	536977152	3712416	51637248	0	16320
8355840,    1,    1	45,44	536839552	3661920	52942848	0	32640
16711680,    1,    1	45,99	536782656	3633760	55554048	0	65280
134217728,    1,    1	45,88	536688512	3631104	92274688	0	524288

Metrices dram__bytes_write.sum and smsp__pcsamp_warps_issue_stalled_barrier are interesting…

The warning you see from NVVP is because some metrics available on compute capability are only visible through Nsight Compute.

It’s quite possible there is a bug in R432. I would suggest filing a bug, which is in the top link of the forum page. That being said, if there is a bug and it is fixed in a later driver, you will likely have to use the later driver to get around the issue.

I’ll file a bug, hoping there is a way for a workaround.

On the other hand, assuming there is a bug, is there any chance to expedite a newer driver to Windows Update?