First kernel run is slower than succeeding

I have a kernel, which runs in 10 loops.
The first loop is always way slower than the succeeding loops (reproducible).
Sometimes (rarely), the second loop is also somewhat slower (+/- 22000).

speed:16175.390
speed:30744.576
speed:30733.140
speed:30710.285
speed:30733.153
speed:30688.517
speed:30658.813
speed:30675.219
speed:30684.009
speed:30666.256

Without going yet into code details, is there a common mistake which I might have done here?

Ref 1:

I have an RTX 2080 SUPER. Changing default -gencode=arch=compute_52,code=\"sm_52,compute_52\" to -gencode=arch=compute_75,code=\"sm_75,compute_75\" made it worse. The performance dropped by +/- 10%.

Ref 2:

Also this doesn’t explain that the entire first run is slow. Sometimes, you only start a kernel once (not in loops like I do). So this would mean the entire first (and only) kernel run is slow ?!

There might be caching effects. That (caching effects) doesn’t indicate a mistake.

If I got you right, this would mean that if the calling code (main()) is designed to only run the kernel only once (not in loops, like mine), this entire kernel call would run slow due caching effects?

If that’s the case, is this documented?
Can I add/change code to avoid this phenomena?

The existence of caches in the GPU is documented (for example, the programming guide). Perhaps something you might want to do would be to learn about caches and become familiar with the concept. The concept of a cache is not unique or specific to GPUs. The general behavior of caches is not unique or specific to GPUs.

If data is in a cache, the general idea is that a processor will generally be able to retrieve it more efficiently than if the data is (only) resident in processor memory. It stands to reason, that if retrieval of that data is relevant to performance, and the data is already in the cache, the processor may be more efficient at processing code. More efficient might translate to better performance.

I’m not sure I can give you chapter and verse of CUDA documentation that recites that sort of thing, but I wouldn’t be surprised if you can find something similar on wikipedia.

I have no suggestions regarding your last question. Maybe someone else will. The way to get past the first “cost” of loading data into the cache is to load data into the cache. That is what your code might be doing on the first kernel call.

I don’t really know if this is the reason for the performance difference, it’s just speculation, in addition to the other ideas you advanced.

Caching effects are (of course) known to me, but I didn’t expect that they compensate that much for non-optimized (CUDA) code. Ok … I should perhaps know this, but I have for sure gaps in my knowledge regarding the big picture of CUDA and GPU hardware. I do quite some search and experiments prior posting here, but I do miss stuff (sorry)! This said, I dare some more questions regarding subject code snippet, hoping that hints might clarify why things happen as the do.

The following experiments brought up more questions.

I still run 10 kernel calls (grid size: 16384, block size 64).

The following a[] array initialization according bit check of b is only a small part of the entire kernel code. Hence my surprise regarding the overall unexpected performance impact.


Code 1:

That’s the code which generated the ops/s shown in the first post. The results differ slightly since hardware between the 2 tests changed somewhat.

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & (uint64_t) 0x1 << i) != 0)
		a[i] = 0xff00ff00;

1st kernel call : 13500 ops/s
Remaining 9 kernel calls : 27330 ops/s


Code 2: If I omit the uint64_t casting for 0x1 (which is programmatically incorrect and gives wrong results since only 32 bits will be used), all 10 kernel calls are identical and slightly faster.

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & 0x1 << i) != 0)
		a[i] = 0xff00ff00;

All 10 kernel calls : 29000.000 ops/s

So it seems that the uint64_t casting in Code 1 makes that the entire first kernel call runs 50% slower. Robert Crovella addresses the caching, but I would assume that initial loading from global memory is done on first thread block invoke, and that subsequent thread blocks use cached data. The results of Code 1 however suggest that caching only occurred after the entire first kernel call completed.


Code 3: If I don’t shift bits at all, the code goes again faster. Is 64-bit shift that expensive, or did I do something wrong?

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & 0x1) != 0)
		a[i] = 0xff00ff00;

All 10 kernel calls : 30900.000 ops/s

AFAIK, 64-bit integer operations are emulated. I read about funnelshift(). Is this the correct/better way to shift bits for 64-bit variables in CUDA?


Additionally … could I optimize the initialization of a[] (global memory)?
Nsight Compute doesn’t show any bank conflicts.