First kernel run is slower than succeeding

I have a kernel, which runs in 10 loops.
The first loop is always way slower than the succeeding loops (reproducible).
Sometimes (rarely), the second loop is also somewhat slower (+/- 22000).

speed:16175.390
speed:30744.576
speed:30733.140
speed:30710.285
speed:30733.153
speed:30688.517
speed:30658.813
speed:30675.219
speed:30684.009
speed:30666.256

Without going yet into code details, is there a common mistake which I might have done here?

Ref 1:

I have an RTX 2080 SUPER. Changing default -gencode=arch=compute_52,code=\"sm_52,compute_52\" to -gencode=arch=compute_75,code=\"sm_75,compute_75\" made it worse. The performance dropped by +/- 10%.

Ref 2:

Also this doesn’t explain that the entire first run is slow. Sometimes, you only start a kernel once (not in loops like I do). So this would mean the entire first (and only) kernel run is slow ?!

There might be caching effects. That (caching effects) doesn’t indicate a mistake.

If I got you right, this would mean that if the calling code (main()) is designed to only run the kernel only once (not in loops, like mine), this entire kernel call would run slow due caching effects?

If that’s the case, is this documented?
Can I add/change code to avoid this phenomena?

The existence of caches in the GPU is documented (for example, the programming guide). Perhaps something you might want to do would be to learn about caches and become familiar with the concept. The concept of a cache is not unique or specific to GPUs. The general behavior of caches is not unique or specific to GPUs.

If data is in a cache, the general idea is that a processor will generally be able to retrieve it more efficiently than if the data is (only) resident in processor memory. It stands to reason, that if retrieval of that data is relevant to performance, and the data is already in the cache, the processor may be more efficient at processing code. More efficient might translate to better performance.

I’m not sure I can give you chapter and verse of CUDA documentation that recites that sort of thing, but I wouldn’t be surprised if you can find something similar on wikipedia.

I have no suggestions regarding your last question. Maybe someone else will. The way to get past the first “cost” of loading data into the cache is to load data into the cache. That is what your code might be doing on the first kernel call.

I don’t really know if this is the reason for the performance difference, it’s just speculation, in addition to the other ideas you advanced.

1 Like

Caching effects are (of course) known to me, but I didn’t expect that they compensate that much for non-optimized (CUDA) code. Ok … I should perhaps know this, but I have for sure gaps in my knowledge regarding the big picture of CUDA and GPU hardware. I do quite some search and experiments prior posting here, but I do miss stuff (sorry)! This said, I dare some more questions regarding subject code snippet, hoping that hints might clarify why things happen as the do.

The following experiments brought up more questions.

I still run 10 kernel calls (grid size: 16384, block size 64).

The following a[] array initialization according bit check of b is only a small part of the entire kernel code. Hence my surprise regarding the overall unexpected performance impact.


Code 1:

That’s the code which generated the ops/s shown in the first post. The results differ slightly since hardware between the 2 tests changed somewhat.

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & (uint64_t) 0x1 << i) != 0)
		a[i] = 0xff00ff00;

1st kernel call : 13500 ops/s
Remaining 9 kernel calls : 27330 ops/s


Code 2: If I omit the uint64_t casting for 0x1 (which is programmatically incorrect and gives wrong results since only 32 bits will be used), all 10 kernel calls are identical and slightly faster.

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & 0x1 << i) != 0)
		a[i] = 0xff00ff00;

All 10 kernel calls : 29000.000 ops/s

So it seems that the uint64_t casting in Code 1 makes that the entire first kernel call runs 50% slower. Robert Crovella addresses the caching, but I would assume that initial loading from global memory is done on first thread block invoke, and that subsequent thread blocks use cached data. The results of Code 1 however suggest that caching only occurred after the entire first kernel call completed.


Code 3: If I don’t shift bits at all, the code goes again faster. Is 64-bit shift that expensive, or did I do something wrong?

uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
	if ((b & 0x1) != 0)
		a[i] = 0xff00ff00;

All 10 kernel calls : 30900.000 ops/s

AFAIK, 64-bit integer operations are emulated. I read about funnelshift(). Is this the correct/better way to shift bits for 64-bit variables in CUDA?


Additionally … could I optimize the initialization of a[] (global memory)?
Nsight Compute doesn’t show any bank conflicts.

On sm > 6.2, no more so than any other 64 bit operation - any 64 bit operation is going to be at half the throughput of it’s 32 bit sibling.

The funnelshift instruction is limited to 32 bit shifts, so it too would require being run twice in order to do shift over the full 64 bit range: PTX ISA :: CUDA Toolkit Documentation

Have you used Nsight Compute to examine the SASS generated by the code line:

if ((b & (uint64_t) 0x1 << i) != 0)

in their different forms outlined in your examples above? This may give you some clues as to runtime variability you’re seeing.

Does the value of “b” actually change between kernel calls?

If not, I can see the compiler optimising away any changes to “a” after the first run.

So from what I understand, I use the correct way to bit shift in CUDA, right?

(off-topic) Apart from being intrinsic, what are the advantages of the funnelshift() functions?

I compiled using -lineinfo, so I see Source versus SASS. However when select the if() line (the one you asked for) on the left side, a hundred of lines get selected on the right side (SASS). I have to dig myself further into the Nsight Compute docs in order to fully understand how to read/use this.

Yes, b is changing with the tread index. So for every thread, it’s value differs.

I noticed - using --ptxas-options=--verbose the following, based on this line:

if ((b & (uint64_t) 0x1 << i) != 0)

If i is 1, 96 registers are used.
If I use uint32_t i.s.o. uint64_t, 128 registers are used (just for testing).
If I use uint64_t, 168 registers are used (which is the correct code).

I’m surprised to see that registers are used to store static numbers. Is it desired/optimum that nvcc decides to put these 64 masks (created by the looped bit shifts) in registers?

… later …

If I comment out the if() statement, I only use 96 registers. Apparently another confirmation that the bit shift uses 72 registers.

Yes.

32 bit output from 64 bit input. 32 bit rotate operations. In fact with more recent architectures, (sm >= 7.0), the standard shift instructions, SHL/R seem to be no longer used by the compiler, in favour of SHF. This despite the instructions still being shown in the instruction set documentation.

I’m not sure the documentation will give you any guidance here. SASS related documention is pretty scarce and so it’s really down to the individual to become familiar with the instructions: CUDA Binary Utilities :: CUDA Toolkit Documentation, which are often similar to the PTX instructions, and trying to follow what is often quite a convoluted flow of instructions, due to compiler optimisations.

Given that registers are the means of storage offering fastest access, the compiler will always attempt to do this, subject to it’s cost/benefit analysis with regard to the overall impact on performance.

If that means that bottlenecks (too many registers), are created then often it comes down to finding a way to algorithmically rearrange operations, to smooth things out.

Hi Everyone,

  1. The global memory is indeed cached, however, the modern GPU’s allow to select “the point of split” between the cache memory and the shared memory of given Streaming Multiprocessor. That is probably why the code slowed down when compiled for Turing architecture (I suspect by default the partition is fifty-fifty, need to ensure though).
  2. Type conversions can be costly, but in this case it could cause reordering of the write instructions so it could fit better the delays caused by cache synchronization and also warp synchronizations (although architectures >=Volta have schedulers that can avoid stalls on the conditional instructions).
  3. From the compiler options I can infer you generate both the binary code and the ptx assembly, which is normally then compiled when you are running the code for the first time. That obviously slows down the first run. To keep the things short, remove the “,compute_52” or “,compute_75” from the “code” line and see if that equalizes the time a bit (should not have a significant impact though).
1 Like