Caching effects are (of course) known to me, but I didn’t expect that they compensate that much for non-optimized (CUDA) code. Ok … I should perhaps know this, but I have for sure gaps in my knowledge regarding the big picture of CUDA and GPU hardware. I do quite some search and experiments prior posting here, but I do miss stuff (sorry)! This said, I dare some more questions regarding subject code snippet, hoping that hints might clarify why things happen as the do.
The following experiments brought up more questions.
I still run 10 kernel calls (grid size: 16384, block size 64).
The following a[]
array initialization according bit check of b
is only a small part of the entire kernel code. Hence my surprise regarding the overall unexpected performance impact.
Code 1:
That’s the code which generated the ops/s shown in the first post. The results differ slightly since hardware between the 2 tests changed somewhat.
uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
if ((b & (uint64_t) 0x1 << i) != 0)
a[i] = 0xff00ff00;
1st kernel call : 13500 ops/s
Remaining 9 kernel calls : 27330 ops/s
Code 2: If I omit the uint64_t
casting for 0x1
(which is programmatically incorrect and gives wrong results since only 32 bits will be used), all 10 kernel calls are identical and slightly faster.
uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
if ((b & 0x1 << i) != 0)
a[i] = 0xff00ff00;
All 10 kernel calls : 29000.000 ops/s
So it seems that the uint64_t
casting in Code 1 makes that the entire first kernel call runs 50% slower. Robert Crovella addresses the caching, but I would assume that initial loading from global memory is done on first thread block invoke, and that subsequent thread blocks use cached data. The results of Code 1 however suggest that caching only occurred after the entire first kernel call completed.
Code 3: If I don’t shift bits at all, the code goes again faster. Is 64-bit shift that expensive, or did I do something wrong?
uint32_t a[64] = { 0 };
uint64_t b = x; // might be any int calculated before
for (int i = 0; i < 64; i++)
if ((b & 0x1) != 0)
a[i] = 0xff00ff00;
All 10 kernel calls : 30900.000 ops/s
AFAIK, 64-bit integer operations are emulated. I read about funnelshift()
. Is this the correct/better way to shift bits for 64-bit variables in CUDA?
Additionally … could I optimize the initialization of a[]
(global memory)?
Nsight Compute doesn’t show any bank conflicts.