I have some performance questions/issues regarding following kernel code. It’s basically calculating a value (b
), which needs to match two hardcoded 64-bit tokens (a[]
). Struct hits
is dimensioned to 20, so it can hold 20 hits per kernel call (which is sufficient).
I know that code, which is never called/used, is removed by the nvcc at compile time.
I also know that branching should be avoided.
I know I should not use printf()
in device code (was for testing only).
There is however much I can’t see a reason for.
struct Hits {
uint64_t tid;
int bid;
uint64_t b;
};
I show some variations of the code in order to better illustrate the questions.
kernel 1:
__global__ void mantas_kernel(
struct Hits *hits,
int *hitsn,
uint64_t *test) {
uint64_t tid = ...
uint64_t a[] = {
0x3e24b1546a138f46,
0xeaa667f23491cf47
};
// calculate b (complex stuff)
for (i = 0; i < 2; i++) {
for (j = 0; j < 32; j++) {
if (a[i] != b[j])
continue;
hits[0].tid = tid;
}
}
}
performance: 33.1 Gop/s
This is basically the fastest I get, but all is stored in array hits[0]
, which hides >1 potential hits. Hence not practicable.
kernel 2:
for (i = 0; i < 2; i++) {
for (j = 0; j < 32; j++) {
if (a[i] != b[j])
continue;
int hitsc = atomicAdd(hitsn, 1);
hits[hitsc].tid = tid;
}
}
performance: 25.5 Gop/s
hits
can now hold more than 1 hit, but the performance dropped. I worked already with atomics before, but never had such a huge performance loss. Why is that?
kernel 3:
This kernel has a race condition, as pointed out by here by striker159.
for (i = 0; i < 2; i++) {
for (j = 0; j < 32; j++) {
if (a[i] != b[j])
continue;
hits[*hitsn].tid = tid;
atomicAdd(hitsn, 1);
}
}
performance: 27.2 Gop/s
The old variable prior atomic operation is skipped. Performance is slightly better. I also don’t see here why an additional int (hitsc
in example before) slowed down previous code down by 5%.
kernel 4:
for (i = 0; i < 2; i++) {
for (j = 0; j < 32; j++) {
if (a[i] != b[j])
continue;
printf("hit!\n");
}
}
performance: 28.3 Gop/s
For testing only. However I never saw 10% performance loss due printf()
.
I can’t believe that the complex calculation of b is very fast, but then a “simple” 64-bit result compare and subsequent struct-storage for host evaluation eats up to 25% of the performance.
Also … there are kernel calls which don’t come up with any hit. Hence the code inside the if()
block is never called. I understand that the compiler doesn’t know that at compile time (so the code must be in), but why does it slow down the entire kernel?
All on RTX 2080 SUPER.
That’s how I measure the performance:
HANDLE_ERROR( cudaEventRecord(start) );
mantas_kernel <<< grids,threads >>> (hits, hitsn, test);
HANDLE_ERROR( cudaDeviceSynchronize() );
HANDLE_ERROR( cudaEventRecord(stop) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
What am I missing?