Hello all,
I’m trying to CUDA-accelerate a program largely consisting of 64-bit integer arithmetic. It’s a kernel to reverse a random number generator given a pattern of outputs. Unfortunately I’m seeing much slower performance on GPU than on even 2 CPU cores at the moment, and was wondering if I’m doing something wrong. I’ve exhausted what I know to look for - is processing bound, no warp divergence. Some warnings about tiny, non-overlapping IO in nvpp, but the IO is a miniscule amount on the timeline.
Kernel:
__global__ void checkSeedBlock(const unsigned long* chunk_values, unsigned long* top_seeds, unsigned int* num_seeds, const uint64_t lot_start, const uint64_t block_size)
{
uint64_t my_start = lot_start + block_size * threadIdx.x;
uint64_t my_end = my_start + block_size;
const unsigned long mask_48bit = (1L << 48) - 1L;
unsigned int num_found = 0;
unsigned long top_seed = 0;
unsigned long seed;
for (seed = my_start; seed < my_end; ++seed) {
int hits = 0;
int chunk;
const unsigned long* chunk_ptr = chunk_values;
#pragma unroll
for (chunk = 0; chunk < NUM_CHUNKS; chunk++) {
unsigned long random;
int are_slimes;
random = (seed + *(chunk_ptr++)) ^ 0x5e434e432L;
random = ((random * 0x5DEECE66DL) & mask_48bit) >> 17;
are_slimes = random % 10 == 0;
if (are_slimes) {
hits += 1;
}
if (__all(chunk - hits > NUM_CHUNKS - REQUIRED_HITS - 1)) {
// All in warp have missed too many to find a match
break;
}
}
if (hits >= REQUIRED_HITS) {
num_found++;
top_seed = seed;
}
}
num_seeds[threadIdx.x] = num_found;
top_seeds[threadIdx.x] = top_seed;
}
Invoked like:
/* setup execution parameters */
dim3 grid(1, 1, 1);
dim3 threads(768, 1, 1);
cudaStream_t stream;
... allocates stream, memory, copies some constants across
uint64_t lot_start;
for (lot_start = start_seed; lot_start < end_seed; lot_start += num_threads * block_size) {
// execute the kernel
checkSeedBlock<<< grid, threads, 0, stream >>>(device_chunk_values, device_top_seeds, device_num_seeds, lot_start, block_size);
// Polling wait every 2ms - native library uses busy-wait and consumes 100% CPU.
// On my machine, each block takes about 9.5ms to run so this just rounds up to 10.
// With different timings, could add up to 2ms extra to each block, but reduces CPU use dramatically.
while (cudaStreamQuery(stream) == cudaErrorNotReady) {
usleep(2000);
}
...
Each kernel run only gets its parameters, then writes results in to a tiny 3k memory area. This 3k is copied back to the host, checked for results (almost inevitably none), then the next run started.
The usleep based wait is so that the CPU use remains low as I planned to run GPU and CPU versions at the same time.
I’ve played with explicitly unrolling loops (the #pragma unroll is ignored since I added the break), to no significant effect. Adding the loop break when all in the warp have failed sped it up quite a lot. Some of the optimisations in the assembly are really neat (“n % 10 == 0” becomes a couple of multiplies, a shift and a subtraction) but it’s still not as fast as I’d hope.
Anyone know where I’m likely going wrong?
External Media
slimes.cu (4.95 KB)