Expected performance

Hello all,

I’m trying to CUDA-accelerate a program largely consisting of 64-bit integer arithmetic. It’s a kernel to reverse a random number generator given a pattern of outputs. Unfortunately I’m seeing much slower performance on GPU than on even 2 CPU cores at the moment, and was wondering if I’m doing something wrong. I’ve exhausted what I know to look for - is processing bound, no warp divergence. Some warnings about tiny, non-overlapping IO in nvpp, but the IO is a miniscule amount on the timeline.

Kernel:

__global__ void checkSeedBlock(const unsigned long* chunk_values, unsigned long* top_seeds, unsigned int* num_seeds, const uint64_t lot_start, const uint64_t block_size)
{
  uint64_t my_start = lot_start + block_size * threadIdx.x;
  uint64_t my_end = my_start + block_size;
  const unsigned long mask_48bit = (1L << 48) - 1L;
  
  unsigned int num_found = 0;
  unsigned long top_seed = 0;
  
  unsigned long seed;
  for (seed = my_start; seed < my_end; ++seed) {
    int hits = 0;
    int chunk;
    const unsigned long* chunk_ptr = chunk_values;
    #pragma unroll
    for (chunk = 0; chunk < NUM_CHUNKS; chunk++) {
      unsigned long random;
      int are_slimes;
      
      random = (seed + *(chunk_ptr++)) ^ 0x5e434e432L;
      random = ((random * 0x5DEECE66DL) & mask_48bit) >> 17;
      are_slimes = random % 10 == 0;
      if (are_slimes) {
        hits += 1;
      }
      
      if (__all(chunk - hits > NUM_CHUNKS - REQUIRED_HITS - 1)) {
        // All in warp have missed too many to find a match
        break;
      }
    }
    
    if (hits >= REQUIRED_HITS) {
      num_found++;
      top_seed = seed;
    }
  }
  
  num_seeds[threadIdx.x] = num_found;
  top_seeds[threadIdx.x] = top_seed;
}

Invoked like:

/* setup execution parameters */
  dim3  grid(1, 1, 1);
  dim3  threads(768, 1, 1);

  cudaStream_t stream;

  ... allocates stream, memory, copies some constants across
  
  uint64_t lot_start;
  for (lot_start = start_seed; lot_start < end_seed; lot_start += num_threads * block_size) {
      // execute the kernel
      checkSeedBlock<<< grid, threads, 0, stream >>>(device_chunk_values, device_top_seeds, device_num_seeds, lot_start, block_size);
      
      // Polling wait every 2ms - native library uses busy-wait and consumes 100% CPU.
      // On my machine, each block takes about 9.5ms to run so this just rounds up to 10.
      // With different timings, could add up to 2ms extra to each block, but reduces CPU use dramatically.
      while (cudaStreamQuery(stream) == cudaErrorNotReady) {
        usleep(2000);
      }
      
      ...

Each kernel run only gets its parameters, then writes results in to a tiny 3k memory area. This 3k is copied back to the host, checked for results (almost inevitably none), then the next run started.

The usleep based wait is so that the CPU use remains low as I planned to run GPU and CPU versions at the same time.

I’ve played with explicitly unrolling loops (the #pragma unroll is ignored since I added the break), to no significant effect. Adding the loop break when all in the warp have failed sped it up quite a lot. Some of the optimisations in the assembly are really neat (“n % 10 == 0” becomes a couple of multiplies, a shift and a subtraction) but it’s still not as fast as I’d hope.

Anyone know where I’m likely going wrong?
External Media
slimes.cu (4.95 KB)

Gah, missed some important details. GPU is a GTX 650Ti, OC edition. Block size (my_end - my_start in kernel) is 4096 (much longer stalls graphics). Also tried on EC2 Cluster-GPU instances, even slower.

/usr/local/cuda/bin/nvcc -arch=compute_20 -code=compute_20 -O -I /usr/local/cuda-5.0/samples/common/inc/ -o slimes_cuda slimes.cu

You are only using a tiny fraction of the power of the GPU. Running a single block (grid<1,1,1>) severely limits what the GPU is able to do in parallel, and how it can hide various latencies. Unfortunately being at work I am unable to delve too deep into how you would need to re-write it to handle running multiple blocks (>32) each running multiple threads (>256).

The other thing that jumped out to me is you mentioned that you are copying 3k chunks, working on them, copying the 3k chunks back. This is far too small a dataset to get any real benefit from the GPU. You want to copying at least 1000 times that really (I would say if you can ‘know’ where you are going, try to do 300MB each time at least). Each copy has overhead, so if you are doing thousands of tiny ones that is thousands times that overhead vs large copies which only have one.

You are only launching a single block!? That is grossly underutilizing the GPU. Split the work into many blocks as well as many threads.

As the next step, use the Occupancy Calculator to find the best blocksize. 768 probably doesn’t give the maximum occupancy possible because of the factor 3 involved

Ah, that makes sense. I didn’t understand blocks so had bypassed them. I’ll shift threads to 256 for maximum occupancy and add block dimensions. Hadn’t clicked that there are scheduling constraints within a thread block which prevent it from using all the hardware.

Thank you both - much faster. It now completes a block of 2^34 seeds in 12 seconds versus nearly 2 minutes before. It’s about 3 times faster than the 4 CPU cores working together now.

I had tried this on EC2 before and worked out I needed a 10x speedup of the CUDA code to be more cost-effective with their spot pricing (7c/hr CPU-only vs 38c/hr with two M2050s + some network latency advantages I don’t gain from). Not quite there but my own GPU can contribute a lot to the pool now.

(1) Please note that loop unrolling typically requires that the loop body has a single-entry, single-exit structure. So loops containing ‘break’ or ‘return’ cannot be unrolled. You may want to look into getting rid of the ‘break’ inside the loop.

(2) Keep in mind that 64-bit integer arithmetic is emulated on the GPU (it has a 32-bit architecture). You would want to avoid any unnecessary 64-bit operations. For example, does ‘num_found’ have to be a 64-bit variable?

Yeah, expected it to stop unrolling when I added that break, but speeds it up massively (timed). I was adding some explicit unrolling (switch / jump table inside outer loop) but didn’t have a noticeable impact.

Interesting about the 64 bit. It’s all actually 48-bit arithmetic (Java’s Random implementation) discarding the top 16, so wonder if that could be done more efficiently.