Block size's effect on program performance, why does my program run faster at seemingly random sizes?

My Cuda program gains a significant performance boost (on average) depending on the size of the blocks & # of blocks; where the total number of “threads” remains the same. (This might not be the agreed upon terminology; where for each kernel the total number of threads is (# of blocks)*(block size)). I made some graphs to illustrate my point.

But first allow me to explain what my algorithm is first, however I’m not sure how relevant it is, because I would imagine this is something that applies to all GPGPU programs. But maybe I am wrong about that.

Basically I go across large arrays that are logically treated as 2D arrays, where each thread adds an element from the array as well as adds the square of that value to another variable and then at the end writes the value to another array, where during each read all the threads are shifted a certain way. Here is my kernel code:

__global__ void MoveoutAndStackCuda(const float* __restrict__ prestackTraces, float* __restrict__ stackTracesOut,
      float* __restrict__ powerTracesOut, const int* __restrict__ sampleShift,
      const unsigned int samplesPerT, const unsigned int readIns,
      const unsigned int readWidth, const unsigned int defaultOffset) {
      unsigned int globalId = ((blockIdx.x * blockDim.x) + threadIdx.x); // Global ID of this thread, starting from 0 to total # of threads
      unsigned int jobNum = (globalId / readWidth); // Which array within the overall program this thread works on
      unsigned int readIndex = (globalId % readWidth) + defaultOffset; // Which sample within the array this thread works on
      globalId = (jobNum * samplesPerT) + readIndex;  // Incorperate default offset (since default offset will also be the offset of
                                                      // index we will be writing to), actual globalID only needed for above two variables.
      float stackF = 0.0;
      float powerF = 0.0;
      for (unsigned int x = 0; x < readIns; x++) {
        unsigned int indexRead = x + (jobNum * readIns);
        float value = prestackTraces[readIndex + (x * samplesPerT) + sampleShift[indexRead]];
        stackF += value;
        powerF += (value * value);
      stackTracesOut[globalId] = stackF;
      powerTracesOut[globalId] = powerF;

Also I compile with:

nvcc --gpu-architecture=sm_52 -O3

Now for the meat of this post, when calling this code

MoveoutAndStackCuda<<<threadGroups, threadsPerGroup>>>(*prestackTracesCudaPtr,
        *stackTracesOutCudaPtr, *powerTracesOutCudaPtr,
        *sampleShiftCudaPtr, samplesPerT, readIns,
        readWidth, defaultOffset);

All I did was differ threadGroups and threadsPerGroup inside the <<<>>>, where (threadGroups.x * threadsPerGroup.x) remains the same. (As stated before this is a 1 dimensional problem).

I incremented the block size by 64 until I reached 1024. I expected no change, because I figured as long as block size is greater than 32, which I believe is the # of ALUs in a core, it would run as fast as possible. Take a look at this graph I made:

For this specific size the total number of threads is 5000 * 5120, so for example if the block size is 64 then there are ((5000 * 5120) / 64) blocks. For some reason there is a significant performance boost at a block size of 896, 768, and 512. Why?

I know this looks random, but each point in this graph is 50 test averaged together!

Here is another graph, this time for when the total # of threads will be (8000 * 8192). This time the boost is at 768 and 960.

Yet another example, this time for a job that is smaller than the other two problems (total threads is 2000 * 2048):

In fact here is an album I made of these graphs, with each graph representing a different size of the problem:

I am running this one a Quadro M5000:, which has 2048 Cuda Cores. I believe each Cuda Core has 32 ALUs, so I presume that total # of computations that could be happening at any given time is (2048 * 32)?

So what explains these magic numbers? I figured it might be the total # of threads divided by the # of cuda cores, or divided by (2048 * 32), but so far I have found no correlation with anything that stretches across all of the graphs in my album. Is there another test I could do to help narrow things down? I want to find out what block size to run this program at for the best results. Is there any kind of equation I could formulate from these results, so that I now what block size will run the fastest for any given job size?

Also I didn’t include it, but I also did a test where block size decreased by 1 from 32 and things got exponentially slower. This of course is not surprising.

cross posting:

In my experience, in general: NO. You may want to consider creating and using an auto-tuning framework to find the fastest combination of configuration parameters for a given platform. Make sure that you achieve steady-state performance for each configuration, otherwise dynamic clock boosting is going to distort the experiments in unpredictable ways (best to use fixed application clocks!)

The reason there is no general “formula” is that the possible interactions between block scheduler, thread scheduler, memory address streams generated, external DRAM response behavior, internal buffering mechanisms, internal resource limits, clock boosting, etc are so numerous and complex, that it is isn’t possible to capture these in a closed-form formula.

For very simple code (e.g. a memory copy) it may be possible to find an approximate formula, but even there I have found it easier to simply tune by cycling through the available parameters to find the best combination.

However, one can apply some heuristics, such as (1) GPUs need to have tens of thousands of active threads for full performance, for memory-intensive tasks on the order of hundreds of thousand is preferred (2) thread blocks should have as few threads as possible while maximizing occupancy to improve granularity of execution, between 128 and 256 thread/block is usually a good starting point (3) individual kernels should not run too long (more than a 2 seconds), as imbalanced load scenarios can develop, but not too short (less than 20 ms) as startup/shutdown overhead can impact performance.

Note that similar (but usually, less severe) challenges in predicting performance exist on modern CPUs: this is why auto-tuning frameworks (e.g. ATLAS for BLAS) were invented more than two decades ago.

Is that okay to do? Just thought it would make sense to also post it here.

Thanks, yeah that all makes sense. Such a formula would be very complicated I’m sure, for now it seems that the best thing to do is to use large blocks and perferably 512 for most cases.

Cross-posting here and on Stackoverflow is fine (the philosophies of the respective sites are very different, these forums are discussion-friendly, for example), however it is much appreciated when pointers are provided at either end so as to avoid providing redundant information.