Block size's effect on program performance, why does my program run faster at seemingly random sizes?

zadiN · January 2, 2017, 5:55pm

My Cuda program gains a significant performance boost (on average) depending on the size of the blocks & # of blocks; where the total number of “threads” remains the same. (This might not be the agreed upon terminology; where for each kernel the total number of threads is (# of blocks)*(block size)). I made some graphs to illustrate my point.

But first allow me to explain what my algorithm is first, however I’m not sure how relevant it is, because I would imagine this is something that applies to all GPGPU programs. But maybe I am wrong about that.

Basically I go across large arrays that are logically treated as 2D arrays, where each thread adds an element from the array as well as adds the square of that value to another variable and then at the end writes the value to another array, where during each read all the threads are shifted a certain way. Here is my kernel code:

__global__ void MoveoutAndStackCuda(const float* __restrict__ prestackTraces, float* __restrict__ stackTracesOut,
      float* __restrict__ powerTracesOut, const int* __restrict__ sampleShift,
      const unsigned int samplesPerT, const unsigned int readIns,
      const unsigned int readWidth, const unsigned int defaultOffset) {
    
      unsigned int globalId = ((blockIdx.x * blockDim.x) + threadIdx.x); // Global ID of this thread, starting from 0 to total # of threads
    
      unsigned int jobNum = (globalId / readWidth); // Which array within the overall program this thread works on
      unsigned int readIndex = (globalId % readWidth) + defaultOffset; // Which sample within the array this thread works on
    
      globalId = (jobNum * samplesPerT) + readIndex;  // Incorperate default offset (since default offset will also be the offset of
                                                      // index we will be writing to), actual globalID only needed for above two variables.
    
      float stackF = 0.0;
      float powerF = 0.0;
    
      for (unsigned int x = 0; x < readIns; x++) {
    
        unsigned int indexRead = x + (jobNum * readIns);
    
        float value = prestackTraces[readIndex + (x * samplesPerT) + sampleShift[indexRead]];
    
        stackF += value;
        powerF += (value * value);
      }
    
      stackTracesOut[globalId] = stackF;
      powerTracesOut[globalId] = powerF;
    }

Also I compile with:

nvcc --gpu-architecture=sm_52 -O3

Now for the meat of this post, when calling this code

MoveoutAndStackCuda<<<threadGroups, threadsPerGroup>>>(*prestackTracesCudaPtr,
        *stackTracesOutCudaPtr, *powerTracesOutCudaPtr,
        *sampleShiftCudaPtr, samplesPerT, readIns,
        readWidth, defaultOffset);

All I did was differ threadGroups and threadsPerGroup inside the <<<>>>, where (threadGroups.x * threadsPerGroup.x) remains the same. (As stated before this is a 1 dimensional problem).

I incremented the block size by 64 until I reached 1024. I expected no change, because I figured as long as block size is greater than 32, which I believe is the # of ALUs in a core, it would run as fast as possible. Take a look at this graph I made:

For this specific size the total number of threads is 5000 * 5120, so for example if the block size is 64 then there are ((5000 * 5120) / 64) blocks. For some reason there is a significant performance boost at a block size of 896, 768, and 512. Why?

I know this looks random, but each point in this graph is 50 test averaged together!

Here is another graph, this time for when the total # of threads will be (8000 * 8192). This time the boost is at 768 and 960.

Yet another example, this time for a job that is smaller than the other two problems (total threads is 2000 * 2048):

In fact here is an album I made of these graphs, with each graph representing a different size of the problem: http://imgur.com/a/LYYFf.

I am running this one a Quadro M5000: http://www.pny.com/nvidia-quadro-m5000, which has 2048 Cuda Cores. I believe each Cuda Core has 32 ALUs, so I presume that total # of computations that could be happening at any given time is (2048 * 32)?

So what explains these magic numbers? I figured it might be the total # of threads divided by the # of cuda cores, or divided by (2048 * 32), but so far I have found no correlation with anything that stretches across all of the graphs in my album. Is there another test I could do to help narrow things down? I want to find out what block size to run this program at for the best results. Is there any kind of equation I could formulate from these results, so that I now what block size will run the fastest for any given job size?

Also I didn’t include it, but I also did a test where block size decreased by 1 from 32 and things got exponentially slower. This of course is not surprising.

Robert_Crovella · January 2, 2017, 6:09pm

cross posting:

[url]cuda - GPGPU: Block size's effect on program performance, why does my program run faster at very specific sizes? - Stack Overflow

njuffa · January 2, 2017, 6:19pm

In my experience, in general: NO. You may want to consider creating and using an auto-tuning framework to find the fastest combination of configuration parameters for a given platform. Make sure that you achieve steady-state performance for each configuration, otherwise dynamic clock boosting is going to distort the experiments in unpredictable ways (best to use fixed application clocks!)

The reason there is no general “formula” is that the possible interactions between block scheduler, thread scheduler, memory address streams generated, external DRAM response behavior, internal buffering mechanisms, internal resource limits, clock boosting, etc are so numerous and complex, that it is isn’t possible to capture these in a closed-form formula.

For very simple code (e.g. a memory copy) it may be possible to find an approximate formula, but even there I have found it easier to simply tune by cycling through the available parameters to find the best combination.

However, one can apply some heuristics, such as (1) GPUs need to have tens of thousands of active threads for full performance, for memory-intensive tasks on the order of hundreds of thousand is preferred (2) thread blocks should have as few threads as possible while maximizing occupancy to improve granularity of execution, between 128 and 256 thread/block is usually a good starting point (3) individual kernels should not run too long (more than a 2 seconds), as imbalanced load scenarios can develop, but not too short (less than 20 ms) as startup/shutdown overhead can impact performance.

Note that similar (but usually, less severe) challenges in predicting performance exist on modern CPUs: this is why auto-tuning frameworks (e.g. ATLAS for BLAS) were invented more than two decades ago.

zadiN · January 2, 2017, 9:07pm

Is that okay to do? Just thought it would make sense to also post it here.

zadiN · January 2, 2017, 9:09pm

Is there any kind of equation I could formulate from these results, so that I now what block size will run the fastest for any given job size?

In my experience, in general: NO. You may want to consider creating and using an auto-tuning framework to find the fastest combination of configuration parameters for a given platform. Make sure that you achieve steady-state performance for each configuration, otherwise dynamic clock boosting is going to distort the experiments in unpredictable ways (best to use fixed application clocks!)

The reason there is no general “formula” is that the possible interactions between block scheduler, thread scheduler, memory address streams generated, external DRAM response behavior, internal buffering mechanisms, internal resource limits, clock boosting, etc are so numerous and complex, that it is isn’t possible to capture these in a closed-form formula.

For very simple code (e.g. a memory copy) it may be possible to find an approximate formula, but even there I have found it easier to simply tune by cycling through the available parameters to find the best combination.

However, one can apply some heuristics, such as (1) GPUs need to have tens of thousands of active threads for full performance, for memory-intensive tasks on the order of hundreds of thousand is preferred (2) thread blocks should have as few threads as possible while maximizing occupancy to improve granularity of execution, between 128 and 256 thread/block is usually a good starting point (3) individual kernels should not run too long (more than a 2 seconds), as imbalanced load scenarios can develop, but not too short (less than 20 ms) as startup/shutdown overhead can impact performance.

Note that similar (but usually, less severe) challenges in predicting performance exist on modern CPUs: this is why auto-tuning frameworks (e.g. ATLAS for BLAS) were invented more than two decades ago.

Thanks, yeah that all makes sense. Such a formula would be very complicated I’m sure, for now it seems that the best thing to do is to use large blocks and perferably 512 for most cases.

njuffa · January 2, 2017, 9:19pm

Cross-posting here and on Stackoverflow is fine (the philosophies of the respective sites are very different, these forums are discussion-friendly, for example), however it is much appreciated when pointers are provided at either end so as to avoid providing redundant information.

Topic		Replies	Views
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	129	December 31, 2024
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	2022	October 12, 2021
Grids and Threads question CUDA Programming and Performance	2	4422	August 7, 2007
Performance degradation as task size grows CUDA Programming and Performance	13	632	April 25, 2023
the relationship between the execution time and blocksize CUDA Programming and Performance	8	2181	July 4, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8622	December 18, 2008
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13486	July 9, 2008
Slow Performance CUDA Programming and Performance	19	8127	November 24, 2008
CUDA perormances CUDA Programming and Performance	10	7130	January 22, 2008
increasing blokSize -> Faster or slower CUDA Programming and Performance	4	869	September 12, 2011

Block size's effect on program performance, why does my program run faster at seemingly random sizes?

Related topics