How to choose how many threads/blocks to have?

So I have a Jetson TX1, which as I understand has 2 SM’s - each with 128 cores.
I read that per SM (which I understand there are 2) there can be a maximum of 16 active blocks, and 64 active warps (or 2048 active threads).

Now I have copied an example, which has chosen block sizes and number of threads etc, but coming into it blind, how would I know how many to choose?

Do I want to reach both of these maxima for best performance? IE Make it so that 128 threads per block? Or is it better to have more threads per block, but then not reach the active block maximum? Or even less threads per block?

Sorry - follow up question - why are the maximums for blocks per SM and warps per SM the way they are? I know this is pretty much a “how does this work?” question. But I genuinely can’t find a decent answer in a concise form. A link would suffice.

Many thanks! :)

The only thing that really matters for occupancy and if performance depends on occupancy is warps. You want to have as close to 64 active warps as possible, all other factors being equal.

However, this does not mean necessarily that your code is somehow deficient if you do not have 64 active warps. OTOH a really low level of active warps, say less than 32, or less than 16, may be a strong indicator that occupancy (i.e. a low level of achieved occupancy) might be a factor to consider in the performance of your code.

The maximums are hardware maximums. Each open block requires a certain amount of “state” to be maintained for it. Therefore it’s not possible to create a HW design that supports an infinite number of open blocks per SM. And its not desirable to burden the HW design with maintaining state for 64 blocks when 16 blocks will suffice for nearly all purposes - simply make sure to choose at least 128 threads per block for your code, if this aspect of performance/occupancy is an issue.

Therefore very small block sizes (e.g. 32 threads per block) may limit performance due to occupancy. Very large block sizes for example 1024 threads per block, may also limit performance, if there are resource limits (e.g. registers per thread usage, or shared memory usage) which prevent 2 threadblocks (in this example of 1024 threads per block) from being resident on a SM

Threadblock size choices in the range of 128 - 512 are less likely to run into the aforementioned issues. Usually there are not huge differences in performance for a code between, say, a choice of 128 threads per block and a choice of 256 threads per block. Due to warp granularity, it’s always recommended to choose a size that is a multiple of 32, and powers-of-2 threadblock size choices are also pretty common, but not necessary.

A good basic sequence of CUDA courses would follow a CUDA 101 type class, which will familiarize with CUDA syntax, followed by an “optimization” class, which will teach the first 2 most important optimization objectives:

  1. Choosing enough threads to saturate the machine and give the machine the best chance to hide latency
  2. efficient use of the memory subsystem(s)

Such classes/presentations can be readily found by searching on e.g. “gtc cuda optimization”

Many thanks for your reply. That clears up a lot. :)

I actually had a very similar issue / question. I followed a relatively detailed table collecting information on individual CUDA-enabled GPUs available at: https://en.wikipedia.org/wiki/CUDA (mid-page). I use 780Ti for development work (CUDA 3.5 capable) and have been looking for any indication on how to select optimum values for the block size and thread count for my application. At this time, I settled (through trial and error) on 1024 threads and 64 blocks but it gives me ~95% execution success. Sometimes application just crashes for no reason at all. What I am trying to do is obviously squeeze every single cycle out of the GPU for compute purposes.

Looking at the referenced Wiki page, for my GPU, I can see that parameter “Maximum number of threads per block” is equal to 1024 (the value I use already in my application) and then maximum block sizes are listed as “Maximum x-dimension of a grid of thread blocks” = 2^31-1, and “Maximum y-, or z-dimension of a grid of thread blocks” = 65635. This makes no sense to me in any way. At the same time, “Maximum number of resident grids per device” = 32, which seems to be closer to stable operating conditions (1024 x 32) I observe through trial and error.

To help clarify the concepts, I spent better part of several last days going through white papers, guides, implementation examples, etc. and still there is no single reference (apart from the Wiki page) where information is collected in any organized fashion. if there is a better reference, please let me know. Otherwise, which of the values from the Wiki table should be taken as maximum for thread and block size count?

As a bonus, is there any way to discover these values during execution time and set dynamically? I am doing development on 780Ti, but at the end of the day, execution will be done on a cluster of V100s, so I am trying to make all possible parameters discover dynamically at execution time (CPU type, number of cores, threads, GPU type, etc.)

Many thanks in advance and I hope this makes any sense …

Well, you should probably debug that rather than looking for a special set of operating conditions to avoid the failure. If you don’t know what is causing the failure, you don’t really know if you have a fix.

A CUDA kernel launch:

mykernel<<<A,B,C,D>>>(…);

has a set of launch configuration parameters, contained in the triple-chevron <<<…>>> syntax. The first parameter (A) is the number of blocks to launch, expressed as a dim3 (3-dimensional) variable. The second parameter (B) is the number of threads per block, which can also be expressed 3-dimensionally. For a thread block, you have a limit on the total number of threads (1024) as well as a limit on each dimension. The total number of threads in a block is the product of the 3 thread block dimensions. The set of all blocks associated with a kernel launch is referred to as the grid. As already mentioned, the grid size is expressed using the first kernel launch config parameter, and it has relevant limits for each dimension, which is where the 2^31-1 and 65535 numbers are coming from.

This refers to concurrent kernels. Probably you are not dealing with concurrent kernels. There is a 1:1 correspondence between a kernel launch and its appropriate grid, so having multiple grids resident means concurrent kernels.

Take a look at the CUDA concurrent kernels sample code.

Regarding not being able to find answers about this, it is fairly basic CUDA 101 type info. Here is an introductory CUDA talk that will expose you to the difference between threads and blocks (which you seem to have not grasped):

http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf

Here’s a recent one that also covers threads, blocks and grid dimensions:

https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

As far as I can see, multiple issues are getting conflated here.

(1) There are architecture-dependent, hardware-imposed, limits on grid and block dimensions. There are also other architecture-dependent resource limits, e.g. on shared memory size or register usage. These are documented in the CUDA Programming Guide.

(2) Within the limitations imposed by hardware, what thread and block configuration results in the highest performance for a given GPU depends on the code that is being run. Block configuration in particular interacts with other resource limits in terms of occupancy. There is no universal formula for the “best” configuration, which explains why you cannot find one online or in published articles. Some basic heuristics for reasonable performance in many uses cases are: 10K+ total threads, 500+ blocks, 128-256 threads/blocks.

One can find the “optimal” configuration for a given code on a given GPU by experimentation, in particular an automated search of the space of possible configurations. Such an auto-tuning approach has been used by widely-used applications on CPUs for at least 20 years, e.g. ATLAS and FFTW.

(3) Supposed stability issues are most often the result of unidentified software bugs, in particular race conditions and out-of-bounds accesses. cuda-memcheck is a tool for first-line defense against these. Actual instability due to hardware does happen occasionally, most often on consumer-grade GPUs (e.g. lack of ECC), and in particular vendor-overclocked GPUs that run at frequencies higher than NVIDIA’s reference cards. The GTX 780Ti seems to be a special case, however, because it is the only GPU specifically advised against by the AMBER project because of stability issues:

http://ambermd.org/gpus/

Thank you, njuffa - in other words, 780Ti is good for gaming, but not really advisable for any serious dev work in CUDA, if I read that right. I just checked the vendor and it is PNY. Just my luck :(

All the stability issues I am observing might then related with the said flaw in the GPU itself and not any issue in the code itself.

I have no specific insights into the GTX 780 Ti as I have never used that GPU. I simply pointed to what looked like potentially pertinent information about an issue from people who I believe have reasonable insights into that issue.

There is general risk when using vendor-overclocked parts (regardless of the vendor): While the vendors appear to guarantee proper operation for graphics applications, I have seen no information that gives me reason to believe that compute applications are part of their qualification process. Graphics applications (and games in particular), tend to have a different usage profile relative to the various functional units in a GPU than compute applications. Also, any minor errors in a graphics application will likely last for the duration of a frame and will never be noticed, while errors may propagate in compute applications.

The amount of risk differs by the nature of the computation (e.g Monte-Carlo computations may tolerate an occasional error as it contributes very little to the final result), and also by the aggressiveness of the vendor-provided overclocking. Some vendors appear to provide up to three levels of overclocking: mild overclocking, ambitious overclocking, and insane overclocking, usually readily distinguished by price level. From what I have seen, GPUs with only mild overclocking tend to be stable for most compute applications.

Excessive heat and especially insufficient power supply can also contribute to hardware instability (modern processors tend to have occasional power spikes of very short duration), as does operation in an environment with lots of electromagnetic noise (e.g. factory floor) or with increased radiation (e.g. extreme altitudes) which can effect the reliability of DRAM.

Overall, instances of true hardware instability seem to be much rarer than cases of latent software bugs. For example, with race conditions, software may seem to work perfectly on one GPU for months on end, only to fail once moved to a slightly different GPU model. Running cuda-memcheck provides good first-line protection against such issues, but it cannot find all bugs.

780Ti was indeed the primary reason for stability issues on my system. Last night I replaced it with 1080 (NVidia manufactured, made sure it does come with stock settings) and right now all the crashes I was experiencing before are just gone. Thank you njuffa - that was a very good pointer and I would have never even looked for a hardware specific issue as far as GPU goes. Lesson learned and KUDOS to you.

I am back and still struggling with memory access errors for some reason. For varGpuBlockSize = 256 * 512 = 131072 (256 blocks are used, 512 threads per block), with varDelimiterSize = 66 (length of the bit sequence) and paralellisation factor varGpuBlockMultiplier = 64, I allocate 553,648,128 bits = 66MB of memory even.

bool * varSequenceDelimiter = NULL;
cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool));
subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier);
cudaDeviceSynchronize();

the kernel is very simple, just going through individual thread block instances and parallel combinations of each thread block and just display parameters, not even modifying memory content.

global void subIntializeDelimiters(bool * varSequenceDelimiter, int varDelimiterSize, int varSpaceSizeReserved, int varGpuBlockCount, int varGpuThreadCount, int varGpuBlockSize, int varGpuBlockMultiplier)
{
// repeat the process within each of the GPU blocks
// counting starts from 1, to account correctly for case of 1 GPU block
for (int varGpuBlock = 0; varGpuBlock < varGpuBlockMultiplier; varGpuBlock++)
{
// calculate the relative position start for this thread
unsigned long long int varElementNumber = varGpuThreadCount * blockIdx.x + threadIdx.x + varGpuBlockSize * varGpuBlock;
unsigned long long int varPositionStart = varDelimiterSize * varElementNumber;

	printf("\n[B%d,T%d,BC%d] - position start: %llu for element: %llu", blockIdx.x, threadIdx.x, varGpuBlock, varPositionStart, varElementNumber);
}

}

there are 512 * 256 = 131072 thread blocks = thread number range then [0;131071] calculated as: varGpuThreadCount * blockIdx.x + threadIdx.x
thread number min => 512 * 0 + 0 = 0 (OK)
thread number max => 512 * 255 + 511 = 131071 (OK)

The varElementNumber (element number) is calculated in the function of thread block number, and paralellisation factor varGpuBlock: (varGpuThreadCount * blockIdx.x + threadIdx.x) + varGpuBlockSize * varGpuBlock. With paralleisation factor of 64, there are 131072 * 64 = 8,388,608 elements in total to examine, ranging [0;8,388,607].

thread block 1 start => (512 * 0 + 0) + 131072 * 0 = 0 (OK)
thread block 1 end => (512 * 255 + 511) + 131072 * 0 = 131071 (OK)
thread block 2 start => (512 * 0 + 0) + 131072 * 1 = 131072 (OK)
thread block 2 end => (512 * 255 + 511) + 131072 * 1 = 262143 (OK)
thread block 3 start => (512 * 0 + 0) + 131072 * 2 = 262144 (OK)
thread block 3 end => (512 * 255 + 511) + 131072 * 2 = 393215 (OK)

thread block 64 end => (512 * 255 + 511) + 131072 * 63 = 8,388,607 (OK)

Each element is 66 bits long, so for each element in each thread block and for each parallel block: varDelimiterSize * varElementNumber
thread block 1 start => 66 * [(512 * 0 + 0) + 131072 * 0] = 0 (OK)
thread block 1 end => 66 * [(512 * 255 + 511) + 131072 * 0] = 8,650,686 (OK)
thread block 2 start => 66 * [(512 * 0 + 0) + 131072 * 1] = 8,650,752 (OK)
thread block 2 end => 66 * [(512 * 255 + 511) + 131072 * 1] = 17,301,438 (OK)
thread block 3 start => 66 * [(512 * 0 + 0) + 131072 * 2] = 17,301,504 (OK)
thread block 3 end => 66 * [(512 * 255 + 511) + 131072 * 2] = 25,952,190 (OK)

thread block 64 end => 66 * [(512 * 255 + 511) + 131072 * 63] = 553,648,062 (OK)

The math for calculating positon within 553,648,128 bit block allocated to kernel works then perfectly fine. However, when running the code with Nsight and memory checker enabled, this is all I get

CUDA context created : 1dd4917bea0
CUDA module loaded: 1dd5dd77b20 kernel.cu
CUDA grid launch failed: CUcontext: 2049925693088 CUmodule: 2050273803040 Function: _Z22subIntializeDelimitersPbiiiiii
CUDART error: cudaLaunch returned cudaErrorLaunchFailure

CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure

CUDART error: cudaGetLastError returned cudaErrorLaunchFailure

Here is the output from the cout

=========================

GPU Device 0: “GeForce GTX 1080” with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used

CUDA kernel launch (initialize delimiters) with 256 blocks and 512 threads per block, thread block multiplier 64

[B39,T32,BC0] - position start: 1320000 for element: 20000
[B39,T33,BC0] - position start: 1320066 for element: 20001
(… truncated, long output … )
[B26,T157,BC0] - position start: 888954 for element: 13469
[B26,T158,BC0] - position start: 889020 for element: 13470
[B26,T159,BC0] - position start: 889086 for element: 13471

position start and element number calculations are correct and given there is not even direct memory access into memory allocated to kernel, there should be no overrun problem at all. However, kernel still returns launch failure for some reason.

Any ideas what the issue might be?

thanks !

compile your code with -lineinfo switch
then run your code with cuda-memcheck from a command line/command prompt

inspect the output
if necessary (ie. a kernel code issue) follow the methodology outlined here:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

thank you

I followed this article (http://docs.nvidia.com/nsight-visual-studio-edition/3.2/Content/CUDA_Properties_Config.htm) and set Generate Line Number Information to Yes

but then in compiler output I get this

1>CUDACOMPILE : nvcc warning : ‘–device-debug (-G)’ overrides ‘–generate-line-info (-lineinfo)’

I assume this means I need to disable --device-debug (-G) to have access to -lineinfo option? The referenced website indicates that “If Generate GPU Debug Information is on (-G), line information (-lineinfo) is automatically generated as well.” so it seems it is already enabled and does not need to be set manually for the project. is that correct?

and the launch with cuda-memcheck produced this output

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaDeviceSynchronize + 0x10e) [0x1b22e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a3) [0x5aa3]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuTexRefSetAddress + 0x2cae4e) [0x2d849b]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\cudart64_90.dll (cudaGetLastError + 0x107) [0x1cf07]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (subCalculate + 0x7a8) [0x5aa8]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (main + 0x48a) [0x698a]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (invoke_main + 0x34) [0x200d4]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main_seh + 0x127) [0x1ffc7]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (__scrt_common_main + 0xe) [0x1fe8e]
========= Host Frame:c:\Users\user\source\repos\burst-delimiter-finder-cuda\x64\Debug\burst-delimiter-finder-cuda.exe (mainCRTStartup + 0x9) [0x20169]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x11fe4]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ef91]

========= ERROR SUMMARY: 2 errors

If you are creating a debug build, you don’t need to specify -lineinfo. The equivalent of -lineinfo is already included in a debug build.

My best guess right now would be that your program is hitting the windows kernel timeout mechanism.

Try reducing the size of the kernel until this issue goes away (e.g. reduce the number of blocks launched.) Then profile the code to estimate kernel duration. If it is long (more than half a second, or so), there’s a good chance the larger/failing case is hitting the timeout.

Debug build kernels run slower (usually) than release builds, so switching to a release build may drop your kernel execution time under the limit, if you are hitting that issue here.

If I were to read this explicitly, the problem is with the cudaDeviceSynchronize function called immediately after kernel is launched

subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier);
cudaDeviceSynchronize();

and then with the following cudaGetLastError call to confirm execution completed just fine - it is possible that execution of kernel just did not finish by the time cudeDeviceSynchronize was called and thus the error?

// process any CUDA kernel errors
cudaError_t varCudaError = cudaGetLastError();
if (varCudaError != cudaSuccess)
{
std::cout << "Failed to launch subIntializeDelimiters kernel (error code: " << cudaGetErrorString(varCudaError) << “)!” << std::endl;
exit(EXIT_FAILURE);
}

I posted something before this double-posting of yours. You may want to go back and look at it.

Sorry about the double post - for some reason, website gave me back error 502 when posting so I was not sure if the message made it through.

I managed to get no errors on launch with 256 threads per block, 16 blocks in debug mode. Release version of the same code is no better. If I move past 16 blocks, I get the very same error.

Is this trial&error method the best I can do to try to guesstimate the optimum thread block size for the calculation process? It seems a bit of a crude approach at best.

You don’t seem to have read what I posted before your double post.

It sounds to me like you’re hitting the timeout. Did you confirm that with an actual kernel duration measurement? What is the duration of the no-errors case (256 threads per block, 16 blocks). Also, if you have things like printf in your kernel, you presumably want to get rid of that junk.

If you have confirmed that the WDDM TDR mechanism is the issue, then you’ll need to make sure your kernels are short (less than ~2 seconds), or lengthen or disable the timeout.

If you actually intend to run this code in a production evironment where the code will be run on a GPU that is hosting a display on windows and subject to the WDDM TDR timeout, then for several reasons you may want to actually target something like 100ms or less for typical kernel duration. Since this is well under the ~2s limit, you don’t need a trial and error method. You can run your kernel with a range of parameters that target this general area, and then use the profiler to study the actual duration.

I did read but writing my very first CUDA code ever, I do not get everything at the very first time. I do apologize for trying your patience here. I am certainly not doing that on purpose.

My assumption was that to improve code performance, I need to stay within kernel and process as much data as I possibly can before dropping back to CPU processing to decide what gets saved into the file and what does not. It seems that there is a limit how much I can process per single kernel execution before TDR mechanism kicks in.

I am doing development on a machine with TDR enabled (it is my primary display card as well) but target is to run the code on headless environment.

I hope it is not to much to ask to point me in the direction of how to measure kernel duration I assume it is cuda profiler executed from command line in some way?

Both nvprof and nvvp are profilers that can be run from the command line both on windows and linux. For a quick kernel duration measurement, nvprof is probably easiest. From command line:

nvprof --print-gpu-trace myapp.exe

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#abstract

The nsight VSE tool you are using also has a built-in profiler.

http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/analysis_tools_overview.htm%3FTocPath%3DDeveloper%2520Tools%7CDesktop%2520Developer%2520Tools%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%25205.4%7CAnalysis%2520Tools%7C_____0