How to choose how many threads/blocks to have?

Thank you - that helps me a lot !

Thank you, again

I went through the profiling for one longer sequence length (132 bits) and now I can see clearly the kernel times for individual call. In this case, subIntializeDelimiters is the longer one of these - it is called only once, but it takes 756.14ms to complete. The actual calculation kernel (subCalculateDelimiter) is shorter, with execution times (average) of 320.03ms.

At least I have a way to isolate the source of the problem. Thank you for that !

==5592== NVPROF is profiling process 5592, command: burst-delimiter-finder-cuda.exe 132 32 4 4 512 256 64 out 100
==5592== Profiling application: burst-delimiter-finder-cuda.exe 132 32 4 4 512 256 64 out 100
==5592== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.69% 32.0029s 100 320.03ms 295.67ms 344.75ms subCalculateDelimiter(bool*, bool*, bool*, int*, int*, int*, int, int, int, int, int, int, int, int)
2.31% 756.14ms 1 756.14ms 756.14ms 756.14ms subIntializeDelimiters(bool*, int, int, int, int, int, int)
API calls: 95.94% 32.7885s 101 324.64ms 295.97ms 756.33ms cudaDeviceSynchronize
1.82% 623.28ms 6 103.88ms 3.1408ms 405.31ms cudaMallocManaged
1.70% 582.45ms 101 5.7668ms 78.456us 9.2648ms cudaLaunch
0.52% 176.57ms 5 35.313ms 820.75us 158.72ms cudaFree
0.01% 3.1257ms 4 781.44us 751.58us 827.79us cudaGetDeviceProperties
0.00% 865.27us 1407 614ns 0ns 4.8040us cudaSetupArgument
0.00% 812.74us 94 8.6460us 0ns 377.55us cuDeviceGetAttribute
0.00% 511.40us 101 5.0630us 1.9220us 44.511us cudaConfigureCall
0.00% 249.14us 1 249.14us 249.14us 249.14us cuDeviceGetName
0.00% 212.63us 101 2.1050us 1.9210us 2.8820us cudaGetLastError
0.00% 17.293us 1 17.293us 17.293us 17.293us cudaSetDevice
0.00% 14.731us 1 14.731us 14.731us 14.731us cuDeviceTotalMem
0.00% 4.1630us 3 1.3870us 320ns 2.8820us cuDeviceGetCount
0.00% 3.5220us 1 3.5220us 3.5220us 3.5220us cudaGetDevice
0.00% 2.5610us 2 1.2800us 320ns 2.2410us cudaGetDeviceCount
0.00% 2.2410us 2 1.1200us 640ns 1.6010us cuDeviceGet

==5592== Unified Memory profiling result:
Device “GeForce GTX 1080 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
1 4.0000KB 4.0000KB 4.0000KB 4.000000KB 14.73100us Host To Device
25359 32.305KB 4.0000KB 1.0000MB 800.0352MB 631.7204ms Device To Host

… coming back with some more questions, this time after having optimized the code to lower the kernel to times to 146.61ms on subCalculateDelimiter and 98.083ms on subIntializeDelimiters. I am using 1024 threads per block and 1024 blocks on GTX1080. I am using thread block multiplier of 8, with 132 bit sequence. That gives me 132MB large memory block. No issues executing here.

==14332== NVPROF is profiling process 14332, command: burst-delimiter-finder-cuda.exe 132 32 4 4 1024 1024 8 out 10000
==14332== Profiling application: burst-delimiter-finder-cuda.exe 132 32 4 4 1024 1024 8 out 10000
==14332== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.95%  1.5e+03s     10000  146.61ms  133.50ms  194.08ms  subCalculateDelimiter(bool*, bool*, bool*, int*, int*, int*, int, int, int, int, int, int, int, int)
                    0.05%  784.66ms         8  98.083ms  93.781ms  102.67ms  subIntializeDelimiters(bool*, int, int, int, int, int, int)
      API calls:   95.03%  1.5e+03s     10008  146.98ms  94.175ms  196.88ms  cudaDeviceSynchronize
                    4.86%  75.2824s     10008  7.5222ms  75.573us  45.269ms  cudaLaunch
                    0.08%  1.20804s         6  201.34ms  5.9828ms  870.81ms  cudaMallocManaged
                    0.02%  273.18ms         5  54.637ms  1.2579ms  253.68ms  cudaFree
                    0.00%  69.919ms    140056     499ns       0ns  1.0282ms  cudaSetupArgument
                    0.00%  40.425ms     10008  4.0390us  2.2410us  2.2057ms  cudaConfigureCall
                    0.00%  29.990ms     10001  2.9980us     961ns  355.77us  cudaGetLastError
                    0.00%  3.0896ms         4  772.39us  757.34us  782.32us  cudaGetDeviceProperties
                    0.00%  909.45us        94  9.6740us       0ns  436.79us  cuDeviceGetAttribute
                    0.00%  578.01us         1  578.01us  578.01us  578.01us  cuDeviceGetName
                    0.00%  19.533us         1  19.533us  19.533us  19.533us  cuDeviceTotalMem
                    0.00%  16.652us         1  16.652us  16.652us  16.652us  cudaSetDevice
                    0.00%  5.1230us         3  1.7070us     640ns  2.8820us  cuDeviceGetCount
                    0.00%  3.5230us         1  3.5230us  3.5230us  3.5230us  cudaGetDevice
                    0.00%  2.8830us         2  1.4410us     321ns  2.5620us  cudaGetDeviceCount
                    0.00%  2.2410us         2  1.1200us     640ns  1.6010us  cuDeviceGet

==14332== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1  4.0000KB  4.0000KB  4.0000KB  4.000000KB  13.12900us  Host To Device
 2559768  32.003KB  4.0000KB  1.0000MB  78.12531GB  88.415307s  Device To Host

I was attempting to increase multiplier factor for thread block from 8 to 16 (that should have increased memory allocation from 132MB to 264MB), keeping block count, thread count, and bit sequence length intact. However, initialization failed. I run memcheck with debug enabled to verify what was going on and the only thing I got back was this

CUDA kernel launch (initialize delimiters) with 1024 blocks and 1024 threads per block, thread block multiplier 16
Failed to launch subIntializeDelimiters kernel (error code: an illegal memory access was encountered)!
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

This is a call I used: cuda-memcheck --print-level info burst-delimiter-finder-cuda.exe 132 32 4 4 1024 1024 16 out 512 > result-finder.txt ---- I tried to increase print level to info to make sure all possible output is caught, but it did not improve anything. I looked through the memcheck documentation at: http://docs.nvidia.com/cuda/cuda-memcheck/index.html, but I do not see an option to extract further details why allocation fails. You mentioned in one message / thread it should be possible to allocate GBs worth of memory with no issues and act on this memory block, but it seems I am bumping into some sort of limit …

Any tools I can use / pointers to what the issue might be? Thanks in advance !

this is what the memory allocation on GPU looks like with “burst-delimiter-finder-cuda.exe 132 32 4 4 1024 1024 >>>8<<< out 10000” (note the thread block multiplier of 8)

Pointer: Consider how programmers debugged code before there were tools like valgrind or cuda-memcheck, or even gdb.

I already run through the code step by step, displaying all calculated values and I do not have any memory overruns. When I remove all calls to within the allocated memory (I do not perform any operations on the allocated block size), it works fine. Since I do not overrun memory block allocated, is there anything else that would prevent larger block from being used? I noticed that my card has 106MB of hardware reserved memory, but that does not seem to coincide with the value for a successful run (132MB) either …

OK, I moved the error reporting point for each subIntializeDelimiters execution

for (int varGpuBlockMultiplierLocal = 0; varGpuBlockMultiplierLocal < varGpuBlockMultiplier; varGpuBlockMultiplierLocal++)
{
	// call the kernel for specific thread block numbers
	printf("\n[B%d,T%d] - thread block multiplier: %llu", varGpuBlockCount, varGpuThreadCount, varGpuBlockMultiplierLocal);
	subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount >>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplierLocal);
	cudaDeviceSynchronize();
	// process any CUDA kernel errors 
	cudaError_t varCudaError = cudaGetLastError();
	if (varCudaError != cudaSuccess)
	{
		std::cout << "Failed to launch subIntializeDelimiters kernel (error code: " << cudaGetErrorString(varCudaError) << ")!" << std::endl;
		exit(EXIT_FAILURE);
	}
}

and this is what I get when I run the call: “cuda-memcheck --print-level info burst-delimiter-finder-cuda.exe 132 32 4 4 1024 1024 16 out 512 > result-finder.txt”

GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used

CUDA kernel launch (initialize delimiters) with 1024 blocks and 1024 threads per block, thread block multiplier 16

[B1024,T1024] - thread block multiplier: 0
Failed to launch subIntializeDelimiters kernel (error code: out of memory)!
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

With 132 bit long sequence, 1024 x 1024 thread block size, and 16 thread block multiplier, I have 264MB memory block size allocated. Card has 8GB memory on it, so why would it give me grief about measly 264MB allocation …

I added memory display information (subPrintMemoryInformation) to verify what memory condition is at before the first memory block is allocated

// calculate total, used, and free GPU memory 
void subPrintMemoryInformation()
{
	// get information about the total and free memort off GPU
	size_t varMemoryFreeByte;
	size_t varMemoryTotalByte;
	cudaError_t varGpuStatus = cudaMemGetInfo(&varMemoryFreeByte, &varMemoryTotalByte);
	if (varGpuStatus != cudaSuccess)
	{
		printf("\n<subPrintMemoryInformation> ERROR: cudaMemGetInfo fails, %s\n", cudaGetErrorString(varGpuStatus));
		exit(1);
	}

	double varMegaByte = 1024 * 1024.0;
	printf("\n<subPrintMemoryInformation> RESULT: GPU memory usage: used = %.2f MB, free = %.2f MB, total = %.2f MB\n", (varMemoryTotalByte - varMemoryFreeByte) / varMegaByte, varMemoryFreeByte / varMegaByte, varMemoryTotalByte / varMegaByte);
}

Before allocating memory block for varSequenceDelimiter, this is the output I get

<subPrintMemoryInformation> RESULT: GPU memory usage: used = 1341.40 MB, free = 6850.60 MB, total = 8192.00 MB

Then I allocate varSequenceDelimiter

bool * varSequenceDelimiter = NULL;
cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool));

and then memory off GPU is

<subPrintMemoryInformation> RESULT: GPU memory usage: used = 2397.46 MB, free = 5794.54 MB, total = 8192.00 MB

This does not make much sense to me. It seems like 1GB was allocated, but my math shows much smaller block needed: 8 * 1024 * 1024 * 132 * sizeof(bool) = 132MB. Why is then ~1GB (2397.46 MB - 1341.40 MB = 1056.06 MB) taken out of the GPU memory space? This makes very little sense to me.

https://stackoverflow.com/questions/4897844/is-sizeofbool-defined claims that sizeof(bool) is 1 byte, and not 1 bit as I was assuming … In that case varSequenceDelimiter becomes 8 * 1024 * 1024 * 132 * sizeof(bool) = 1056 MB, which now confirms the size of the memory allocation. It is not a very efficient way to store a bit in a byte, so I will work on lowering that down to some more manageable size.

In the meantime, it seems like I cannot allocate 2112 MB in a single block, though. 16 * 1024 * 1024 * 132 * sizeof(bool) = 2112 MB, and this seems to flunk on

std::cout << std::endl << "CUDA kernel launch (initialize delimiters) with " << varGpuBlockCount << " blocks and " << varGpuThreadCount << " threads per block, thread block multiplier " << varGpuBlockMultiplier << std::endl;
subPrintMemoryInformation();

bool * varSequenceDelimiter = NULL;
cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool));
subPrintMemoryInformation();

produces

// before cudaMallocManaged call
<subPrintMemoryInformation> RESULT: GPU memory usage: used = 1341.40 MB, free = 6850.60 MB, total = 8192.00 MB
// after cudaMallocManaged call 
<subPrintMemoryInformation> RESULT: GPU memory usage: used = 1341.40 MB, free = 6850.60 MB, total = 8192.00 MB

so the memory allocation failed in this case. The largest contiguous pool I managed allocate is 1980 MB (~2GB): >>>15<<< * 1024 * 1024 * 132 * sizeof(bool). Is there any limitation in the memory block size that cudaMallocManaged can allocate? I looked through documentation but it is silent on this point.

// before cudaMallocManaged call
<subPrintMemoryInformation> RESULT: GPU memory usage: used = 1341.40 MB, free = 6850.60 MB, total = 8192.00 MB
// after cudaMallocManaged call 
<subPrintMemoryInformation> RESULT: GPU memory usage: used = 3321.46 MB, free = 4870.54 MB, total = 8192.00 MB

sounds like you’re running into a limitation on what an int variable can store.

you should be checking the return code from every CUDA API call. Any time you have trouble with a CUDA code, yo should be doing proper CUDA error checking. Google “proper CUDA error checking” and apply it to your code.

As a defensive strategy to avoid running into issues with integer overflow during size computations for memory allocations, it is best to start such an expression with the sizeof() part, because that returns a size_t, forcing the rest of the expression to be evaluated using size_t as well. E.g.

sizeof(bool) * varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize

Thank you - after some back and forth, I have optimized the code and now can push the GPU to close to 100% utilization when running calculations. Lots learned in the process !!!

very interesting thread.

I’m a new cuda C programmer. Then I have a take a program that usually run in CPU, and changed it to run in a GPU (my cuda test enviroment have a Gygabyte Aurus GTX 1060 6G card).

For the test I have changed a program that have about 52920 “optimization processes”. I run a “block” of 32768 “processes” first, and then the rest (20152). Configuting smaller blocks last more (tested half, 16384 “processes”)

Then I have tested my program with different number of threads in the kernel call (the blocks changed based on the number of threads). The threads doesn’t need to share anything between them. Then I don’t care the number of threads executed.

My program fill some arrays with data, then it call the kernel, and after that it executes a cudaMemcpy to get results. Profiling with nvprof tells my that 99.94% of time it’s executed by cudaMemcpy api call.

All the threads last less than 2 seconds. Each thread executes: adds, subs, muls and divs with float32 numbers

Time that last the program by number of threads (CPU and GPU)

cpu i5 4xxx 4 threads 3591 sec
cpu fx8350 8 threads 2205 sec
gpu 1060 6G 64 threads 5795 sec
gpu 1060 6G 32 threads 2844 sec
gpu 1060 6G 16 threads 938 sec
gpu 1060 6G 8 threads 1325 sec
gpu 1060 6G 4 threads 2286 sec
gpu 1060 6G 2 threads 3056 sec

The faster option for my process it’s executing 16 threads (somewhere I read that number of threads should be divisible by 32, but not for this).

When running the program, nvidia-smi tells me that GPU it’s at 100% utilization (well), but performance it’s set at P2 (normal?)

Does it have any sense all of this? Other tests with simplier threads, gave me a x20 performance (gpu vs cpu). It’s normal x2,35 performance (it seems too low for me)

Thanks in advance

If your program does relatively little processing on the GPU, performance may be limited by PCIe throughput. You should be able to identify such a scenario from the profiler output. For best performance results with CUDA, data transfers between CPU and GPU should be minimized. Data should remain resident in GPU memory for as logn as possible.

I assume the 1060 in question is a Pascal-family GTX 1060. I seem to recall that for some GPUs in that family, P2 is the normal full-performance state. Just to be sure you are not limited by power or thermal capping, make sure there is adequate power supply or cooling.

What kind of speedup you can expect from porting code from CPU to GPU depends very much on the specifics of the task at hand. When comparing high-end GPU to high-end CPU, typical application-level advantage for a GPU implementation is 2x to 10x, with an average of 5x. Use the profiler to guide your optimization efforts.

thanks for your reply… in my tests the GPU is at 100% during almost all the test (the PCIe throughput in my scenario it’s not a bottleneck, I think: I transfer about 200Mb from host to device, at the beginning; and receive 16Mb of results, at the end).

My GTX has three fans, and temperature at the GPU it’s about 60ºC (with fans at 37%) during test. I don’t know how it’s the power supply, but I suppose that it’s fine for this computer. Let’s suppose that P2 is my normal full-performance state…

The main doubt: how should I select the number of threads and blocks for maximizing performance. In my last test I get the better performance with 16 threads… a test that I did before (with a similar process, but a simplier cuda kernel), I get the best result with 64 threads… changing the number of threads changes a lot the time need for the same process…

In CPU I get the better performance setting the number of threads to the number of threads specified in the cpu datasheet… but in gpu I don’t understand the logic.

How should I select the number of threads?, or should I test different values empirically for each process/GPU?

Thanks in advance.

Choosing an appropriate thread / block configuration and the assignment of threads to data object is one of the major design tasks when writing CUDA code. You may want to create scaffolding which lets you play with different configurations easily, or even explore the design space in automated fashion, similar to what some CPU-based libraries use.

Some heuristics I use: Map threads to output data items. Default to one output data item per thread. Each thread collects necessary input data as required. Threads may co-operate in collecting input data if there is natural overlap, by utilizing shared memory. Chose initial thread count per block between 128 and 256, and use as many blocks as are needed to cover the data.

thanks again for your reply.

In some CPU there are a limitation in what each thread of a core could do (example, some CPU share one floating-point unit per core for all the threads).

Is there any document that could explain some similar limitation on GPUs? (in order to try to optimize the program), or anyone has some recommendation about what to try in order to increase the number of simultaneous threads?

Hi, I’m going to autoreply myself.

A good place to begin optimizating a cuda program it’s the next link: “https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#cuda-best-practices

Yes, there are limitations in number of some elements in gpu, sm…

I though that the main reason because my program it’s “slow” on GPU is because each thread reads a big array (about 200MB), and based on that array and some input parameters it compute some values… As the array it’s too big to enter on any memory cache, my program goes to the device memory… and then It not complies with one of the main recommendations for optimizing cuda programs: minimize access to global memory…

I will think another way to optimize my program, but seems difficult.

While it is true that the fastest access to memory is the one you don’t have to make, I think this statement is a bit too strong.

If we look at the performance advantage GPUs have over CPUs, we find that the advantage in computational throughput is about a factor 5x to 10x, while the advantage in terms of memory bandwidth is about 5x. So while the advantage in computational throughput is larger, the advantage in memory throughput is still useful to achieve meaningful acceleration.

That advantage holds as long as the accesses to global memory are regular and follow a few basic rules. An easy way to exploit the memory bandwidth advantage of GPUs is to use Thrust. In other contexts, memory intensive apps have benefited from the use of OpenACC. The resulting application-level speedup achieved may only be 2x to 3x in such cases, but that is still far better than the single-digit percent annual improvement one gets from the continued use of CPUs.

Generally speaking, it is a good idea to become acquainted with the CUDA profiler early in one’s use of CUDA, and let the profiler guide the programmer’s focus as to where to optimize. It usually doesn’t hurt to think about using (possibly redundant) computation or use of shared memory to reduce accesses to global memory, in the same way that thinking about cache-friendly algorithms on CPU is frequently a good idea.

Hi! I am designing a slender matrix multiply kernel and I am trying to introduce even less threads per block. What if I choose 16 threads per block? Another half of wrap will do nothing and wait?

Also, do you have any suggested material for slender matrix multiply? (Maybe not paper…too hard to understand…)

Thank you!!