kernel performance and number of threads

humorstar · November 22, 2007, 5:23am

I have a kernel. If I ask for only 32 threads, it takes 4ms to complete. If I ask for 512 threads, it takes 0.8ms to complete. If I ask for 768 threads, it takes 0.7ms to complete. Why the more the threads, the faster?

The time recorded includes memory allocation on the host and device side, data transfer between host and device, and kernel execution.

AndreiB · November 22, 2007, 7:06am

Are you changing thread block size or grid size? Maximum size of thread block is 512 threads, but this is possible only if register usage of single thread is not above 16.

It seems that your calls with 512 and 768 threads are not executed at all. Check if you are not running out of available registers and shared memory.

MisterAnderson42 · November 22, 2007, 2:10pm

If you are using cutil, you can do CUT_CHECK_ERROR after calling the kernel. Otherwise, calling cudaThreadSynchronize, then cudaGetLastError, then cudaGetErrorString will give you any error messages resulting from the kernel launch. It is always a good idea to do this for every kernel in a project: disable the checks when DNDEBUG is defined for performance in a release build.

Topic		Replies	Views
maximum total number of threads for kernel Maximum allowed number of blocks in grid CUDA Programming and Performance	2	4076	August 10, 2007
Understanding number of threads Problems with program working CUDA Programming and Performance	3	1039	August 17, 2009
Number of Threads CUDA Programming and Performance	0	3035	August 15, 2010
Ideal number of thread per bloc CUDA Programming and Performance	9	3409	February 5, 2008
Threads and blocks concept question Invoking a kernel CUDA Programming and Performance	2	1668	December 5, 2007
How to chose the number of blocks and threads in kernel calling CUDA Programming and Performance	3	663	November 27, 2011
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7005	January 30, 2008
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8349	February 12, 2008
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	754	November 17, 2014
efficiency of block/thread ratios CUDA Programming and Performance	2	3817	April 18, 2007

kernel performance and number of threads

Related topics