blocks vs threads and bad CUDA performance

surfdabbler · January 22, 2015, 9:04am

I understand the difference between the two. I have a program that I’m writing, and if I launch more than one thread per block, my program crashes and gets memory errors, but if I launch one thread per block, it runs fine. I am writing a particle-constraint resolver, and each thread is responsible for relaxing one constraint, so there is no interaction between threads. In this scenario, is there any disadvantage to having only one thread per block?

Is each CUDA core capable of simultaneously running multiple threads on different data or something like that that I’m missing out on here?

I’m asking, because the GPU is maxed out at 100% usage, and the overall CUDA application performance is roughly equivalent to a single threaded CPU running the same code. :(

little_jimmy · January 22, 2015, 10:07am

“if I launch more than one thread per block, my program crashes and gets memory errors, but if I launch one thread per block, it runs fine”

check your memory indices/ offsets
also, simply running the program in the debugger would likely already give you a better indication of the memory error/ crash source

regardless, it comes across as if you have hardly parallelized your code
hence, the comparable performance to a cpu
perhaps if you provide pseudo code/ functionality summary/ flow chart, someone can assist

randys006 · January 22, 2015, 7:37pm

Do you literally mean 1 thread per block? SM/SMXs always run threads in simultanous groups of 32 called warps. Thus, if you specify 1 thread per block you have 31 cores idling for every 1 that is performing calculations, or a maximum of 1/32 of your GPU’s potential. Hence CPU-like performance.

Bottom line: fix those indexing bugs! Refer to the sample applications for examples.

surfdabbler · January 23, 2015, 12:22am

Thanks for the help. Yes, I was running one thread per block, e.g. kernel<<<3200,1>>>(params). Interesting that I have seen sample code done this way, with one thread per block. From what you said, it sounds like bad sample code.

My program seems to run OK with 32 threads per block (i.e. kernel<<<100,32>>>), but did have problems with 64. Adding error checks in various places has caused the error to go away, but I suspect you are right - there is probably a memory access error in the kernel. I will track it down eventually.

Anyway, I am now seeing a big performance advantage with the CUDA code running 32 threads in each block.

Is the warp size likely to change any time in the future? Am I hamstringing future code if I code the threads per blockat 32?

Topic		Replies	Views
Kernel Launch: number of blocks CUDA Programming and Performance	1	1686	May 21, 2009
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8349	February 12, 2008
Threads Per Block Issue CUDA Programming and Performance	2	888	September 7, 2010
finding the best number of threads per block CUDA Programming and Performance	3	7832	January 29, 2010
Using <<<...>>> CUDA Programming and Performance	6	2475	June 19, 2011
Number of thread blocks and threads in those, difference for performance? CUDA Programming and Performance	1	380	September 6, 2021
newbie, microprocessors CUDA Programming and Performance	7	4714	March 26, 2008
Synchronizing Blocks CUDA Programming and Performance	3	2347	January 10, 2018
Shared memory and register usage - just 1 thread/block CUDA Programming and Performance	1	793	July 21, 2009
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7006	January 30, 2008

blocks vs threads and bad CUDA performance

Related topics