blocks vs threads and bad CUDA performance

I understand the difference between the two. I have a program that I’m writing, and if I launch more than one thread per block, my program crashes and gets memory errors, but if I launch one thread per block, it runs fine. I am writing a particle-constraint resolver, and each thread is responsible for relaxing one constraint, so there is no interaction between threads. In this scenario, is there any disadvantage to having only one thread per block?

Is each CUDA core capable of simultaneously running multiple threads on different data or something like that that I’m missing out on here?

I’m asking, because the GPU is maxed out at 100% usage, and the overall CUDA application performance is roughly equivalent to a single threaded CPU running the same code. :(

“if I launch more than one thread per block, my program crashes and gets memory errors, but if I launch one thread per block, it runs fine”

check your memory indices/ offsets
also, simply running the program in the debugger would likely already give you a better indication of the memory error/ crash source

regardless, it comes across as if you have hardly parallelized your code
hence, the comparable performance to a cpu
perhaps if you provide pseudo code/ functionality summary/ flow chart, someone can assist

Do you literally mean 1 thread per block? SM/SMXs always run threads in simultanous groups of 32 called warps. Thus, if you specify 1 thread per block you have 31 cores idling for every 1 that is performing calculations, or a maximum of 1/32 of your GPU’s potential. Hence CPU-like performance.

Bottom line: fix those indexing bugs! Refer to the sample applications for examples.

Thanks for the help. Yes, I was running one thread per block, e.g. kernel<<<3200,1>>>(params). Interesting that I have seen sample code done this way, with one thread per block. From what you said, it sounds like bad sample code.

My program seems to run OK with 32 threads per block (i.e. kernel<<<100,32>>>), but did have problems with 64. Adding error checks in various places has caused the error to go away, but I suspect you are right - there is probably a memory access error in the kernel. I will track it down eventually.

Anyway, I am now seeing a big performance advantage with the CUDA code running 32 threads in each block.

Is the warp size likely to change any time in the future? Am I hamstringing future code if I code the threads per blockat 32?