Need help to better understand CUDA structure


i’m a french student and currently doing an internship.

My job is to modified a program (already written in C++) to make it work on CUDA.

I have a Quadro 4000, and I don’t know a thing about CUDA (though i tried to read the programming guide, it’s still obscur).

For this specific GPU, there are 256 Cores, and 1024 threads per blocks, and each block has 48 Kb shared memory, right ?

What is the link between cores and blocks ? or cores and multiprocessor ?

Is it : 8 MP, and 32 cores/MP = 256 cores ?

And with 8 blocks per MP i have 64 blocks of 1024 threads ?

And what’s the difference between threads and resident treads ? Because if i have 1024 threads per block max, it must be 8*1024 > 8000 per MP, and not 1536 ?

I have to wonder how i can cut 3 loops on many many images for the program to be as fast as possible … (currently taking several hours).

Thank you, and i hope you can understand my English ^^"


Please, i need to know how many blocks there are on the Quadro 4000 before considering anything else …

You can have as many as you want - 65535 x 65535 x 65535 is the upper limit and that is the same for every fermi card. But I suspect that isn’t really what you are trying to ask. The Quadro 4000 has 8 multi-processors, and each can run up to 8 blocks of threads concurrently. Each block runs for its lifetime on one MP. As a block finishes on a MP, another can be run from the pool of unrun blocks. In this way all the blocks from a grid of any size (up to the 65535^3 limit) are run and retired, and the kernel launch is completed.

Does that make things any clearer?

I see. Thank you !
So, i can run 64 blocks at the same time, and each one can run 1024 threads ?
The 48 Kb shared memory is for one MP ? So there are 6 Kb of shared memory per block ?

No it is up to 48k per block, and up to 8 blocks per MP. The amount of shared memory and registers each block requires determines the amount of blocks which will run concurrently, not the other way around.

Ok ! Thank you ! I would have been hard for me to understand that otherwise … !

So i have 64 blocks, and 1024 threads per block, and i need to process 10 000 images for the first loop, and then 100 other images for each in the second loop, and then process on zone/region of interest in each images (to do a correlation), let’s say 1000 max.
How can I efficiently cut theses loop to run the GPU at full potential … ? I was thinking to use the shared memory for 3 lines, 1 of the image of the first loop, and 2 of the second loop … this way, blocks can be the second loop (2*64=128 > 100) and threads the last loop of region of interest (1024 > 1000, which is already a maximum).
What do you think ?

(it’s 3 nested loops)

I don’t know anything about image processing, so I can’t really offer any advice, sorry. Start simple. Something that is slow, but works can always be optimized. Something that is prematurely optimized, overky complex and doesn’t work is no help to anyone. Best of luck.

Yes it’s probably a great advice !

Do you think I can start a new topic to ask for image processing ?