How much threads can execute in parallel?

Hello everybody.

I’m trying to learn a little programming with CUDA and GF9800 GT. I’m a little confused about parallel execution on the device. In the programming guide is written that the 9800GT has 16 multiprocessors and every mulitproc. has 8 processors. So the device can execute 16*8=128 threads in parallel. Is that correct? What thus a wrap mean? Wrap has 32 threads per multiprocesor, but the multiprocessor has only 8 processors so mulstiproc. can only execute 8 threads in parallel. What happens with the reast of threads 32-8=24?

I’m probably very wrong in my conclusions, so please correct me!

Thank you for your time.

I will have more questions later.

The programming guide states that a multiprocessor processes one warp in 4 cycles. This allows the shaders to be clocked higher than the instruction decoder.

You can think of it as the remaining 24 threads of each warp being in a pipeline. That means that all 32 get processed at the same time but only 8 of them are being finished every shader clock cycle.

This is not exactly true, that pipeline thing, but it’s a proper abstraction. Nvidia doesn’t really elaborate much on how this works deep in the hardware.

Thank you for your reply. So 9800 gt can process 512 threads in 4 clock cycles. So for max speed it’s better to not use more than 512 threads or?

Thanks.

No, you will want to use tens of thousands of threads for best performance. There’s virtually no penalty for having “too much” and more threads means better saturation of all the queues and pipelines. In my current app, I have a million and that’s not considered very much either.

512 threads is way too few.
Read the Programming Guide if you haven’t already, it was answered there.