Hi, sorry for the newbie questions but none of these were addressed in the FAQ or programmer guide. Anyways, I’ve done plenty of multithreaded programming so I’m down with code being execute concurrently and aware of the problems that can arise from that but I’m still having trouble understanding some of the basic CUDA concepts and to me it seems that the sdk is trying to abstract away the hardware, yet there are so many limitations in place that you have to know the hardware to use the sdk. Anways, here goes:
why is the warp size 32? To me it seems that since each multiprocessor has 8 processors, 8 would have been a better warp size. Is this 32 threads per warp a universal truth, should I code in that way, will it change, where did 32 come from?
if a multiprocessor has 8 processors, I can understand why the max number of active blocks is 8 (each processor is processing a block) but how can the max number of active warps be 24 and the max number of active threads be 768 as stated in the programming guide A.1? Is each processor really capable of handling 96 operations all at once? If so, is there there something lower than the processor that is not being mentioned in the programming guide like each processor has 16 ALUS capables of handling 4 floats or something? From my understanding it goes: a card has 1-16 mps, each mp has 8 processors, each processor can XXX operations in parallel. I understand that with Intel SIMD you can do 4 adds/muls/etcs per op, but 96?
does each multiprocessor have a scheduler? why is the max threads per block 512 if there is a scheduler? where does 512 come from if 11 lines down in appendix A1 it says an mp can handle 768 threads concurrently? Or does active not mean concurrent?
Any help would be greatly appreciated. We’re trying to get some computer vision algorithms onto the GPU and really want to use this technology.
And I thought register combiners were a pain…