I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread. I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
A. The programmer writes a kernel, and organize its execution in a grid of thread blocks.
B. Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
C. Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
D. The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
E. If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
F. On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
G. If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I “submerge” the CPU with work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores in the architecture will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Not directly at the point, but you are thinking about gpu cores that they behave the same as cpu cores.
If you are trying to fill the gpu you do not run 384 threads, but thousands of jobs. This is because there are latencies associate with memory reads.
In fact you should not worry about threads running in the same time, but rather active warps and blocks. Which means warps which are executing a specific instruction or waiting for memory to be fetched. If you want to fill the gpu you should submit blocks with more 384 or 512 threads per block. At each time you could have many warps active on a SM waiting for data.
So you are saying that probably the best approach to achieve my goal is to use the third strategy (“overload” the SMs)?
I admit that I am bit confused. In particular:
Is there any REAL relationship between a CUDA thread and a CUDA core? On traditional CPU cores, once a thread starts on a core it will stay there (ok, unless it is migrated for some reason). From the documentation and from different forum posts I am getting the idea that this relationship does not hold in CUDA. So how things actually work?
I was thinking about the dual instruction issue of the Fermi architecture. From the docs I read that Fermi cards can issue instructions from 2 warps at the same time.
In my case I have 48 cores for each SM, so if I have 2 warps of 32 threads then what is going to happen? The current instruction of the first warp can be executed in parallel; but there are not enough cores to execute all the threads in the second warp (48cores per SM - 32 cores busy for the first warp = 16 remaining cores).
Does the current instruction from the second warp execute in two clock cycles? If not, what is the meaning of those additional 16 cores?
In any case, can you guys give me any reference for this stuff? I read the CUDA Programming Guide, I read also a small part of the PTX guide, and the chapters dedicated to the hardware of the “Programming Massively Parallel Processors” and “CUDA Application design and development” books; but I still cannot get a straight answer for my questions.
Yes. In many cases “overloading” the SM will give performance increase because of hiding the latencies. The compute capability 2.1 (the 500 series) has 48 cores per SM which means that 16 cores remain unused. in order to take advantage of those there are special techniques. If the target is the high performance computing you should concentrate on the 2.0 capability (400 series and all Tesla 20xx), where there are 32 cores per SM. In the gpu many warp could just stay and wait, in this time other warps can issues reads requests or execute instructions, depending on the resources. It is true that in the compute capability 2.0 each thread in a warp will run on a core.
As I said I am not really interested in performance, but I need to be sure that every core was used in the execution of a kernel (possibly every core should have executed the same code, but I know it is a bit utopian )
Can you please clarify which are these special techniques for using those spare cores?
If I compile my code specifying the “sm_20” switch, will it behaves as you said? (Even though my GPU is actually Compute capability 2.1)
I just want to add on late thing. For the cc 2.1 the remaining 16 core are used by using instruction level parallelism. This can be achieved when there are calculations in the kernel which are independent of each other. For example:
could be executed in parallel if d,e and f do not depend on a,b and c. I think the compiler will do thi automatically