Each thread working concurrently ?

How can I know that each thread in thread blocks is working concurrently ?

I defined thread block with dimension 4 x 4 and each thread block has 3 x 10 threads. I think that the result may show concurrently (suppose that: 100 lines by one time ) but my result show one line by one time.

How I ensure that it is working concurrently and how can I know ?

You can’t. The only assumption you can make is that threads belonging to the same warp will be executing concurrently, e.g. threads 0-31, 32-63,…, 480-511 will execute concurrently if the size of a warp is 32.

Can you be a bit more specific? Your question has a number of possible different answers, dependent on the level of detail/understanding you require.

In the absolute sense, the threads are obviously not executing concurrently. A threadblock is basically a virtual multiprocessor, but the real multiprocessors only have 8 streaming processors. Various pipelining considerations mean that groups of 32 threads (called a ‘warp’) will always appear to the programmer to run concurrently - if you have a race condition within a warp, it’s impossible to predict which thread will ‘win’ the race. However, the hardware scheduler makes no guarantees about the order in which warps within the same block are executed, unless a [font=“Courier New”]__syncthreads()[/font] (or similar) command is present. So unless your code uses those calls, then your program should treat all threads within a block as running concurrently. Even those commands only guarantee consistency at a single point in the code - which warp leaves the [font=“Courier New”]__syncthreads()[/font] first is not defined.

Thanks for your answer.

This is my first time to coding CUDA programming. I’m quiet poor in English skill, sorry for that.

I’m not sure about my code is managed by warp. I defined

dim3 threadsPerBlock(16,4);
dim3 threadsPerGrid(4,4);

I don’t understand that how can I manage my program with warp or it automatically divide thread to warp form ?

A warp of threads is the basic scheduling and execution unit inside the GPU. It isn’t something the programmer has any control over, other than being the only scale at which execution coherence is implicitly guaranteed by the execution model. You cannot know what order blocks are executed in, and you cannot know what order warps within blocks are executed in, but you can assume that each thread within a warp of 32 threads are executed coherently, so threads (0…31) are executed together, (32…63), etc. Nothing else is guaranteed or predictable.

Thanks for your answer.