Warps in emulation and device mode

Hi,

I’ve been trying to port an algorithm to the GPU. Whilst debugging I get unexpected results. I have read somewhere that in emulation mode the warp size is set to 1. Now I still get unexpected results when running this simple kernel.

[codebox]testKernel<<<1, dim3(32, 1)>>>();

global void testKernel()

{

int tid = threadIdx.x + threadIdx.y * blockDim.x;

printf("THREAD %i REPORTING START\n", tid);

for (int i = 0; i < 5; ++i) {

    printf("THREAD %i WORKING ON NUMBER %i\n", tid, i);

}

}[/codebox]

As output I get:

Is this normal for emulation mode? I am not sure if the emulation mode really instantiates threads or that it just runs the code in parallel. In case of the latter it makes sense.

However, can I assume that if the code is run on the device the output would be as indicated below (assuming output would be possible like that)?? This change in warp size makes debugging quite hard.

Cheers,

Frank

It is completely normal, the CPU runs each thread a chunk of time before changing, try to execute this

[codebox]

global void testKernel()

{

int tid = threadIdx.x + threadIdx.y * blockDim.x;

printf("THREAD %i REPORTING START\n", tid);

__syncthreads();

for (int i = 0; i < 5; ++i) {

    printf("THREAD %i WORKING ON NUMBER %i\n", tid, i);

}

}[/codebox]

the syncthreads call will force the CPU to change the current thread.

This does indeed give the desired output in Emulation mode. However, am I right when I say the __syncthreads() function call should not be necessary in device mode? Since I only have 32 threads (warp size) they should all be executing the same instruction.

In my actual algorithm I can’t use the __syncthreads() function because I have some for/if/while statements. My block dimension is 32 by 6. There is an for-loop in my kernel which looks something like this:

[codebox]for (int i = 0; i < sharedmem[threadIdx.y]; ++i) {

// do stuff

}[/codebox]

Using __syncthreads inside the for loop will crash the application because there are different values in the sharedmem array. Now, since the warp size is 32 and my block dimension is 32 by 6 this implementation should not affect the performance, right? Since I’m using threadIdx.y each warp will have the same value for the for-loop.

Please correct me if I’m wrong as my whole kernel implementation is relying on this =)

On the device it will be the same as you say and you were right at the first post.

In the device emulation mode you need to use syncthreads to avoid critical races sometimes due to the way that the cpu executes the threads (a bunch of instructions sequentially executed, changing threads time to time), on the gpu they are not necessary unless you are doing specific things such as memory access etc.

Supose that the thread code have 10 intructions,

the CPU flow will be

Thread 1 Ins 1
Thread 1 Ins 2
Thread 1 Ins 3
Thread 1 Ins 4
Thread 1 Ins 5
Thread 1 Ins 6
----Change of thread
Thread 2 Ins 1
Thread 2 Ins 2
Thread 2 Ins 3
Thread 2 Ins 4
Thread 2 Ins 5
----Change of thread

Is not very accurate but i hope it will give you an approximate idea.
Hope I could help you.

Regards.