Problem with thread indexes and emulation

Hi all,

I’m experiencing some weird behaviour in emulation mode, which I have found out is due to using a switch statement with threadIdx.y as the expression…

If I emulate this kernel:

__global__ void kernel1( ) {

  fprintf(stderr,"tx=%d ty=%d\n",threadIdx.x,threadIdx.y);

}

I will get an output like this (launching a (1,1) grid and (8,2) block):

tx=0 ty=0

tx=1 ty=0

tx=2 ty=0

tx=3 ty=0

tx=4 ty=0

tx=5 ty=0

tx=6 ty=0

tx=7 ty=0

tx=0 ty=1

tx=1 ty=1

tx=2 ty=1

tx=3 ty=1

tx=4 ty=1

tx=5 ty=1

tx=6 ty=1

tx=7 ty=1

However, if I add a switch statement somewhere in the kernel like this:

__global__ void kernel2( ) {

  fprintf(stderr,"tx=%d ty=%d\n",threadIdx.x,threadIdx.y);

  switch (threadIdx.y) {

    case 1: ...

    case 2: ...

  }

}

Indexes get corrupted and I get this output:

tx=0 ty=0

tx=1 ty=0

tx=2 ty=0

tx=3 ty=0

tx=4 ty=0

tx=5 ty=0

tx=6 ty=0

tx=7 ty=0

tx=0 ty=1

tx=1 ty=0

tx=2 ty=0

tx=3 ty=0

tx=4 ty=0

tx=5 ty=0

tx=6 ty=0

tx=7 ty=0

Any explanation or something I’m missing? Or is this a bug?

Thanks.

The first thing that pops to mind is – do you have break statements after each case in the switch?

Yes. In fact, changing the switch to an equivalent ‘if / else if’ structure produces the same result. And I have double checked that there are no writes out of place in the code.

I cant get what you want to do with a “switch()”.But I think it should be like this :

switch (threadIdx.y) {
case 0: …
case 1: …
}

since the BlockDim.y=2;and threadIdx.y would be either 0 or 1.

What the switch does doesn’t matter much, take care that printf is before it, so results should be correct.

Code given was just a quick example, real code uses 8x8 blocks, and I need to switch in groups of 8 to perform some shared memory loads.

Try to get switch-es that work on full warps. That way you will not get divergent warps. If you work on blocks of 8 threads, it means that you have 4 divergent paths per warp.