# Execute different instruction for each warp and synchronize

Okay, let me try and describe my problem - There might not be a solution to it. Each SM executes x number of warps, but I want each of those warps to execute a different single instruction in one area of my code without having the overhead of all the branch logic, like “if (Warp==0)…, else if Warp==1,…”. Note: This part of the code would execute several hundred times with 2 or 3 other instructions.

Maybe I can draw a couple of examples…

Ultimately I would like to do something like this:

``````for(int i = 0; i <200; i++){

x = Data[a];

y=sin(x) (if for warp 0)  y=cos(x) (for warp 1)  y=tan(x)  (for warp 2) ...

Data2[b] = y;

}
``````

Or maybe like this… Each warp starts its own loop but all the warps are synchronized. Here I use syncthreads() but that might not be the right command here. I want each of the for loops to execute at the same time in lock-step and synchronized together.

``````if (WarpID==0)

for(int i = 0; i <200; i++){

x = Data[a];

y=sin(x);

Data2[b] = y;

}

else if (WarpID==1)

for(int i = 0; i <200; i++){

x = Data[a];

y= cos (x);

Data2[b] = y;

}

else if (WarpID==2)

for(int i = 0; i <200; i++){

x = Data[a];

y=tan(x);

Data2[b] = y;

``````

I hope this makes some sense. I know I could remodel my data/code to fix it or even put in the logic like below but that would not work as well for the model I trying to use to solve my problem.

``````for(int i = 0; i <200; i++){

x = Data[a];

if (WarpID==0)

y=sin(x);

else if (WarpID==1)

y=cos(x);

else if (WarpID==2)

y=tan(x); ...

Data2[b] = y;

}
``````

Why do you think this will be bad? Have you tested this code with the visual profiler? From looking at the code I would think it is primarily bandwidth bound meaning the extra cycles from the branching won’t really matter. Also why do you need to synchronize the threads at all? Finally this code segment doesn’t make sense since a and b don’t change you are just overwriting the computation you did in the previous iteration.

I suppose branching logic and instructions itself are cheap, divergence is the problem, so just check warp id and go.

Thank you Justin and Lev for your help. Your comments are very much appreciated. I wanted to keep the example light so I quickly created something with just 3 if blocks but my project would have 16 (or more) so the amount of logic code would have been way up there. Luckily, the original thought that I did not think was going to work actually did. I found that usingâ€¦

``````...

if (warpID==0){

for(int i = 0; i <128; i++){

output[laneID + i*2048 + 262144*warpID] = __cosf(shared[i]);

}}

else if (warpID==1){

for(int i = 0; i <128; i++){

output[laneID + i*2048 + 262144*warpID] = __sinf(shared[i]);

}}

else if (warpID==2){

for(int i = 0; i <128; i++){

output[laneID + i*2048 + 262144*warpID] = __tanf(shared[i]);

}}

else if (warpID==3){

for(int i = 0; i <128; i++){

output[laneID + i*2048 + 262144*warpID] = __logf(shared[i]);

}}

...(12 more blocks here like above)

else if (warpID==15){

for(int i = 0; i <128; i++){

output[laneID + i*2048 + 262144*warpID] = __expf(2*(shared[i]);

}}

...
``````

â€¦actually did work for what I wanted it to do. The NVidia Guide states that __syncthreads() should not be used in divergent code(B.6) however it looks like users can if they are careful. I think the guide states not to do this because if it is not done right it will fail. __syncthreads() translates into â€bar.sync 0â€ instruction and according to the specs of that instruction in ptx_isa_2.3.pdf it can be divided up into different warps. So in the end for myself I found that just using __syncthreads() is all I really needed. Basically calling _syncthreads() exactly once from each warp, even if they are in different locations in the code, will synchronize so that each loop will execute at the same time. In the example above this would mean that all the warps â€œiâ€ values would be synchronized.

just remove sync threads instruction and instert in after if, you will see speed up.

Are you trying to improve global memory throughput with the additional __syncthreads()? If yes, you need to reorder accesses so that all memory accesses between two __syncthreads() go to a contiguous block of memory.

Ultimately, I need the __syncthreads() at the level they are because the source and the destination of all threads would be to the same place. So that each warp would do something like:

(1) Grab some variables A
(2) Do a computation on it (but each warp here would do something different)
(3) Store it to B
(5) Grab some variables B
(6) Do a computation on it (but each warp here would do something different)
(7) Store it back to A
(9) ect.

I don’t really know how efficient it will be in the end but I guess I’ll find out. But the code example above does do what I want it to do except right now its just saving to global memory and I need to change it so it points back to shared memory.

Itâ€™s kind of a cool trick. We have 16 different warps that go off and do their own thing and then sync to share data then go off and do their own thing again, etc. And there is not much logic code to do the branching after it gets started. I just have to play with it and see how the performance works out.