Hello,
i am having a problem. My kernel has a loop which runs for say 20 times. But for most of the cases I need to break much before 20 times and hence dont need to execute other kernels. But when i use break statement all the values loaded in shared memory becomes wrong and hence i cannot use break statement.
MY question is what will happen if there is a break statement in a loop in cuda kernel. I can understand that it will cause divergence but why should this affect the final output in terms of results.
Please Help me explain this.
Hello,
i am having a problem. My kernel has a loop which runs for say 20 times. But for most of the cases I need to break much before 20 times and hence dont need to execute other kernels. But when i use break statement all the values loaded in shared memory becomes wrong and hence i cannot use break statement.
MY question is what will happen if there is a break statement in a loop in cuda kernel. I can understand that it will cause divergence but why should this affect the final output in terms of results.
Please Help me explain this.
This won’t work, since [font=“Courier New”]__syncthreads()[/font] must be encountered by all threads in the same manner.[sup][1][/sup]
Instead of using [font=“Courier New”]break[/font] inside the loop, you can set a flag variable and then make the whole loop body excluding the [font=“Courier New”]__syncthreads()[/font] conditional depending of that variable.
[sup][1] It appears that on all current hardware actually only all warps need to encounter them, but lets keep things simple and safe for future hardware as well[/sup].
This won’t work, since [font=“Courier New”]__syncthreads()[/font] must be encountered by all threads in the same manner.[sup][1][/sup]
Instead of using [font=“Courier New”]break[/font] inside the loop, you can set a flag variable and then make the whole loop body excluding the [font=“Courier New”]__syncthreads()[/font] conditional depending of that variable.
[sup][1] It appears that on all current hardware actually only all warps need to encounter them, but lets keep things simple and safe for future hardware as well[/sup].
For threads that have finished their work, just execute the [font=“Courier New”]__syncthreads()[/font] and any data transfers to/from shared memory that may be needed by other threads, but not the rest of the loop body:
int flag = 1;
for (int i=0; i<20; i++) {
// load data...
__syncthreads();
if (flag) {
// do work...
if (finished) {
// this is where the break statement would have been
flag = 0;
}
}
__syncthreads();
}
For threads that have finished their work, just execute the [font=“Courier New”]__syncthreads()[/font] and any data transfers to/from shared memory that may be needed by other threads, but not the rest of the loop body:
int flag = 1;
for (int i=0; i<20; i++) {
// load data...
__syncthreads();
if (flag) {
// do work...
if (finished) {
// this is where the break statement would have been
flag = 0;
}
}
__syncthreads();
}