What is the optimum way for loop handlings?

Hi,

I have written a of code that is in C++ as:

for( int i=0; i<m; ++i )
{
if(//some condition)
{

        for(int j=0; j<n; ++j )
       {
             for( int k=0; k< h; ++k )
             {
                  // some body here...
             }
        }
   }
   else  
   {
        for(int j=0; j<n; ++j )
       {               
             // some body here...                 
        }
   }

}

I want to write this code in CUDA but if we use loops like this which lead THREAD DIVERGENCE and degrade the performance of the CUDA program.

So , my question is what is the optimum way of handling this code?

Thanks
Manjunath G

If m,n and h are the same for every thread, then this will limit a part of the divergence.

To make it perfect, every warp of threads should agree on which condition is taken.

The key is that every thread in a warp needs to execute the same instruction, so if your condition is warp dependant, then this should be fine.

If not, then the warp will have to be split in multiple “sub warps” which will leave some of the scalar processors idle, thus hurting performances.

I would first try it as is, then try it with a fake “optical” case to see if it is actually worth the trouble of possibly rearanging things to make the condition warp dependant.
It could very well be that you will be that this divergence will not be the limiting factor of your algorithm.

My m,n and h are not same.

so sub-wraps is the solution for this??