how to avoid branchings in kernels?

Hello cuda experts,

I’m quite a newbie in cuda programming, coming from c++. That’s why my kernel code is containing many “if” and “for” branchings. I wrote the following (bubble)sort code, which works fine, but it seems to me not to be really efficient. It’s currently uses 30 register.
Any suggests how to make it more efficient?:

__constant__ int cImgWidth;
__constant__ int cImgHeight;
    
__global__ void SortKernel(  uchar* GpuNrOfElements, size_t GpuNrOfElementsPitch, float4* GpuFloatArrays, size_t GpuFloatArrPitch, float* GpuWeightArrays, size_t GpuWeightArrPitch)
{
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
        
    uchar* iNr = (uchar*)((char*)GpuNrOfElements + (y * GpuNrOfElementsPitch));
        	
    uchar iElements = iNr[x];
       
    if (iElements>=2)
    {
        for (int iLocal = (iElements-1); iLocal > 0; iLocal-- )
        { 
            float* Weight_Row = (float*)((char*)GpuWeightArrays + (y * GpuWeightArrPitch) + (cImgHeight *iLocal * GpuWeightArrPitch));
            float* Weight_RowMinOne = (float*)((char*)GpuWeightArrays + (y * GpuWeightArrPitch) + (cImgHeight *(iLocal-1) * GpuWeightArrPitch));
        			
            if( Weight_Row[x] > Weight_RowMinOne[x])
            {
        	//swap Weight elements
        	float WeightTmp = Weight_Row[x];
        	Weight_Row[x] = Weight_RowMinOne[x];
        	Weight_RowMinOne[x] = WeightTmp;
                
        	float4* _Row = (float4*)((char*)GpuFloatArrays + (y * GpuFloatArrPitch) + (cImgHeight *iLocal * GpuFloatArrPitch));
        	float4* _RowMinOne = (float4*)((char*)GpuFloatArrays + (y * GpuFloatArrPitch) + (cImgHeight *(iLocal-1) * GpuFloatArrPitch));
        	//swap Float elemets too
        	float4 _Rowtemp = _Row[x];
        	_Row[x] = _RowMinOne[x];						
        	_RowMinOne[x] = _Rowtemp;
            }
        }
    }
}

it’s actually not really a bubble-sort-algorithm because it’s only one iteration, but that’s ok for my purpose.
Two words to the code to make it more readable. I have n times a float4 array (GpuFloatArrays) of my image size (cImgWidth x cImgHeight) and the same amount of float arrays containing the corresponding weights (GpuWeightArrays). “n” varies from 0 to 6 and is kept in an uchar array (GpuNrOfElements). This means that the number of Weight array elements and respectively the number of Float array elements varies between 2-6 from thread to thread.

Thank you very much in advance.
cheers Greg