Confusing about for loops in kernels

Hi everyone !

Im writing a code that implies Pixel Purity Index algorithm on GPU. This algorithm calculates millions of dot products of pixel vectors. Whatever let me ask my question about kernel launches on CUDA.

this is my Kernel function

__global__ void PPIKernel(const float *const d_Cube, const float *const d_skewers,
						  unsigned int*  d_extremeSet,int pixels_in_one_band,int num_bands,int *result){
							  
	//int index = ((gridDim.y*threadIdx.x) + blockIdx.y)*num_bands;
	float pe=0, pemin = 100.0, pemax = 0;
	int imax, imin;
	int index = (blockIdx.y*blockDim.x + threadIdx.x)*num_bands;

	for(int v=0;v<pixels_in_one_band;v++){   // LOOP 1 START 

		for(int d=0;d<num_bands;d++){  // LOOP 2 START 
				pe=pe+d_skewers[index+d]*d_Cube[222*v+d];  
			
			}                        // LOOP 2 STOP
		if (pe>pemax){
			pemax = pe;
			imax = v;
		}
		else if (pe < pemin){
			pemin = pe;
			imin = v;
		}
		pe = 0;
	}		// LOOP 1 END
	d_extremeSet[index / num_bands * 2] = imax;
	d_extremeSet[(index / num_bands) * 2 + 1] = imin;

}  // KERNEL END

this funtion is a lil bit confusing.

I launch this kernel with <<<64,256>>> blocks/threads configuration.

When I put my data in this program, each thread should do 512x620x220 single-floating point multiplication ( these numbers are from my dataset ). And my program calculates minimum and maximum dot products for each thread.

**When I add these two lines end of the kernel, this program takes approximately 1.5 hours to complete all of the dot products !!

d_extremeSet[index / num_bands * 2] = imax;
d_extremeSet[(index / num_bands) * 2 + 1] = imin;

**But if I dont add these two lines this program takes 2 seconds to be completed.

This issue is confusing my mind. Can anyone explain what is going on my threads ?

Thanks a lot.

d_extremeSet is the entire output of your kernel. By removing the statements that write the output of your kernel, the compiler is empowered to eliminate all computation that produces that output as “dead code”. In all likelihood this causes the entire kernel to shrink to a minimal stub. You can check this working hypothesis by dumping the generated machine code (SASS). To do so, run

cuobjdump --dump-sass [executable file name]

on the binary produced by the compiler.