Hi everyone !

Im writing a code that implies Pixel Purity Index algorithm on GPU. This algorithm calculates millions of dot products of pixel vectors. Whatever let me ask my question about kernel launches on CUDA.

this is my Kernel function

```
__global__ void PPIKernel(const float *const d_Cube, const float *const d_skewers,
unsigned int* d_extremeSet,int pixels_in_one_band,int num_bands,int *result){
//int index = ((gridDim.y*threadIdx.x) + blockIdx.y)*num_bands;
float pe=0, pemin = 100.0, pemax = 0;
int imax, imin;
int index = (blockIdx.y*blockDim.x + threadIdx.x)*num_bands;
for(int v=0;v<pixels_in_one_band;v++){ // LOOP 1 START
for(int d=0;d<num_bands;d++){ // LOOP 2 START
pe=pe+d_skewers[index+d]*d_Cube[222*v+d];
} // LOOP 2 STOP
if (pe>pemax){
pemax = pe;
imax = v;
}
else if (pe < pemin){
pemin = pe;
imin = v;
}
pe = 0;
} // LOOP 1 END
d_extremeSet[index / num_bands * 2] = imax;
d_extremeSet[(index / num_bands) * 2 + 1] = imin;
} // KERNEL END
```

this funtion is a lil bit confusing.

I launch this kernel with <<<64,256>>> blocks/threads configuration.

When I put my data in this program, each thread should do 512x620x220 single-floating point multiplication ( these numbers are from my dataset ). And my program calculates minimum and maximum dot products for each thread.

**When I add these two lines end of the kernel, this program takes approximately 1.5 hours to complete all of the dot products !!

```
d_extremeSet[index / num_bands * 2] = imax;
d_extremeSet[(index / num_bands) * 2 + 1] = imin;
```

**But if I dont add these two lines this program takes 2 seconds to be completed.

This issue is confusing my mind. Can anyone explain what is going on my threads ?

Thanks a lot.