# Problem with reduction

Hi everybody,

I am trying to do the 2_norm of each column of a Matrix and find which one is the minimum.

It looks like this:

``````__global__ void obtener_s_gpu (float *HT, int m, int n, float *c, int L, float *ss, int *ids, float *normaBloque)

{

int i,j;

int posib, div, resto;

float mult;

int nTotalThreads = blockDim.x;	// Total number of threads per block (512)

int numBlocks = gridDim.x;	// Total number of block threads

// Declare arrays to be in shared memory.

__shared__ float norma[3];

posib=(int)pow((float)L,m);

if (idx<posib)

{

div = idx;

//building s

for (i=0; i<m; i++)

{

resto=div%L;

ss[i*posib+idx]=(-L+1.0)/2.0+resto;

div=div/L;

}

for (i=m-1; i>-1; i--)

{

mult=0.0;

for (j=m-1; j>i-1; j--)

{

mult=mult+HT[j*n+i]*ss[j*posib+idx]; //mul=mul+R[i*m+j]*ss[j]

}

mult=mult-c[i];

}

}else{

}

//Encontrar la menor norma

{

int halfPoint = (nTotalThreads >> 1);	// divide by two

// only the first half of the threads will be active.

{

// Get the shared value stored by another thread

float temp = norma[threadIdx.x + halfPoint];

{

}

}

}

{

normaBloque[numBlocks] = norma[0];

}

}
``````

Everything seems to work fine, but when I add the line " ids[blockIdx.x] = thread; " something happens.

Exit without that line:

6 threads en total. Se necesitan 2 bloques

posibles vectores de s: 4

Matriz ss:

-0.500000 0.500000 -0.500000 0.500000

-0.500000 -0.500000 0.500000 0.500000

ids_0: 1094378115

normaBloque_0: 6.114271

ids_1: 1063137839

normaBloque_1: 8.192823

6 threads en total. Se necesitan 2 bloques

posibles vectores de s: 4

Matriz ss:

0.000000 0.000000 0.000000 0.000000

0.000000 0.000000 0.000000 0.000000

ids_0: 0

normaBloque_0: 0.000000

ids_1: 0

normaBloque_1: 0.000000

How can it happen??, any idea??.

I declared ids like normaBloque, and the last one looks fine, so I don’t think that’s the problem.

cudaMalloc ((void **) &normaBloque_d, gridSize*sizeof(float));

cudaMalloc ((void **) &ids_d, gridSize*sizeof(int));

I give it to the kernel and then copy it to host with :

cudaMemcpy(normaBloque, normaBloque_d, gridSize*sizeof(float), cudaMemcpyDeviceToHost);

cudaMemcpy(ids, ids_d, gridSize*sizeof(int), cudaMemcpyDeviceToHost);

I am lost…

Edited: Here is the full Kernel code, maybe now it’s clearer… or maybe not. First I build the vector, then calculate the 2_norm, and then find the minimum.

Everything works but the ids vector. Also, when I execute the screen flashes a little. Why am I doing wrong???.

I found the problem. I was giving the ids vector to the kernel, instead of ids_d…

Silly mistake, so sorry.