hi,

i wrote a kernel which should do the same as sger of blas. (he should do just A= x * y + A)

but he does actually nothing, but this without any error ;)

this the kernel

```
/** simple sger:
* berechnet A = x*y_t + A (y_t .. y transponierter vektor)
*/
__global__ void simple_sger(float *x, float *y,float *A, int incX, int incY,int incA, float a)
{
int i=threadIdx.x;
int j=threadIdx.y;
float result=a*x[i*incX]*y[j*incY];
A[i+j*incA]+=result;
}
```

i start it like this:

```
float *vec1;
float *vec2;
float out[32];
float *devM;
float M[32*32]; for(int i=0; i<32*32;i++) M[i]=3;
cudaMalloc((void**)&vec1, 32*sizeof(float));
cudaMalloc((void**)&vec2, 32*sizeof(float));
cudaMalloc((void**)&devM, 32*32*sizeof(float));
for (int i=0; i<32; i++)out[i]=1.0f;
cudaMemcpy(vec1,out,32*sizeof(float),cudaMemcpyHostToDevice);
for (int i=0; i<32; i++)out[i]=5.0f;
cudaMemcpy(vec2,out,32*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(devM,M,32*32*sizeof(float),cudaMemcpyHostToDevice);
dim3 d(32,32);
simple_sger<<<1,d>>>(vec1, vec2, devM, 7.3f, 1, 1, 32);
cudaMemcpy(M,devM,32*32*sizeof(float),cudaMemcpyDeviceToHost);
for (int i=0;i<32;i++){
for (int j=0; j<32;j++) std::cout<<M[i*32+j]<<" ";
std::cout<<std::endl;
}
```

the output everytime what i wrote in the matrix before i started the kernel. (in this case its 3)

i checked nearly every error that is returned.

other kernels do work (for example sswap or sscal are working with those verctors)

i tried just writing a constant value in the matrix fields this also doesnt work.

writing back the value of result in the vector x or y - also doesnt work.

i am desperate, i can’t figure out what is wrong

thanks in advance

stefan