strange problem with kernel kernel does not write back to global memory

hi,

i wrote a kernel which should do the same as sger of blas. (he should do just A= x * y + A)

but he does actually nothing, but this without any error ;)

this the kernel

/** simple sger:

 *	berechnet A = x*y_t + A 		(y_t .. y transponierter vektor)

 */

__global__ void simple_sger(float *x, float *y,float *A, int incX, int incY,int incA, float a)

{	

	int i=threadIdx.x;

	int j=threadIdx.y;

	float result=a*x[i*incX]*y[j*incY];

	A[i+j*incA]+=result;

}

i start it like this:

float *vec1;

	float *vec2;

	float out[32];

	float *devM;

	float M[32*32]; for(int i=0; i<32*32;i++) M[i]=3;

	cudaMalloc((void**)&vec1, 32*sizeof(float));

	cudaMalloc((void**)&vec2, 32*sizeof(float));

	cudaMalloc((void**)&devM, 32*32*sizeof(float));

	for (int i=0; i<32; i++)out[i]=1.0f;

	cudaMemcpy(vec1,out,32*sizeof(float),cudaMemcpyHostToDevice);

	for (int i=0; i<32; i++)out[i]=5.0f;

	cudaMemcpy(vec2,out,32*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(devM,M,32*32*sizeof(float),cudaMemcpyHostToDevice);

	dim3 d(32,32);

	simple_sger<<<1,d>>>(vec1, vec2, devM, 7.3f, 1, 1, 32);

	

	cudaMemcpy(M,devM,32*32*sizeof(float),cudaMemcpyDeviceToHost);

	for (int i=0;i<32;i++){

		for (int j=0; j<32;j++) std::cout<<M[i*32+j]<<" ";

		std::cout<<std::endl;

	}

the output everytime what i wrote in the matrix before i started the kernel. (in this case its 3)

i checked nearly every error that is returned.

other kernels do work (for example sswap or sscal are working with those verctors)

i tried just writing a constant value in the matrix fields this also doesnt work.

writing back the value of result in the vector x or y - also doesnt work.

i am desperate, i can’t figure out what is wrong

thanks in advance

stefan

Use cudaGetLastError() to check the result from executing your kernel. Most likely it is failing to launch. You are trying to launch 1024 threads in one block, but the maximum allowed is 512 threads per block.

thanks a lot, i forgot this limit.

and I checked the error with cudaThreadSynchronize to ensure that it has really launched.

now i know that this seems not to work. (until now it had worked)