Strange behaviour/bug? memory not allocated properly

Joe_Bloggs · May 13, 2008, 11:33am

I have been getting some very strange results with some cuda code I wrote.
I have the following 2 arrays:

float *d_p2nsums;
float *d_n2psums;

I allocate memory to them in an initialization function:

CUDA_SAFE_CALL(cudaMalloc((void**)d_p2nsums,((1+(numPosVectors))(1+(numWeights))sizeof(float)))); CUDA_SAFE_CALL(cudaMalloc((void)d_n2psums,((1+(numNegVectors))(1+(*numWeights))*sizeof(float))));

then in the main program I call seperate kernel functions on them:

HSpartColSums<<< numBlocksNeg, numThreadsNeg>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_n2psums);
HSpartRowSums<<< numBlocksPos, numThreadsPos>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_p2nsums);

The results from d_p2nsums will then be correct whereas d_n2psums will be wrong.
If I call the kernel functions in the opposite order d_n2psums will be correct and d_p2nsums will be wrong.
After some testing I discovered that what is happening is that the 2 arrays overlap, so that the 2nd kernel function call overwrites the results of the previous call.
I discovered that the overlap occurs after the first 256 entries of d_p2nsums, which overlaps with the initial part of d_n2psums.
I have many other similar device memory allocations in the initialization function, but no problems with them. Changing the order of the memory allocations for d_p2nsums and d_n2psums makes no difference at all.
I can solve the problem by copying the results to host memory after each kernel call, and then copying them back later when needed.
But I am still interested to find out what the problem was. Is this a bug in cuda? I have checked and rechecked my code, and can find nothing wrong with it.

Simon_Green · May 13, 2008, 12:12pm

It’s hard to say without seeing your complete code.

The best way to debug these kind of memory problems is to compile in emulation debug mode - have you tried this?