I have been getting some very strange results with some cuda code I wrote.
I have the following 2 arrays:
float *d_p2nsums;
float *d_n2psums;
I allocate memory to them in an initialization function:
CUDA_SAFE_CALL(cudaMalloc((void**)d_p2nsums,((1+(numPosVectors))(1+(numWeights))sizeof(float)))); CUDA_SAFE_CALL(cudaMalloc((void)d_n2psums,((1+(numNegVectors))(1+(*numWeights))*sizeof(float))));
then in the main program I call seperate kernel functions on them:
HSpartColSums<<< numBlocksNeg, numThreadsNeg>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_n2psums);
HSpartRowSums<<< numBlocksPos, numThreadsPos>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_p2nsums);
The results from d_p2nsums will then be correct whereas d_n2psums will be wrong.
If I call the kernel functions in the opposite order d_n2psums will be correct and d_p2nsums will be wrong.
After some testing I discovered that what is happening is that the 2 arrays overlap, so that the 2nd kernel function call overwrites the results of the previous call.
I discovered that the overlap occurs after the first 256 entries of d_p2nsums, which overlaps with the initial part of d_n2psums.
I have many other similar device memory allocations in the initialization function, but no problems with them. Changing the order of the memory allocations for d_p2nsums and d_n2psums makes no difference at all.
I can solve the problem by copying the results to host memory after each kernel call, and then copying them back later when needed.
But I am still interested to find out what the problem was. Is this a bug in cuda? I have checked and rechecked my code, and can find nothing wrong with it.