Strange behaviour/bug? memory not allocated properly

I have been getting some very strange results with some cuda code I wrote.
I have the following 2 arrays:

float *d_p2nsums;
float *d_n2psums;

I allocate memory to them in an initialization function:

CUDA_SAFE_CALL(cudaMalloc((void**)d_p2nsums,((1+(numPosVectors))(1+(numWeights))sizeof(float)))); CUDA_SAFE_CALL(cudaMalloc((void)d_n2psums,((1+(numNegVectors))(1+(*numWeights))*sizeof(float))));

then in the main program I call seperate kernel functions on them:

HSpartColSums<<< numBlocksNeg, numThreadsNeg>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_n2psums);
HSpartRowSums<<< numBlocksPos, numThreadsPos>>>(d_posData, d_negData, d_p2n, numPosVectors, numNegVectors, numWeights, d_p2nsums);

The results from d_p2nsums will then be correct whereas d_n2psums will be wrong.
If I call the kernel functions in the opposite order d_n2psums will be correct and d_p2nsums will be wrong.
After some testing I discovered that what is happening is that the 2 arrays overlap, so that the 2nd kernel function call overwrites the results of the previous call.
I discovered that the overlap occurs after the first 256 entries of d_p2nsums, which overlaps with the initial part of d_n2psums.
I have many other similar device memory allocations in the initialization function, but no problems with them. Changing the order of the memory allocations for d_p2nsums and d_n2psums makes no difference at all.
I can solve the problem by copying the results to host memory after each kernel call, and then copying them back later when needed.
But I am still interested to find out what the problem was. Is this a bug in cuda? I have checked and rechecked my code, and can find nothing wrong with it.

It’s hard to say without seeing your complete code.

The best way to debug these kind of memory problems is to compile in emulation debug mode - have you tried this?