2MB size limit cudaMemcpy

Hi,
When I try to copy a block of memory from the the CUDA device (Tesla C2050) to the host (Win64 7, VS2008) that is larger than 2MB, cudaMemcpy returns with an “unknown error.” Is there a hard limit to how much cudaMemcpy can handle? The calls work perfectly well.

Thanks,
Oliver

2MB or 2GB?

Windows 7 WDDM can mess with multi-gigabyte support, but 2MB should be no problem at all.

It’s 2MB. When I try to copy an array of size 600600sizeof(double) cudaMemcpy fails whereas 500500sizeof(double) succeeds. I’ve narrowed it down, so that it looks like the failure begins right at the 2MB boundary. Frankly, I’m stumped. Is there something else I could be missing?

Can you post the code?

If you have a choice, move out of windows7…

I had always wanted NVIDIA to expose their HPC cards as HPC cards…instead of graphic cards.
Not sure why they want to expose so…
But it sure invites a lot of problem in Windows7.
Windows does not like anybody controlling anything…
It gets too nosy…

(I’m not at the CUDA machine right now, but…)

It’s not much more than:


double* host_a;

double* dev_a;

host_a = malloc(600 * 600, sizeof(double));

status = cudaMalloc(dev_a, 600 * 600 * sizeof(double));

status = cudaMemcpy(host_a, dev_a, 600 * 600 * sizeof(double), cudaMemcpyDeviceToHost);


status = cudaMalloc(dev_a, 600 * 600 * sizeof(double));

should be

status = cudaMalloc((void**)&dev_a, 600 * 600 * sizeof(double));

Yes, that’s right. I was posting from memory. The code compiles and executs fine when the amount is smaller than 2MB. Can any one else verify/reproduce this problem under Win7-64?

Thanks.

One of my systems is Win7 x64 and I’ve never seen anything like this. There is no such limit on memory transfers. You need to post your exact code. I’m 90% sure that you’re not allocat\ing the memory correctly, or you’re passing a bad pointer to cudaMemcpy.

add a cudaThreadSynchronize() before and after your memcpy.

Why does this require a cudaThreadSynchronize()? Can you explain it?

As much as I learnt from the manual, DEVICE to DEVICE copies and Copy to devices from host that are less than 64KB can be asynchronous in nature… But this one beats me… Can you please explain the rationale?

random theory is: he’s doing a lot of kernels before that memcpy, he’s right at the 2 second TDR limit on WDDM, and the batch is taking longer than 2s as soon as he copies more than 2MB.

Thanks for that tip! However…

I’m finally back in the lab, where I can test the code and the procedure that’s calling the kernel (which is spawning several thousands of threads) now looks like this: (All I’m doing is a distance matrix on very large matrices.) Now cudaThreadSynchronize fails with “unknown error” at the call right before cudaMemcpy after the call to the kernel.

The getDim procedure scales all of the dimensions (units/thread, threads/block, numBlocks) so that no more than 1024 threads per block are being executed at once.

Any tips would be greatly appreciated.

The calling procedure:

__declspec (dllexport) int distanceMatrix (double *in, int Vn, int Vx, double *out) {

double *dev_a;

double *dev_b;



int threadDim;

int blockDim_1D;

int nSquared;

nSquared = Vn * Vn;

cudaError_t status = cudaSuccess;



getDim(Vn, &blockDim_1D, &threadDim);

//Because the distance matrix is by definition a square, the x and y 

//dimensions are equal.

dim3 threads(threadDim, threadDim);

dim3 blocks(blockDim_1D, blockDim_1D);

// allocate CUDA memory for Input

status = cudaMalloc ( (void**) &dev_a, Vn * Vx * sizeof(double));

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMalloc Input %s\n", cudaGetErrorString(status));

    return 1;

}

// allocate CUDA memory for Output 



status = cudaMalloc ( (void**) &dev_b, nSquared * sizeof(double));

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMalloc Output %s\n", cudaGetErrorString(status));

    return 1;

}

// Copy data to CUDA

status = cudaMemcpy ( dev_a, in, Vn * Vx* sizeof(double), cudaMemcpyHostToDevice );

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMemcpy Input %s\n", cudaGetErrorString(status));

    return 1;

}

// Execute the kernel

distance_kernel<<<blocks,threads>>>( dev_a, dev_b, threads, Vn, Vx);

status = cudaThreadSynchronize();

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaThreadSynchronize before %s\n", cudaGetErrorString(status));

    return 1;

}

// Copy data from CUDA

status = cudaMemcpy ( out, dev_b, nSquared * sizeof(double), cudaMemcpyDeviceToHost );

if(status != cudaSuccess)

{

	if (status == cudaErrorInvalidValue) 

		std::cout << "cudaErrorInvalidValue" << std::endl;

	else if (status == cudaErrorInvalidDevicePointer)

		std::cout << "cudaErrorInvalidDevicePointer" << std::endl;

	else if (status == cudaErrorInvalidMemcpyDirection)

		std::cout << "cudaErrorInvalidMemcpyDirection" << std::endl;

else fprintf(stderr, “cudaMemcpy Device to Host 2 %s\n”, cudaGetErrorString(status));

    return 1;

}



status = cudaThreadSynchronize();

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaThreadSynchronize after %s\n", cudaGetErrorString(status));

    return 1;

}

// Free CUDA memory

status = cudaFree (dev_b);

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaFree %s\n", cudaGetErrorString(status));

    return 1;

}

status = cudaFree (dev_a);

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaFree %s\n", cudaGetErrorString(status));

    return 1;

}

return 0;

} // distanceMatrix

So, Looks like your kernel is the problematic one…

Perhaps try it with a locally defined and malloc’ed “out” variable to see if the problem goes away or at least changes. I see that “out” is currently a parameter __declspec (dllexport) int distanceMatrix (double *in, int Vn, int Vx, double *out)

Unknown error on WDDM platforms is usually a sign of TDR. Increase your TdrDelay reg key as described here:

Thanks, I’ll have a look at that. When I left the lab today, I had noticed that when I scale down the number of threads per block to 64, then things (appear to) work swimmingly (no more unknown errors.)

I’ll let y’all know how it turns out.

Feature request for CUDA 4.1: could you make a new error enum for these display runtime limit watchdog aborts? That’d help diagnose these problems much easier.

I actually think we just made a change to the driver a few weeks ago to change how TDR reporting is handled (get rid of unknown error).