2MB size limit cudaMemcpy

MrSampson · April 7, 2011, 12:43pm

Hi,
When I try to copy a block of memory from the the CUDA device (Tesla C2050) to the host (Win64 7, VS2008) that is larger than 2MB, cudaMemcpy returns with an “unknown error.” Is there a hard limit to how much cudaMemcpy can handle? The calls work perfectly well.

Thanks,
Oliver

SPWorley · April 7, 2011, 8:42pm

2MB or 2GB?

Windows 7 WDDM can mess with multi-gigabyte support, but 2MB should be no problem at all.

MrSampson · April 11, 2011, 7:17am

It’s 2MB. When I try to copy an array of size 600600sizeof(double) cudaMemcpy fails whereas 500500sizeof(double) succeeds. I’ve narrowed it down, so that it looks like the failure begins right at the 2MB boundary. Frankly, I’m stumped. Is there something else I could be missing?

tera · April 11, 2011, 7:58am

Can you post the code?

Sarnath · April 11, 2011, 11:56am

If you have a choice, move out of windows7…

I had always wanted NVIDIA to expose their HPC cards as HPC cards…instead of graphic cards.
Not sure why they want to expose so…
But it sure invites a lot of problem in Windows7.
Windows does not like anybody controlling anything…
It gets too nosy…

MrSampson · April 11, 2011, 2:48pm

(I’m not at the CUDA machine right now, but…)

It’s not much more than:

double* host_a;

double* dev_a;

host_a = malloc(600 * 600, sizeof(double));

status = cudaMalloc(dev_a, 600 * 600 * sizeof(double));

status = cudaMemcpy(host_a, dev_a, 600 * 600 * sizeof(double), cudaMemcpyDeviceToHost);

hamster143 · April 11, 2011, 4:54pm

status = cudaMalloc(dev_a, 600 * 600 * sizeof(double));

should be

status = cudaMalloc((void**)&dev_a, 600 * 600 * sizeof(double));

MrSampson · April 11, 2011, 5:04pm

Yes, that’s right. I was posting from memory. The code compiles and executs fine when the amount is smaller than 2MB. Can any one else verify/reproduce this problem under Win7-64?

Thanks.

hamster143 · April 11, 2011, 5:43pm

One of my systems is Win7 x64 and I’ve never seen anything like this. There is no such limit on memory transfers. You need to post your exact code. I’m 90% sure that you’re not allocat\ing the memory correctly, or you’re passing a bad pointer to cudaMemcpy.

tmurray · April 11, 2011, 5:45pm

add a cudaThreadSynchronize() before and after your memcpy.

Sarnath · April 12, 2011, 6:14am

Why does this require a cudaThreadSynchronize()? Can you explain it?

As much as I learnt from the manual, DEVICE to DEVICE copies and Copy to devices from host that are less than 64KB can be asynchronous in nature… But this one beats me… Can you please explain the rationale?

tmurray · April 12, 2011, 6:26am

random theory is: he’s doing a lot of kernels before that memcpy, he’s right at the 2 second TDR limit on WDDM, and the batch is taking longer than 2s as soon as he copies more than 2MB.

MrSampson · April 18, 2011, 7:20am

Thanks for that tip! However…

I’m finally back in the lab, where I can test the code and the procedure that’s calling the kernel (which is spawning several thousands of threads) now looks like this: (All I’m doing is a distance matrix on very large matrices.) Now cudaThreadSynchronize fails with “unknown error” at the call right before cudaMemcpy after the call to the kernel.

The getDim procedure scales all of the dimensions (units/thread, threads/block, numBlocks) so that no more than 1024 threads per block are being executed at once.

Any tips would be greatly appreciated.

The calling procedure:

__declspec (dllexport) int distanceMatrix (double *in, int Vn, int Vx, double *out) {

double *dev_a;

double *dev_b;



int threadDim;

int blockDim_1D;

int nSquared;

nSquared = Vn * Vn;

cudaError_t status = cudaSuccess;



getDim(Vn, &blockDim_1D, &threadDim);

//Because the distance matrix is by definition a square, the x and y 

//dimensions are equal.

dim3 threads(threadDim, threadDim);

dim3 blocks(blockDim_1D, blockDim_1D);

// allocate CUDA memory for Input

status = cudaMalloc ( (void**) &dev_a, Vn * Vx * sizeof(double));

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMalloc Input %s\n", cudaGetErrorString(status));

    return 1;

}

// allocate CUDA memory for Output 



status = cudaMalloc ( (void**) &dev_b, nSquared * sizeof(double));

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMalloc Output %s\n", cudaGetErrorString(status));

    return 1;

}

// Copy data to CUDA

status = cudaMemcpy ( dev_a, in, Vn * Vx* sizeof(double), cudaMemcpyHostToDevice );

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaMemcpy Input %s\n", cudaGetErrorString(status));

    return 1;

}

// Execute the kernel

distance_kernel<<<blocks,threads>>>( dev_a, dev_b, threads, Vn, Vx);

status = cudaThreadSynchronize();

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaThreadSynchronize before %s\n", cudaGetErrorString(status));

    return 1;

}

// Copy data from CUDA

status = cudaMemcpy ( out, dev_b, nSquared * sizeof(double), cudaMemcpyDeviceToHost );

if(status != cudaSuccess)

{

	if (status == cudaErrorInvalidValue) 

		std::cout << "cudaErrorInvalidValue" << std::endl;

	else if (status == cudaErrorInvalidDevicePointer)

		std::cout << "cudaErrorInvalidDevicePointer" << std::endl;

	else if (status == cudaErrorInvalidMemcpyDirection)

		std::cout << "cudaErrorInvalidMemcpyDirection" << std::endl;

else fprintf(stderr, “cudaMemcpy Device to Host 2 %s\n”, cudaGetErrorString(status));

    return 1;

}



status = cudaThreadSynchronize();

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaThreadSynchronize after %s\n", cudaGetErrorString(status));

    return 1;

}

// Free CUDA memory

status = cudaFree (dev_b);

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaFree %s\n", cudaGetErrorString(status));

    return 1;

}

status = cudaFree (dev_a);

if(status != cudaSuccess)

{

    fprintf(stderr, "cudaFree %s\n", cudaGetErrorString(status));

    return 1;

}

return 0;

} // distanceMatrix

Sarnath · April 18, 2011, 7:24am

So, Looks like your kernel is the problematic one…

WimpyOne · April 18, 2011, 9:55am

Perhaps try it with a locally defined and malloc’ed “out” variable to see if the problem goes away or at least changes. I see that “out” is currently a parameter __declspec (dllexport) int distanceMatrix (double *in, int Vn, int Vx, double *out)

tmurray · April 18, 2011, 3:26pm

Unknown error on WDDM platforms is usually a sign of TDR. Increase your TdrDelay reg key as described here:

MrSampson · April 18, 2011, 3:38pm

Thanks, I’ll have a look at that. When I left the lab today, I had noticed that when I scale down the number of threads per block to 64, then things (appear to) work swimmingly (no more unknown errors.)

I’ll let y’all know how it turns out.

SPWorley · April 18, 2011, 7:59pm

Feature request for CUDA 4.1: could you make a new error enum for these display runtime limit watchdog aborts? That’d help diagnose these problems much easier.

tmurray · April 18, 2011, 9:51pm

I actually think we just made a change to the driver a few weeks ago to change how TDR reporting is handled (get rid of unknown error).

Topic		Replies	Views
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2632	August 2, 2010
cudaMalloc error in big loop CUDA Programming and Performance	12	15584	May 21, 2008
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	9814	March 24, 2021
Slow cudaMemcpy execution Tested in GTX480 and GT240 CUDA Programming and Performance	6	2245	April 25, 2012
cudaMemcopy bug CUDA Programming and Performance	3	1520	September 19, 2011
Inconsistent cudaMemcpy Timing cudaMemcpy and kernel timing hiccups at 1 second intervals CUDA Programming and Performance	1	1090	October 6, 2010
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2974	November 13, 2017
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16079	September 29, 2017
copying memory to and from 3D pitched pointers CUDA Programming and Performance	6	6804	August 2, 2011
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	660	September 28, 2018

2MB size limit cudaMemcpy

Related topics