Cuda Memory transfer limit

Hi!

I have a problem with a program, when I try to pass to the GPU a vector bigger than 1 000 000 the program doesn’t work.
The vector is double type so its size is 8Mb and my GPU has 1Gb of global memory so I don’t understand what is the problem.

I have tried to split this vector in two parts but the result is the same, I can’t pass to the GPU more than 1 000 000 (each part has 500 000 components).

If the vector is float type I can pass 2 000 000 components.

In all these trials the memory size of the vector was 8Mb, so I understand that the problem should be with the total memory size is pass to the GPU or with the time needed to transfer 8Mb.

In relation with this question I have another question: “Can I change the Timeout Detection & Recovery (TDR) in Mac?”
(I have done it for Windows but I didn’t get information to do it in Mac)

I have performed a naive code with the problem:
(In the code vector C gives problems when N > 1 000 000
The kernel is a naive kernel where C doesn’t do anything it is only to check that the problem is in cudaMemcpy)

#include <stdio.h>

const int N = 1000000;
const int K = 100;

const int BLOCK_SIZE = 1024;
const int NUM_BLOCKS = 1;

global void kernel(double *C, double *out)
{

int l = threadIdx.x;

if(l < K)
{
	out[l] = 2.;
}

}

int main()
{

double hostOut[K], hostC[N];
double *deviceC, *deviceOut;
int i;

//Vector C values
for (i = 0; i < N; i++)
{
hostC[i] = 1.;
}

cudaMalloc(&deviceOut, sizeof(double)*K);
cudaMalloc(&deviceC, sizeof(double)*N);
		
cudaMemcpy(deviceC, hostC, sizeof(double)*N, cudaMemcpyHostToDevice);


kernel <<<NUM_BLOCKS, BLOCK_SIZE >>>(deviceC, deviceOut);

cudaMemcpy(hostOut, deviceOut, sizeof(double)*K, cudaMemcpyDeviceToHost);

cudaFree(deviceC);
cudaFree(deviceOut);

//Print to check the program goes well
for(int j = 0; j < 5; j++)
{
printf(“f[%i] = %f\n”, j, hostOut[j]);
}

return 0;

}

double hostOut[K], hostC[N];

it’s not the CUDa, you can’t allocate such large arrays in the stack. use the following instead:

double *hostOut = new double[K], *hostC = new double[N];

and at the end:

delete[] hostOut;
delete[] hostC;

and btw, there is a CODE tag, it’s the last in toolbox above the message

PS: for a pure C, use malloc/free instead and check returned pointers

Great!!

Thank you very much, it worked perfectly! :)