Hi!
I have a problem with a program, when I try to pass to the GPU a vector bigger than 1 000 000 the program doesn’t work.
The vector is double type so its size is 8Mb and my GPU has 1Gb of global memory so I don’t understand what is the problem.
I have tried to split this vector in two parts but the result is the same, I can’t pass to the GPU more than 1 000 000 (each part has 500 000 components).
If the vector is float type I can pass 2 000 000 components.
In all these trials the memory size of the vector was 8Mb, so I understand that the problem should be with the total memory size is pass to the GPU or with the time needed to transfer 8Mb.
In relation with this question I have another question: “Can I change the Timeout Detection & Recovery (TDR) in Mac?”
(I have done it for Windows but I didn’t get information to do it in Mac)
I have performed a naive code with the problem:
(In the code vector C gives problems when N > 1 000 000
The kernel is a naive kernel where C doesn’t do anything it is only to check that the problem is in cudaMemcpy)
#include <stdio.h>
const int N = 1000000;
const int K = 100;
const int BLOCK_SIZE = 1024;
const int NUM_BLOCKS = 1;
global void kernel(double *C, double *out)
{
int l = threadIdx.x;
if(l < K)
{
out[l] = 2.;
}
}
int main()
{
double hostOut[K], hostC[N];
double *deviceC, *deviceOut;
int i;
//Vector C values
for (i = 0; i < N; i++)
{
hostC[i] = 1.;
}
cudaMalloc(&deviceOut, sizeof(double)*K);
cudaMalloc(&deviceC, sizeof(double)*N);
cudaMemcpy(deviceC, hostC, sizeof(double)*N, cudaMemcpyHostToDevice);
kernel <<<NUM_BLOCKS, BLOCK_SIZE >>>(deviceC, deviceOut);
cudaMemcpy(hostOut, deviceOut, sizeof(double)*K, cudaMemcpyDeviceToHost);
cudaFree(deviceC);
cudaFree(deviceOut);
//Print to check the program goes well
for(int j = 0; j < 5; j++)
{
printf(“f[%i] = %f\n”, j, hostOut[j]);
}
return 0;
}