I am tying to do something very simple. I want to modify every element of an array from 5 to 1 and then print the changed value. Can you please tell me what is wrong with the following code?
void global repla(int *a)
{
int i=threadIdx.x;
a[i]=1;
}
int main(int argc, char** argv)
{
int a[5]={5,5,5,5,5};
int b=0;
cudaMalloc((void **)b,5sizeof(int));
cudaMemcpy(&b,a,5sizeof(int),cudaMemcpyHostToDevice);
int i;
repla<<<1,5>>>(B);
cudaMemcpy(&a,b,5sizeof(int),cudaMemcpyDeviceToHost);
for(i=0;i<5;i++)
{
printf("%d ",a[i]);
}
return 0;
}
I dont think so. A malloc()ed pointer is VIRTUAL and susceptible to pageout mechanism of OS. The device can no WAY access this pointer. The address accessed by the device should be raw physical address that is continuous and pinnnnnned. You need to specially allocate physical memory using CUDA calls - I would believe so.
I have not directly worked on 2.2Beta. BUt being a systems programmer - I think my guesses are pretty close to truth.
Yes, you mensioned a importand point I never said.
If you want to get the device pointer through cudaHostGetDevicePointer
The host memoy should be pinned memory allocated from new funciton cudaHostAlloc and pass cudaHostAllocMapped to it.
Its so called zero copy access in CUDA 2.2Beta.
Just like the funciton of accesing host memory in geforce 6XXX.
Its a way for you to direct modify data on host through kernel code.
But the performance is not so good after my test.
even you fetch it by share memory in kernel. the time cost is still a little more than copy from host to kernel, calculate and then copy result to host.
I’m finding a way to improve the performance. But still do not have a good method for only one calculation is needed.(if you have two work to do, you can use stream to make calculate and copy async).