How to access kernel modified data on host

I am tying to do something very simple. I want to modify every element of an array from 5 to 1 and then print the changed value. Can you please tell me what is wrong with the following code?

void global repla(int *a)
int i=threadIdx.x;

int main(int argc, char** argv)
int a[5]={5,5,5,5,5};
int b=0;
cudaMalloc((void **)b,5
int i;
printf("%d ",a[i]);
return 0;

i think the problem is or are the & in those cudamemcpy function.

both a and b are pointers so this isnt necessary.

because your are giving always a void** pointer to the function.

correct me if i’m wrong

Silly me…


Also Ksh*tiz Beta,
Your cudaMalloc() is missing a “&”. (void **)&b should be used. Otherwise, your program would just CRASH

Quick note:
The forum does NOT allow to spell your name correctly. Thats y added a *

If you use CUDA 2.2Beta,

you can use a device pointer pointing to a host pointer.

so that the data in these two pointers are the same.

and you can calculate it in kernel. access it on host

hopefully it can help you.

I dont think so. A malloc()ed pointer is VIRTUAL and susceptible to pageout mechanism of OS. The device can no WAY access this pointer. The address accessed by the device should be raw physical address that is continuous and pinnnnnned. You need to specially allocate physical memory using CUDA calls - I would believe so.

I have not directly worked on 2.2Beta. BUt being a systems programmer - I think my guesses are pretty close to truth.

Yes, you mensioned a importand point I never said.

If you want to get the device pointer through cudaHostGetDevicePointer

The host memoy should be pinned memory allocated from new funciton cudaHostAlloc and pass cudaHostAllocMapped to it.

Its so called zero copy access in CUDA 2.2Beta.

Just like the funciton of accesing host memory in geforce 6XXX.

Its a way for you to direct modify data on host through kernel code.

But the performance is not so good after my test.

even you fetch it by share memory in kernel. the time cost is still a little more than copy from host to kernel, calculate and then copy result to host.

I’m finding a way to improve the performance. But still do not have a good method for only one calculation is needed.(if you have two work to do, you can use stream to make calculate and copy async).

Obviouslty the PCI-E bandwidth will be much slower than the intra-device bandwidth.

Test it with compute 1.2 cards or so, which have 1024 active threads possible (instead of 768). That might help hide the latencies.