why use cudamemcpy instead of a simple for loop

A newbie question: why we use “cudamemcpy” to initialize a device variable instead of using a for loop. Both seem to work for me. For example,

cudaMemcpy(a_d,a_h, n*sizeof(int), cudaMemcpyHostToDevice)
for(int i=0;i<n;i++) a_d[i]=a_h[i].


Unless you’re using device emulation, your method will not work at all–the GPU’s memory is not directly addressable by the CPU. It’ll probably segfault and crash.

You are right. Thanks tmurray again!

I have another question regarding how to use shared memory. To simplify the case, say we have two arrays float A[10] and float B[10]. Each element value in B depends on values of three consecutive elements in A, e,g. B[1]=a0A[0]+a1A[1]+a2A[2], B[2]=a1A[1]+a2A[2]+a3A[3]… How do I allocate appropriate shared memory for array A?


If you look at the separable convolution example in the SDK it addresses a similar problem. A solution is to add additional threads to your thread blocks. Each thread is responsible for reading a single value into the shared memory array. Then put in a conditional for the summing step to have the extra threads sit out for the rest of the kernel.

Thanks… jgoffeney. let me give it a try…