I have a question about the data transfer. Please help
I am passing a complex array from a fortran program to GPU
call processfun(bindata)
cuComplex *d_data;
extern “C” void processfun_(cuComplex data)
{
printf(“\n %f %f”,data[0].x,data[0].y); // I am only displaying the 1st element
CUDA_SAFE_CALL(cudaMalloc((void*) &d_data, sizeof(cuComplex)N));
CUDA_SAFE_CALL(cudaMemcpy(d_data, data, Nsizeof(cuComplex), cudaMemcpyHostToDevice));
Thanks for the reply. I would really appreciate if you could help me do this simple thing. ( What are the modifications then I need to make in the above code ? This would help me understand the way it works.
Say I want to pass a complex array of 3 elements 1+2i, 2+3i and 3+4i. On the GPU, I want to multiply them by 2 and get the result back on CPU
I don’t understand your program. You are using host functions (cudaMemcpy, etc.), so I am presuming that you are showing a function that runs on the host, but then you seem to be operating on device memory in the same function “d_data[i].x = d_data[i].x * 2;”, etc.
Unless I have misunderstood your post, what you need to do is move the lines that operate on the device memory into a separate global function (‘kernel’), the call to that function would occur in the place where you currently have the lines operating on the device memory.
You’ll need a kernel. It can be used instead of the “for” loop and do all the “iterations” concurrently.
Kernels (and device functions called by kernels) are the only places in the code where you can read and modify data allocated on device. Also, you cannot read/modify data residing in host memory from within kernels (so it goes both ways).
An example code could look like this:
//a kernel == a function declared as __global__
__global__ void multiply(cuComplex *d_data)
{
int i = threadIdx.x;
d_data[i].x = d_data[i].x * 2;
d_data[i].y = d_data[i].y * 2;
}
Threads’ threadIdx will span from 0 to whatever number of threads per block you call*.