Newbie question about data transfer

Hi,

I have a question about the data transfer. Please help

I am passing a complex array from a fortran program to GPU
call processfun(bindata)

cuComplex *d_data;

extern “C” void processfun_(cuComplex data)
{
printf("\n %f %f",data[0].x,data[0].y); // I am only displaying the 1st element
CUDA_SAFE_CALL(cudaMalloc((void
*) &d_data, sizeof(cuComplex)N));
CUDA_SAFE_CALL(cudaMemcpy(d_data, data, N
sizeof(cuComplex), cudaMemcpyHostToDevice));

printf("\n %f %f",d_data[0].x,d_data[0].y); // Gives a segmentation fault error

for(int i=0;i<N;i++)
{
	d_data[i].x = d_data[i].x * rand();
	d_data[i].y = d_data[i].y * rand();
}

CUDA_SAFE_CALL(cudaMemcpy(data, d_data, N*sizeof(cuComplex), cudaMemcpyDeviceToHost));

}

Q. Why do I get the segmentation fault? Cant I access the device data directly?

Thanks

You can’t dereference a device pointer on the CPU side; it’s only there to be used from CUDA functions.

Thanks for the reply. I would really appreciate if you could help me do this simple thing. ( What are the modifications then I need to make in the above code ? This would help me understand the way it works.

Say I want to pass a complex array of 3 elements 1+2i, 2+3i and 3+4i. On the GPU, I want to multiply them by 2 and get the result back on CPU

extern “C” void processfun_(cuComplex data)
{
CUDA_SAFE_CALL(cudaMalloc((void
*) &d_data, sizeof(cuComplex)N));
CUDA_SAFE_CALL(cudaMemcpy(d_data, data, N
sizeof(cuComplex), cudaMemcpyHostToDevice));

for(int i=0;i<N;i++)
{
d_data[i].x = d_data[i].x * 2;
d_data[i].y = d_data[i].y * 2;
}

CUDA_SAFE_CALL(cudaMemcpy(data, d_data, N*sizeof(cuComplex), cudaMemcpyDeviceToHost));
}

Once again, thanks a lot.

I don’t understand your program. You are using host functions (cudaMemcpy, etc.), so I am presuming that you are showing a function that runs on the host, but then you seem to be operating on device memory in the same function “d_data[i].x = d_data[i].x * 2;”, etc.

Unless I have misunderstood your post, what you need to do is move the lines that operate on the device memory into a separate global function (‘kernel’), the call to that function would occur in the place where you currently have the lines operating on the device memory.

ted84,

You’ll need a kernel. It can be used instead of the “for” loop and do all the “iterations” concurrently.

Kernels (and device functions called by kernels) are the only places in the code where you can read and modify data allocated on device. Also, you cannot read/modify data residing in host memory from within kernels (so it goes both ways).

An example code could look like this:

//a kernel == a function declared as __global__ 

__global__ void multiply(cuComplex *d_data) 

{

int i = threadIdx.x;

d_data[i].x = d_data[i].x * 2;

d_data[i].y = d_data[i].y * 2;

}

Threads’ threadIdx will span from 0 to whatever number of threads per block you call*.

The host part:

extern "C" void processfun_(cuComplex *data)

{

CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, sizeof(cuComplex)*N));

CUDA_SAFE_CALL(cudaMemcpy(d_data, data, N*sizeof(cuComplex), cudaMemcpyHostToDevice));

//specyfing launch parameters

dim3 gridSize(1,1,1); //one block only

dim3 blockSize(N,1,1); //N threads in that block

//launching

multiply<<<gridSize,blockSize>>>(d_data);

CUDA_SAFE_CALL(cudaMemcpy(data, d_data, N*sizeof(cuComplex), cudaMemcpyDeviceToHost));

}
  • assuming N is smaller than 512, which is the maximum amount of threads for a single block. If it’s bigger, you’ll need to use more blocks.