global to global memory transfer problem


I am implementing a fluids solver in CUDA 1.0 w/ vs 2005 on a Quardo FX4600

I first read a text file with the first line being the number of points in an airfoil (points) and the second and third lines being the X (Foil_X)and Y (Foil_Y) coordinate of said points.

I allocate that space on the GPU with

float * dFoil_X;

	cudaMalloc((void**) &dFoil_X,points*sizeof(float));


float * dFoil_Y;

	cudaMalloc((void**) &dFoil_Y,points*sizeof(float));


I then allocate the space for the results

float* dCp;

	cudaMalloc((void**) &dCp,points*sizeof(float));

I can read back dFoil_X and dFoil_Y just fine from the card. I also did a simple loop with 1:400 on the GPU to dCp and read that back to the host fine. But it I try





and read back dCp I do not get dFoil_X back. Any ideas on why this might be?

I should note that I am by no means a computer programmer. This is really my first time using pointer :o I can post the code if anybody thinks it will help.


Firstly you’re missing a semicolon inside your loop but I guess that’s not the problem.

Are you running this loop inside a kernel or on the host?
If you run it on the host it can not work. With dCp and dFoil you point to memory locations on the device. Only the device can read and write from those locations.

I guess what you want to do or what you are already doing is copy the values over inside a kernel and then retrieve the value from the device. I’m not sure what’s wrong here. If you could post your kernel and how you call it we can help further.

I think I tried to overly simplify my problem in the explanation. I have copied data to the device from the host and allocated memory on the device for my results. As a test I justed wanted to copy the data I copied to the device to the memory allocated for a result on the device and then copy those results to the host. In the following kernel the data being moved is Gamma with size = 400*sizeof(float).

I send Gamma = 0.000000, -0.124347, 0.132042

and get back Cp = 0.000000, -0.140711, 0.640000



#include <stdio.h>

#include "matrixMul.h"


//! Matrix multiplication on the device: C = A * B

//! wA is A's width and wB is B's width


__global__ void

matrixMul( float* Cp, float* Foil_X, float* Foil_Y, float* Panel_Length, float* Theta, float* Gamma, float* Field_X, float* Field_Y)


    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

	float PI = 3.14159265358979323846;


   // Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;


	int p = 400;

	int fp = 2911;

	float AoA = 8*PI/180;

	int i;

	int j;

	float A, B, C, D, E, F, G, P, Q, at_x, at_y, st, sn;

	float cn2_x[2], cn1_x[2], ct2_x[2], ct1_x[2], cn2_y[2],cn1_y[2], ct2_y[2], ct1_y[2];


	float smt=0, smn=0;

	//__shared__ float sCp[BLOCK_SIZE];



  smt = 0;

  smn = 0;



  	A = -(Field_X[i]-Foil_X[j])*cosf(Theta[j])-(Field_Y[1]-Foil_Y[j])*sinf(Theta[j]);

  	B = (Field_X[i]-Foil_X[j])*(Field_X[i]-Foil_X[j])+(Field_Y[i]-Foil_Y[j])*(Field_Y[i]-Foil_Y[j]);

  	C = sinf(0.-Theta[j]);

  	D = cosf(0.-Theta[j]);

  	E =  (Field_X[i]-Foil_X[j])*sinf(Theta[j])-(Field_Y[i]-Foil_Y[j])*cosf(Theta[j]);

  	F = logf(1+(Panel_Length[j]*Panel_Length[j]+2.*A*Panel_Length[j])/B);

  	G = atan2f((E*Panel_Length[j]),(B+A*Panel_Length[j]));

  	P = (Field_X[i]-Foil_X[j])*sinf(0.-2.*Theta[j])+(Field_Y[i]-Foil_Y[j])*cosf(0.-2.*Theta[j]);

  	Q = (Field_X[i]-Foil_X[j])*cosf(0.-2.*Theta[j])-(Field_Y[i]-Foil_Y[j])*sinf(0.-2.*Theta[j]);

  	cn2_x[0] = D+0.5*Q*F/Panel_Length[j]-(A*C+D*E)*G/Panel_Length[j];

  	cn1_x[0] = 0.5*D*F+C*G-cn2_x[0];

  	ct2_x[0] = C+0.5*P*F/Panel_Length[j]+(A*D-C*E)*G/Panel_Length[j];

  	ct1_x[0] = 0.5*C*F-D*G-ct2_x[0];

 	C = sinf(PI/2.-Theta[j]);

  	D = cosf(PI/2.-Theta[j]);

  	P = (Field_X[i]-Foil_X[j])*sinf(PI/2.-2.*Theta[j])+(Field_Y[i]-Foil_Y[j])*cosf(PI/2.-2.*Theta[j]);

  	Q = (Field_X[i]-Foil_X[j])*cosf(PI/2.-2.*Theta[j])-(Field_Y[i]-Foil_Y[j])*sinf(PI/2.-2.*Theta[j]);

  	cn2_y[0] = D+0.5*Q*F/Panel_Length[j]-(A*C+D*E)*G/Panel_Length[j];

  	cn1_y[0] = 0.5*D*F+C*G-cn2_x[0];

  	ct2_y[0] = C+0.5*P*F/Panel_Length[j]+(A*D-C*E)*G/Panel_Length[j];

  	ct1_y[0] = 0.5*C*F-D*G-ct2_x[0];

 	if( j == 0)


    at_x = 0.;

    at_y = 0.;




    at_x = ct1_x[0]+ct2_x[1];

    at_y = ct1_y[0]+ct2_y[1];


 	st = at_x*Gamma[j];

  	sn = at_y*Gamma[j];




  	cn2_x[1] = cn2_x[0];

  	cn1_x[1] = cn1_x[0];

  	ct2_x[1] = ct2_x[0];

  	ct1_x[1] = ct1_x[0];

  	cn2_y[1] = cn2_y[0];

  	cn1_y[1] = cn1_y[0];

  	ct2_y[1] = ct2_y[0];

  	ct1_y[1] = ct1_y[0];








  if(ty ==0 & bx==0)Cp[i]=Gamma[i];//sCp[0];







#endif // #ifndef _MATRIXMUL_KERNEL_H_

Wow I feel like an idiot. It works just fine. When I was displaying the value I was displaying different indicies. Everything seems to be working fine. I thought this seemed like any offly weird problem.