GPU Transfer problems GPU won't correctly read data out from Device to Host

aadejare · July 15, 2010, 9:58pm

Thanks for the marco tera, after messing around with it, I’ve still ran a roadblock of data staying the same in the system. Here’s the new main code (the integrate function is still the same:

[codebox]

for( int i = 0; i < ntime; i++){

cudaMemcpy(cpos, pos, sizeof(float)*3*man, cudaMemcpyHostToDevice);

cudaMemcpy(cxm, xm, sizeof(float)*3*man, cudaMemcpyHostToDevice);

cudaMemcpy(cforce, force, sizeof(float)*3*man, cudaMemcpyHostToDevice);

calc_force(forcer, pos,force,man,box, potene);

integrate <<<3*man, 1234567 >>> (cpos,force,dt,man, cxm, kinene, i, ntime);



cudaMemcpy(pos, cpos, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

cudaMemcpy(xm, cxm, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

cudaMemcpy(force, cforce, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

for( int j=0; j< 3*man; i++){

	cout <<pos[j];

}

}

global void integrate( double *pos, double *force, double dt, int man, double *xm, double ken, int j, int ntime)

{

    double xx, vi;

    ken = 0.0;

for( int i=0; i< 3*man; i++){

    //timestep[i] = i*dt;

    xx= 2*pos[i] - xm[i] + (dt*dt*force[i]);

vi = (xx-xm[i])/(2*dt);    

    ken += vi*vi;

    xm[i] = pos[i];

    pos[i] = xx;

// #cout << k << endl;

}

}[/codebox]

I think I may have to do a device to device copy, but I feel that would be inefficient and would not help one bit. Can anyone offer thoughts on this?

avidday · July 15, 2010, 10:00pm

Your kernel is never running - the execution configuration is illegal.

tera · July 15, 2010, 10:09pm

Your number of threads per block (1234567) is way too large, so the kernel can’t launch. The maximum supported number of threads per block is 512 (or 1024 on compute capability 2.0 devices).

Also, it helps a lot to check the return values from CUDA calls - this would have told you that the kernel does not launch at all. This macro comes in handy:

#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf(#x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}

You just wrap all CUDA calls (apart from the kernel call itself - problems with the kernel will be reported on the following CUDA call instead) inside CUDA_CALL().

aadejare · July 16, 2010, 3:58pm

Your number of threads per block (1234567) is way too large, so the kernel can’t launch. The maximum supported number of threads per block is 512 (or 1024 on compute capability 2.0 devices).

Also, it helps a lot to check the return values from CUDA calls - this would have told you that the kernel does not launch at all. This macro comes in handy:
#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf(#x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}
You just wrap all CUDA calls (apart from the kernel call itself - problems with the kernel will be reported on the following CUDA call instead) inside CUDA_CALL().

Thanks for the marco, after messing around with it, I’ve still ran a roadblock of data staying the same in the system. Here’s the new main code (the integrate function is still the same:

for( int i = 0; i < ntime; i++){

	cudaMemcpy(cpos, pos, sizeof(float)*3*natom, cudaMemcpyHostToDevice);

	cudaMemcpy(cxm, xm, sizeof(float)*3*natom, cudaMemcpyHostToDevice);

	CUDA_CALL(cudaMemcpy(cforce, force, sizeof(float)*3*natom, cudaMemcpyHostToDevice));

	calc_force(forcer, pos,force,natom,box, potene);

	integrate <<<3*natom, 1>>> (cpos,cforce,dt,natom, cxm, kinene, i, ntime);

	

	cudaMemcpy(pos, cpos, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

	cudaMemcpy(xm, cxm, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

	CUDA_CALL(cudaMemcpy(force, cforce, sizeof(float)*3*natom, cudaMemcpyDeviceToHost));

  }

This time it may be the thread. I read that you can use threading to execute the same calculation on different memory as if there’s a for loop. Should I approach the integrate function through that method or am I thinking of a whole different topic? All I know now is that data is definitely going into the data, but it’s not being manipulated.

Once again, please and thank you.

tmurray · July 16, 2010, 4:04pm

after your kernel launch, do cudaGetLastError()–that’s how you know if a kernel actually launched or not.

aadejare · July 16, 2010, 5:08pm

Thank you for the response, I’ll try that sometime later. Unfortunately, another user’s method worked for me. But I still have the same issue of data not reading from the device into the host memory.

I’ll post it on the original topic.

tera · July 16, 2010, 7:44pm

You should definitely do this - it’s the whole point of why GPUs are faster than CPUs (for certain tasks). Take some time to study the chapter on the programming model in the Programming Guide.

How do you know this?

And please, check for errors as tmurray told you. It really is a better indicator for something going wrong than the forums are.

aadejare · July 16, 2010, 9:53pm

Thank you for informing about the checks.

I’ve done the checks for the errors, I know the number is going to the system since the macro you gave me, as well as error checks from a colleague’s past code. The numbers come out, but NAN’s when running the code for the position. The production of NAN’s leads me to assume assumee that the data isn’t copying out correctly since the numbers would slowly creep towards each other and the space out near back to it’s original position using low numbers (i.e. 10).

aadejare · July 20, 2010, 7:13pm

I think I found my main problem after reworking the code, but I get an error like this:

cudaMemcpy(pos, cpos, sizeof(float)3man, cudaMemcpyDeviceToHost) returned “unspecified launch failure”

I’m getting this error for all of my cuda variables.

for (int i = 0; i<ntime; i++) {

  int blockPerGrid = 448;

	int threadsPerBlock = man/blockPerGrid;

....

	calc_force(forcer, pos,force,man,box, potene);

	

	cudaMemcpy(cpos, pos, sizeof(float)*3*man, cudaMemcpyHostToDevice);

	CUDA_CALL(cudaMemcpy(cxm, xm, sizeof(float)*3*man, cudaMemcpyHostToDevice));

	cudaMemcpy(cforce, force, sizeof(float)*3*man, cudaMemcpyHostToDevice);

	cudaMemcpy(ckinene, kinene, sizeof(float)*ntime, cudaMemcpyHostToDevice);

   integrate <<<blockPerGrid, threadsPerBlock>>> (cpos,cforce,dt,man, cxm, ckinene, i, ntime);

	CUDA_CALL(cudaMemcpy(pos, cpos, sizeof(float)*3*man, cudaMemcpyDeviceToHost));

	cudaMemcpy(kinene, ckinene, sizeof(float)*ntime, cudaMemcpyHostToDevice);

	cudaMemcpy(xm, cxm, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

	cudaMemcpy(force, cforce, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

	totene = (.5*kinene[0] + potene);

	writepon(pont, i*dt, totene, potene, .5*kinene[0]);

	writeconf(posfile,pos,box,man);

 	}

__global__ void integrate( float *pos, float *force, float dt, int natom, float *xm, float *ken, int j, int ntime)

{

		float xx, vi;

		ken[0] = 0.0;

		//timestep[i] = i*dt;

	  int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i =0; i<3*natom; i++) {

		xx= 2*pos[i] - xm[i] + (dt*dt*force[i]);

	vi = (xx-xm[i])/(2*dt);	

		ken[0] += vi*vi;

		xm[i] = pos[i];

		

	}

//	  #cout << k << endl;

}

Could it be that I’m using the thread improperly or is it that there’s too little memory for the threads and blocks?

tera · July 21, 2010, 8:48am

If you had followed tmurray’s advice, you would know that your kernel still does not launch.

How large is [font=“Courier New”]man[/font]? Depending on it’s size, you might still have the same problem as before because [font=“Courier New”]threadsPerBlock[/font] and [font=“Courier New”]blockPerGrid[/font] are interchanged from the usual formula in your program. Try

int threadsPerBlock = 448;

	int blockPerGrid = (man + threadsPerBlock - 1) / threadsPerBlock;

Also, your kernel still looks quite strange. What’s the intention behind these lines:

int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i =0; i<3*natom; i++) {

? I from the first line will never be used because it is hidden by the i from the second line. It looks like you should re-read the introductory material about the CUDA execution model in the Programming Guide.

aadejare · July 21, 2010, 4:39pm

If you had followed tmurray’s advice, you would know that your kernel still does not launch.

How large is [font=“Courier New”]man[/font]? Depending on it’s size, you might still have the same problem as before because [font=“Courier New”]threadsPerBlock[/font] and [font=“Courier New”]blockPerGrid[/font] are interchanged from the usual formula in your program. Try
int threadsPerBlock = 448;

	int blockPerGrid = man/threadsPerBlock;
Also, your kernel still looks quite strange. What’s the intention behind these lines:
int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i =0; i<3*natom; i++) {
? I from the first line will never be used because it is hidden by the i from the second line. It looks like you should re-read the introductory material about the CUDA execution model in the Programming Guide.

Okay, I’ll do that. I ran the cudaGetLastError() and got Floating point exception. As for man, man can be any number (it’s a variable). If that’s the case, would this mean that the way I’m allocating memory is not proper?

Thanks once again for being patient with me.

tera · July 21, 2010, 5:10pm

If man happens to be larger than 229376, you will exceed the block size limit of 512 threads. Also, you will want to have block size as a multiple of 32 (better yet, 64) to avoid partially unused warps.

Note that I originally missed to correct a second mistake in the grid size equation - I’ve updated the previous post accordingly.

aadejare · July 21, 2010, 6:24pm

Okay. I’ve done error checks for entering and returning from the block and used cuda-gdb to debug it. Everything says it’s clear but the data that I would normally check for it, (an output file) is not showing any new changes to the program. Here’s the abridge code (the … assume it’s the variables with the “c” next to the name:

[code

int threadsPerBlock = 448;

int blockPerGrid;

if (man >= 2293376)

        blockPerGrid = 512;

else 

   blockPerGrid=man/threadsPerBlock;

if ( man % threadsPerBlock != 0) blockPerGrid++;

for (int i = 0; i<ntime; i++) {

calc_force(forcer, pos,force,man,box, potene);

error = cudaMemcpy(cpos, pos, sizeof(float)3natom, cudaMemcpyHostToDevice);

if( error != cudaSuccess){

  error = cudaGetLastError();

  cerr << "cudaMalloc couldn't allocate memory for cpos\n";

  cerr << cudaGetErrorString(error) << endl;

  return -1;

}

…

integrate <<<blockPerGrid, threadsPerBlock>>> (cpos,cforce,dt,man, cxm, cvel, i, ntime);

 error =cudaMemcpy(pos, cpos, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

if( error != cudaSuccess){

  error = cudaGetLastError();

  cerr << "cudaMalloc couldn't allocate memory for cvel\n";

  cerr << cudaGetErrorString(error) << endl;

  return -1;

}

…[/code]

E.D_Riedijk · July 22, 2010, 6:17am

calc_force(forcer, pos,force,man,box, potene);

error = cudaMemcpy(cpos, pos, sizeof(float)3natom, cudaMemcpyHostToDevice);
if( error != cudaSuccess){

  error = cudaGetLastError();

  cerr << "cudaMalloc couldn't allocate memory for cpos\n";

  cerr << cudaGetErrorString(error) << endl;

  return -1;

}

you are checking for errors after a memcopy, you should do so immediately after your kernel.

also it would be best to post all of your code, because nobody can see the source of your problems (mem allocation e.g.)

aadejare · July 22, 2010, 9:22pm

Code removed.

Here’s the code I’m working with.

aadejare · August 2, 2010, 8:53pm

I got it to work. Please lock the thread.