GPU Transfer problems GPU won't correctly read data out from Device to Host

Thanks for the marco tera, after messing around with it, I’ve still ran a roadblock of data staying the same in the system. Here’s the new main code (the integrate function is still the same:

[codebox]

for( int i = 0; i < ntime; i++){

cudaMemcpy(cpos, pos, sizeof(float)*3*man, cudaMemcpyHostToDevice);

cudaMemcpy(cxm, xm, sizeof(float)*3*man, cudaMemcpyHostToDevice);

cudaMemcpy(cforce, force, sizeof(float)*3*man, cudaMemcpyHostToDevice);

calc_force(forcer, pos,force,man,box, potene);

integrate <<<3*man, 1234567 >>> (cpos,force,dt,man, cxm, kinene, i, ntime);



cudaMemcpy(pos, cpos, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

cudaMemcpy(xm, cxm, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

cudaMemcpy(force, cforce, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

for( int j=0; j< 3*man; i++){

	cout <<pos[j];

}

}

global void integrate( double *pos, double *force, double dt, int man, double *xm, double ken, int j, int ntime)

{

    double xx, vi;

    ken = 0.0;

for( int i=0; i< 3*man; i++){

    //timestep[i] = i*dt;

    xx= 2*pos[i] - xm[i] + (dt*dt*force[i]);

vi = (xx-xm[i])/(2*dt);    

    ken += vi*vi;

    xm[i] = pos[i];

    pos[i] = xx;

// #cout << k << endl;

}

}[/codebox]

I think I may have to do a device to device copy, but I feel that would be inefficient and would not help one bit. Can anyone offer thoughts on this?

Your kernel is never running - the execution configuration is illegal.

Your number of threads per block (1234567) is way too large, so the kernel can’t launch. The maximum supported number of threads per block is 512 (or 1024 on compute capability 2.0 devices).

Also, it helps a lot to check the return values from CUDA calls - this would have told you that the kernel does not launch at all. This macro comes in handy:

#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf(#x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}

You just wrap all CUDA calls (apart from the kernel call itself - problems with the kernel will be reported on the following CUDA call instead) inside CUDA_CALL().

Thanks for the marco, after messing around with it, I’ve still ran a roadblock of data staying the same in the system. Here’s the new main code (the integrate function is still the same:

for( int i = 0; i < ntime; i++){

	cudaMemcpy(cpos, pos, sizeof(float)*3*natom, cudaMemcpyHostToDevice);

	cudaMemcpy(cxm, xm, sizeof(float)*3*natom, cudaMemcpyHostToDevice);

	CUDA_CALL(cudaMemcpy(cforce, force, sizeof(float)*3*natom, cudaMemcpyHostToDevice));

	calc_force(forcer, pos,force,natom,box, potene);

	integrate <<<3*natom, 1>>> (cpos,cforce,dt,natom, cxm, kinene, i, ntime);

	

	cudaMemcpy(pos, cpos, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

	cudaMemcpy(xm, cxm, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

	CUDA_CALL(cudaMemcpy(force, cforce, sizeof(float)*3*natom, cudaMemcpyDeviceToHost));

  }

This time it may be the thread. I read that you can use threading to execute the same calculation on different memory as if there’s a for loop. Should I approach the integrate function through that method or am I thinking of a whole different topic? All I know now is that data is definitely going into the data, but it’s not being manipulated.

Once again, please and thank you.

after your kernel launch, do cudaGetLastError()–that’s how you know if a kernel actually launched or not.

Thank you for the response, I’ll try that sometime later. Unfortunately, another user’s method worked for me. But I still have the same issue of data not reading from the device into the host memory.

I’ll post it on the original topic.

You should definitely do this - it’s the whole point of why GPUs are faster than CPUs (for certain tasks). Take some time to study the chapter on the programming model in the Programming Guide.

How do you know this?

And please, check for errors as tmurray told you. It really is a better indicator for something going wrong than the forums are.

Thank you for informing about the checks.

I’ve done the checks for the errors, I know the number is going to the system since the macro you gave me, as well as error checks from a colleague’s past code. The numbers come out, but NAN’s when running the code for the position. The production of NAN’s leads me to assume assumee that the data isn’t copying out correctly since the numbers would slowly creep towards each other and the space out near back to it’s original position using low numbers (i.e. 10).

I think I found my main problem after reworking the code, but I get an error like this:

cudaMemcpy(pos, cpos, sizeof(float)3man, cudaMemcpyDeviceToHost) returned “unspecified launch failure”

I’m getting this error for all of my cuda variables.

for (int i = 0; i<ntime; i++) {

  int blockPerGrid = 448;

	int threadsPerBlock = man/blockPerGrid;

....

	calc_force(forcer, pos,force,man,box, potene);

	

	cudaMemcpy(cpos, pos, sizeof(float)*3*man, cudaMemcpyHostToDevice);

	CUDA_CALL(cudaMemcpy(cxm, xm, sizeof(float)*3*man, cudaMemcpyHostToDevice));

	cudaMemcpy(cforce, force, sizeof(float)*3*man, cudaMemcpyHostToDevice);

	cudaMemcpy(ckinene, kinene, sizeof(float)*ntime, cudaMemcpyHostToDevice);

   integrate <<<blockPerGrid, threadsPerBlock>>> (cpos,cforce,dt,man, cxm, ckinene, i, ntime);

	CUDA_CALL(cudaMemcpy(pos, cpos, sizeof(float)*3*man, cudaMemcpyDeviceToHost));

	cudaMemcpy(kinene, ckinene, sizeof(float)*ntime, cudaMemcpyHostToDevice);

	cudaMemcpy(xm, cxm, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

	cudaMemcpy(force, cforce, sizeof(float)*3*man, cudaMemcpyDeviceToHost);

	totene = (.5*kinene[0] + potene);

	writepon(pont, i*dt, totene, potene, .5*kinene[0]);

	writeconf(posfile,pos,box,man);

 	}
__global__ void integrate( float *pos, float *force, float dt, int natom, float *xm, float *ken, int j, int ntime)

{

		float xx, vi;

		ken[0] = 0.0;

		//timestep[i] = i*dt;

	  int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i =0; i<3*natom; i++) {

		xx= 2*pos[i] - xm[i] + (dt*dt*force[i]);

	vi = (xx-xm[i])/(2*dt);	

		ken[0] += vi*vi;

		xm[i] = pos[i];

		

	}

//	  #cout << k << endl;

}

Could it be that I’m using the thread improperly or is it that there’s too little memory for the threads and blocks?

If you had followed tmurray’s advice, you would know that your kernel still does not launch.

How large is [font=“Courier New”]man[/font]? Depending on it’s size, you might still have the same problem as before because [font=“Courier New”]threadsPerBlock[/font] and [font=“Courier New”]blockPerGrid[/font] are interchanged from the usual formula in your program. Try

int threadsPerBlock = 448;

	int blockPerGrid = (man + threadsPerBlock - 1) / threadsPerBlock;

Also, your kernel still looks quite strange. What’s the intention behind these lines:

int i = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i =0; i<3*natom; i++) {

? I from the first line will never be used because it is hidden by the i from the second line. It looks like you should re-read the introductory material about the CUDA execution model in the Programming Guide.

Okay, I’ll do that. I ran the cudaGetLastError() and got Floating point exception. As for man, man can be any number (it’s a variable). If that’s the case, would this mean that the way I’m allocating memory is not proper?

Thanks once again for being patient with me.

If man happens to be larger than 229376, you will exceed the block size limit of 512 threads. Also, you will want to have block size as a multiple of 32 (better yet, 64) to avoid partially unused warps.

Note that I originally missed to correct a second mistake in the grid size equation - I’ve updated the previous post accordingly.

Okay. I’ve done error checks for entering and returning from the block and used cuda-gdb to debug it. Everything says it’s clear but the data that I would normally check for it, (an output file) is not showing any new changes to the program. Here’s the abridge code (the … assume it’s the variables with the “c” next to the name:

[code

int threadsPerBlock = 448;

int blockPerGrid;

if (man >= 2293376)

        blockPerGrid = 512;

else 

   blockPerGrid=man/threadsPerBlock;

if ( man % threadsPerBlock != 0) blockPerGrid++;

for (int i = 0; i<ntime; i++) {

calc_force(forcer, pos,force,man,box, potene);

error = cudaMemcpy(cpos, pos, sizeof(float)3natom, cudaMemcpyHostToDevice);

if( error != cudaSuccess){

  error = cudaGetLastError();

  cerr << "cudaMalloc couldn't allocate memory for cpos\n";

  cerr << cudaGetErrorString(error) << endl;

  return -1;

}

integrate <<<blockPerGrid, threadsPerBlock>>> (cpos,cforce,dt,man, cxm, cvel, i, ntime);

 error =cudaMemcpy(pos, cpos, sizeof(float)*3*natom, cudaMemcpyDeviceToHost);

if( error != cudaSuccess){

  error = cudaGetLastError();

  cerr << "cudaMalloc couldn't allocate memory for cvel\n";

  cerr << cudaGetErrorString(error) << endl;

  return -1;

}

…[/code]

you are checking for errors after a memcopy, you should do so immediately after your kernel.

also it would be best to post all of your code, because nobody can see the source of your problems (mem allocation e.g.)

Code removed.

Here’s the code I’m working with.

I got it to work. Please lock the thread.