CUDA Release vs. Debug

Hello everyone.

It seems I have made a classic programming blunder. For weeks now, I have been programming under the false assumption that what works in Debug mode will also work in Release mode. Of course, this is not true. Learned my lesson. In the meantime, I am still stuck with a program that works only in Debug mode but not in Release mode. Here is my program structure, and then I will explain what I have figured out:

  1. Initialize a whole bunch of 2D arrays for computation
  2. Generate as many CPU threads as I have GPU devices available (currently 2)
  3. Each CPU thread has a copy of the input data of 2D arrays to copy to its corresponding GPU
  4. Each CPU thread does cudaMemcpy from HOST to GPU
  5. Each GPU does some computation
  6. Each CPU thread copies back the updated 2D matrices back to the CPU side
  7. Threads end, combine results, finish program

I have done testing so far to know that yes, each CPU thread is getting the proper input. I would assume (falsely?) that since the variables are passing PERFECTLY in Debug mode, then, in Release mode, they are also being passed properly. I have verified that all the data arrives properly on the CPU side in Debug & Release.

Once I am in the GPU, which is difficult to debug of course, I can set every value in the to-be-copied-back-to-CPU matrices to some outrageous value like 58971490823.4123, but after the copying is finished, I am still getting zeros (or something like 7.77639e-315) which is the default value I set the matrix that the new GPU data copies onto. That make sense?

There are some other parameters that get passed in the kernel, and I know those are working fine because each GPU calculates a portion of the to-be-copied-back-to-the-CPU matrix, and when I test to see what it looks like, I can tell that the proper portions of the matrix are being toyed with by the GPU; however, I’m seeing that copying GPU to CPU is causing some problems. You might say uninitialized variables is the problem, but my friend, not a single variable is uninitialized in my program. Even on the GPU side I make sure that’s taken care of.

Anybody else have issues with Release mode like this? I would greatly appreciate any insight!

Thanks,
Daniel

EDIT: Ugh, I always forget to post this stuff. I’m running Windows Server 2003 Standard Edition, Quadcore 4GB ram, Visual Studio 2008 and CUDA 3.0.

Its hard to tell without the kernel code, but the main reason for debug/release issues like that is the fact indeed that the debug
code runs on the CPU in a SYNCRONIZED way where the code in release runs of course multi-threaded. Thus if you have race-conditions
in your code you’d see garbage results in release while in debug it might show the expected results.

You can try to debug it with Nexus (windows) or gdb(linux) or find the race-condition in the code yourself :)

hope that helps
eyal

I highly doubt that my company would enjoy me posting entire sections of code, but I’ll do enough to make a point.

[codebox]global void kernel(double *d_values,

	       double    **d_valuesPtr,

	       double     *d_ntimes,

	       double    **d_ntimesPtr,

	       TimeStamp  *d_times,

	       TimeStamp **d_timesPtr,

	       long       *d_valuesLengthRow,

	       ulong      *d_offsetArray,

	       double     *d_c,

	       double    **d_cPtr,

	       double     *d_p,

	       double    **d_pPtr,

	       double     *d_tradeFreq,

	       double     *d_exp_deltas,

	       double    **d_exp_deltasPtr,

	       double      minTF,

	       ulong       numVars,

	       int         startThread,

	       int         endThread)

{

int tid = blockIdx.x * blockDim.x + threadIdx.x;

tid += startThread;

// startThread & endThread is just telling me what rows of the 2D matrix to calculate

    // (eg. if startThread = 5 and endThread = 9, then that's the rows it will calculate because

    // rows 0-4 will be taken care of by the other GPU device)

if ((startThread <= tid) && (tid <= endThread))

{

	// only need to execute the following code chunk one time

	if (tid == startThread)

	{

		for (int i = 0; i < numVars; i++)

		{

                            // setting the addresses of the 2D array in GPU memory correctly

			d_valuesPtr[i]     = &d_values[d_offsetArray[i]];

			d_ntimesPtr[i]     = &d_ntimes[d_offsetArray[i]];

			d_timesPtr[i]      = &d_times[d_offsetArray[i]];

			d_cPtr[i]          = &d_c[i * numVars];

			d_pPtr[i]          = &d_p[i * numVars];

			d_exp_deltasPtr[i] = &d_exp_deltas[d_offsetArray[i]];

		}

	}

	__syncthreads();

	// perform rest of matrix computation

}

}

[/codebox]

Now, I have tested that prior to reaching the kernel invocation, the input is properly passed. The goal is to change the values in d_c and d_p and copy them back to the CPU like so:

[codebox]kernelWrapper(d_values,

      d_valuesPtr,

      d_ntimes,

      d_ntimesPtr,

      d_times,

      d_timesPtr,

      d_valuesLengthRow,

      d_offsetArray,

      d_c,

      d_cPtr,

      d_p,

      d_pPtr,

      d_tradeFreq,

      d_exp_deltas,

      d_exp_deltasPtr,

      myData->minTradeFreq,

      myData->numStocks,

      numBlocks,

      numThreadsPerBlock,

      myData->startThread,

      myData->endThread);

cudaThreadSynchronize();

// memory = width * height * sizeof(double)

cudaMemcpy(myData->c,d_c,memory,cudaMemcpyDeviceToHost);

cudaMemcpy(myData->p,d_p,memory,cudaMemcpyDeviceToHost);

[/codebox]

…where my actual kernel invocation is:

[codebox]kernel<<<numBlocks,numThreadsPerBlock>>>(d_values,

				 d_valuesPtr,

				 d_ntimes,

				 d_ntimesPtr,

				 d_times,

				 d_timesPtr,

				 d_valuesLengthRow,

				 d_offsetArray,

				 d_c,

				 d_cPtr,

				 d_p,

				 d_pPtr,

				 d_tradeFreq,

				 d_exp_deltas,

				 d_exp_deltasPtr,

				 minTradeFreq,

				 numStocks,

				 startThread,

				 endThread);

[/codebox]

Remember, these things work perfectly in Debug mode. Also, in Debug mode, things are working concurrently. I have seen each GPU calculate it’s partial result at the same time, then the results individually written back, then summed together for a final result. That’s exactly what Release mode should be doing, but it is so stupid that it optimizes to the wrong answer. Ugh… I’m getting a headache from all this.

eyalhir74, I wish I could use Nexus aka Parallel Nsight, but I am running Windows Server 2003. As far as I have seen, Nsight requires the host and target machines to be Windows Vista or Windows 7, but my host machine (Windows XP 32-bit Professional) and target machine (Windows Server 2003 Standard Edition) don’t meet those requirements. :/

When I run the code like it is above, I get the two 10x10 matrices, d_c and d_p, looking like this:

Device 0: (computing rows 0-4)

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

Device 1: (computing rows 5-9)

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

SO. You can see that the GPU is affecting the resulting arrays in the right places. When I change the kernel code to:

[codebox]global void kernel(double *d_values,

	       double    **d_valuesPtr,

	       double     *d_ntimes,

	       double    **d_ntimesPtr,

	       TimeStamp  *d_times,

	       TimeStamp **d_timesPtr,

	       long       *d_valuesLengthRow,

	       ulong      *d_offsetArray,

	       double     *d_c,

	       double    **d_cPtr,

	       double     *d_p,

	       double    **d_pPtr,

	       double     *d_tradeFreq,

	       double     *d_exp_deltas,

	       double    **d_exp_deltasPtr,

	       double      minTF,

	       ulong       numVars,

	       int         startThread,

	       int         endThread)

{

int tid = blockIdx.x * blockDim.x + threadIdx.x;

tid += startThread;

// startThread & endThread is just telling me what rows of the 2D matrix to calculate

    // (eg. if startThread = 5 and endThread = 9, then that's the rows it will calculate because

    // rows 0-4 will be taken care of by the other GPU device)

if ((startThread <= tid) && (tid <= endThread))

{

	// only need to execute the following code chunk one time

	if (tid == startThread)

	{

		for (int i = 0; i < numVars; i++)

		{

                            // setting the addresses of the 2D array in GPU memory correctly

			d_valuesPtr[i]     = &d_values[d_offsetArray[i]];

			d_ntimesPtr[i]     = &d_ntimes[d_offsetArray[i]];

			d_timesPtr[i]      = &d_times[d_offsetArray[i]];

			d_cPtr[i]          = &d_c[i * numVars];

			d_pPtr[i]          = &d_p[i * numVars];

			d_exp_deltasPtr[i] = &d_exp_deltas[d_offsetArray[i]];

		}

	}

	__syncthreads();

	// perform rest of matrix computation

__syncthreads();

	for (int i = 0; i < numVars * numVars; i++)

		d_c[i] = d_p[i] = -42.0;

}

}

[/codebox]

…then the output of d_c and d_p looks like this:

Device 0: (computing rows 0-4)

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

Device 1: (computing rows 5-9)

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314 1.60937e-314

And if I change where I set all the values to -42.0 to -4255214232.0, I get 1.7199e-314 for all those values you see up above. So, it does change something…

As you can see, it is properly “affecting” the matrices at the right places, but it will not, for the life of me, take in any kind of coherent value. It initially seems like the GPU/Visual Studio/Windows XP/Windows Server/SOMETHING in Release mode can’t handle cudaMemcpy.

I’m running on two Tesla C1060 cards at the moment.

Any thoughts?

What does the code look like that converts [font=“Courier New”]*d_cPtr[/font] and [font=“Courier New”]*d_pPtr[/font] from host to device pointers?

This is inside a function called passData(), which is invoked when each CPU thread is generated. passData() passes a pointer to a struct of information called myData.

[codebox]

void passData(GPUThread *myData)

{

cudaSetDevice(myData->device);

double     *d_values;

double    **d_valuesPtr;

double     *d_ntimes;

double    **d_ntimesPtr;

TimeStamp  *d_times;

TimeStamp **d_timesPtr;

double     *d_c;

double    **d_cPtr;

double     *d_p;

double    **d_pPtr;

double     *d_exp_deltas;

double    **d_exp_deltasPtr;

ulong      *d_offsetArray;

long       *d_valuesLengthRow;

double     *d_tradeFreq;

cudaMalloc((void**)&d_values,myData->arrayMemory);

cudaMalloc((void**)&d_ntimes,myData->arrayMemory);

cudaMalloc((void**)&d_times,myData->arrayMemory);

cudaMalloc((void**)&d_c,myData->corrPredMemory);

cudaMalloc((void**)&d_p,myData->corrPredMemory);

cudaMalloc((void**)&d_exp_deltas,myData->arrayMemory);

cudaMalloc((void**)&d_valuesPtr,myData->numVars * sizeof(double *));

cudaMalloc((void**)&d_ntimesPtr,myData->numVars * sizeof(double *));

cudaMalloc((void**)&d_timesPtr,myData->numVars * sizeof(TimeStamp *));

cudaMalloc((void**)&d_cPtr,myData->numVars * sizeof(double *));

cudaMalloc((void**)&d_pPtr,myData->numVars * sizeof(double *));

cudaMalloc((void**)&d_exp_deltasPtr,myData-> numVars ** sizeof(double *));

cudaMalloc((void**)&d_offsetArray,myData->numVars * sizeof(ulong));

cudaMalloc((void**)&d_valuesLengthRow,myData->valuesLengthRowMemory);

cudaMalloc((void**)&d_tradeFreq,myData->tradeFreqMemory);

cudaMemcpy(d_values,myData->values,myData->arrayMemory,cudaMemcpyHostToDevice);

cudaMemcpy(d_ntimes,myData->ntimes,myData->arrayMemory,cudaMemcpyHostToDevice");

cudaMemcpy(d_times,myData->times,myData->arrayMemory,cudaMemcpyHostToDevice");

cudaMemcpy(d_c,myData->c,myData->corrPredMemory,cudaMemcpyHostToDevice);

cudaMemcpy(d_p,myData->p,myData->corrPredMemory,cudaMemcpyHostToDevice);

cudaMemcpy(d_exp_deltas,myData->exp_deltas,myData->numElements*sizeof(double),cudaMemcpyHostToDevice);

cudaMemcpy(d_offsetArray,myData->offsetArray,myData->numVars*sizeof(ulong),cudaMemcpyHostToDevice);

cudaMemcpy(d_valuesLengthRow,myData->valuesLengthRow,myData->numStocks*sizeof(long),cudaMemcpyHostToDevice);

cudaMemcpy(d_tradeFreq,myData->tradeFreq,myData->tradeFreqMemory,cudaMemcpyHostToDevice);

// followed by kernel invocation, copying back, etc., as shown in my previous posts

[/codebox]

numVars corresponds to how long/wide the resulting matrices myData->c and myData->p are going to be.

Is this what you were asking about?

Sorry, I messed up when trying to reply to you. But I did post my reply in the thread!

Yes, and it explains why your code does not work. You cannot just copy an array of host pointers to the device and use them there as device pointers. You need to convert each of them to device pointers.

This just happens to work in emulation, since in emulation there is no difference between host and device pointers.

Then I need you to explain what exactly isn’t working. I basically followed the example of the SDK Example “simpleMultiGPU” to do what I was doing. I don’t understand. What they did worked, and mine is quite similar to what they have except that I’m dealing with 1D arrays of doubles and 1D arrays of double* (and TimeStamps too).

Also, note that I did NOT cudaMemcpy an array of pointers. I copied over the actual array. The 6 arrays are:

c

p

values

ntimes

times

exp_deltas

The 6 pointer arrays are (I set their proper addresses, i.e. the places to point, only once inside the GPU):

cPtr

pPtr

valuesPtr

ntimesPtr

timesPtr

exp_deltasPtr

These are the most notable 1D arrays. The continguous regions of memory holding doubles and TimeStamp values (TimeStamp is a typedef for __int64).

Also, I have not been running in emulation mode. Never. I only, up until now, have been running in Debug mode. I should have tried to compile with Release mode more recently than this.

Ah, sorry. If you actually calculate the xxxPtr variables from base pointers and offsets on the device, then you are doing the right thing. That code was the one I was asking for.

[codebox]global void kernel(double *d_values,

	       double    **d_valuesPtr,

	       double     *d_ntimes,

	       double    **d_ntimesPtr,

	       TimeStamp  *d_times,

	       TimeStamp **d_timesPtr,

	       long       *d_valuesLengthRow,

	       ulong      *d_offsetArray,

	       double     *d_c,

	       double    **d_cPtr,

	       double     *d_p,

	       double    **d_pPtr,

	       double     *d_tradeFreq,

	       double     *d_exp_deltas,

	       double    **d_exp_deltasPtr,

	       double      minTF,

	       ulong       numVars,

	       int         startThread,

	       int         endThread)

{

int tid = blockIdx.x * blockDim.x + threadIdx.x;

tid += startThread;

// startThread & endThread is just telling me what rows of the 2D matrix to calculate

    // (eg. if startThread = 5 and endThread = 9, then that's the rows it will calculate because

    // rows 0-4 will be taken care of by the other GPU device)

if ((startThread <= tid) && (tid <= endThread))

{

	// only need to execute the following code chunk one time

	if (tid == startThread)

	{

		for (int i = 0; i < numVars; i++)

		{

                            // setting the addresses of the 2D array in GPU memory correctly

			d_valuesPtr[i]     = &d_values[d_offsetArray[i]];

			d_ntimesPtr[i]     = &d_ntimes[d_offsetArray[i]];

			d_timesPtr[i]      = &d_times[d_offsetArray[i]];

			d_cPtr[i]          = &d_c[i * numVars];

			d_pPtr[i]          = &d_p[i * numVars];

			d_exp_deltasPtr[i] = &d_exp_deltas[d_offsetArray[i]];

		}

	}

	__syncthreads();

	// perform rest of matrix computation

}

}[/codebox]

Note, this is where I set the addresses for the d_somethingPtr arrays in the code I quoted. All inside the GPU.

I really think that cudaMemcpy can’t write back correctly (and, conversely, can’t write to the GPU correctly?). I just need to know what it is that I’m doing that is messing it up, because this stuff should be working correctly…

SOLVED.

In order to use doubles, you have the change sm_10 to sm_13 for the compiler. The thing is, I changed that on Debug weeks ago, but I forgot to change that on Release. The hours I have wasted due to this one little error…

:/

My only lingering problem is that whenever I have a set of .txt files to read from, if I change one day’s worth to a different one, the first time the computer runs, the kernel invocation fails and the output is a matrix of a bunch of 0s (the default values). Whenever I run the program a second time, it works perfectly. It always fudges up the first run through. I can’t imagine why.