undefined problem after upgrading to 3.2RC App working seamlessly in 3.1, working in 3.2RC

Hi,

I have a strange problem on the app I’m developing, that is that I can’t get it to work since upgrading to 3.2.

I’ve spent 2 days around the code to explicitly synchronize and getlasterror after each Cuda call, disable streams, … all for nothing. The behavior is that at the end of my computation, I get a vector full of #QNAN, without Cuda having returned any error code.

I’ve reverted to toolkit 3.1 (257.21 driver), recompile the code as is, and voilà , everything is working fine (vector filled with meaningfull values).

My app uses several features : Stream synchronization, context thread migration, async pinned memory transferts. I currently run 2 threads on each device, without any specific problem on the 257.21 driver / 3.1 toolkit.

If I upgrade the driver to the 260.61 release, I get the bad behavior described higher. After recompiling with the 3.2 toolkit, nothing gets better, even after disabling streams (all streams = NULL), not using pinned memory and having only 1 running thread on the tesla C1060.

I would be thankful for any hint on known feature changes between 3.1 and 3.2 that could explain this, or on how to try and spot the problem … The computation part of the app uses a mix of specific kernels and cublas calls.

My development set up : Win7 x64, VS2005, core I7, 1xGTX260, 1xTeslaC1060, 6GbRAM.

best regards

Julien

Did you try to run wtihout streams with cuda 3.1 and new driver? Try to debug your code with new version 3.2. And check where is difference come.

Did you try to run wtihout streams with cuda 3.1 and new driver? Try to debug your code with new version 3.2. And check where is difference come.

Hi Lev,

thanks for the hint, I’ll try this.

Hi Lev,

thanks for the hint, I’ll try this.

Hi,

I’ve tried and disabled streams (putting all streams to NULL), recompiled with 3.1, it doesn’t work with the 260.61 driver (it works with the 257.21).

What I don’t understand is that for debugging, I call this simple code after each kernel launch and cublas call, and I never get any error code from cudaStreamSynchronize :

[codebox]bool syncMainStream()

{

	if (cudaSuccess != cudaStreamSynchronize(MainStream))  //Mainstream ==0 while debugging

		return (false);

	else return(true);

};

[/codebox]

I primarily thought that my issue would either come from streams or context migration features, but it seems to me that if I had a context migration problem, I should get errors when launching kernels shouldn’t I ?

Now what is the best way to check fr the content of device data ? should copy everything back to the host to be able to peek at data with the visual debugger, or are there any more comfortable ways to do this ? I can either use Visual 2005 or visual 2010, and so far I’ve not been able to integrate the cuda toolkit with VS2010.

Thanks for any help

Julien

Hi,

I’ve tried and disabled streams (putting all streams to NULL), recompiled with 3.1, it doesn’t work with the 260.61 driver (it works with the 257.21).

What I don’t understand is that for debugging, I call this simple code after each kernel launch and cublas call, and I never get any error code from cudaStreamSynchronize :

[codebox]bool syncMainStream()

{

	if (cudaSuccess != cudaStreamSynchronize(MainStream))  //Mainstream ==0 while debugging

		return (false);

	else return(true);

};

[/codebox]

I primarily thought that my issue would either come from streams or context migration features, but it seems to me that if I had a context migration problem, I should get errors when launching kernels shouldn’t I ?

Now what is the best way to check fr the content of device data ? should copy everything back to the host to be able to peek at data with the visual debugger, or are there any more comfortable ways to do this ? I can either use Visual 2005 or visual 2010, and so far I’ve not been able to integrate the cuda toolkit with VS2010.

Thanks for any help

Julien

“Context Thread Migration”, vow!! Does it mean that CUDA context can be juggled around between CPU threads?

Can some1 shed some light?

“Context Thread Migration”, vow!! Does it mean that CUDA context can be juggled around between CPU threads?

Can some1 shed some light?

Hi,

Sorry not to have taken care of this thread for a long time (other priorities …)

This is exactly what you’re talking about. If you do a search about cuCtxPopCurrent both in the reference manual and the programming guide, you will have (little) information about context migration.

Back to my problem, which is now precisely indentified : I’ve made the assumption that the call

cublasSscal(vectorSize, 0.0, vector, increment)

would initialize any vector to 0, and this is how my application initializes all its data structures. This has worked up to the 3.1 cuda release (with corresponding drivers). Since the 3.2 RC release, some of the values in the vectors initialized this way are #QNAN, which compromizes all subsequent computations.

I’ve switched to a very simple initialisation kernel, and everything is working fine now.

The remaining question is : should cublasSscal (or any other cublas functions that scale the result vector) force a value of 0 when an original value of Qnan is found in the vector that is scaled ?

Julien

Hi,

Sorry not to have taken care of this thread for a long time (other priorities …)

This is exactly what you’re talking about. If you do a search about cuCtxPopCurrent both in the reference manual and the programming guide, you will have (little) information about context migration.

Back to my problem, which is now precisely indentified : I’ve made the assumption that the call

cublasSscal(vectorSize, 0.0, vector, increment)

would initialize any vector to 0, and this is how my application initializes all its data structures. This has worked up to the 3.1 cuda release (with corresponding drivers). Since the 3.2 RC release, some of the values in the vectors initialized this way are #QNAN, which compromizes all subsequent computations.

I’ve switched to a very simple initialisation kernel, and everything is working fine now.

The remaining question is : should cublasSscal (or any other cublas functions that scale the result vector) force a value of 0 when an original value of Qnan is found in the vector that is scaled ?

Julien

This the correct behavior.

From IEEE Std 754-2008, section 6.2
Operations with NaNs: For an operation with quiet NaN inputs, other than maximum and minimum operations, if a floating-point result is to be delivered the result shall be
a quiet NaN which should be one of the input NaNs.

So, if you multiply a NaN by zero, the result will be a NaN.

This the correct behavior.

From IEEE Std 754-2008, section 6.2
Operations with NaNs: For an operation with quiet NaN inputs, other than maximum and minimum operations, if a floating-point result is to be delivered the result shall be
a quiet NaN which should be one of the input NaNs.

So, if you multiply a NaN by zero, the result will be a NaN.

Hi,

I totally agree with this in regarding general case computations, I just thought that BLAS routine could have another spec, as think I came accross a fortran implementation that would force the result to be 0 if the corresponding scaling factor was 0, but I’m not sure of this.

As I came across BLAS quite recently I’m not well aware of some typical uses, so can someone tell me if there is a simple way to initialize vectors using BLAS calls ?

thanks in advance

Julien

Hi,

I totally agree with this in regarding general case computations, I just thought that BLAS routine could have another spec, as think I came accross a fortran implementation that would force the result to be 0 if the corresponding scaling factor was 0, but I’m not sure of this.

As I came across BLAS quite recently I’m not well aware of some typical uses, so can someone tell me if there is a simple way to initialize vectors using BLAS calls ?

thanks in advance

Julien