2 strange bugs depending the type and order of class properties Kernel bug if I change the type and

SpongePatoche · June 21, 2012, 10:19am

Hi all,

I have two very strange bug in a kernel if I change the type and order of some properties in a class.

For information, I work on a Tesla C2050 with Cuda compilation tools release 4.0 (V0.2.1221), driver version 270.41.19 and with -arch sm_20 flags.

So, let me explain, I can’t show my full code (because there are a many files / class) but I will try to show you a simplifies code like :

class classA

{

public:

    double N, D;  // bug if type is uint

    double L;

};

class classB

{

public:

    // some properties

};

class ClassC

{

public:

   ClassA a;

   ClassB b; // bug if I declare this properties before ClassA

};

So, my problem looks like this : in ClassA if I change the type of properties ‘N’ and ‘D’ to uint type I have a bug …

And the second, in the ClassC when I reverse the statement of ‘ClassA a;’ and ‘ClassB b;’ …

My class are more complicated (with inherit and virtual method) but some other class use the the complexity and work.

Do you think the first problem that affects the second ?

Thanks you in advance

njuffa · June 21, 2012, 11:31pm

Of what nature are these bugs? Is the compiler giving an error message or indicating an internal compiler error? Is the code crashing with an unspecified launch failure at run time? Does the kernel run but give incorrect results? It’s hard to tell anything from the snippets shown.

I would strongly recommend switching to the latest released CUDA version, which is 4.2. In CUDA 4.0 the compiler frontend used for sm_2x targets was replaced, resulting in some bugs creeping in, the vast maority of which are fixed in CUDA 4.2. A CUDA 5.0 preview is available to registered developers at this time, this should be considered to be of alpha release quality.

If your problem persists after switching to CUDA 4.2. I would suggest reducing your code to the smallest possible code that still reproduces the issue and filing a bug, attaching the reduced self-contained code. A link to the bug reporting form can be found on the registered developer website.

SpongePatoche · June 22, 2012, 8:26am

Sorry, I was so surprised about my bug that I forget to tell you what happened …

Like you say, the code crash with an unspecified launch failure at run time (when I change the type in ClassA and the order in ClassB).
And for more information, the files /dev/nvidia0 , /dev/nvidia1 and /dev/nvidiactl disappears when crash appears.

I will try to update to CUDA4.2 and to try again and I will follow the result

SpongePatoche · June 22, 2012, 3:57pm

I update CUDA to the latest version (4.2) and upadte the driver (295.41) and … it’s work !!!

(as whereof, the updates can be a good thing … good job nvidia)

I take this post to ask you 2 small others things:

How can I disabled this warning during compialtion step ?

warning: function ... is hidden by ... -- virtual function override intended?

And it seems that the first launch of a kernel take lot of time (around 5 sec) and better after …

is there a specification in nvidia to decrease this time ?

njuffa · June 22, 2012, 6:52pm

I am not familiar with that warning. The warning is likely given for good reason (two functions with the same name, one hiding the other, maybe?) and it might be best to fix the source code to avoid the risk the compiler warns about.

Lengthy startup times are usually indicative of running on a device without an attached display. In that case the driver unloads after CUDA shuts down, and needs to load again when CUDA is used the next time. With a display attached, the running desktop keeps the driver busy and thus keeps it loaded. Use “nvidia-smi -pm 1” to put the driver into persistent mode so it does not unload even when not in use.

Even with the driver loaded, the first call to a CUDA API always takes longer as it initializes the CUDA context and runtime. The usual technique to keep this startup overhead out of the timed portion of a task is to issue a cudaFree(0) before timing starts.

SpongePatoche · June 25, 2012, 2:33pm

Indeed, the persistant mode was disabled, now it seems to be ok … thank

Concerning the warning, I use pure virtual function (virtual void myFunction(void) = 0;) in order to have different processing, so I require to have the same function prototype’s … so it doesn’t matter.

Thank for all njuffa