Weird bug with private members

Hello,

I fought against my computer for two days, and I finally found the most strange bug I’ve ever seen.
I have a C++ class of host code, having 2 device pointers as members, which I allocate with a standard cudaMalloc.
Let’s say the class is A, and pointers are p1 and p2.

juste after the cudaMalloc in my class constructor, I print the adress of p1 : it is always 0x100800.
But, inside ANY other call of a member function, if I print the adress again, I get 0x200000.

When it becomes scary : if I only move p1 in the public part of the class, I get the correct behavior with p1, BUT the same bug appears for p2.
I have to move p1 and p2 in the public part of A, and then all is OK…
(I emphasize that moving them in the public part is the only thing I’ve done.)

For those strange bugs, I use to be warned by valgrind, but my whole program produces absolutely 0 valgrind error or warnings.

I can’t figure out how to reproduce this bug on any computer (tried a little bit, but I don’t have too much time for this, the bug does not appear on all my computers).

Just to know if anybody here have encountered the same bug, and if not, then I’m perhaps in the target of cosmic waves for 2 days.

PS : tested with cuda 3.0, 3.1, gcc 4.1, gcc 4.3

Sounds like you have an out of bounds write somewhere in your app. Moving from private to public probably shifted the offset at which the pointers are stored in the class, thus the out of bounds write no longer hit them. You are correct that valgrind normally catches these bugs - but it can’t catch everything.

I’ve never tried this with gdb myself (and am not sure if it supports it), but some debuggers allow you to set a breakpoint to trigger when a value in memory changes. Set that on p1 and see where it triggers.

Sounds like you have an out of bounds write somewhere in your app. Moving from private to public probably shifted the offset at which the pointers are stored in the class, thus the out of bounds write no longer hit them. You are correct that valgrind normally catches these bugs - but it can’t catch everything.

I’ve never tried this with gdb myself (and am not sure if it supports it), but some debuggers allow you to set a breakpoint to trigger when a value in memory changes. Set that on p1 and see where it triggers.

Thank you,

Indeed, your remarks could well explain the symptoms, and I’m perhaps too much confident in valgrind.

The debugging feature you’re talking about would definitely be useful… unfortunately I can’t find such an option in gdb.

Do you know linux debuggers that allow that ? The Intel one perhaps ?

EDIT :

OK, The Intel one seems to offer the feature (“watch”)

I’ll try this tomorrow. Thank you for pointing this to me !

Thank you,

Indeed, your remarks could well explain the symptoms, and I’m perhaps too much confident in valgrind.

The debugging feature you’re talking about would definitely be useful… unfortunately I can’t find such an option in gdb.

Do you know linux debuggers that allow that ? The Intel one perhaps ?

EDIT :

OK, The Intel one seems to offer the feature (“watch”)

I’ll try this tomorrow. Thank you for pointing this to me !

Yeah, valgrind is generally very good at detecting these sorts of problems. It just can’t detect if you have some bogus memory write somewhere that just happens to end up still inside a valid memory region for writes, so it is possible (though not common) that it can miss out of bounds writes that cause the behavior you are seeing.

Yeah, valgrind is generally very good at detecting these sorts of problems. It just can’t detect if you have some bogus memory write somewhere that just happens to end up still inside a valid memory region for writes, so it is possible (though not common) that it can miss out of bounds writes that cause the behavior you are seeing.

Ok, I’ve just used the “watch” feature of my debugger (in fact gdb does it), and it was extremely helpful. Thank you again !

So, bug found, but this is ** a quite annoying one ** !

I can sum it up as :

sizeof(myclass) is not the same in the nvcc-managed code and in the gcc (or icc) managed-code.

When NVCC compiles the class files, it finds a size of 656 bytes.

And gcc or icc, given the header file, finds only 648 bytes. That’s what caused the out of bound write.

Note that the header file is (obviously) the same for nvcc and gcc.

The question is now : how to solve it :/

Ok, I’ve just used the “watch” feature of my debugger (in fact gdb does it), and it was extremely helpful. Thank you again !

So, bug found, but this is ** a quite annoying one ** !

I can sum it up as :

sizeof(myclass) is not the same in the nvcc-managed code and in the gcc (or icc) managed-code.

When NVCC compiles the class files, it finds a size of 656 bytes.

And gcc or icc, given the header file, finds only 648 bytes. That’s what caused the out of bound write.

Note that the header file is (obviously) the same for nvcc and gcc.

The question is now : how to solve it :/

What are the command line options you give to gcc/icc and to nvcc? Since nvcc does not compile the host code itself, but merely passes it on to gcc/icc, I’d expect the difference can only be in compiler options.

What are the command line options you give to gcc/icc and to nvcc? Since nvcc does not compile the host code itself, but merely passes it on to gcc/icc, I’d expect the difference can only be in compiler options.

Ok, you pointed me in the right direction. On the cluster where I compile, even with only cuda 3.1 loaded, include path points to old cuda files.

Since I have a cudaDeviceProp as class member, this changes everything.

I’ll remember this one !

Thank you very much for your precious help.

Ok, you pointed me in the right direction. On the cluster where I compile, even with only cuda 3.1 loaded, include path points to old cuda files.

Since I have a cudaDeviceProp as class member, this changes everything.

I’ll remember this one !

Thank you very much for your precious help.

Did anyone ever mention how cool your forum name is?

Did anyone ever mention how cool your forum name is?

You’re the first, but I must admit this is one of my best inventions :D

You’re the first, but I must admit this is one of my best inventions :D