I fought against my computer for two days, and I finally found the most strange bug I’ve ever seen.
I have a C++ class of host code, having 2 device pointers as members, which I allocate with a standard cudaMalloc.
Let’s say the class is A, and pointers are p1 and p2.
juste after the cudaMalloc in my class constructor, I print the adress of p1 : it is always 0x100800.
But, inside ANY other call of a member function, if I print the adress again, I get 0x200000.
When it becomes scary : if I only move p1 in the public part of the class, I get the correct behavior with p1, BUT the same bug appears for p2.
I have to move p1 and p2 in the public part of A, and then all is OK…
(I emphasize that moving them in the public part is the only thing I’ve done.)
For those strange bugs, I use to be warned by valgrind, but my whole program produces absolutely 0 valgrind error or warnings.
I can’t figure out how to reproduce this bug on any computer (tried a little bit, but I don’t have too much time for this, the bug does not appear on all my computers).
Just to know if anybody here have encountered the same bug, and if not, then I’m perhaps in the target of cosmic waves for 2 days.
Sounds like you have an out of bounds write somewhere in your app. Moving from private to public probably shifted the offset at which the pointers are stored in the class, thus the out of bounds write no longer hit them. You are correct that valgrind normally catches these bugs - but it can’t catch everything.
I’ve never tried this with gdb myself (and am not sure if it supports it), but some debuggers allow you to set a breakpoint to trigger when a value in memory changes. Set that on p1 and see where it triggers.
Sounds like you have an out of bounds write somewhere in your app. Moving from private to public probably shifted the offset at which the pointers are stored in the class, thus the out of bounds write no longer hit them. You are correct that valgrind normally catches these bugs - but it can’t catch everything.
I’ve never tried this with gdb myself (and am not sure if it supports it), but some debuggers allow you to set a breakpoint to trigger when a value in memory changes. Set that on p1 and see where it triggers.
Yeah, valgrind is generally very good at detecting these sorts of problems. It just can’t detect if you have some bogus memory write somewhere that just happens to end up still inside a valid memory region for writes, so it is possible (though not common) that it can miss out of bounds writes that cause the behavior you are seeing.
Yeah, valgrind is generally very good at detecting these sorts of problems. It just can’t detect if you have some bogus memory write somewhere that just happens to end up still inside a valid memory region for writes, so it is possible (though not common) that it can miss out of bounds writes that cause the behavior you are seeing.
What are the command line options you give to gcc/icc and to nvcc? Since nvcc does not compile the host code itself, but merely passes it on to gcc/icc, I’d expect the difference can only be in compiler options.
What are the command line options you give to gcc/icc and to nvcc? Since nvcc does not compile the host code itself, but merely passes it on to gcc/icc, I’d expect the difference can only be in compiler options.