occasional ULF caused by cudaMemcpy, no kernels, same args every call corrupt memory used by CUDA, b

Hi everyone

Posting after 7 man-days of debugging on this issue with no insights into a solution. We get the Unspecified Launch Failure in a program using only cudaMalloc and cudaMemcpy. The program runs indefinitely until I get some 2nd thread to interact with the thread running the CUDA calls. After this interaction, even if the interaction does nothing, every CUDA fn fails with ULF.

The args to cudaMemcpy are always the same. I print the addresses and length every call. I have double-checked all the synchronization. After each call I call cudaSynchronizeThreads() and cudaGetLastError().

valgrind and helgrind report no issues.
We’re using CUDA 2.2. and Fedora 10, with Fedora Eclipse.

Our best guess is that our process memory heap is being corrupted, and the CUDA libraries require some state in our process or thread that is damaged by the interaction with another thread, though I cannot see how this occurs. The program is basically as follows:

main thread:

  1. start cuda thread (cuda thread starts and runs successfully thousands of iters )
  2. sleep( 20 secs )
  3. call size() on a std::map that contains pointers to cuda-allocated memory
  4. ULFs start to occur.

cuda thread:

  1. cudaMemcpy host 2 device
  2. cudaMemcpy device 2 host

We have done our best to rule out corruption of the container and trace all the memory addresses through the program. They do not change, they work for CUDA calls fine. We never delete the memory passed to CUDA calls.

Any ideas what we could be missing?

regards
dave

RESOLVED

Hi all,

This problem was caused by an incompatibility between CUDA 2.2.1 and libXML 2.7.2, the latter supplied with Fedora 10. The problem is probably more with libXML than CUDA, but I’m not sure. Including the libXML headers when using the CUDA compiler causes some warnings at compile-time (included below) and subtle (minor) heap corruption when libXML functions are used. We rolled back to libXML 2.6.32 and the problems go away, and so do the warnings.

all the best
dave rawlinson

/usr/include/libxml2/libxml/xmlmemory.h(66): warning: attribute “alloc_size” ignored

/usr/include/libxml2/libxml/xmlmemory.h(153): warning: attribute “alloc_size” ignored

/usr/include/libxml2/libxml/xmlmemory.h(161): warning: attribute “alloc_size” ignored

/usr/include/libxml2/libxml/xmlmemory.h(165): warning: attribute “alloc_size” ignored

/usr/include/libxml2/libxml/xmlerror.h(846): warning: invalid attribute for “xmlGenericErrorFunc”

/usr/include/libxml2/libxml/valid.h(44): warning: invalid attribute for “xmlValidityErrorFunc”

/usr/include/libxml2/libxml/valid.h(59): warning: invalid attribute for “xmlValidityWarningFunc”

/usr/include/libxml2/libxml/parser.h(597): warning: invalid attribute for “warningSAXFunc”

/usr/include/libxml2/libxml/parser.h(607): warning: invalid attribute for “errorSAXFunc”

/usr/include/libxml2/libxml/parser.h(619): warning: invalid attribute for “fatalErrorSAXFunc”

/usr/include/libxml2/libxml/relaxng.h(35): warning: invalid attribute for “xmlRelaxNGValidityErrorFunc”

/usr/include/libxml2/libxml/relaxng.h(45): warning: invalid attribute for “xmlRelaxNGValidityWarningFunc”

/usr/include/libxml2/libxml/xmlschemas.h(95): warning: invalid attribute for “xmlSchemaValidityErrorFunc”

/usr/include/libxml2/libxml/xmlschemas.h(105): warning: invalid attribute for “xmlSchemaValidityWarningFunc”