Hello,
I’m building a service that runs on a server with many GPU jobs and collects core dumps that are produced by them. Since the HDDs of this server are heavily loaded by the aforementioned jobs, I’d rather not materialize these coredumps on disks. Instead I’d prefer to write coredumps into the pipe and stream them to a remote storage.
Unfortunately, it seems that CUDA library is unable to write coredumps into the pipe. When I set CUDA_COREDUMP_FILE environment variable to the path of my pipe, only first 64 bytes of the coredump are sent.
After a small research with strace I found out that CUDA library calls ftell function of a file descriptor of coredump file. This function returns -1 for pipes and after that program terminates. I’ve implemented a custom version of ftell that counts the number of bytes written into the pipe using LD_PRELOAD mechanism and this allowed me to obtain an almost valid coredump (the only difference is that first 64 bytes of the coredump are located at the end of the file, with one possible explanation being that CUDA library does fseek till the beginning of a coredump file when writing ELF header).
However, this custom approach seems totally unreliable. Is it possible to fix it in CUDA library?
Best regards,
Grigory Reznikov.