Is compute-sanitizer compatible with NVDec?

Unfortunately I cannot do much on my end without a reproducer. I verified locally that suppressions work as shown below. Can you please confirm you can reproduce this locally?

$ cat test.cu
#include <cassert>

#define ASSERT_RT(Code) assert((Code) == cudaSuccess)

int main()
{
    int *ptr;
    ASSERT_RT(cudaMalloc(&ptr, sizeof(int)));
    int v;
    ASSERT_RT(cudaMemcpy(&v, ptr, sizeof(int), cudaMemcpyDeviceToHost));
    ASSERT_RT(cudaDeviceSynchronize());
}
$ nvcc -o test test.cu
$ compute-sanitizer --tool initcheck --show-backtrace no --xml --save report.xml ./test
========= COMPUTE-SANITIZER
========= Host API memory access error at host access to 0x7f14eac00000 of size 4 bytes
=========     Uninitialized access at 0x7f14eac00000 on access by cudaMemcpy source
=========
========= ERROR SUMMARY: 1 error
$ compute-sanitizer --tool initcheck --show-backtrace no --suppressions report.xml ./test
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors

Iā€™ll try that, thank you. Probably tomorrow. Meanwhile, sanitizer with leak check found some issues in my app, so thatā€™s good. ;)

1 Like

Awesome thanks, feel free to reach out here anytime!

While I get your test case set up, maybe you can look at the files I uploaded: the error.xml and the command + output. I had to rename error.xml to error.txt to upload it.

error.txt (3.7 KB)
command.txt (3.2 KB)

Thank you sir!

I suspect we might have a bug with path / module matching. Can you please try the suppressions.txt file attached? I just removed all <module> and <path> entries. Thanks!
suppressions.txt (2.0 KB)

Thank you. There is no change, i.e., the error is still not suppressed.

OK can you try removing each frame one-by-one from the bottom? and see if at some point it works. Thanks!

Will do. Thank you.

One-by-one, I finally got to all of them removed but the error continues. Did you notice my earlier question about how the xml shows addresses in decimal while the error output has it in hex? I donā€™t know if that is significant for matching.

Thanks for trying. Hexadecimal output should not matter, but the issue could come from the size. Can you try reverting to the suppressions file I uploaded and remove line <accessSize>...</accessSize> ?

Sorry, no change.

OK I am running out of ideas unfortunately. Can you please verify whether or not this file correctly suppresses all errors?

<?xml version="1.0" encoding="utf-8"?>
<ComputeSanitizerOutput>
  <record>
    <kind>InitcheckApiError</kind>
    <level>Error</level>
    <what>
    </what>
    <hostStack>
    </hostStack>
  </record>
</ComputeSanitizerOutput>

No, it did not suppress anything. I will try with your test case posted earlier.

Do you know why I get thousands of errors when I know that the memcpy at issue is called only once. Perhaps that is relevant.

Wait, let me test again. I had changed the script.

Sorry, no change. Whatā€™s with the thousands of errors?

There is one error per address so that you can visualize where is the uninitialized access.

========= Host API memory access error at host access to 0xb17200000 of size 2,228,224 bytes
=========     Uninitialized access at 0xb17200780 on access by cudaMemcpy source

In the example above, first copy of uninitialized byte is at offset 0x780, which can be useful information (I am assuming your imageā€™s width is 1920).

Please let me know when you get a chance to try the instructions I provided, thanks!

Sorry, I do not understand. Which instructions are you referring to? As far as I know I have done everything you asked for except that I havenā€™t tried your minimal test case.

Sorry yes I was referring to the minimal test case, thanks

OK, thank you. I will do that now.

1 Like

I canā€™t build it because all the stuff comes up undefined at link time. Iā€™ve only built things through VS 2019 for the driver API so Iā€™ll need help to build it. Sorry about that.

tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol atexit referenced in function ā€œvoid __cdecl __nv_cudaEntityRegisterCallback(void * *)ā€ (?__nv_cudaEntityRegisterCallback@@YAXPEAPEAX@Z)
cudart_static.lib(cudart_cudart_global.obj) : error LNK2001: unresolved external symbol atexit
cudart_static.lib(cudart_cuoswin32.obj) : error LNK2001: unresolved external symbol atexit
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol labs referenced in function ā€œlong __cdecl abs(long)ā€ (?abs@@YAJJ@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol llabs referenced in function ā€œ__int64 __cdecl abs(__int64)ā€ (?abs@@YA_J_J@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol fabs referenced in function ā€œdouble __cdecl abs(double)ā€ (?abs@@YANN@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol fminf referenced in function ā€œfloat __cdecl fmin(float,float)ā€ (?fmin@@YAMMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol fmaxf referenced in function ā€œfloat __cdecl fmax(float,float)ā€ (?fmax@@YAMMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol exp2f referenced in function ā€œfloat __cdecl exp2(float)ā€ (?exp2@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol expm1f referenced in function ā€œfloat __cdecl expm1(float)ā€ (?expm1@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol log2f referenced in function ā€œfloat __cdecl log2(float)ā€ (?log2@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol log1pf referenced in function ā€œfloat __cdecl log1p(float)ā€ (?log1p@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol acoshf referenced in function ā€œfloat __cdecl acosh(float)ā€ (?acosh@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol asinhf referenced in function ā€œfloat __cdecl asinh(float)ā€ (?asinh@@YAMM@Z)
tmpxft_0000ba98_00000000-19_test.obj : error LNK2019: unresolved external symbol atanhf referenced in function "float __-- More ā€“