How to set filter in cuda-memcheck (or compute-sanitizer)

I am developing a fairly big project (have around 1200 kernels so far). I have 1 kernel that possibly has some memory race which is why it’s giving different answers every time. I want to find it by performing cude-memcheck on that specific kernel. So naturally, I am trying to use --filter option in cuda-memcheck with --tool racecheck option. The codebase is big and performing cuda-memcheck on all kernels especially with racecheck enabled will take an eternity.

The official documentation says using key value pair as: {key1=val1}[{,key2=val2}].

I am not really sure what exactly this means and whatever I have tried resulted in invalid options message. I could not find any example online as well as Nvidia cuda-samples provided with the toolkit.

So far, I have tried these (and probably all combinations of these):

cuda-memcheck --filter <kernel_name>,kns <Executable>

cuda-memcheck --filter key1=<kernel_name>, key2=kns <Executable>

cuda-memcheck --filter key1='<kernel_name>', key2='kns' <Executable>

cuda-memcheck --filter <kernel_name>,[kns] <Executable>

I am not sure exactly how to interpret the documentation. An example would be great. Thanks.

Note: I can use cuda-memcheck with other options and my executable is compiled correctly with flags like Xcompiler, lineinfo etc.

Suppose I have a kernel in my code defined like this:

__global__ void k(...) {...};

specifying:

compute-sanitizer --filter kernel_name=<mangled name of k> ./my_executable

will run compute sanitizer only on kernel k

Example:

$ cat t2086.cu
__device__ int val = 0;
__global__ void good(){

  val = 1;

}

__global__ void bad(int *d){

  *d = 1;

}

int main(){

  int *d = NULL;

  good<<<1,1>>>();
  bad<<<1,1>>>(d);
  cudaDeviceSynchronize();
}
$ nvcc -o t2086 t2086.cu -lineinfo
$ compute-sanitizer --filter kernel_name=_Z4goodv ./t2086
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ compute-sanitizer --filter kernel_name=_Z3badPi ./t2086
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at 0x50 in /home/user2/misc/t2086.cu:10:bad(int *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x20d6ea]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x843b]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x5ea78]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3ce7]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3b66]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3b8e]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3910]
=========                in /home/user2/misc/./t2086
=========     Host Frame:__libc_start_main [0x21b15]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x3751]
=========                in /home/user2/misc/./t2086
=========
========= Program hit unspecified launch failure (error 719) on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x3bd253]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x39e37]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3915]
=========                in /home/user2/misc/./t2086
=========     Host Frame:__libc_start_main [0x21b15]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x3751]
=========                in /home/user2/misc/./t2086
=========
========= ERROR SUMMARY: 2 errors
$

You can get mangled kernel names using cuobjdump and probably other ways.

Note that the docs suggest that --filter is deprecated. The replacements are here.

Don’t want to use mangled names? If your kernel names are mutually exclusive (i.e. one kernel name is not entirely contained within another kernel name) you can probably use the substring matching option:

$ compute-sanitizer --filter kns=bad ./t2086
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at 0x50 in /home/user2/misc/t2086.cu:10:bad(int *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x20d6ea]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x843b]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x5ea78]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3ce7]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3b66]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3b8e]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3910]
=========                in /home/user2/misc/./t2086
=========     Host Frame:__libc_start_main [0x21b15]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x3751]
=========                in /home/user2/misc/./t2086
=========
========= Program hit unspecified launch failure (error 719) on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x3bd253]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x39e37]
=========                in /home/user2/misc/./t2086
=========     Host Frame: [0x3915]
=========                in /home/user2/misc/./t2086
=========     Host Frame:__libc_start_main [0x21b15]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x3751]
=========                in /home/user2/misc/./t2086
=========
========= ERROR SUMMARY: 2 errors
$

Note that these tools have their own sub-forums