nvprof gives the following warnings:
==179614== Warning: 541 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==179614== Warning: 541 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size.
I did try increasing both those size values when launching nvprof to no avail. The doubly confounding part is, (not shown here) some kernels do get successfully profiled. The application depends on two sets of kernels. One set is compiled when the application is built. The other set is contained by dynamically loaded libraries, compiled elsewhere. The only kernels getting profiled are the ones loaded from library, while the ones not getting profiled are the ones implemented and built by the application itself.
The application contains plenty of cudaCheckErrors() throughout, which I would assume catches any errors along the way.
Compiled with nvcc flags:
-O3 -std=c++14 -Xcompiler -Wall -D_BSD_SOURCE -g -rdc=true --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75