I have written a simple demo to capture performance of my CUDA kernel in multi-thread environment.
My kernels are launched by thousands of host threads and my CUPTI Activity API Demo ‘s buffer are instantly full of records.
So im really wondering is there a way to just limit my cupti activity profiler to only one working thread,which means the profiler only record cuda apis and kernels that are launched by this unique thread?