Possible to profile Thrust codes using Visual Profiler?

I tried to run but only the Non-Thrust kernel calls got profiled.

It should be possible, yes. Under the hood, visual profiler uses the nvprof mechanism, and I use nvprof to profile thrust codes frequently.

Thanks, I finally got it working.
Also I found what Thrust can provide is quite limited, as below code shows:
I end up to have 992 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.

				thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
				ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>()) ;


thrust has a transform_reduce function which might cut your kernel calls in half. And you can also do a segmented reduction using reduce_by_key. finally, you can combine both using a transform iterator with reduce_by_key, possibly getting your kernel launches with thrust down to just a few.

Hi txbob, thanks for the suggestion and I did quite some research on it.
For my case:

1), transform_reduce: will not help, as there is a pointer redirect “idx0[i]”, and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
2), reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a mapping key table with same size. Will try it out.
3),transform iterator: will not help, same reason as 1).

Think I can’t avoid writing my own kernel, still much thanks!

Seems like you have an answer now on your cross-posting: