Possible to profile Thrust codes using Visual Profiler?

I tried to run but only the Non-Thrust kernel calls got profiled.
Thanks.

It should be possible, yes. Under the hood, visual profiler uses the nvprof mechanism, and I use nvprof to profile thrust codes frequently.

Thanks, I finally got it working.
Also I found what Thrust can provide is quite limited, as below code shows:
I end up to have 992 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.

for(i=1;i<=9;i++)
		{
			for(j=i;j<=9;j++)
			{
				ATA[i][j]=0;
				for(m=1;m<=50000;m++)
					ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
			}
		}
for(i=1;i<=dim0;i++)
		{
			for(j=i;j<=dim0;j++)
			{
				thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
				ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>()) ;

			}
		}

thrust has a transform_reduce function which might cut your kernel calls in half. And you can also do a segmented reduction using reduce_by_key. finally, you can combine both using a transform iterator with reduce_by_key, possibly getting your kernel launches with thrust down to just a few.

Hi txbob, thanks for the suggestion and I did quite some research on it.
For my case:

1), transform_reduce: will not help, as there is a pointer redirect “idx0[i]”, and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
2), reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a mapping key table with same size. Will try it out.
3),transform iterator: will not help, same reason as 1).

Think I can’t avoid writing my own kernel, still much thanks!

Seems like you have an answer now on your cross-posting:

[url]cuda - Can Thrust transform_reduce work with 2 arrays? - Stack Overflow