I tried to run but only the Non-Thrust kernel calls got profiled.

Thanks.

It should be possible, yes. Under the hood, visual profiler uses the nvprof mechanism, and I use nvprof to profile thrust codes frequently.

Thanks, I finally got it working.

Also I found what Thrust can provide is quite limited, as below code shows:

I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.

While if I write my own kernel, only 1 kernel launch needed.

```
for(i=1;i<=9;i++)
{
for(j=i;j<=9;j++)
{
ATA[i][j]=0;
for(m=1;m<=50000;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
```

```
for(i=1;i<=dim0;i++)
{
for(j=i;j<=dim0;j++)
{
thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>()) ;
}
}
```

thrust has a transform_reduce function which might cut your kernel calls in half. And you can also do a segmented reduction using reduce_by_key. finally, you can combine both using a transform iterator with reduce_by_key, possibly getting your kernel launches with thrust down to just a few.

Hi txbob, thanks for the suggestion and I did quite some research on it.

For my case:

1), transform_reduce: will not help, as there is a pointer redirect “idx0[i]”, and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]

2), reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a mapping key table with same size. Will try it out.

3),transform iterator: will not help, same reason as 1).

Think I can’t avoid writing my own kernel, still much thanks!

Seems like you have an answer now on your cross-posting:

http://stackoverflow.com/questions/31955505/can-thrust-transform-reduce-work-with-2-arrays