void CSR(int i,unsigned int N,
unsigned int *xadj,unsigned int *adjncy,
double *dataxx,double *datayy,double *datazz,
double *Cspin,
double *CHDemag,double *CH)
{
if(i < N)
{
double dot[3]={0,0,0};
for(int n = xadj[i] ; n < xadj[i+1]; n++)
{
unsigned int neigh=adjncy[n];
printf("%d\n",n);
printf("%f,%f,%f\n",dataxx[n],datayy[n],datazz[n]);
double val[3] = {dataxx[n],datayy[n],datazz[n]};
for(unsigned int co = 0 ; co < 3 ; co++)
{
dot[co]+=(val[co]*Cspin[3*neigh+co]);
}
}
double a=CHDemag[3*i];
double b=CHDemag[3*i+1];
double c=CHDemag[3*i+2];
CH[3*i]=a+dot[0];
CH[3*i+1]=b+dot[1];
CH[3*i+2]=c+dot[2];
// CH[3*i]=CHDemag[3*i]+dot[0];
// CH[3*i+1]=CHDemag[3*i+1]+dot[1];
// CH[3*i+2]=CHDemag[3*i+2]+dot[2];
}
}
under the same code and the machine(except gpu)
titan v:490ms
rtx2080 :380ms
titan v’s double precision compatity may better than rtx2080ti
but the result doesn’t.
may i shoule compile the code to double precision using some arg ?
[1] Compile for the correct GPU target architecture (learn about the -arch and -gencode switches of nvcc)
[2] Familiarize yourself with the CUDA profiler, profile your kernel and use the results to guide optimizations
[3] Learn about the restrict modifier and how it can help the compiler generate better code
In my experience, questions of the sort “GPU X is faster than GPU Y, why?” based on perceived notions or simplistic mental models of GPU performance are not fruitful. The output of the CUDA profiler is a much better way to zero in on the factors that are crucial to the performance of a particular kernel. If necessary the relative performance of two GPUs can then be discussed based on salient differences in profiler output.
At first glance your kernel would appear to be memory bound, with some potentially disadvantageous access patterns because of indirection caused by the use of the adjacency matrix.
I would suggest following up on item [2] in post #9. Achieved bandwidth is also a function of access patterns; blanket statements like “titan v 's memory thoughout also powerful than rtx” are not really actionable.
There may be other issues affecting your application level performance due to code you haven’t shown, your kernel configuration(s) may be sub-optimal, your performance methodology may not be sound, etc.