why the titan v is slowed than rtx 2080ti ?

742820157 · July 6, 2019, 8:49am

the kernel :

void CSR(int i,unsigned int N,
	unsigned int *xadj,unsigned int *adjncy,
	double *dataxx,double *datayy,double *datazz,
	double *Cspin,
	double *CHDemag,double *CH)
{ 
	if(i < N)
	{
		double dot[3]={0,0,0};
		for(int n = xadj[i] ; n < xadj[i+1]; n++)
		{
			unsigned int neigh=adjncy[n];
			printf("%d\n",n);
			printf("%f,%f,%f\n",dataxx[n],datayy[n],datazz[n]);
			double val[3] = {dataxx[n],datayy[n],datazz[n]};
			for(unsigned int co = 0 ; co < 3 ; co++)
			{
				dot[co]+=(val[co]*Cspin[3*neigh+co]);
			}
		}
		double a=CHDemag[3*i];
		double b=CHDemag[3*i+1];
		double c=CHDemag[3*i+2];
		CH[3*i]=a+dot[0];
		CH[3*i+1]=b+dot[1];
		CH[3*i+2]=c+dot[2];
		// CH[3*i]=CHDemag[3*i]+dot[0];
		// CH[3*i+1]=CHDemag[3*i+1]+dot[1];
		// CH[3*i+2]=CHDemag[3*i+2]+dot[2];
	}
}

under the same code and the machine(except gpu)
titan v:490ms
rtx2080 :380ms
titan v’s double precision compatity may better than rtx2080ti
but the result doesn’t.
may i shoule compile the code to double precision using some arg ?

thank you.

742820157 · July 6, 2019, 9:47am

the code is wrong above,
the correct code is:

__global__ void CSpMV_CSR(unsigned int N,
	unsigned int *xadj,unsigned int *adjncy,
	double *dataxx,double *datayy,double *datazz,
	double *Cspin,
	double *CHDemag,double *CH)
{ 
	int i = blockDim.x*blockIdx.x + threadIdx.x;
	if(i < N)
	{
		double dot[3]={0,0,0};
		for(int n = xadj[i] ; n < xadj[i+1]; n++)
		{
			unsigned int neigh=adjncy[n];
			double val[3] = {dataxx[n],datayy[n],datazz[n]};
			for(unsigned int co = 0 ; co < 3 ; co++)
			{
				dot[co]+=(val[co]*Cspin[3*neigh+co]);
			}
		}
		CH[3*i]=CHDemag[3*i]+dot[0];
		CH[3*i+1]=CHDemag[3*i+1]+dot[1];
		CH[3*i+2]=CHDemag[3*i+2]+dot[2];
	}
}

cbuchner1 · July 6, 2019, 10:42am

Give more complete information please - how did you compile this code for the two architectures? (Compiler options)

742820157 · July 6, 2019, 10:49am

I compile my code using nvcc bigData.cu
without any other options

cbuchner1 · July 6, 2019, 11:00am

That seems incredibly wrong as this would target the oldest supported GPU architectures such as sm_20

742820157 · July 6, 2019, 11:03am

what should i do ?
i don’t know how to improve the titan v.

742820157 · July 6, 2019, 11:05am

the titan is more expensive than rtx 2080ti,
why the performance is opposited?

742820157 · July 6, 2019, 11:32am

any body help me ?

njuffa · July 6, 2019, 2:15pm

[1] Compile for the correct GPU target architecture (learn about the -arch and -gencode switches of nvcc)
[2] Familiarize yourself with the CUDA profiler, profile your kernel and use the results to guide optimizations
[3] Learn about the restrict modifier and how it can help the compiler generate better code

In my experience, questions of the sort “GPU X is faster than GPU Y, why?” based on perceived notions or simplistic mental models of GPU performance are not fruitful. The output of the CUDA profiler is a much better way to zero in on the factors that are crucial to the performance of a particular kernel. If necessary the relative performance of two GPUs can then be discussed based on salient differences in profiler output.

At first glance your kernel would appear to be memory bound, with some potentially disadvantageous access patterns because of indirection caused by the use of the adjacency matrix.

742820157 · July 6, 2019, 4:41pm

what should i do ?

[1] Compile for the correct GPU target architecture (learn about the -arch and -gencode switches of nvcc)
[2] Familiarize yourself with the CUDA profiler, profile your kernel and use the results to guide optimizations
[3] Learn about the restrict modifier and how it can help the compiler generate better code

In my experience, questions of the sort “GPU X is faster than GPU Y, why?” based on perceived notions or simplistic mental models of GPU performance are not fruitful. The output of the CUDA profiler is a much better way to zero in on the factors that are crucial to the performance of a particular kernel. If necessary the relative performance of two GPUs can then be discussed based on salient differences in profiler output.

At first glance your kernel would appear to be memory bound, with some potentially disadvantageous access patterns because of indirection caused by the use of the adjacency matrix.

the -arch and -gencode is not useful,and the titan v’s double precision compatity may be 10 times than rtx2080ti,but the running result isn’t.

njuffa · July 6, 2019, 4:45pm

If the performance of the code is bound by memory throughput (as I think it is), the computational throughput is largely irrelevant.

742820157 · July 6, 2019, 5:15pm

what should i do ?

[1] Compile for the correct GPU target architecture (learn about the -arch and -gencode switches of nvcc)
[2] Familiarize yourself with the CUDA profiler, profile your kernel and use the results to guide optimizations
[3] Learn about the restrict modifier and how it can help the compiler generate better code

In my experience, questions of the sort “GPU X is faster than GPU Y, why?” based on perceived notions or simplistic mental models of GPU performance are not fruitful. The output of the CUDA profiler is a much better way to zero in on the factors that are crucial to the performance of a particular kernel. If necessary the relative performance of two GPUs can then be discussed based on salient differences in profiler output.

At first glance your kernel would appear to be memory bound, with some potentially disadvantageous access patterns because of indirection caused by the use of the adjacency matrix.

the -arch and -gencode is not useful,and the titan v’s double precision compatity may be 10 times than rtx2080ti,but the running result isn’t.

but the titan v 's memory thoughout also powerful than rtx.

742820157 · July 6, 2019, 5:17pm

but the titan v 's memory thoughout also powerful than rtx.

njuffa · July 6, 2019, 5:39pm

I would suggest following up on item [2] in post #9. Achieved bandwidth is also a function of access patterns; blanket statements like “titan v 's memory thoughout also powerful than rtx” are not really actionable.

There may be other issues affecting your application level performance due to code you haven’t shown, your kernel configuration(s) may be sub-optimal, your performance methodology may not be sound, etc.

742820157 · July 7, 2019, 12:55am

thank you very, i work!,may I know how to improve it.

Topic		Replies	Views
Why cudaStream in Titan V is slower than P4000? CUDA Programming and Performance	8	926	December 22, 2019
Titan RTX and Titan V CUDA Programming and Performance	18	13438	August 11, 2019
Requesting recommendation on selection between V100 vs T4 vs RTX2080 Ti vs Titan RTX for CUDA programming CUDA Programming and Performance	1	2506	March 5, 2019
Is GeForce RTX 2080 slower than GeForce GTX 1080 on small matrix-matrix multiplication? CUDA Programming and Performance	12	2885	October 25, 2018
Programming across architectures CUDA Programming and Performance	3	419	November 2, 2018
[Help] 1080 GTX - TI 20x slower than 2070 RTX? CUDA Programming and Performance	2	536	November 9, 2020
Titan X (with latest drivers) slower than Titan Black with older drivers CUDA Programming and Performance	45	11357	October 13, 2015
TITAN X CUDA Programming and Performance	35	10818	March 23, 2015
Tesla V100 is slower than RTX 2080ti CUDA Programming and Performance	6	2506	October 12, 2021
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	2607	May 31, 2024

why the titan v is slowed than rtx 2080ti ?

Related topics