Basic Performance Ques.. ? from a non-CS .. noob

nitin.life · July 1, 2009, 2:15pm

I have doubt, which has always been back of my mind…

THIS MAY REALLY SEEM A NOOBISH QUESTION to some of you :"> , but am not a CS guy hence I don’t understand how this all actually works to the core so I want to learn…

if we have situation a:

shared double a;

  b is some global double array variable;

 // then we do read in a from b

  a[tid] = b[indx]; // assume access coalesced

 //compute (no sync between threads required)

 a[tid] = a[tid]*2.0;

//read out to b

 b[indx] = a[tid];

and we have Situation b:

b is some global double array variable;

 // assume access from device memory as coalesced

 b[indx] =b[indx]*2.0;

say if we launch say 32768 threads with 128 threads per block…

would situation A be faster than B ? If so then why ? … because in both case we have one read and write from global memory, and in situation A we have small overhead of copying from and to b to a also…

does doing 1 flop require multiple reads from the memory (it should be just be one read from intuition) ?

or is it because multiplying a global variable with some constant/variable is slower than mutliplying with a shared variable ?

I am not sure what is the exact answer to the above question…

Thanks

NA

Nico · July 1, 2009, 2:22pm

IMHO both situations are comparable in terms of speed. Note that, even in the second situation, the variable gets loaded from DDR memory into a register on chip. Then it gets multiplied by 2 using the register, and afterwards it stores the value of the register back in global memory.
So the only difference between situation A and B is that you’re using shared memory in A and a register in B. And because accessing shared memory is as fast as accessing a register ( assuming no bank conflicts), then the performance will be approximately the same but with a small overhead for situation A.

N.

nitin.life · July 1, 2009, 2:31pm

Dint knew this !!

Thanks… for the input :) …

NA

Sarnath · July 2, 2009, 5:10am

Shared memory should be used like a cache for staging data.

And, cache thrives on “Locality of Reference” i.e. same data being accessed again and again in a small piece of code.

Topic		Replies	Views
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3567	July 12, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	622	July 12, 2011
Shared mem vs. registers CUDA Programming and Performance	3	1408	October 14, 2009
Efficient way of doing? CUDA Programming and Performance	4	8189	July 14, 2010
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3285	October 14, 2009
Correct Use of Shared Memory? CUDA Programming and Performance	1	742	January 6, 2010
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1239	April 26, 2013
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4589	May 12, 2015
performance for global and shared memory CUDA Programming and Performance	2	6283	January 15, 2008
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2741	April 23, 2010

Basic Performance Ques.. ? from a non-CS .. noob

Related topics