how to know what variables are placed in local memory?

ispla · January 26, 2010, 3:53pm

where in it http://openpaste.org/en/18965/

is using local memory?

function local arrays garanted placed in local memory? or not?

function local struct garanted placed in local memory? or not?

ptxas info	: Compiling entry function '_Z6matrixP10calcDataInP11calcDataOut'

ptxas info	: Used 16 registers, 170+0 bytes lmem, 4312+16 bytes smem, 116 bytes cmem[1], 4 bytes cmem[14]

Cygnus_X1 · January 26, 2010, 8:40pm

Variables that are put in local memory
a) Arrays that are dynamically indexed, i.e. array[index], instead of array[23]
b) Any variable when number of required registers exceeds the maximum. You do not know which and when, but compiler will try to minimise number of stores and loads, so variables which you access infrequently but their values are kept for the big part of your kernel are most likely candidates.
c) Apparently when you load a struct from global to register space at once (that is my current problem btw). Try loading it element by element instead.

Variables that are not put in local memory must conform to these constrainds:

There must be enough registers to store the data
Which element is accessed must be known precisely at compile time

ispla · January 26, 2010, 9:52pm

need some fast indirect addressing memory, not only registers. Or add indirect addressing commands to registers.

16384bytes shared memory is only 21 bytes per tread

and Constant memory slower than shared memory

and give me fast custom permutation bits device for avesome cryptogphic power

–

make it for me in new version

and make someone faster than GTS240 on 40nm tecnology

in that time ATI is best choice

ispla · January 26, 2010, 9:58pm

AND how to know what variables are placed in local memory?

condition “a) Arrays that are dynamically indexed, i.e. array[index], instead of array[23]” sometimes hard to check manualy

seibert · January 26, 2010, 11:43pm

You should keep in mind that spilling to local memory happens at the register level, not at the source code level. A variable might not be assigned to local memory, but the intermediate value in a complex expression could be put into local memory. Variables in the source code do not have a one-to-one correlation with registers.

ispla · January 27, 2010, 4:40am

make me debugging tool

I need to know where is in my program is using slow local memory

especialy neef to know where optimization dont be did by reson using indirect addressing

hard to manualy monitor this

OR:

need some fast indirect addressing memory, not only registers. Or add indirect addressing commands to registers.

16384bytes shared memory is only 21 bytes per tread

eldadk · January 27, 2010, 7:17am

I’m not sure this solves your problem, but “cuda visual profiler” shows you how many registers are being used by each kernel function when you perform an “occupancy analysis”.

eldad.

Cygnus_X1 · January 27, 2010, 9:42am

This is actually the easy part! When you address an array by a value of another variable it is indirect addressing. Loop unrolling may help you, but not necessairly.

On device with 1.2 capability or higher you can launch up to 1024 threads, thats 16 bytes of shared memory per thread. On device 1.1 or lower you have only 8192 bytes of shared memory in total, and you can launch up to 768 threads, that is 10.6 bytes per thead.

Current ATI best card is probably better than current NVIDIA’s best card, but ATI baby is fairly new, while NVIDIA has over one-year-old architecture. Fermi is on the way but you need to wait few more months.

Apart from the newest ATI which can handle DirectX11, programming them was somewhat more complicated…

ispla, you are quite demanding. Make me this, make me that… :P

ispla · January 27, 2010, 10:36am

This is actually the easy part! When you address an array by a value of another variable it is indirect addressing. Loop unrolling may help you, but not necessairly.

On device with 1.2 capability or higher you can launch up to 1024 threads, thats 16 bytes of shared memory per thread. On device 1.1 or lower you have only 8192 bytes of shared memory in total, and you can launch up to 768 threads, that is 10.6 bytes per thead.

Current ATI best card is probably better than current NVIDIA’s best card, but ATI baby is fairly new, while NVIDIA has over one-year-old architecture. Fermi is on the way but you need to wait few more months.

Apart from the newest ATI which can handle DirectX11, programming them was somewhat more complicated…

ispla, you are quite demanding. Make me this, make me that… :P

constant memory to slow- useless

convert constant memory cache into shared memory

and make per thread shared memory mapping, and let compiler make auto optimization local arrays and others into per thread mapped shared memory

ispla · January 29, 2010, 2:49pm

this:

http://openpaste.org/en/18965/

on GTS8800 640MB with my optimisation make 3E6 per sec

mem_size_in  = 82944

mem_size_out = 9437184

hostMem	  = 18957312

devMem	   = 9520128

Processing time dev copy in : 0.061460 (ms)

Processing time dev : 188.712250 (ms)

Processing time dev copy out: 7.153702 (ms)

pass/ms= 3010.420993

Processing time host: 6755.754883 (ms)

host/dev= 34.480907

34X faster than CELERON CORE DUO 2333

trivial task 116X faster

who can do a better optimization?

Topic		Replies	Views
How can I avoid local memory? CUDA Programming and Performance	1	1574	February 5, 2009
How is memory type chosen for stack variable? CUDA Programming and Performance	5	6229	November 5, 2007
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22661	December 6, 2007
why my kernel uses local memory? CUDA Programming and Performance	9	3756	August 21, 2015
Local memory? CUDA Programming and Performance	6	5135	April 23, 2007
Variable in Kernel CUDA Programming and Performance	3	3953	November 18, 2011
Local variables and registers CUDA Programming and Performance	13	6302	March 23, 2010
Question about variables inside a kernel CUDA Programming and Performance	5	2412	January 22, 2008
temporary memory issues CUDA Programming and Performance	11	5425	March 30, 2008
Registers and Locally declared Variables Variables declared in _global_ functions CUDA Programming and Performance	4	5808	September 5, 2007

how to know what variables are placed in local memory?

Related topics