how to know what variables are placed in local memory?

where in it http://openpaste.org/en/18965/

is using local memory?

how to know what variables are placed in local memory?

function local arrays garanted placed in local memory? or not?

function local struct garanted placed in local memory? or not?

ptxas info	: Compiling entry function '_Z6matrixP10calcDataInP11calcDataOut'

ptxas info	: Used 16 registers, 170+0 bytes lmem, 4312+16 bytes smem, 116 bytes cmem[1], 4 bytes cmem[14]

Variables that are put in local memory
a) Arrays that are dynamically indexed, i.e. array[index], instead of array[23]
b) Any variable when number of required registers exceeds the maximum. You do not know which and when, but compiler will try to minimise number of stores and loads, so variables which you access infrequently but their values are kept for the big part of your kernel are most likely candidates.
c) Apparently when you load a struct from global to register space at once (that is my current problem btw). Try loading it element by element instead.

Variables that are not put in local memory must conform to these constrainds:

  • There must be enough registers to store the data
  • Which element is accessed must be known precisely at compile time

need some fast indirect addressing memory, not only registers. Or add indirect addressing commands to registers.

16384bytes shared memory is only 21 bytes per tread

and Constant memory slower than shared memory

and give me fast custom permutation bits device for avesome cryptogphic power

make it for me in new version

and make someone faster than GTS240 on 40nm tecnology

in that time ATI is best choice

AND how to know what variables are placed in local memory?

condition “a) Arrays that are dynamically indexed, i.e. array[index], instead of array[23]” sometimes hard to check manualy

You should keep in mind that spilling to local memory happens at the register level, not at the source code level. A variable might not be assigned to local memory, but the intermediate value in a complex expression could be put into local memory. Variables in the source code do not have a one-to-one correlation with registers.

make me debugging tool

I need to know where is in my program is using slow local memory

especialy neef to know where optimization dont be did by reson using indirect addressing

hard to manualy monitor this


OR:

need some fast indirect addressing memory, not only registers. Or add indirect addressing commands to registers.

16384bytes shared memory is only 21 bytes per tread

I’m not sure this solves your problem, but “cuda visual profiler” shows you how many registers are being used by each kernel function when you perform an “occupancy analysis”.

eldad.

This is actually the easy part! When you address an array by a value of another variable it is indirect addressing. Loop unrolling may help you, but not necessairly.

On device with 1.2 capability or higher you can launch up to 1024 threads, thats 16 bytes of shared memory per thread. On device 1.1 or lower you have only 8192 bytes of shared memory in total, and you can launch up to 768 threads, that is 10.6 bytes per thead.

Current ATI best card is probably better than current NVIDIA’s best card, but ATI baby is fairly new, while NVIDIA has over one-year-old architecture. Fermi is on the way but you need to wait few more months.

Apart from the newest ATI which can handle DirectX11, programming them was somewhat more complicated…

ispla, you are quite demanding. Make me this, make me that… :P

constant memory to slow- useless

convert constant memory cache into shared memory

and make per thread shared memory mapping, and let compiler make auto optimization local arrays and others into per thread mapped shared memory

this:

http://openpaste.org/en/18965/

on GTS8800 640MB with my optimisation make 3E6 per sec

mem_size_in  = 82944

mem_size_out = 9437184

hostMem	  = 18957312

devMem	   = 9520128

Processing time dev copy in : 0.061460 (ms)

Processing time dev : 188.712250 (ms)

Processing time dev copy out: 7.153702 (ms)

pass/ms= 3010.420993

Processing time host: 6755.754883 (ms)

host/dev= 34.480907

34X faster than CELERON CORE DUO 2333

trivial task 116X faster


who can do a better optimization?