memory optimazation question

is there any guide or example to improve local share mem access. for start I used a lot shared memory for each SM.
currently my visual profiler point to these things.

Register Pressure and some mem load/store alignment and access pattern.

if there are some guide or example to show best way to access/used share memory thatll be great.

I mean, I guess just avoid bank conflicts and you’ll be good.

is there any other way to increase occupancy due to a lot share mem/blk? for example maybe use constant mem or other type storage for input data to reduce share mem resource? device mem seem too slow, so maybe constant mem? but I’m not broadcast it, each thread.idx need access different element within mem

as for bank conflict. for kepler, 32banks, if the array[2048] and I’m access element within array for each thread in below 2 ways, which is faster.

float
f1=array[thread.idx];
f2=array[thread.idx256];
f3=array[thread.idx
2562];
f4=array[thread.idx
256*3];

or

float
f1=array[thread.idx];
f2=array[thread.idx+1];
f3=array[thread.idx+2];
f4=array[thread.idx+3];

the latter would be faster? since the former one is accessing on same bank. the later will take 4 cycle? the former will take how many cycle (8)?

yes, second is faster since it has no bank conflicts. the cycles is much longer story. first, all gpus can perfrom alu+mem operations simultaneously, even in the same cycle. second, you can perform 4 alu operations per 1 memory operation (i.e. there are 4 alus per 1 memory unit). third, the data will be ready in register much later (execution is pipelined). i’ve seen reports of 50 cycles even if no bank conflicts. so you should either run enough threads to hide the delay or don’t access f1…4 for up to a few dozen alu operation after you have read it. actually, alus has similar delays (9 cycles for kepler), so unless you optimize for a death, you can just rely on other threads hiding your delays.

constant mem is just a small cache for device mem, optimized for same-index access. if your access is not uniform, avoid it. there is also texture mem, available in sm 3.5+ via ldg/const restrict pointers. it’s 48KB L1 cache which is a nice addition to 16 KB main L1 cache.

Overall strategy to increasing occupancy is to increase ILP instead. i.e. it’s enough to have 4 blocks per SM in kepler if you have maximized ILP to the level that threads execute enough operations to hide delay before using ALU/mem operation result. In particluar, look at the famous http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

There are lots of other resources with arch details:
http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf
http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
http://repository.lib.ncsu.edu/ir/bitstream/1840.16/9585/1/etd.pdf
https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
http://hgpu.org/?p=14541 Dissecting GPU Memory Hierarchy through Microbenchmarking

Talks:
http://on-demand-gtc.gputechconf.com/gtc-quicklink/9BNvqKX
http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
http://on-demand.gputechconf.com/gtc/2016/presentation/s6807-angerer-dynamic-parallelism.pdf

Books:
http://www.cudahandbook.com/
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”
David Kirk, Wen-mei Hwu “Programming Massively Parallel Processors”