memory optimazation question

s002wjh · June 30, 2016, 6:44pm

is there any guide or example to improve local share mem access. for start I used a lot shared memory for each SM.
currently my visual profiler point to these things.

Register Pressure and some mem load/store alignment and access pattern.

if there are some guide or example to show best way to access/used share memory thatll be great.

MutantJohn · July 3, 2016, 7:17pm

I mean, I guess just avoid bank conflicts and you’ll be good.

s002wjh · July 6, 2016, 4:48pm

is there any other way to increase occupancy due to a lot share mem/blk? for example maybe use constant mem or other type storage for input data to reduce share mem resource? device mem seem too slow, so maybe constant mem? but I’m not broadcast it, each thread.idx need access different element within mem

s002wjh · July 6, 2016, 6:00pm

as for bank conflict. for kepler, 32banks, if the array[2048] and I’m access element within array for each thread in below 2 ways, which is faster.

float
f1=array[thread.idx];
f2=array[thread.idx256];
f3=array[thread.idx2562];
f4=array[thread.idx256*3];

or

float
f1=array[thread.idx];
f2=array[thread.idx+1];
f3=array[thread.idx+2];
f4=array[thread.idx+3];

the latter would be faster? since the former one is accessing on same bank. the later will take 4 cycle? the former will take how many cycle (8)?

BulatZiganshin · July 7, 2016, 12:02pm

yes, second is faster since it has no bank conflicts. the cycles is much longer story. first, all gpus can perfrom alu+mem operations simultaneously, even in the same cycle. second, you can perform 4 alu operations per 1 memory operation (i.e. there are 4 alus per 1 memory unit). third, the data will be ready in register much later (execution is pipelined). i’ve seen reports of 50 cycles even if no bank conflicts. so you should either run enough threads to hide the delay or don’t access f1…4 for up to a few dozen alu operation after you have read it. actually, alus has similar delays (9 cycles for kepler), so unless you optimize for a death, you can just rely on other threads hiding your delays.

constant mem is just a small cache for device mem, optimized for same-index access. if your access is not uniform, avoid it. there is also texture mem, available in sm 3.5+ via ldg/const restrict pointers. it’s 48KB L1 cache which is a nice addition to 16 KB main L1 cache.

Overall strategy to increasing occupancy is to increase ILP instead. i.e. it’s enough to have 4 blocks per SM in kepler if you have maximized ILP to the level that threads execute enough operations to hide delay before using ALU/mem operation result. In particluar, look at the famous http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

There are lots of other resources with arch details:

http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
http://repository.lib.ncsu.edu/ir/bitstream/1840.16/9585/1/etd.pdf

Dissecting GPU Memory Hierarchy through Microbenchmarking | hgpu.org Dissecting GPU Memory Hierarchy through Microbenchmarking

Talks:
http://on-demand-gtc.gputechconf.com/gtc-quicklink/9BNvqKX

Books:
http://www.cudahandbook.com/
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”
David Kirk, Wen-mei Hwu “Programming Massively Parallel Processors”

Topic		Replies	Views
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3869	March 2, 2009
Using shared memory in device function and allocate required shared memory in global function CUDA Programming and Performance	2	30	April 14, 2025
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4055	October 20, 2008
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010
Global memory access cost CUDA Programming and Performance	4	2924	November 18, 2017
Latency and low-level performance questions CUDA Programming and Performance	10	4287	June 23, 2015
Using Shared Memory in CUDA C/C++ Technical Blog	36	1991	October 8, 2020
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Best way to allocate a small lookup table 2KB of data, read only CUDA Programming and Performance	7	2785	March 22, 2011

memory optimazation question

Related topics