Memory Optimization Coalesce

Dear,
I am trying to optimze my cuda implementation,after investigation with cuda visual profiler i found that my proble is the memory uncoalesce

Here is my problem
I have let say 2000 block with 128 threads each
Thread of each block share some commun data (data structre (13 interger variable) )
How can I coalesce the reading of my global memory so that I first load the data into share memory.

I read this thread http://forums.nvidia.com/index.php?showtop…st=#entry452234
the problem here is that I am confuse with the way he his reading data (int and struct hello ) and what he meaning by smemAOS

Thanks for your reply.

Willer

Coallacing issue can be removed by binding that memory in texture of the GPU. Note that texture memory is read only.