I am trying to optimze my cuda implementation,after investigation with cuda visual profiler i found that my proble is the memory uncoalesce
Here is my problem
I have let say 2000 block with 128 threads each
Thread of each block share some commun data (data structre (13 interger variable) )
How can I coalesce the reading of my global memory so that I first load the data into share memory.
I read this thread http://forums.nvidia.com/index.php?showtop…st=#entry452234
the problem here is that I am confuse with the way he his reading data (int and struct hello ) and what he meaning by smemAOS
Thanks for your reply.