Dear,
I am trying to optimze my cuda implementation,after investigation with cuda visual profiler i found that my proble is the memory uncoalesce
Here is my problem
I have let say 2000 block with 128 threads each
Thread of each block share some commun data (data structre (13 interger variable) )
How can I coalesce the reading of my global memory so that I first load the data into share memory.
I read this thread [url=“http://forums.nvidia.com/index.php?showtopic=79620&pid=452234&mode=threaded&show=&st=#entry452234”]The Official NVIDIA Forums | NVIDIA
the problem here is that I am confuse with the way he his reading data (int and struct hello ) and what he meaning by smemAOS
Thanks for your reply.
Willer