shared memory using shared memory

Hi everyone…iam a complete newbie to CUDA…i want help in using shared memory for my program…

i hv a large array of say 6400x6400 elements , i want to divide the array in to chunks of 64x64 and store them in shared memory and process them in a block…it means i want to give 64x64 to a single block not to a thread…My doubt is how can i transfer the elements from global memory to shared memory…and process them in the block.

Check the matrix transpose example of the SDK.

Hello rohitkrishna, in times like this the best place to look for an answer is the programming guide nVidia provides with the CUDA installation, for CUDA 4.1 its programming guide has all the explanation with a very nice example starting at page 22 (pdf page 34). There is also more about shared memory in the guide than just section.

The short answer is, that you need to make an array with the shared qualifier, so if your elements are ints, then:

shared int shared_elements[64*64];

and then you proceed to have each thread pick an element (in fact more because the largest possible block has 1024 threads, and 64*64=4096, so a thread will have to pick 4 elements) from global and store in shared, a __syncthreads() will be necessary after the copy code to make sure all threads finished their job of filling the shared memory.