I have a very basic question. I have an array in global memory each element of which is 32 bytes. Now which of the following is a better way to load this data into shared memory and if possible why?
- Each thread loads 1 element of the array ie 32 bytes from global to shared memory. But in this case I will be using less number of threads to load the data so most of the threads remains idle. ( My array has say a total of 100 elements)
- Each thread loads 4 bytes of data from global to shared memory. In this case I will have lessors number of threads which are idle.
SO which one of these two will be better?