In the documentation for Matrix Transpose in the CUDA SDK it is mentioned that on 8-series and 10-series GPUs the width of each memory partition is 256 bytes. Does this mean that the size of the DRAM row buffer could at most be 256 bytes? or the row buffer is much larger but does not hold data from contiguous locations?
Thanks,
nagesh