Correct copying to local (shared) memory?

Hi. I need to copy block of global memory into the shared. So… When i do it with FOR cycle, this is correct?

  1. I have GeForce1080Ti with bus width = 352 bits = 44 bytes = 11 floats.
  2. 512 threads per block (only 1 block in calculation)
  3. I have global buffer with 5632 floats, so 5632 floats (11 floats (bus width) * 512 thread == 5632) i want to copy to local memory, every thread should to copy 11 flaots (just in bus width). When i do it with FOR cycle, will my read transactions be combined according to the memory bus width, or will be performed single transaction per one single float? (one float == one transaction (bus width == 11 floats))?

May be is there any way to make scope mem copy? Like memcpy only in kernel?

#define COPY_BLOCK_EL_NUM 11 //11 floats  (very strange membus width = 352bits = 44 bytes = 11 float :-) )
#define THREADS_NUM 512 //threads num
#define SAMPLE_BLOCK_SIZE 5632 //11 floats * 512 threads

__global__ void processAL1(float* samples, uchar* out, int samplesNum)
{
	int tidx = threadIdx.x;	

	__shared__ float sampleBlock[SAMPLE_BLOCK_SIZE];

	int smcStart = COPY_BLOCK_EL_NUM*tidx;
	int smcEnd = smcStart + COPY_BLOCK_EL_NUM;
	for (int i = smcStart; i < smcEnd; i++)
		sampleBlock[i] = samples[i]; //samples have same size == SAMPLE_BLOCK_SIZE == 5632
	__syncthreads();

It’s generally preferable to arrange for adjacent threads to load and store adjacent elements, both to/from global memory as well as to/from shared memory. The methodology for copy would be very similar to a grid-stride loop:

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

Here’s an example that should be efficient:

https://stackoverflow.com/questions/15468059/copy-to-the-shared-memory-in-cuda

the only change there would be to modify warpSize to threadblock size.