Hi. I need to copy block of global memory into the shared. So… When i do it with FOR cycle, this is correct?
- I have GeForce1080Ti with bus width = 352 bits = 44 bytes = 11 floats.
- 512 threads per block (only 1 block in calculation)
- I have global buffer with 5632 floats, so 5632 floats (11 floats (bus width) * 512 thread == 5632) i want to copy to local memory, every thread should to copy 11 flaots (just in bus width). When i do it with FOR cycle, will my read transactions be combined according to the memory bus width, or will be performed single transaction per one single float? (one float == one transaction (bus width == 11 floats))?
May be is there any way to make scope mem copy? Like memcpy only in kernel?
#define COPY_BLOCK_EL_NUM 11 //11 floats (very strange membus width = 352bits = 44 bytes = 11 float :-) )
#define THREADS_NUM 512 //threads num
#define SAMPLE_BLOCK_SIZE 5632 //11 floats * 512 threads
__global__ void processAL1(float* samples, uchar* out, int samplesNum)
{
int tidx = threadIdx.x;
__shared__ float sampleBlock[SAMPLE_BLOCK_SIZE];
int smcStart = COPY_BLOCK_EL_NUM*tidx;
int smcEnd = smcStart + COPY_BLOCK_EL_NUM;
for (int i = smcStart; i < smcEnd; i++)
sampleBlock[i] = samples[i]; //samples have same size == SAMPLE_BLOCK_SIZE == 5632
__syncthreads();