Processing Several elements per thread

Hi,

I read some where that Processing Several elements per thread has following two benefits:

1- Multiple loads get pipelined

2- Indexing calculations can often be re-used

I agree with the first point, as this will hide the access latency. Can anybody give an example in support of 2 above

Thanks

gpuguy

I think you can try data copy via kernel.
you can utilize both benefits.

Thanks!

But sorry…how do we data copy via kernel? I used to think that before we call a kernel we need to do data copy using cudaMemcpy(). Could you please help me how can we data copy via kernel?

I very often opt to such solutions. Especially where the problems are reasonably well defined and i can use some templates.

It is then often possible to place several data elements in registers which brings in additional advantage over using shared memory. Smem can often be quite slow compared to using just registers ( the programming guide will tell you it’s the same latency but its not ).

One example is an algorithm i did recently where one had to read in data and construct at vector V ( length(V) = 64 for ex) and then use that to construct the matrix M = VV’, and then in the next iteration read in and construct a new V and do M += VV’. Now i first stored the matrix M in shared memoy which gave me an overall performance of 80-100 GFLOPs.

But later i realized that all the columns of M could be stored in the registers of 64 threads, since there is no need for communication between the threads in this particular implementation. This got my overall perfomance up to about 240 GFLOPs.

You can give this problem a try yourself, just make sure you don’t spill over into lmem!

Thanks you very much. This make sense to me. :rolleyes:

But being somewhat new to CUDA, I am unable to understand how exactly you used registers for storing your data. Your data must be there in the global memory that you copied back to GPU register. I know how to bring data from global memory to shared memory, but I am unable to understand how do you transfer your data from global memory to registers. A two line code will help me understand this.

Sorry for multiple posts…it was an error

Thanks you very much. This make sense to me. :rolleyes:

But being somewhat new to CUDA, I am unable to understand how exactly you used registers for storing your data. Your data must be there in the global memory that you copied back to GPU register. I know how to bring data from global memory to shared memory, but I am unable to understand how do you transfer your data from global memory to registers. A two line code will help me understand this.

Not much to it really…

// normally maye you'd write something like:

__global__ void kernel(float* globalArray, int n)

{

// normal declaration like this will place the global value into 'register_val' which will be placed in a register

float register_val = globalArray[threaIdx.x + blockIdx.x*gridDim.x];

// But doing this will cause issues:

float register_array[n];

for(int i = 0; i < n; i++)

	register_array[i] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x + i*n];  // <------ not in registers, will be placed in local memory

// this however is OK:

register_array[0] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x];

register_array[1] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x + 1*n];

register_array[2] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x + 2*n];

register_array[3] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x + 3*n];

// Acessing with constant indexes is doable, but not very practical => user template variables and unrolling

}

// this will be easier to handle :

template<int N>

__global__ void kernel(float* globalArray)

{

	float register_array[N]

#pragma unroll

	for(int i = 0; i < N; i++)

		  register_array[i] =  globalArray[threaIdx.x + blockIdx.x*gridDim.x + i*n];		  

// do something with values on-chip

	 ..............

}

But as i said you need to know your N otherwise your performance will be severly affected by this.

you can read “vector addition” in section 2.1 of programming guide 2.3

// Kernel definition

__global__ void VecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x;

	C[i] = A[i] + B[i];

}

int main()

{

// Kernel invocation

	VecAdd<<<1, N>>>(A, B, C);

}

all you have to do is to replace “C[i] = A[i] + B[i];” by “C[i] = A[i];”

how will it hide the access latency? i thought the ratio of arithmetic to global loads and higher occupancy hides the access latency.