Hello All. I am a novice in CUDA programming and have been following the official guide and wrote few scripts.I need some concepts to be cleared.Following are my questions and problem.
I have a number of N length vectors stacked together. The function I want to run is to compare each such layer( of vector length N) to a constant vector , of length N too.
Then write out results in a separate stack of layers each of length 1. (say true / false values )
As illustrated below.
|<- ------ N ------->| |<- ------ N ------->| |<--- 1 --->|
sample1 +++++++++++++ const. vector +++++++++++++ result1 *
sample2 +++++++++++++ result2 *
sample3 +++++++++++++ result3 *
sample4 +++++++++++++ result4 *
I decided to treat the stacked sample vectors as 2D matrix of dimension 4 * N , the constant vector to be 1N and result as 2D matrix of 41.
In my execution configuration, I define a grid of dimension dimGrid(1,4) => Grid of 4 blocks such that 4 blocks arranged in a 1D fashion vertically. The block size is defined as dimBlock(1 , N) => a 1D array of N threads in every block, arranged horizontally.
Ques1: Can I do an operation where I take a 2D matrix and compare it to a 1D matrix , row wise? I know that a simple strategy is to pass on a single row of 2D matrix of samples to kernel function and do it inside a loop. i.e
for( i = 0; i < 4; i++)
{
Compare<<<dimGrid , dimBlock>>>(Sample[i] , ConstVector, Result[i]);
cudamenCpy(host_result , Result[i] , sizeof(int) , D->H)
}
But I want to avoid this as there is transfer time associated in each iteration as shown above.
What I was thinking is to do looping inside the kernel, but not sure whether I will have to use __syncthreads() or not .
Ques2: Now I wish that each thread instead of processing a single array element , process a whole array, in order to use less threads and increase the load on a single thread . How will I change my execution configuration to achieve above?? [If it is possible to do so]
Ques3: For a thread processing a single array element, I have seen examples as :
__global__ void Compare(float* A, float* B, float* Result, int VectorLength)
{
id = threadIdx.x + blockDim.x * blockSize.x;
int i = 0;
if(idx < VectoLength)
{
i = A[idx] - B[idx];
Result[idx] = (i == 0) ? 1 : 0;
}
}
Can I change above such that my single thread compares all elements together in one go and then writes to the Result?
Guidance appreciated.