Single Thread Processing a vector of elements General Concept.

Hello All. I am a novice in CUDA programming and have been following the official guide and wrote few scripts.I need some concepts to be cleared.Following are my questions and problem.

I have a number of N length vectors stacked together. The function I want to run is to compare each such layer( of vector length N) to a constant vector , of length N too.

Then write out results in a separate stack of layers each of length 1. (say true / false values )

As illustrated below.

|<- ------ N ------->|					|<- ------ N ------->|						 |<--- 1 --->|

sample1  +++++++++++++			const. vector  +++++++++++++								result1 *

sample2  +++++++++++++																		result2 *	

sample3  +++++++++++++																		result3 *

sample4  +++++++++++++																		result4 *

I decided to treat the stacked sample vectors as 2D matrix of dimension 4 * N , the constant vector to be 1N and result as 2D matrix of 41.

In my execution configuration, I define a grid of dimension dimGrid(1,4) => Grid of 4 blocks such that 4 blocks arranged in a 1D fashion vertically. The block size is defined as dimBlock(1 , N) => a 1D array of N threads in every block, arranged horizontally.

Ques1: Can I do an operation where I take a 2D matrix and compare it to a 1D matrix , row wise? I know that a simple strategy is to pass on a single row of 2D matrix of samples to kernel function and do it inside a loop. i.e

for( i = 0; i < 4; i++)


	Compare<<<dimGrid , dimBlock>>>(Sample[i] , ConstVector, Result[i]);

	cudamenCpy(host_result , Result[i] , sizeof(int) , D->H)


But I want to avoid this as there is transfer time associated in each iteration as shown above.

What I was thinking is to do looping inside the kernel, but not sure whether I will have to use __syncthreads() or not .

Ques2: Now I wish that each thread instead of processing a single array element , process a whole array, in order to use less threads and increase the load on a single thread . How will I change my execution configuration to achieve above?? [If it is possible to do so]

Ques3: For a thread processing a single array element, I have seen examples as :

__global__ void Compare(float* A, float* B, float* Result, int VectorLength)


	id = threadIdx.x + blockDim.x * blockSize.x; 

	int i = 0;

	if(idx < VectoLength)


		  i = A[idx] - B[idx];

		  Result[idx] = (i == 0) ? 1 : 0;



Can I change above such that my single thread compares all elements together in one go and then writes to the Result?

Guidance appreciated.

Any clue… help!!

Transferring all data in one go is a good strategy and Yes, you can loop within the kernel. If you have no more than 32 rows you could in principle do like this:

int row, r = 0, a = A[idx], index = idx;


			 for(row = 0; row < rowMax; ++row, index += rowLength) 

				 r |= (a == B[index]) << row;


			 Result[idx] = r;

Thank you very much for answering me. What is the restriction of 32??

In actual problem I have 4000 such samples/ rows and each row length is 361, will it work?

Thank you for your time.

You are taking a crash-course, learning C as well as CUDA simultaniously, yes? :)

The restriction with 32 is that my snippet sets one bit in ‘r’ for each row and there are only 32 bits in an integer.

Your data set is one array A[361] which you wish to compare to another (2D) array B[4000][361] ?

  • Round up to nearest warpsize (A[384], B[4000][384] I think?), and just ignore the excess unused elements

  • Your result vector must be 2D as well: Result[4000/32][384]

  • Transfer all data to device.

  • Loop the kernel from host, advancing 32 rows each time or do an outer loop in the kernel that will do the same.

  • Transfer back the result.

I see one problem though, and that is that you would probably like to have the result back in a format that can be more easily evaluated by host code - with less overhead on the poor programmer - like say an array of chars?

The restriction with 32 still applies. This is because in order to have efficient coalesced memory writes, the data must be 32 bits wide. An array of char4 would fit that description.

char4 r;


			 for ...


				 r.x = a == B[index];

				 r.y = a == B[index+=rowLength];

				 r.z = a == B[index+=rowLength];

				 r.w = a == B[index+=rowLength];

				 Result[idx] = r;

Homework: Figure out what the index of Result[idx] is supposed to be if you have more than 4 rows.

:D … Not in C but yes in CUDA.

:D :D … Yeah surely i will do . Thank you teacher!!!.

I was doing a dry run of your snippet. I understood your snippet but was wondering why 32!! . Initially I thought it has something to do with warp size but could not pay attention to integer size – 4 bytes.

Thank you very much for this. I shall , as your student, prove to be a good programmer :D