char * comparision

nachovall · February 16, 2010, 11:28am

Hi all,

I’m doing a program in CUDA and theres is one kind of operation that I have to do it a lot of times. Right now my implementation is working fine but I wonder if there exists some other way of doing this faster. That’s my case:

I have 2 char * (‘a’ and ‘b’) of size SCREEN_RES*numOfChars. For each pixel I have numOfChars of chars. I need to test each pixel of ‘a’ against all the pixels of ‘b’. By ‘Test’ I mean a logical operation to check if the chars of pixel in ‘a’ is a subgroup of ‘b’. Here’s an example for 1 pixel:

numOfChars=2

a={‘1’,‘2’}

b={‘3’,‘7’}

In binary:

a={‘00000001’,‘00000010’}

b={‘00000011’,‘00000111’}

This sould return true because ‘a’ is a subgroup of ‘b’. The way I’m doing this is:

__device__ bool isSubGroup(char * a, char * b, int aPos, int bPos)

{

	bool equals=true;

	uint i=0;

	while(equals && i<NUM_OF_CHARS)

	{

		equals=a[aPos+i]==(a[aPos+i]&b[bPos+i]);

		i++;

	}

	return equals;

}

Where char * a and b are both in global memory. I can’t use shared memory. (It’s a bit complicated to explain but is not an option)

I go thought all the chars anding all chars. My question is, is there some other way of doing this faster? For instance comparing all chars at once instead of using a while? Or maybe some other operations to do this faster than this one?

I hope is clear. Thank you very much.

kbam · February 17, 2010, 1:02am

If you can assign a different thread to do each comparison " equals=a[aPos+i]==(a[aPos+i]&b[bPos+i]);" then all threads for a pixel would get data from a and b in parallel, and if numOfChars >=16 possibly some coalesced.

I think the __any or __all could be used to combine the results.

Alternatively is it possible to make a and b char2 (or char3 or char4) so that you get 2 or more bytes transfered from global at a time.
" equals=a[aPos+i].x==(a[aPos+i].x&b[bPos+i].x);" and another expression for testing the .y …

nachovall · February 17, 2010, 9:15am

The problem is that then I need a lot of threads per block and is bounded to 512. I did a 3 dimension block (1,1,numOfChars) such that every thread in z “and” a and b. Actually is not working at all, because I have to take care of the thread race condition but I think I could make it works. It is more much faster!!! Thank you very much.

Despite is highly recommended I do not respect the warp size to do the same operations, so that’s not an option.

I’ve already tried it but it wasn’t fast enough.

Thank you very much for your help.

Topic		Replies	Views
CUDA and char* programming CUDA Programming and Performance	8	14967	July 31, 2008
Algorithm query... CUDA Programming and Performance	3	452	March 17, 2011
Several threads attacking the same position Superposition in that position. CUDA Programming and Performance	4	826	December 2, 2010
Parallel Bit Operations on one char CUDA Programming and Performance	12	5984	September 28, 2007
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8598	December 7, 2009
What is the fastest way to copy 512 bytes from global to shared memory? CUDA Programming and Performance	5	981	December 24, 2014
Array Handling between threads Question about addressing array CUDA Programming and Performance	3	3124	June 1, 2008
Parallelize the Execution. CUDA Programming and Performance	6	6663	April 30, 2009
compare 2 array - stack CUDA Programming and Performance	6	1414	June 9, 2016
Newbie: loops performance issue CUDA Programming and Performance	4	1521	February 15, 2010

char * comparision

Related topics