char * comparision

Hi all,

I’m doing a program in CUDA and theres is one kind of operation that I have to do it a lot of times. Right now my implementation is working fine but I wonder if there exists some other way of doing this faster. That’s my case:

I have 2 char * (‘a’ and ‘b’) of size SCREEN_RES*numOfChars. For each pixel I have numOfChars of chars. I need to test each pixel of ‘a’ against all the pixels of ‘b’. By ‘Test’ I mean a logical operation to check if the chars of pixel in ‘a’ is a subgroup of ‘b’. Here’s an example for 1 pixel:

numOfChars=2

a={‘1’,‘2’}

b={‘3’,‘7’}

In binary:

a={‘00000001’,‘00000010’}

b={‘00000011’,‘00000111’}

This sould return true because ‘a’ is a subgroup of ‘b’. The way I’m doing this is:

__device__ bool isSubGroup(char * a, char * b, int aPos, int bPos)

{

	bool equals=true;

	uint i=0;

	while(equals && i<NUM_OF_CHARS)

	{

		equals=a[aPos+i]==(a[aPos+i]&b[bPos+i]);

		i++;

	}

	return equals;

}

Where char * a and b are both in global memory. I can’t use shared memory. (It’s a bit complicated to explain but is not an option)

I go thought all the chars anding all chars. My question is, is there some other way of doing this faster? For instance comparing all chars at once instead of using a while? Or maybe some other operations to do this faster than this one?

I hope is clear. Thank you very much.

If you can assign a different thread to do each comparison " equals=a[aPos+i]==(a[aPos+i]&b[bPos+i]);" then all threads for a pixel would get data from a and b in parallel, and if numOfChars >=16 possibly some coalesced.

I think the __any or __all could be used to combine the results.

Alternatively is it possible to make a and b char2 (or char3 or char4) so that you get 2 or more bytes transfered from global at a time.
" equals=a[aPos+i].x==(a[aPos+i].x&b[bPos+i].x);" and another expression for testing the .y …

The problem is that then I need a lot of threads per block and is bounded to 512. I did a 3 dimension block (1,1,numOfChars) such that every thread in z “and” a and b. Actually is not working at all, because I have to take care of the thread race condition but I think I could make it works. It is more much faster!!! Thank you very much.

Despite is highly recommended I do not respect the warp size to do the same operations, so that’s not an option.

I’ve already tried it but it wasn’t fast enough.

Thank you very much for your help.