Correct CUDA kernel Invocation

Hello all,

I have recently started developing on CUDA and I would appreciate some advice in the following

subject:

STEP1:

In the host memory I have this specific array:

(elements are unimportant)

const int a=

{	 'h','e','l','l','o','\0' ,

	 'h','o','w','\0' ,

	 'A','R','E','\0' ,

	 'Y','O','U','\0' , 

};

The above element pretty much contains certain (4 in number) strings(hello, how , ARE, YOU) that are separated by ‘\0’

STEP 2:

The array is copied to CUDA memory

STEP 3:

The array is processed the following way:

Thread 1:

functionA ( [‘h’,‘e’,‘l’,‘l’,‘o’])

A certain function processes the a part of the array so the threading needs to

operate on chunks of array elements not just one array element at a time,

Thread 2:

functionA([‘h’,‘o’,‘w’])

Thread 3

functionA([‘A’,‘R’,‘E’])

…and so on.

So in general each CUDA thread runs functionA which operates on a small chunk of the array at a time.

My QUESTION is: how should I invoke the CUDA kernel? Should I use blocks for every input that is processes by functionA?

Any help/ideas will be really appreciated.

Thank you all in advance.

Hi,

Lets say that the array chars is within global memory. Thus each thread has access to every array element. The main problem, as I see it, is the changing length of the strings in the array. I can come up with two scenarios solving the problem:

  1. Each thread starts processing from the begining of the array. Depending on its ID the actually processed string (by ‘functionA’) starts from the first non-‘\0’ char after ID number of ‘\0’ occurances. It will work, but would be not very efficient.

  2. Pad the strings on the right with ‘\0’ (see below) up to the length of the longest one to ensure their length equality. Then each thread would start from the offset given by thread ID times string length (given the thread ids begin from 0). Processing of ‘functionA’ would be done, I suppose, until first ‘\0’ occurance.

const int a={

'h','e','l', 'l', 'o','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

',

'h','o','w','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

',

'A','R','E','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

',

'Y','O','U','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

','

const int a={

‘h’,‘e’,‘l’, ‘l’, ‘o’,‘\0’,

‘h’,‘o’,‘w’,‘\0’,‘\0’,‘\0’,

‘A’,‘R’,‘E’,‘\0’,‘\0’,‘\0’,

‘Y’,‘O’,‘U’,‘\0’,‘\0’,‘\0’

};

'

};

Of cource the mentioned length need to include first ‘\0’ (here 6 chars).

To sum up, when You call the kernel pass mentioned length to it as addtional parameter. The number of threads would be in this case equal to number of strings. For such an example one block would do:

functionA<<<1, 4>>>(6, a);

Hope it helps,

MK

It has been really helpful, thanks!