setting up grid/block layout

Hello. Im just about ready to turn my C program into a CUDA program but I’m not entirely sure what the layout would be.

Basically, the program is given a large array of states(which are 16 bytes and the smallest unit ever worked on) that are all independent and need the same function ran on them. So if the warp size was 16 with and I only had 16 states id be done in one pass.

So I’m guessing that means I want a total of ‘warp size’ blocks and only 1 grid.

Sound about right?
Thanks

Since I need read-write access to the states I think that leaves me with storing all the states in global memory. Im guessing each block would want to copy a state from global memory into its own shared memory, work on it and then copy back over.

Instead of copying a single state, I might see about taking a couple each(I’m not sure about any sizes yet). And that would mean each thread could take on a state.

Sounds like a plan or am I missing something here?

Thanks

If you have large array of ‘states’ and each ‘state’ can be processed independently then you probably need as much threads running as practical (taking into account 5sec limitation, etc.).

First, you need to determine block size which depends on your kernel resource usage. After that set grid size to something not too small and not too big, 256-512-1024 may be a good start, depending on running time of your kernel. And benchmark your test runs since simple change in execution configuration may change overall performance considerably.

It all depends on what kind of work you do with these “states”. You could assign 1 state to 1 block and have 16 blocks process each state… That would be the best thing to do in my opinion…

Much depends on “What needs to be done to each state and how complex it is and how easily it lends itself to data-parallel computation”

btw,
The first step in converting a C program to a CUDA program is to rename it as a “.cu” file… he he he hee…