Hello. Im just about ready to turn my C program into a CUDA program but I’m not entirely sure what the layout would be.
Basically, the program is given a large array of states(which are 16 bytes and the smallest unit ever worked on) that are all independent and need the same function ran on them. So if the warp size was 16 with and I only had 16 states id be done in one pass.
So I’m guessing that means I want a total of ‘warp size’ blocks and only 1 grid.
Sound about right?