I am now exploring task parallelism using CUDA. Basically, I want to run 32 independent tasks in parallel on GPU, and each task is mapped to a thread block. Each task may run 20 iterations and has its own memory space.
In my program, a memory (work) space is a class having multiple int array pointers. Therefore, I need to be able to see 32 class pointers in the CUDA kernel, each pointer corresponds to a memory space for a task.
I understand that the host needs to pass 32 class pointers into the CUDA kernel. But when I try to pass an array of 32 class pointers (pointer of pointer) into the CUDA kernel, many issues come up.
I am not sure if I am doing in a reasonable way to explore task parallelism in CUDA. Does anyone have related experience or comments? Thanks.