Currently I am running a for loop from 1 to 1 million and inside that I am performing simple operations. So I have something like this:
int threadsPerBlock = 512; int blocksPerGrid = (1 million + threadsPerBlock - 1) / threadsPerBlock;
and then I call the kernel function like this:
Kernel<<<blocksPerGrid, threadsPerBlock>>>(function parameters);
Just wanted to know if there is a better way to do this ? Because I just came to knew that blocks are not parallelized and so there will be around 2000 blocks and they will be executed sequentially inside the GPU. But I want all that 1 million executions happen in parallel and so the complexity will be O(1) ? Can it be done ? I can also change structure of my code to something else if needed.
I am a beginner in CUDA :-)