CUDA Increasing Speed Possible ?


Currently I am running a for loop from 1 to 1 million and inside that I am performing simple operations. So I have something like this:

int threadsPerBlock = 512;

int blocksPerGrid = (1 million + threadsPerBlock - 1) / threadsPerBlock;

and then I call the kernel function like this:

Kernel<<<blocksPerGrid, threadsPerBlock>>>(function parameters);

Just wanted to know if there is a better way to do this ? Because I just came to knew that blocks are not parallelized and so there will be around 2000 blocks and they will be executed sequentially inside the GPU. But I want all that 1 million executions happen in parallel and so the complexity will be O(1) ? Can it be done ? I can also change structure of my code to something else if needed.

I am a beginner in CUDA :-)



Buy a cluster of 2000 CUDA cards ;)

Each multiprocessor can process up to 8 blocks in parallel, and you have up to 30 multiprocessors in your GPU which also work in parallel => up to 240 blocks run concurrently.

So what you are doing is correct in principle, however there is an optimization for very large grids which might(!) improve performance in your case called persistent threads:…09hpg_paper.pdf