How to run with large number of blocks?

Hi guys.

I’m trying to run with large number of blocks.

e.g. 3628800 (=10!) blocks

I know the max number of blocks is 65535.

I think that if the program exceeds this limit, the exceeding rest blocks are run later.

Is it wrong?

And,

I wrote test code. (using NVIDIA Parallel Nsight)

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

__global__ void isPrimeNumber(int *px, int *py)

{

	int tid = blockIdx.x;	

	int i, divisor = 0;

	int num = px[tid];

	for(i=1; i<=num; i++){

		if(num%i == 0) divisor++;

	}

	if(divisor == 2){

		py[tid] = 1; //correct

	}else{

		py[tid] = 0; //not prime number

	}

}

cudaError_t getPrimeCount(int num, int count){

	int *xs, *ys;

	xs = (int *)malloc(sizeof(int)*num);

	ys = (int *)malloc(sizeof(int)*num);

	int *d_xs, *d_ys;

	cudaError_t cudaStatus = cudaMalloc((void**)&d_xs, sizeof(int)*num);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMalloc failed!");

        goto Error;

    }

	cudaStatus = cudaMalloc((void**)&d_ys, sizeof(int)*num);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMalloc failed!");

        goto Error;

    }

	

	int i;

	for(i=0; i<num; i++){

		xs[i] = i;

		ys[i] = 0;

	}

	cudaStatus = cudaMemcpy (d_xs, xs, sizeof(int)*num, cudaMemcpyHostToDevice);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMemcpy failed!");

        goto Error;

    }

	//CPU -> GPU

	isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

	cudaStatus = cudaMemcpy (ys , d_ys , sizeof(int)*num, cudaMemcpyDeviceToHost);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMemcpy failed!");

        goto Error;

    }

	for(i=0; i<num; i++){

		if(ys[i] == 1){ //xs[i] is prime number

			count++;

		}

	}

	printf("Count of Prime Number(<=%d) = %d\n",num ,count);

	

	free(xs);

	free(ys);

Error:

	cudaFree(d_xs);

	cudaFree(d_ys);

	return cudaStatus;

}

int main()

{

	int num = 10000; // <- run

	//int num = 10000; // does not run

	int count = 0;

	cudaError_t cudaStatus = getPrimeCount(num, count);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "getPrimeCount failed!");

        return 1;

    }

	cudaStatus = cudaDeviceReset();

    if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaDeviceReset failed!");

        return 1;

    }

    return 0;

}

This test code calculates the number of the prime number.

When the variable “num” in main function is 10000, this code run.

But, When “num” is 70000, this code does not run.

I don’t know what is wrong :-(

Thanks for your time, any help will be highly appreciated.

//CPU -> GPU

    isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

num is an int but the launch configuration part between the <<< >>> accepts a dim3 for the 1st 2 parameters,

if you pass a dim3 variable you can have up to 65535*65535 blocks

so replace

int num = 10000;

with

dim3 num(10000,100);

num.x will now be 10000, and num.y will be 100

NB maximum value you can set num.x and num.y to is 64k-1

=======

However you are also creating blocks that only have 1 thread per block ideally to make best use of a GPU you should have N*32 threads per block (where N is an integer).

Hi kbam

Thank you for your reply.

dim3 num(10000,100);
NB maximum value you can set num.x and num.y to is 64k-1

I see.
If num is over 64k-1, how I shold code ?

Um…
I’m Japanese, so I can’t inform someone in a nice way…

Do you understand me?

I want to code with large number of threads.
e.g. 3628800 (=10!) threads

So, how I shold code in this instance?

Very respectfully yours.

It doesn’t sound like a good idea to start with so many threads. Actually, it’s a very bad idea that was wrongly popularized a few years ago when GPGPU just started.

Advice: let a single thread do perhaps 1000 threads’s job. Or just do multiple launches of the same kernel, each time asking the threads(blocks) to process a different set of data.

You are using

int tid =  blockIdx.x;

and only having 1 thread per block, use the following instead

int tid = blockIdx.x * blockDim.x + threadIdx.x;

you can now replace

isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

with say

isPrimeNumber<<<14175, 256>>>(d_xs, d_ys);

You will find more examples in the “CUDA C Programming Guide” and other materials you can download from Nvidia site.

NB There are few other properties of prime numbers that you could use to speed up your code.

Hi hyqneuron.
Thank you for your help.
And sorry for the late reply.

Oh…
I have done multiple launches of the same kernel.
I will study CUDA more :-)

Hi kbam.
I’m so sorry for the late reply.

Your comment was quite an education for me :)
Thank you!