How to run with large number of blocks?

goemonpiano · May 26, 2011, 5:35pm

Hi guys.

I’m trying to run with large number of blocks.

e.g. 3628800 (=10!) blocks

I know the max number of blocks is 65535.

I think that if the program exceeds this limit, the exceeding rest blocks are run later.

Is it wrong?

And,

I wrote test code. (using NVIDIA Parallel Nsight)

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

__global__ void isPrimeNumber(int *px, int *py)

{

	int tid = blockIdx.x;	

	int i, divisor = 0;

	int num = px[tid];

	for(i=1; i<=num; i++){

		if(num%i == 0) divisor++;

	}

	if(divisor == 2){

		py[tid] = 1; //correct

	}else{

		py[tid] = 0; //not prime number

	}

}

cudaError_t getPrimeCount(int num, int count){

	int *xs, *ys;

	xs = (int *)malloc(sizeof(int)*num);

	ys = (int *)malloc(sizeof(int)*num);

	int *d_xs, *d_ys;

	cudaError_t cudaStatus = cudaMalloc((void**)&d_xs, sizeof(int)*num);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMalloc failed!");

        goto Error;

    }

	cudaStatus = cudaMalloc((void**)&d_ys, sizeof(int)*num);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMalloc failed!");

        goto Error;

    }

	

	int i;

	for(i=0; i<num; i++){

		xs[i] = i;

		ys[i] = 0;

	}

	cudaStatus = cudaMemcpy (d_xs, xs, sizeof(int)*num, cudaMemcpyHostToDevice);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMemcpy failed!");

        goto Error;

    }

	//CPU -> GPU

	isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

	cudaStatus = cudaMemcpy (ys , d_ys , sizeof(int)*num, cudaMemcpyDeviceToHost);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaMemcpy failed!");

        goto Error;

    }

	for(i=0; i<num; i++){

		if(ys[i] == 1){ //xs[i] is prime number

			count++;

		}

	}

	printf("Count of Prime Number(<=%d) = %d\n",num ,count);

	

	free(xs);

	free(ys);

Error:

	cudaFree(d_xs);

	cudaFree(d_ys);

	return cudaStatus;

}

int main()

{

	int num = 10000; // <- run

	//int num = 10000; // does not run

	int count = 0;

	cudaError_t cudaStatus = getPrimeCount(num, count);

	if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "getPrimeCount failed!");

        return 1;

    }

	cudaStatus = cudaDeviceReset();

    if (cudaStatus != cudaSuccess) {

        fprintf(stderr, "cudaDeviceReset failed!");

        return 1;

    }

    return 0;

}

This test code calculates the number of the prime number.

When the variable “num” in main function is 10000, this code run.

But, When “num” is 70000, this code does not run.

I don’t know what is wrong :-(

Thanks for your time, any help will be highly appreciated.

kbam · May 26, 2011, 11:16pm

//CPU -> GPU

    isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

num is an int but the launch configuration part between the <<< >>> accepts a dim3 for the 1st 2 parameters,

if you pass a dim3 variable you can have up to 65535*65535 blocks

so replace

int num = 10000;

with

dim3 num(10000,100);

num.x will now be 10000, and num.y will be 100

NB maximum value you can set num.x and num.y to is 64k-1

=======

However you are also creating blocks that only have 1 thread per block ideally to make best use of a GPU you should have N*32 threads per block (where N is an integer).

goemonpiano · May 28, 2011, 4:31pm

Hi kbam

Thank you for your reply.

dim3 num(10000,100);
NB maximum value you can set num.x and num.y to is 64k-1

I see.
If num is over 64k-1, how I shold code ?

Um…
I’m Japanese, so I can’t inform someone in a nice way…

Do you understand me?

I want to code with large number of threads.
e.g. 3628800 (=10!) threads

So, how I shold code in this instance?

Very respectfully yours.

hyqneuron · May 28, 2011, 4:58pm

It doesn’t sound like a good idea to start with so many threads. Actually, it’s a very bad idea that was wrongly popularized a few years ago when GPGPU just started.

Advice: let a single thread do perhaps 1000 threads’s job. Or just do multiple launches of the same kernel, each time asking the threads(blocks) to process a different set of data.

kbam · May 30, 2011, 12:46am

You are using

int tid =  blockIdx.x;

and only having 1 thread per block, use the following instead

int tid = blockIdx.x * blockDim.x + threadIdx.x;

you can now replace

isPrimeNumber<<<num, 1>>>(d_xs, d_ys);

with say

isPrimeNumber<<<14175, 256>>>(d_xs, d_ys);

You will find more examples in the “CUDA C Programming Guide” and other materials you can download from Nvidia site.

NB There are few other properties of prime numbers that you could use to speed up your code.

goemonpiano · June 1, 2011, 7:02pm

Hi hyqneuron.
Thank you for your help.
And sorry for the late reply.

Oh…
I have done multiple launches of the same kernel.
I will study CUDA more :-)

goemonpiano · June 1, 2011, 7:04pm

Hi kbam.
I’m so sorry for the late reply.

Your comment was quite an education for me :)
Thank you!

Topic		Replies	Views
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3974	February 9, 2024
Help with prime algorithmen CUDA Programming and Performance	6	1493	July 7, 2009
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	2983	May 16, 2008
The Problem with Primes CUDA Programming and Performance	15	6545	October 10, 2010
Number of Blocks CUDA Programming and Performance	3	1492	October 15, 2011
Number of items that can be processed in CUDA CUDA Programming and Performance cuda , kernel	5	302	February 17, 2024
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18149	October 17, 2013
How to realise 3 dimension calculation? CUDA Programming and Performance	9	243	August 28, 2023
Grid dimensions CUDA Programming and Performance	6	5421	September 18, 2009
CUDA Program Issue CUDA Programming and Performance cuda	19	143	September 20, 2024

How to run with large number of blocks?

Related topics