thread / block allocation in function of data size


I have a single dim array of size N, and I would like every element to be handled by a thread. I am working on Tesla1.

  1. How can I allocate block/threads independent of the data size?
    kernel_function <<1, N >> (); ==> whatever N is, this runs 512 threads max.

Anything better than the following?

if (N <=1024){
	block = 1;
	threads = N;
	block = N/512;
	threads = 512;
  1. On an array of size N, I would like to work on elements 1… N-1. In fact, in total I need N-1 threads.
    N=1024, block =2, what should “threads” be in the following:
    kernel_function <<block, threads >> ();
  • If threads = 512, then I have array[N] that I don’t want to touch.
  • If threads = 511, then array[512] is skipped.


That is because there is a limit of 512 threads per block on all current hardware, including your “Tesla1”, whatever than might be…

How about:

unsigned int blocksize = 512;

	unsigned int gridsize = ((N%blocksize == 0) ? 0 : 1) + (N/blocksize);

	mykernel <<< gridsize, blocksize >>> ();

How about just launching N threads and then having the kernel threads outside 1…N-1 just not do anything? I am sure you can see how to do that.


Thanks for the reply.

I will start with the second problem. In fact, there is a mistake in what I wrote, I would like to touch only elements form 0…N-1 (1…N-1 handling is easier). Currently, I am doing “if (myid==N) return;” but unfortunately this type of program is not the best, every thread is checking that if statement and it is necessary for a single thread.

I know that there is a limit of 512 threads/block. The question was more like: is it possible to find a universal solution with CUDA? I would like to be able to write a program without getting stuck in the relation of data size and thread/block numbers. I knew that CUDA programming model was limited, but I was not expecting that limitation. This means that programs are valid only for a specific architecture specifications. Before sending this question, I thought I was missing something.

Thanks again!

I really don’t understand what restrictions you are talking about. CUDA treats hardware in the most abstract of ways, and it is possible to write code that deals with anything from 1 to 5126533565335 threads without requiring any modification whatsoever. The sort of branching you are worrying has minimal effect on kernel throughput because there is negligible divergence. As long as every thread in a warp follows the same code path, instruction throughput remains very high and the overall effect on performance is minimal.

A single comparison once a thread isn’t really going to have a performance impact worth worrying about. It’ll be insignificant.

One way to make your algorithm independant of numbers of threads or blocks is to have a loop inside each thread. For example, you could set it to always launch 30 blocks of 256 threads. This method is often faster than simply using one thread per element, although a dynamic number of blocks is usually favourable (and not incompatable with multiple elements per thread).

Thanks for the replies.

I am not that worried about performance caused by that small if statement, it just appeared as a weird solution. I don’t say that thread number/index are limited either. I only claim that the programming model is very limited because it is based on thread per block (per grid) and at the very beginning you have to partition your data accordingly.