time problems with big grid

hello, my name is asaf na dim new with cuda.
i have a geForce GTX 1080 using visual studio and try to tun this code:

global void test(int *x)


typedef unsigned long long ull;


  • Host code
    int main()

    int *x;

    dim3 dimBlock(8, 32,4);
    dim3 dimGrid(8000, 3, 51615);


    int y=(int)malloc(sizeof(int));



the code do nothing special and its only for testing,
as you can see, the kernal is empty and yet the compiling take to much time(around 12 sec).
when i changed the grid to “dim3 dimGrid(1,1,1)” it run very fast.

i would like to know few things:

  1. i can see that im ok with the limits of grid and block(1024 threads in the block and the grid can be
    around 65,000 blocks each dimension so what is the problem? is there any differnet between grid(1,1,1) or
  2. how to calculate the correct bounded limits for best performance?
  3. if needed so how to calculate how many gpu’s do i need for best performance?

thanks a lot,
asaf anter

the kernal is empty and yet the compiling take to much time(around 12 sec).
when i changed the grid to “dim3 dimGrid(1,1,1)” it run very fast.

if you mean execution, not compiling, please compute how much time GPU spends on executing each individual thread i.e. 12 sec/ (dimBlock*dimGrid). In this time GPU creates new thread, executes its empty body and cleans up the execution context. For comparison, try to do the same on CPU, although i recommend you to limit yourself to 1 million threads and run it overnight :D

you are right i meant execution, isn’t it supose to be very fast?

if nanoseconds isn’t fast enough for you, you need to buy GPUs with THz frequencies

12 sec not nano sec!

now i see the problem. do you know how much threads are created by the following code?

dim3 dimBlock(8, 32,4);
dim3 dimGrid(8000, 3, 51615);

yes, something around 1*10^12. like i said im new with cuda and at the online courses and manuals
it says that i can create up to grid(65535,65535,65535) * 1024 threads.
i didn’t know there is a price for it so what is the correct way to do this?

yes, it can. and as you measured, it spend only 8 picoseconds to create and then destroy each thread. it’s very cheap compared to CPUs but still not absolutely free

when you have very short kernel you can optimize it by performing more work. i.e. instead of making 10^12 threads performing single operation, you will be fine making only 10^9 threads each performing 10^3 operations. Overall, modern GPUs can execute up to ~10^5 threads simultaneously, so a few million threads per kernel should be ok to fill entire GPU and deal with tail effect

ok i got u.
let’s say i have a very large matrix 2-D, and assume it large enough will iget the numbers i mensioned.
how will you optimize the performace?
i mean if i will divide it to peices, i need to do a lot of read and write from host to device and back.
so where is the limit between create a lot of threads with few read and write, or the oposite?

and i’ll be very glad if you explain what it “tail effect” :)

tail effect: when you have 1000 cores and run 1000 big jobs on them, total time (defined by the time when ALL jobs will finish their processing) is defined by slowest core. due to various reasons, it may be much larger than average time. it’s tail effect

if you create 10^5 smaller jobs instead, each next job will start once some core finished a previous one, so work is distributed in more dynamic way and tail effect become much smaller (it’s proportional to job size, so with 100x more jobs, the effect is 100x less)

as i said, its optimal to create a few million jobs. i don’t understand why it should change number of reads/writes

f.e. instead of running 10^9 jobs performing a[i]+=b[i] it’s better to run 10^6 jobs performing the loop

base = job_number*1000;
for (i=0; i<1000; i++) a[base+i] += b[base+i];

ok, i want to map an area with some device to a 3-D matrix with dimension 50,00060,000100
and i need to do some action on each cell.

i used cudaMalloc to allocate memory at the device and cudaMemCpy to copy from host to device.

i tried to allocate thread for each pixel, but it takes to long…

any suggestion how to do it?

process more than one pixel by each thread. f.e. run 5060,000100 threads and process 1000 pixels by each thread