int y=(int)malloc(sizeof(int));
cudaMemcpy(y,x,sizeof(int),cudaMemcpyDeviceToHost);
printf(“%d\n”,*y);
cudaFree(x);
free(y);
}
the code do nothing special and its only for testing,
as you can see, the kernal is empty and yet the compiling take to much time(around 12 sec).
when i changed the grid to “dim3 dimGrid(1,1,1)” it run very fast.
i would like to know few things:
i can see that im ok with the limits of grid and block(1024 threads in the block and the grid can be
around 65,000 blocks each dimension so what is the problem? is there any differnet between grid(1,1,1) or
grid(50000,50000,50000)?
how to calculate the correct bounded limits for best performance?
if needed so how to calculate how many gpu’s do i need for best performance?
the kernal is empty and yet the compiling take to much time(around 12 sec).
when i changed the grid to “dim3 dimGrid(1,1,1)” it run very fast.
if you mean execution, not compiling, please compute how much time GPU spends on executing each individual thread i.e. 12 sec/ (dimBlock*dimGrid). In this time GPU creates new thread, executes its empty body and cleans up the execution context. For comparison, try to do the same on CPU, although i recommend you to limit yourself to 1 million threads and run it overnight :D
yes, something around 1*10^12. like i said im new with cuda and at the online courses and manuals
it says that i can create up to grid(65535,65535,65535) * 1024 threads.
i didn’t know there is a price for it so what is the correct way to do this?
yes, it can. and as you measured, it spend only 8 picoseconds to create and then destroy each thread. it’s very cheap compared to CPUs but still not absolutely free
when you have very short kernel you can optimize it by performing more work. i.e. instead of making 10^12 threads performing single operation, you will be fine making only 10^9 threads each performing 10^3 operations. Overall, modern GPUs can execute up to ~10^5 threads simultaneously, so a few million threads per kernel should be ok to fill entire GPU and deal with tail effect
ok i got u.
let’s say i have a very large matrix 2-D, and assume it large enough will iget the numbers i mensioned.
how will you optimize the performace?
i mean if i will divide it to peices, i need to do a lot of read and write from host to device and back.
so where is the limit between create a lot of threads with few read and write, or the oposite?
tail effect: when you have 1000 cores and run 1000 big jobs on them, total time (defined by the time when ALL jobs will finish their processing) is defined by slowest core. due to various reasons, it may be much larger than average time. it’s tail effect
if you create 10^5 smaller jobs instead, each next job will start once some core finished a previous one, so work is distributed in more dynamic way and tail effect become much smaller (it’s proportional to job size, so with 100x more jobs, the effect is 100x less)
as i said, its optimal to create a few million jobs. i don't understand why it should change number of reads/writes
f.e. instead of running 10^9 jobs performing a[i]+=b[i] it’s better to run 10^6 jobs performing the loop
base = job_number*1000;
for (i=0; i<1000; i++) a[base+i] += b[base+i];