If you use Visual Studio 2008 or 2010 Professional (or above, express editions are not supported,) you should install Parallel Nsight 2.0 and after this step go to New->Project and select CUDA 4.0 or CUDA 3.2 template; at this point you have your cu and cuh files, now you can add your code above in cu file and use cuh like .h C++ header file.
Alternatively you can rename your main.cpp file to main.cu, in this way cuda compiler knows that <<<grid, block, shared, cudaStream>>>kernelName is a kernel call ;-). From cpp you cannot call directly your kernel.
you should always avoid branches in your kernel (like if, switch and so on…) if you want to achieve the best performance in CUDA, all threads of same warp should run same instructions!! If branches occur then each branch is executed in sequence and threads of same branch are executed in parallel, this can be a big issue in performance.
i know it’s a crucial part whenever we use loops and especially IF statement with ELSE cases,
for example i know that if for example i have a LOOP in my kernel implementation , so my program would stop on the last line of the } (end of block)
until all treads finish executing that specific loop code.
This is wrong, instance of a kernel is merely a thread then if you have a loop with if statement only this thread runs loop and if statement; if you want to synchronize threads in your kernel you should use __synchthreads() then in this way you can force all threads to be suspended until all threads reach end of loop.
and if i have an ‘if () a else b’
so first my program would execute the ‘a’ part then the ‘b’ part and only then will continue on executing the rest of code.
i read some tutorials and watched “STANFORD” live videos regarding that issue but i didn’t understand it, something about working on wraps only or something like that.
Program?? Do you mean thread?? About warp (I think you mean warp and no wraps…), a warp is a collection of 32 threads of same kernel instance but this is applied to different element of your dataset. I do not know about stanford live videos but for performance optimization I can give you these suggestions:
Locate data parallelism to elaborate on GPU;
Use shared memory in your kernel, texture memory via cudaMallocPitch cudaMemcpy2D and cudaBindTexture2D for image processing, constant memory, avoid whenever possible malloc to allocate host memory because is more better in transfer rate to use pinned memory with cudaMallocHost than classic malloc;
Avoid branches in your kernel;
Create block threads multiple of one warp (32 threads);
Use multiple cuda streams to overlap memory transfer and kernel execution (mask latency);
Use int in for loop (like for(int i = 0; i < 10; i++)) instead of unsigned int;
Reduce memory copying between host and device and vice versa.
i didn’t quite get the part of “all threads of same wrap should run same instruction” - most probably im mixing things up ,
but the importance of warps is most crucial when we involve branches , i mean if for example we execute an IF statement , so
threads which chose to go through path A would be warped and executed , and the others would be warped as well , the difference is
that the time of execution would be 2 times greater because we first have to wait for the first PACK to finish and then to the second on.
i know that in-order to make them run the same code instruction i need to sync the threads , is that what you meant?
do you have any suggestions about how one can avoid branches?
if ( threadIdx.x % 2 == 0)
how can i optimize it?
from my c# threading experience , i just initialize the threads and let them go, i dont have to worry about them getting into branches or any of the stuff that nvidia makes me worry about. why is that?(even if GPU cores are weaker than modern CPU),why threads have to work simultaneousy (same code)
“This is wrong, instance of a kernel is merely a thread then if you have a loop with if statement only this thread runs loop and if statement; if you want to synchronize threads in your kernel you should use __synchthreads() then in this way you can force all threads to be suspended until all threads reach end of loop.”
doesnt it conflict the first thing you wrote me about branches and serial executing?
Thanks for your Help ! and im sorry for the bother.
In GF110 chip there are two warp scheduler in each Streaming Multiprocessor SM, each warp scheduler selects half warp in input… this half warp (16 threads) is issued among 16 CUDA cores of same SM (one thread per CUDA core), if each thread runs same instruction then your application uses concurrently all 16 CUDA cores; if one or more threads of same warp run different instructions then branches occurs, all branches are managed sequentially; in worst scenario, all 16 threads of same warp run 16 different instruction, in this case application uses only one CUDA core of SM at a time… then you have lost a lot of performance. Same stuff about second warp scheduler. My “all threads of same warp should run same instruction” is only an advice to achieve best performance, of course you can run all your threads on different instructions but it is not best practice to program GPU.
if ( threadIdx.x % 2 == 0)
You should always think how you can computes each element of your data set in independently way, this is data parallelism… in your above problem you are writing code like on CPU, but the right way is always think in data set term and no in “if statement”. To optimize your code you should first tell me what A and B should do…?
Yap, I am a .NET programmer too!! :D C# way to manage threads is amazing… but this works only on CPU with few cores, on GPU you have many cores!! GF110 has 512 CUDA cores!! Hardware architecture between CPU and GPU is very different, multicore CPUs are optimized to use functional parallelism (few threads, each to manage a different block of instructions) and GPUs are optimized to use data parallelism (many threads, each runs same instruction but on different element of same data set; this is SIMT architecture that is like SIMD without restriction of SIMD). Stuffs about branches, shared memory, pinned memory, texture cache and so on are important only if you want to achieve maximum performance from your NVIDIA GPU… you can always avoid these “best practices”.
I mean that each thread runs same loop, but when one thread reaches the end of loop it does not wait other threads. I have misunderstood what you mean with “program” and it was only a suggestion in case you want to synchronize your threads of same block in the end of loop; this of course decreases performance…
If you are .NET developer go to http://www.hybriddsp.com/Home.aspx with this project you can write your CUDA application and CUDA kernel in C#!! This is amazing :D, at the moment I have not benchmark about performance… but I think that will be great.