multithreaded app parallele programming newby >.<


I wrote already several sequential apps, but, in order to use more efficiently tesla cards, i’d like to write a multithreaded app…

I took a look in the programming guide that goes along with cuda, but it seems that the threads are created by the compiler…

Would it be possible to create threads on the device with pthread library ?
Any idea about what to do ?

Thanks in advance for any help !!!


Electro – multithreading is different from hardware-threading…

Multi-threading -pthreads --etc… are software created threading libraries that run on the CPU.

CUDA – threads are hardware elements – Your program runs inside your graphics card. Your code is the only thing that executes inside GPU. (no OS, no libs) etc…

Got it Mr.Electro??

So if I understood correctly, the fact that the GPU is using more than one of its proc with what i wrote is transparent to me ?

Your question is NOT transparent to me :-)

You execute your EXE from the CPU. This EXE will copy your GPU kernel part to the GPU device and start execution. The processors inside the GPU will execute your code and return the results back to the CPU from where you print the results… Does this sound clear to you?

It does !!!

What i want to do is to create an app that runs 128 times the same corner-turn or matrice calculation or FFT. After having written apps that are doing only once one of these operations, i’d like to know how to modifiy/rewrite those apps to stick with my goal.

How to do that seems unclear to me.

That was what i meant…

Hi Electo,

There is a fft library inside CUDA maybe you can take a look at that one… And from what I have experienced, it is very hard to rewrite sequential CPU code in a way it is also fast on the GPU.

That is why when I first started I completely began from scratch.

Good luck

You have to re-write things completely for CUDA.

Meet Mr.CUDA! Forget your existing code :-)

The apps i wrote are already done with cuda, using the cufft and cublas library…

Forgetting these codes is not a problem, but my problem of having no idea how to optimize cuda applications to be sure they run on the number i want of GPU processors remains, though…


i forgot to add that the apps i wrote are running on a cuda capable device (i set that device to be the tesla card i have, not the graphic card)

You have no control over how many Multiprocessors are used to run your code. They are always all used as long as there are enough blocks requested for your kernel.

So, in order to reduce the idle of GPU processors, i should create blocks. I see, thanks for the answer !

Yes, and you should use lots of them. Having more blocks than multiprocessors allows the block scheduler to hide the effects of memory latency.