A few thaughts about CUDA

I’m a C/C++ programmer who has recently become interested in Nvidia’s CUDA. It seemed to be an interesting and powerful technology but the reality is disappointing. Within a week I didn’t manage to parallelize a simple program that has less than 20 short lines. CUDA utilities (especially compiler) and documentation’s quality is quite poor. If I didn’t know it’s a final product I’d say it is in beta stage. Here are a few examples of what happened:

  1. device global variables == catastrophe
    In my first attempt to use device memory I’ve made a few variables global to not have to pass them as arguments to my kernel. I know global variables are ugly but I just wanted to write a short test. Compiler didn’t have anything against, the only hint that something is wrong was “Advisory: Cannot tell what pointer points to, assuming global memory space” which referred to lines where I assigned things to shared memory (local variable) . The program didn’t work of course. I spent many hours to figure out what is the problem. Google makes me think I am the only one who tried to do this. I suggest improving compiler messages and/or documentation.

  2. Unstable drivers
    Just a few days were enough to write a program that crashed my computer.
    I’ve never written a program for CPU which could crash my computer (more than 4 years of coding, tens of thousands of LOC).

  3. Interesting side-effect when synchronizing threads.
    Try something like this: make a shared variable and write something to it in one thread and read in another one. Easy task? volatile modifier or __threadfence_block() is enough? No. The first option seem not to work at all, the second works only if the writing thread is still running during synchronization. I think synchronization should synchronize returned threads too (or there should be a warning that it won’t - I didn’t notice it in programming guide).

If I encounter something else, I’ll post it here. I think I’ll give CUDA a few more days.

I use CUDA 2.2 toolkit.

Sorry that you’ve had so much trouble with CUDA…in defense though:

  1. I agree that some of the compiler/driver error messages could be improved (and there are some small missing bits in the documentation). However, I’ve always found this CUDA forum to be an invaluable resource for learning about CUDA. If you’re really interested in learning about CUDA, you should make an effort to check this forum every day and read over some of the threads…even if they don’t have anything to do with what you are working on, you’ll learn some excellent tips from the other developers here.

  2. That’s a side effect of GPU computing, because you’re dealing with hardware drivers on a very low level. If you run a kernel that has bugs in it, you might end up hanging the device, which will eventually cause the driver to stop responding or crash. Are you on Windows or Linux? If you have access to a Linux machine, Valgrind has proven to be a great tool for working out any hidden bugs.

  3. Have you tried __syncthreads, or any of the atomic functions? Synchronizing threads can sometimes be a pretty hairy task if you’re working on a really complicated kernel.

In any case, don’t give up on CUDA. There are lots of developers who are having great successes with it, but it does take a fair amount of time to really get the hang of it, because most of us aren’t trained to think in SIMD! Like I said in #1, don’t hesitate to post on the programming forum (or search the forums) if you run into a problem…people are usually quick to respond with a solution.

EDIT: If you can, upgrade to the CUDA 2.3 toolkit/SDK and play around with some of the SDK examples. Modify them, re-compile, etc. so you get used to how everything works together.

Yes, I tried __syncthreads as well. I’m sure you have to do

[codebox]if(condition)

{

var[threadIdx.x] = 5;

}

__syncthreads();

if(condition)

{

return;

}

something = var[threadIdx.x+1];[/codebox]

instead of

[codebox]if(condition)

{

var[threadIdx.x] = 5;

return;

}

__syncthreads();

something = var[threadIdx.x+1];[/codebox]

to be sure that var contains appropriate values. I think it’s strange.

BTW, here is even stranger example:

[codebox]if(threadIdx.x > 0)

{

while(shThreadState[threadIdx.x-1] < 2*a);

return;

}[/codebox]

executes smoothly while

[codebox]if(threadIdx.x > 0)

{

while(shThreadState[threadIdx.x-1] < 2*a);

}

if(threadIdx.x >0 )

{

return;

}[/codebox]

hangs my application (I know this code doesn’t make sense but it was useful for debugging purposes).

EDIT:

I’m using Windows XP 32 bit with Visual C++ 2008 Express Edition for coding.

Its kinda hard to know, since you didnt post the whole code, but… the first piece of code that runs

smoothly - might be actually optimized out by the compiler since it doesnt really do anything and

therefore the compiler optimize it out.

On the other hand, the second piece of code might not be optimized out (altough the code pieces

are indeed identical to the human eye) and maybe causes a dead lock - hence the application hangs.

I must say that profquail is 101% correct. The forums are amazing, the documents are good and CUDA

is AMAZING!!! I’ve been using the forums (and CUDA) for the last 18 months - I think there have only

been a few times where CUDA/GPU was to blame. Its mostly user errors or misunderstanding of how

things work…

eyal

Trying to implement really sneaky synchronization primitives is generally a bad idea.

Also, you should upgrade your drivers and toolkit–a lot of the issues you complain about have been improved or fixed.

Tow Dragon:

Yes, CUDA is rather frustrating especially in that first day or two when you’re trying to get used to it. It’s so similar to coding you’re used to (it’s just C!) but that similarity is what throws you astray because now not all your assumptions hold any longer.
It’s really common to get angry that this “dumb new CUDA thing whichis badly designed.” but what happens is after you dig deeper and get used to it and learn, you’ll see that that “dumb design” is actually supports the unique behavior of GPU parallelism quite well. Your very first issue of memory races and synchronization is perhaps a great example of exactly the kind of change the hardware needs your mental model to readjust to.

For actual details on your issues, Prof Quail’s post was great for initial help.

Thanks for responses. I’ll update my drivers. Do you know where I could find an example of a simple buffer that one thread writes to and another one reads from (I want it to behave like a FIFO queue)? It’s the most complicated part of the program I want to write and (for now) I don’t have time to dig through all these sample projects.

I have had similar troubles…check the CUDA programmers guide for threadfence() and threadfence_block() and syncthreads(). [In my opinion, syncthreads should be called syncthreads_block() to be consistent, since it only syncs threads within a given thread block…but the CUDA folks can chime in on this]. Also search for “volatile” in that guide. You may need to declare device variables (or “variables in global device memory”) to be volatile or else the compiler optimizer will replace repeated mem reads with one mem read, if nothing in the current function seems to be modifying that variable. Volatile implies that another thread may change the var at any time. Trick! Took me some headbangging to find that. I don’t think there is any advantage (and maybe some troubles) of using device vars instead of kernel args for CPU->GPU data passing…I think you would have to do a cudaMemcpy to each of the device vars before calling the kernel to transfer the data from CPU mem to GPU mem…but correct me if I’m wrong, CUDA people…I don’t think those device vars can be directly accesed on the CPU side in any manner except getting their address (and I don’t even know if you can do that on the CPU side) but I haven’t tried it. Kernel args: a good idea.

He is friendly, and smart as a whip. We need more like him… External Media