How to write efficient from local to glocal memory Writing - time problems

Hi all,
I’m a newcomer in CUDA programming. I’m planning to use it to improve my program in CT reconstruction. I’ve completed my program and it has been tested in Geforce 8600GT. How ever, I found that, there are one factor that effect to overall computation time.
In my program, I use global memory to store the reconstruting image (rim).
In the kernel program, i usually update new value to rim, however, it effects to computation time.
With update operation, it takes me 200ms to complete one iteration. But with the same computation requirements (I simple exclude update opeartion), it takes only 1ms.
Since in my program, the reconstructing image requires more than 16KB memory, so I must use global memory (I mean there are no way of using shared memory to store regular result).
When I try to estimate the number of write (set the value from local memory to a position in global memory). I found that the transfer rate equal to (32MB/sec). In my test, it takes 200 ms to complete 128 x 128 x 100 operation of writing float value.
In my test, I separates my job into 64 concurrent job (64 thread/ 1 block). If I separate more that, there are a conflict access occurs (at one time, there are more than one thread that read-calculate-write value to one position in global memory).
With above configuration, it takes me 125ms to complete CPU >< 200ms in GPU.
Finally, I wonder to know if there are any ways to make the writing operation work faster?
Thanks in advance.

There are two possibilities here:

  1. You are not making coalesced reads or writes. That’s probably the #1 reason for bad performance for new (or less new :) ) CUDA programmers. Make sure whenever possible that consecutive threads write to consecutive memory locations.

  2. When you comment out the update operation, the compiler might optimize away a bunch of code, since it has no effect. Check the ptx output (use the --keep option) in both cases to make sure it hasn’t just removed all your calculation. That could cause huge timing differences.

I’d also suggest taking a look at the CUDA profiler in 1.1. It has a lot of nice options, like the ability to count uncoalesced reads.

  1. How do you perform timings?
  2. If you comment out write to global memory nvcc is likely to remove all code from your kernel, because it is a dead code (your code does something, but writes no results in global or shared memory). THis is why you get 200ms drop to 1ms: you’re running empty kernel.
  3. 32MB/sec seem very odd. You’re doing something terribly wrong: either doing timing or writing to memory. Global memory have throughtput of about 70GiB/sec, so ther’s some room for improvement.

All this issues has been discussed here. Try searching forum for answers, they are there.

Thanks for your replys,
I’m looking up my program to find out what’s caused slowing in my GPU program.