Writes in same memory location Cant add numbers from different threads?

Yes I’ve seen this parameter in the doc of nvcc. But I’ve already tried and it doesn’t seem to work for me. Is it possible? Or it should work anytime?

It should work. Please check your poject files. It’s set in the same place where you’d already set -keep option. By default (without -po maxrregcount option) the compiler doesn’t try to optimize register utilization as much as it can, since for different thread block sizes optimal register counts are different . For example for 256 threads per block there’s no difference between 11, 12, 13, 14, 15 or 16 registers in terms of warp occupancy (all these counts result in 66% warp occupancy, which is a good ratio), but the thread block size is unknown at compile time.

With -po maxrregcount=N the compiler is forced to stay on the budget of N registers. Beware that too low N can cause spills to local memory in the produced code, which in many cases can be the reason of a slowdown in spite of the warp occupancy increase. You can always check hardware resources utilization in the produced .cubin file.

Each thread block in the histogram64 CUDA SDK sample computes per-thread subhistograms on the first stage and merges them (to form per-block subhistograms) on the second stage. __syncthreads() is required only between the two stages.

Ok, it worked. I don’t know what I did when I tried last time. But it’s working now. But as you said the compiler’s using local memory instead and it’s slow. I’ve spent my day trying to minimize the number of registers. But I couldn’t. I guess there are too many texture fetching and too many calculations depending of each other. I’ve managed to decrease the registers in my second kernel to an acceptable level. But my first kernel I simply can’t and it’s slower than my Cg implementation. How is it possible? I mean, in Cg, without taking any precaution about different kinds of memory, it just works, and quickly…

How many registers does the Cg implementation consume? What is the minimum register count without local memory spills in the CUDA counterpart? How different are Cg and CUDA sources? In principle, porting Cg shaders to CUDA is rather simple and straightforward task, so can you make them look as similar as possible?

I don’t know how checking the number of registers used by the Cg implementation. I’ve never had to look at this before. About the CUDA implementation my first kernel’s using 40 registers. With -po maxrregcount=N option, I can force the compiler to 36 with lmem = 4. And the smallest number of registers I can have is 20 with lmem=244. If I go below this, the compiler produces a segmentation fault.

And I’ve tried to reformulate my algorithm in many different ways but I couldn’t reduce the use of registers.

And yes my implementation is the closest as possible from the Cg one. The major difference I can see is that in the Cg implementation, most of the calculations were done directly on float3 structures when I’m forced to use floats in small arrays since there’s no vector operations natively in CUDA. I minimized the number of temporary structures used (float, float4 etc.) by re-using each of them through my code. Or when I was doing 3 texture fetchings of float4, everything was stored in a float4x3 structure in Cg, it’s now stored in 3 float4 in CUDA.

As an example, instead of doing this in Cg:

// All vectors are float3

Result.x = dot(vector1_a, vector2_a) + dot(vector1_b, vector2_b);    

vector*_a and vector*_b are the 2 parts of a 6 element vector.

I’m doing something like this in CUDA:

vector1[6];

vector2[6];

...

Result[tid] = vector1[0]*vector2[0] + vector1[1]*vector2[1] + ... + vector1[5]*vector2[5];

I have approximatly the same timings in CUDA than in Cg now. And actually I haven’t touched to the registers. My main problem was due to Memcpy. According to the documentation I was using cudaArray because texture fetching are optimized for it. And since there’s no way to write directly in a cudaArray within a kernel (confirmed by NVIDIA on different posts), I was using a buffer allocated with cudaMalloc, I was giving this buffer to my kernel for writing, and at the end I was copying this buffer back to the cudaArray (which was bound to a texture).

Now I directly bind a texture to the buffer (1D texture). And I could get comparable timings with my Cg implementation. The problem of registers may still remain. It might improve my timing by reducing them (if it’s not too memory latency bound). Because all the purpose using CUDA for me was to run my kernel faster than in Cg. And I think I can’t achieve this with my application.