Well, looking at your code segments, you increased the number of arithmetic operations insided the loop by 50% (3 vs previous 2). Since loop itself has some arithmetic, plus the constant mem access, the time increases by less than 50%.
What’s the total number of threads you’re launching?
There are a couple possibilities: It’s hard to tell from only a small piece of a large kernel.
Change in occupancy due to increased use of registers. What block size are your running and how many blocks? What are the register usage values reported in the cubin (nvcc -cubin).
Dead code optimization. If you don’t use ks1 anywhere, all the code related to computing it will be optimized away.
ks2 = materials[nearest_hit.material_id].ks; ← this line looks suspiciously like it is an uncoalesced read. It should drop your performance significantly. But perhaps the dead code optimization of the no longer used ks1 makes up for it so you still have 30ms… just guessing here.
While the template approach is better than a for, it still uses many pointer casts which I don’t think nvcc can handle. Maybe hardcoding them using a.x*=b.x;a.y*=b.y;a.z*=b.z would result in further improvement. Your local memory size is really suspicious. It looks like some vectors are still being put in there.
i think the laag was because i need to store ks1 value for a long time - all cycle. each iteration. i.e. i must save ks1 value between cycle iterations.
and code in cycle is not small.
but i dont really understand why that is the cause of laags and i will be glag to some explanation
i have used shared memory and all works fine now!
__shared__ float ks[BLOCK_SIZE][BLOCK_SIZE];
ks[threadIdx.x][threadIdx.y]=1;
float* pKs = &ks[threadIdx.x][threadIdx.y];
for(i=1;i<4;i++)
{
...
*pKs *= materials[nearest_hit.material_id].ks; // ok 30 ms per frame
final_color += (*pKs)*real_color;
}
I think the problem is exactly that ks is put in local memory.
You have to remove ALL templated operator overloading to prevent it. Merely rewriting one statement is not likely to work. Generic programming at that level won’t work well with nvcc. Most likely, you have to implement vector math exactly like nvidia’s header to prevent local memory.
Since shared memory is addressable, the local memory problem also gets solved by putting variable in shared memory.
The ptx would indeed be helpful when investigating local memory problems.
is ok. Note that operator returns float instead of float&, if it returns float&, it still results in local memory. That’s possibly due to a pointer to the middle of a struct being stored in a temp variable or something. It’s not difficult to handle in a compiler, but nvcc just doesn’t handle it.
So, you can use template metaprogramming in nvcc, as long as you don’t cast pointers and don’t use reference that would result in a pointer cast. However, I found that too annoying and ended up writing a macro processor instead. Also, it may be even easier to just write the shading in nvcc without any vector math. I did a Ward and a Phong that way.
Finally, if you really like template metaprogramming, why do you insist on using nvcc? After all, ptx is better specified than nvcc’s local memory handling. You can just DIY…
No, even in the 1.0 era the reason was not in casting pointers but in indexing variables. (The registers cannot be dynamically indexed as you know.) I got working quite a similar templated approach as FROL suggested. The only reason to have to use template metaprogramming for the loops was that NVCC 1.0 didn’t unroll even short determined loops.
Now, with 1.1 the case is different and eg. this code results in no use of local memory. Go and test yourself:
Well… I only passed constant to that operator, otherwise I won’t be talking about this in the first place.
In 1.0, a simple test kernel didn’t meet any problem either. It took a more complicated kernel to result in local memory (possibly due to a bug), and made me rewrite a lot of code. That memory was very unpleasant, so maybe I went a bit too harsh on nvcc.
1.1 indeed unrolled the loops and handles pointers much better. However, I guess FROL is using 1.0.
Yeah, FROL was probably using CUDA 1.0, but 1.1 is now also publicly available.
True. But the problem lies now in declaring an array of structures having a user-defined constructor:
my_float3 c[5];
The same code still compiles if you modify the above array to distinct objects and the array indexing respectively.
my_float3 c0, c1, c2, c3, c4;
I isolated the bug in the following example. It seems you cannot make an array of objects with a user-defined constructor. The compiler tries to make the array constructor call non-inline no matter what you tell it to:
"test.cu", line 7: Error: External calls are not supported (found non-inlined call to __cxa_vec_ctor)
(This is, of course, only relevant if we need to have an array of structs with a specialized constructor. Given a specialized constructor we need to define a constructor with no arguments to be able to allocate an array of it. By the way, C++0x will have the my_struct()=default; notation to completely get around this case.)
Also thanks for your effort, pyrtsa:)
That may have explained my local memory case in 1.0… it may be due to similar reasons, but the compiler hasn’t detected it.