Suddenly performance lost

Hello. I have a very strange performance lost.

float ks[4] = {0.75,0.75*0.75,...};

vec4f real_color(0,0,0,0),final_color(0,0,0,0);  // where vec4f is some thing like float4

for(int i=0;i<4;i++)

{

...

    final_color += (ks[i])*real_color; // OK 30ms per frame

}

// materials placed in constant memory

for(int i=0;i<4;i++)

{

...

   ks2 = materials[nearest_hit.material_id].ks;    // OK 30ms per frame

   final_color += (ks2)*real_color;                      // OK 30ms per frame

}

// BUT !

for(int i=0;i<4;i++)

{

...

   ks2 = materials[nearest_hit.material_id].ks;    

   ks1 = ks1*ks2;                                               // WOW! 40 ms per frame

   final_color += (ks1)*real_color;                      

}

this is very strange performance lost because of just one floating point multiplication.

why its can happend? - some reasons?

forward thanks for reply.

Well, looking at your code segments, you increased the number of arithmetic operations insided the loop by 50% (3 vs previous 2). Since loop itself has some arithmetic, plus the constant mem access, the time increases by less than 50%.

What’s the total number of threads you’re launching?

Paulius

the total number of threads are 640*480 = 307200 threads.

the loop has a lot of arithmetic. a lot of ray-sphere intersections and

the some outher computations.

very strange that

for(int i=0;i<4;i++)

{

...

  ks2 = materials[nearest_hit.material_id].ks;    // OK 30ms per frame

  final_color += (ks2)*real_color;                      // OK 30ms per frame

}

work normally, but if i make ks1 = ks1*ks2; - i have performance lost.

i dont understand it.

There are a couple possibilities: It’s hard to tell from only a small piece of a large kernel.

  1. Change in occupancy due to increased use of registers. What block size are your running and how many blocks? What are the register usage values reported in the cubin (nvcc -cubin).
  2. Dead code optimization. If you don’t use ks1 anywhere, all the code related to computing it will be optimized away.

ks2 = materials[nearest_hit.material_id].ks; ← this line looks suspiciously like it is an uncoalesced read. It should drop your performance significantly. But perhaps the dead code optimization of the no longer used ks1 makes up for it so you still have 30ms… just guessing here.

the block size is 8x8 = 64

the number of blocks is (640/8)*(480/8) = 4800

register usage in main kernel: (i make it with -keep)

code  {

	name = doPixel

	lmem = 692

	smem = 28

	reg = 40

	bar = 0

	bincode  { ....

so 8840 = 2560 - less then the total number of registers on one multiprocessor (8192).

but i use it! i could write

ks1 = materials[nearest_hit.material_id].ks;    // OK 30ms per frame

final_color += (ks1)*real_color;                      // OK 30ms per frame

and all be ok. problems only will be if i write

ks1 *= materials[nearest_hit.material_id].ks

may be some more information you needed? like ptx code?

and what do you mean by uncoalesced read ? how can i make it coalesced?

How did you implement operator*? Seems like you used a for and things are put in local memory.

#define universe_call __device__ __host__

inline universe_call VECTOR<n,T> operator*(const T rhs) const

	{ 

   VECTOR<n,T> v;

   MulVec<n,T>::exec(v.M,M,rhs);

   return v;

	}

// where MulVec is inline template function to be shure in unrolling loop

template <int n,class T> 

struct MulVec

{

   inline static universe_call void exec(T* c,const T b)

   {

    *c *= b;

    MulVec<n-1,T>::exec(c+1,b);

   }

};

template<class T>

struct MulVec<0,T>: public _MGML_VEC::ZeroIter<T> { };

and inline-functions unrolling really need.

if i do

inline universe_call VECTOR<n,T> operator*(const T rhs) const

	{ 

   VECTOR<n,T> v;

    for(int i=0;i<n;i++)

    v.M[i] = M[i]*rhs;

   return v;

	}

i loose about 10 ms per frame in my program.

While the template approach is better than a for, it still uses many pointer casts which I don’t think nvcc can handle. Maybe hardcoding them using a.x*=b.x;a.y*=b.y;a.z*=b.z would result in further improvement. Your local memory size is really suspicious. It looks like some vectors are still being put in there.

Indeed, try to use structs instead of arrays, and hardcode the indices as field names. Internal registers are not indexable.

do you mean operator* and operator+= ?

final_color.x += ks1*real_color.x;

final_color.y += ks1*real_color.y;

final_color.z += ks1*real_color.z;

nothing changes.

if i place the .ptx code here, could somebody help me?

i have found that

ks2 = materials[nearest_hit.material_id].ks;  // constant memory

ks2 = ks1*ks2;	

final_color += (ks2)*real_color;

works fine, but i need to store ks1*ks2 in to multilpy it on next iteration on ks2 again.

if i write ks1=ks2; i have lag

does it mean that ks1 lies in local memory?

omg. i have solved my problem!

i think the laag was because i need to store ks1 value for a long time - all cycle. each iteration. i.e. i must save ks1 value between cycle iterations.

and code in cycle is not small.

but i dont really understand why that is the cause of laags and i will be glag to some explanation

i have used shared memory and all works fine now!

__shared__ float ks[BLOCK_SIZE][BLOCK_SIZE];

ks[threadIdx.x][threadIdx.y]=1;

float* pKs = &ks[threadIdx.x][threadIdx.y];

for(i=1;i<4;i++)

{

...

*pKs *= materials[nearest_hit.material_id].ks;  // ok 30 ms per frame

  final_color += (*pKs)*real_color;

}

I think the problem is exactly that ks is put in local memory.
You have to remove ALL templated operator overloading to prevent it. Merely rewriting one statement is not likely to work. Generic programming at that level won’t work well with nvcc. Most likely, you have to implement vector math exactly like nvidia’s header to prevent local memory.
Since shared memory is addressable, the local memory problem also gets solved by putting variable in shared memory.
The ptx would indeed be helpful when investigating local memory problems.

ok. thanks all for help.

but now some questions about generic programming.

  1. if i rewrite some crytical code in not generic style nothing changes. Even if i remove generic code ks put in local memory anyway and i have laags.

According to CUDA_Programming_Guide_1.0 the device == inline.

so the static recursion like

template <int n,class T> 

struct MulVec

{

  inline static universe_call void exec(T* c,const T b)

  {

   *c *= b;

   MulVec<n-1,T>::exec(c+1,b);

  }

};

MUST be unrolled in to linear code. In outher case compiler can’t make code because there are no recursion supported.

so what exactly features of template metaprogramming can make problems(what exactly problems) with nvcc and why?

and template metaprogramming is great! Why shouldn’t i use it?

Well, the thing is in pointer handling, not in template itself.

It seems that you can’t just cast pointers and expect nvcc to handle it correctly. In 1.0 era, this:

struct myfloat3{

float x,y,z;

__device__ inline float& operator[](int x){return (float*this)[x];}

};

results in local memory.

However, this:

struct myfloat3{

float a[3];

__device__ inline float operator[](int x){return a[x];}

};

is ok. Note that operator returns float instead of float&, if it returns float&, it still results in local memory. That’s possibly due to a pointer to the middle of a struct being stored in a temp variable or something. It’s not difficult to handle in a compiler, but nvcc just doesn’t handle it.

So, you can use template metaprogramming in nvcc, as long as you don’t cast pointers and don’t use reference that would result in a pointer cast. However, I found that too annoying and ended up writing a macro processor instead. Also, it may be even easier to just write the shading in nvcc without any vector math. I did a Ward and a Phong that way.

Finally, if you really like template metaprogramming, why do you insist on using nvcc? After all, ptx is better specified than nvcc’s local memory handling. You can just DIY…

No, even in the 1.0 era the reason was not in casting pointers but in indexing variables. (The registers cannot be dynamically indexed as you know.) I got working quite a similar templated approach as FROL suggested. The only reason to have to use template metaprogramming for the loops was that NVCC 1.0 didn’t unroll even short determined loops.

Now, with 1.1 the case is different and eg. this code results in no use of local memory. Go and test yourself:

struct my_float3 {

    float x, y, z;

    __device__ my_float3() {}

    __device__ my_float3(float x, float y, float z) : x(x), y(y), z(z) {}

    __device__ explicit my_float3(float f) {

        for (unsigned i=0; i<3; i++) (*this)[i] = f;

    }

    __device__ float & operator[](unsigned x) {

        return reinterpret_cast<float *>(this)[x];

    }

    __device__ float operator[](unsigned x) const {

        return reinterpret_cast<float const *>(this)[x];

    }

    __device__ my_float3 & operator+=(my_float3 const & r) {

        for (unsigned i=0; i<3; i++) (*this)[i] += r[i];

        return *this;

    }

    __device__ my_float3 & operator*=(my_float3 const & r) {

        for (unsigned i=0; i<3; i++) (*this)[i] *= r[i];

        return *this;

    }

};

__device__ my_float3 operator+(my_float3 const & a, my_float3 const & b) {

    return my_float3(a) += b;

}

__device__ my_float3 operator*(my_float3 const & a, my_float3 const & b) {

    return my_float3(a) *= b;

}

__global__ void test(float * p) {

    const unsigned thread_count = blockDim.x * gridDim.x;

    const unsigned thread_index = threadIdx.x + blockDim.x * blockIdx.x;

    my_float3 a;

    for (unsigned i=0; i<3; i++) a[i] = p[i * thread_count + thread_index];

    my_float3 b(1.0f, 2.1f, 4.3f);

    a += a * b;

    for (unsigned i=0; i<3; i++) p[i * thread_count + thread_index] = a[i];

}

Well… I only passed constant to that operator, otherwise I won’t be talking about this in the first place.
In 1.0, a simple test kernel didn’t meet any problem either. It took a more complicated kernel to result in local memory (possibly due to a bug), and made me rewrite a lot of code. That memory was very unpleasant, so maybe I went a bit too harsh on nvcc.
1.1 indeed unrolled the loops and handles pointers much better. However, I guess FROL is using 1.0.

Also, some more complicated (but non the less doable) stuff still crashes 1.1:

struct my_float3 {

   float x, y, z;

   __device__ my_float3() {}

   __device__ my_float3(float x, float y, float z) : x(x), y(y), z(z) {}

   __device__ explicit my_float3(float f) {

       for (unsigned i=0; i<3; i++) (*this)[i] = f;

   }

   __device__ float & operator[](unsigned x) {

       return reinterpret_cast<float *>(this)[x];

   }

   __device__ float operator[](unsigned x) const {

       return reinterpret_cast<float const *>(this)[x];

   }

   __device__ my_float3 & operator+=(my_float3 const & r) {

       for (int i=0; i<3; i++) (*this)[i] += r[i];

       return *this;

   }

   __device__ my_float3 & operator*=(my_float3 const & r) {

       for (int i=0; i<3; i++) (*this)[i] *= r[i];

       return *this;

   }

};

__device__ my_float3 operator+(my_float3 const & a, my_float3 const & b) {

   return my_float3(a) += b;

}

__device__ my_float3 operator*(my_float3 const & a, my_float3 const & b) {

   return my_float3(a) *= b;

}

__global__ void test(float * p) {

   const unsigned thread_count = blockDim.x * gridDim.x;

   const unsigned thread_index = threadIdx.x + blockDim.x * blockIdx.x;

   my_float3 a;

   a=*(my_float3*)&p[thread_index];

   my_float3 b(1.0f, 2.1f, 4.3f);

   my_float3 c[5];

   a += a * b;

   ((int*)&a)[2]+=333;

   c[0]=a;c[4]=b;

   for(int i=0;i<p[0];i++){

   b*=*(my_float3*)&p[i];

   if(p[1]<33.f){

    a+=a*b;

    c[4]=a;

   }else{

     if(p[2]<33.f){

       b+=b*a;

       a+=c[0];

      }else{

       b+=a*a+c[4];

       a+=b*b;

       c[0]+=a;

    }}

    *(my_float3*)&p[i]=a;

	}

}

Yeah, FROL was probably using CUDA 1.0, but 1.1 is now also publicly available.

True. But the problem lies now in declaring an array of structures having a user-defined constructor:

  my_float3 c[5];

The same code still compiles if you modify the above array to distinct objects and the array indexing respectively.

   my_float3 c0, c1, c2, c3, c4;

I isolated the bug in the following example. It seems you cannot make an array of objects with a user-defined constructor. The compiler tries to make the array constructor call non-inline no matter what you tell it to:

struct my_struct {

    float a, b;

    __device__ __inline__ my_struct() {} // <- doesn't help, it'll be non-inline anyway

    __device__ __inline__ my_struct(float f) : a(f), b(f+1) {}

};

__global__ void test(float * x) {

    my_struct list[4]; // tries to invoke non-inlined calls to my_struct constructor!

    list[0] = my_struct(x[0]);

    x[1] = list[0].a;

    x[2] = list[0].b;

}

nvcc -ptx test.cu yields error:

"test.cu", line 7: Error: External calls are not supported (found non-inlined call to __cxa_vec_ctor)

(This is, of course, only relevant if we need to have an array of structs with a specialized constructor. Given a specialized constructor we need to define a constructor with no arguments to be able to allocate an array of it. By the way, C++0x will have the my_struct()=default; notation to completely get around this case.)

Thanks for your help, asadafag! :-)

Also thanks for your effort, pyrtsa:)
That may have explained my local memory case in 1.0… it may be due to similar reasons, but the compiler hasn’t detected it.