So I’m wondering, with Fermi’s cache memory architecture, which kernel would be more efficient?

Arrays:

[codebox]**global** void test1a(float *g1, float *g2, float *g3, float *g4, float *sum)

{

```
int i = blockIdx.x*512+threadIdx.x;
g1[i]=g1[i]*0.99f;
g2[i]=g2[i]*0.99f;
g3[i]=g3[i]*0.99f;
g4[i]=g1[i]*0.99f;
sum[i]=g1[i]+g2[i]+g3[i]+g4[i];
g1[i]=g1[i]+(g1[i]<=0.5f)*(0.7f-g1[i]);
g2[i]=g2[i]+(g2[i]<=0.5f)*(0.7f-g2[i]);
g3[i]=g3[i]+(g3[i]<=0.5f)*(0.7f-g3[i]);
g4[i]=g4[i]+(g4[i]<=0.5f)*(0.7f-g4[i]);
```

}[/codebox]

Also, what if I break up the above kernel into multiple kernels each handling a single operation? Would that be more efficient?

or Array of structs:

[codebox]struct TestStruct

{

```
float g1, g2, g3, g4, sum;
```

};

**global** void test1b(TestStruct *structArr)

{

```
int i = blockIdx.x*512+threadIdx.x;
structArr[i].g1=structArr[i].g1*0.99f;
structArr[i].g2=structArr[i].g2*0.99f;
structArr[i].g3=structArr[i].g3*0.99f;
structArr[i].g4=structArr[i].g4*0.99f;
structArr[i].sum=structArr[i].g1+structArr[i].g2+structArr[i
```

].g3+structArr[i].g4;

```
structArr[i].g1=structArr[i].g1+(structArr[i].g1<=0.5f)*(0.7f-structArr[i].g1);
structArr[i].g2=structArr[i].g2+(structArr[i].g2<=0.5f)*(0.7f-structArr[i].g2);
structArr[i].g3=structArr[i].g3+(structArr[i].g3<=0.5f)*(0.7f-structArr[i].g3);
structArr[i].g4=structArr[i].g4+(structArr[i].g4<=0.5f)*(0.7f-structArr[i].g4);
```

}[/codebox]

I can pad the struct with dummy fields if that makes the memory access more efficient.

I’ve searched all over the place for this, but I still can’t find a clear answer, and somehow the nvidia forum search engine is not working for me right now. So apologies if this has been discussed before.

Thanks,

hammer