__global__ void access(float *pfIn, unsigned long ulCount, unsigned long ulValues, unsigned long ulStride, unsigned long ulOffset)
{
unsigned long lulI=blockIdx.x * blockDim.x + threadIdx.x;
float lfGesamt = 0;
if (lulI<ulCount)
{
unsigned long pos = (lulI*ulStride+ulOffset)%ulCount;
for (unsigned long lulJ=0; lulJ<ulValues; lulJ++)
{
lfGesamt += pfIn[pos];
}
}
}
To make it do what the code says i turned off the optimization by passing --opencc-options -O0, which worked well so far.
The Problem: I can’t see any effect neither of the parameters “stride” nor of “offset”. The Bandwidth is always the same. Also i don’t see any effect of the L1 and L2 Cache of my GT540 M, the calculated bandwidth is equal when i read 1 value or if i do it 10 times.
Optimazation is ok, when i generate gpu debug information (at least i hope so)
When i don’t generate the debug info, i achieve a bandwidth of 200gb/s at a 9800 gt, specification says it has about 58 Gb/s…
either am i doing something incredibly wrong, or it is still too optimized.
The other is the not-caching and the equal bandwidth with stride&offset.
But probably the stride gets equalized since every block only reads the block memory, and the offset should result in maximum 1 more coalesced access at cc 2.1
So the main problem left is the equal bandwidth between 1 and 10 reads which should get cached.
Your kernel is accessing the same position of global memory over and over meaning you are possibly getting cache hits which would give you a higher effective bandwidth. You should run this through the visual profiler and see what your cache hit rate is.
No, it’s intermediate code for a virtual machine with an infinite number of registers. It is optimized when compiled (“assembled”) for the real architecture. Use [font=“Courier New”]cuobjdump -sass[/font] on the .cubin file to see (disassembled) machine code.