In previous version it was 0.83 ocupancy, now i get 0.25 but new version runs faster (0.40 sec vs 0.31 sec for 100 runs). so next improvement is to get higher occupancy. how is it possible here? i have a limit for smem. that’s main problem.
Higher occupancy doesnt mean more performance for sure. You can add the following to your command line
compilation, maybe there is something to do with it:
–ptxas-options="-v -mem "
also, I think 0.40 vs 0.31 is not a good test case to see which algorithm is better. You probably need a much
bigger dataset to see if indeed the boost is true, whether the boost is only 0.09sec or 10% or anything higher…
you need to stress test your algorithm in order to see if you indeed gained performance or not…
eyal
8192 x 1024 i suppose is big enough. but probably should test at smaller data sets.
–ptxas-options="-v -mem - what is it for? i got the same time when tried this flag
what is “local store” in profiler? in cubin file i saw string: lmem = 0, so why here “local store” araised?
add the --ptxas-options=“-v -mem” to the compilation command line of the .cu file.
That should yield a result like this:
1>ptxas info : Compiling entry function '_Z29CalculatePhaseCILj3EEvjjjjjjjP23GGPUGenericSearchPar
amsPfS2_S2_S2_S2_S2_PiS3_'
1>ptxas info : Used 17 registers, 3272+16 bytes smem, 27328 bytes cmem[0], 40 bytes cmem[1]
This is how you can know what resources your kernel uses…
Well, is it possible that better occupancy brings worse results?
I launched this kernel 1000 times on 8192x1024 data size and got next results:
1.374 sec, BLOCK_SIZE = 32, 0.188 occupancy, smem usage 2208, registers per thread 11
1.659 sec, BLOCK_SIZE = 34, 0.375 occupancy, smem usage 2344, registers per thread 11
Block size must be a mulitple of 32 - all you are doing by increasing it to 34 is introducing a second warp to the block (so there are really 64 threads in total), of which 30 out of the 32 extra threads you requested are doing absolutely nothing. You might want to read the section of the programming guide that covers the execution model before you go to much further.
Oh, I see, thank you :) I choosed 32. ocupancy is still very low. mostly cause of big smem size. Maybe it would be better to calc more than one column per thread and get higher ocupancy?
There is something that looks wrong to me in your code. In the first loop you load the data into shared memory, however
in the second loop, you read again from gmem (as opposed to reading from shared memory) and then read again from shared memory.
// second loop:
for (int sweep = window_size; sweep < NumSweeps; sweep++)
{
float2 fValueNew = signal[sweep * NumSamples + sample]; // This should not have been called, I guess....
float amplitude = (fValueNew.x * fValueNew.x + fValueNew.y * fValueNew.y);
Di += amplitude;
Di -= windowmemory[threadIdx.x][windowmemory_level];
windowmemory[threadIdx.x][windowmemory_level] = amplitude;
windowmemory_level++;
if (windowmemory_level == WINDOW_SIZE) windowmemory_level = 0;
di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;
}
The whole idea of using the shared memory in the first loop was that in the second loop you won’t need to read the data from
gmem in an uncoallesced way. Now you’re reading twice from gmem.
As for further optimization - maybe you can add a window size for loop and accumulate the stuff you need, thus
reducing the shared memory needed from :
__shared__ float windowmemory[NUM_THREADS_PER_BLOCK][WINDOW_SIZE + 1];
// To this:
__shared__ float windowmemory[NUM_THREADS_PER_BLOCK];
thus making the code look something like this:
__shared__ float windowmemory[NUM_THREADS_PER_BLOCK];
...
for ( iWindow = 0; iWindow < WINDOW_SIZE; iWindow++ )
{
__syncthreads();
windowmemory[ threadIdx.x ] = signal[ ... ];
__syncthreads();
....
}
Does this makes sense?
eyal
Actually i am reading each element once in gmem (when adding it to sum). before i did it twice (when adding and removing).
your model assumes that many cores process one column. how will you do that? i mean in my model i load each element once from gmem and store it in smem to eliminate second reading, and number of arithmetic operations seems to be minimum to find middles. My model has low ocupancy? Your model for sure will have better ocupancy but will it give less time? also it is not obvious to me how to code your model. u can move along column and load data to smem - that seems optimal, but how to calculate Middle with no overhead in number of arithmetic operations? parallel reduction building prefix sum and then to calculate Middle in position j: Middle[j] = S[j + windowsize / 2] - S[j - windowsize /2], where S - partial sums of column bulit by paralel reduction. Probably better, not sure