TLP and ILP optimization

int tx = blockIdx.x*blockDim.x + threadIdx.x; //  dimGrid (577, 256)
        int ty = blockIdx.y*blockDim.y + threadIdx.y; // dimBlock(32,32)

	int index= tx*samples + ty; //8064 

	if ( tx > records || ty > samples-1)
	 result[tx*wsamples+ty].x=  (float) buffer[2*index]-128;
         result[tx*wsamples+ty].y=  (float) buffer[2*index+1]-128;

Can anyone help me to do ILP and TLP in this code …

By “do” I assume you meant “increase”.

From what I can tell based on the extremely limited and fragmentary information presented, your kernel has sufficient parallelism to cover relevant latencies. One thing I would recommend trying is to make each thread block smaller (say 16x16) and then run more of them. This approach often allows using internal resources more fully in the presence of granularity constraints, but it is impossible to predict whether this results in a measurable performance advantage in a particular use case.

Your code seems to be memory bandwidth limited, assuming “result” and “buffer” are in global memory. So you want to make sure to get the best possible memory interface utilization by paying attention to access patterns, and would want to consider moving load access to the texture path.

The profiler can tell you quite a bit about the performance characteristics of your kernel and guide your optimization process. If you have not done so yet, I would suggest familiarizing yourself with this important tool.


Actually can i give more work to a single thread and will it help in this case?

if i want to assign more work to a single thread how to do that ?

int ty = blockIdx.y*blockDim.y + threadIdx.y * 2; // dimBlock(32,32)

result[txwsamples+ty].x= (float) buffer[2index]-128;
result[txwsamples+ty].y= (float) buffer[2index+1]-128;
result[txwsamples+ty].x= (float) buffer[2index+2]-128;
result[txwsamples+ty].y= (float) buffer[2index+3]-128;

will this be fine??

Actually boss one thing more

if am calculating cufft of the data it is taking only 40 ms

whereas in this case i am consuming 126 ms which i want to reduce …

If the code is memory bandwidth limited, as I suspect (use the profiler to confirm or refute this working hypothesis), giving more work to each thread is unlikely to increase the performance as it doesn’t change the computation/memory ratio, i.e. FLOPS per byte consumed.