Why taking so much time?

Hi All,

In my kernel


void getValue( unsigned char *Value, long xValue, long yValue )



	long limit = __mul24( xValue, yValue);		 // Here xValue = 2000 & yValue = 1500.

	long idx = __mul24( blockIdx.x , blockDim.x ) + threadIdx.x;

	if( idx < limit )


				  int y = ((idx) / (xValue));

				  int y_ = __mul24(y,8);

				  int dt =  __mul24(y, idx);

				  for ( int i = 0; i < 4; i++ )


			for ( int l = 0; l < 8; l++ )


				int dl = tex1Dfetch(hDY, y_+l ) + dt + i;  

				int  ind  = __mul24(idx, 256) + __mul24(i, 64) + l*8;

				Value[ind  ]   = tex1Dfetch( hTexture, (0 << 2)+dl ); 

				Value[ind+1] = tex1Dfetch( hTexture, (1 << 2)+dl ); 

				Value[ind+2] = tex1Dfetch( hTexture, (2 << 2)+dl ); 

				Value[ind+3] = tex1Dfetch( hTexture, (3 << 2)+dl ); 

				Value[ind+4] = tex1Dfetch( hTexture, (4 << 2)+dl ); 

				Value[ind+5] = tex1Dfetch( hTexture, (5 << 2)+dl ); 

				Value[ind+6] = tex1Dfetch( hTexture, (6 << 2)+dl ); 

				Value[ind+7] = tex1Dfetch( hTexture, (7 << 2)+dl ); 





This is called using 256 threads per block.

My problem is:

This functin execution time is 93 ms . I am not understanding why it takes so much time?

How did you measure that 93ms?


OK, those functions are on host side. Could you call your kernel twice (in a single call to your host program) and measure the difference between 2 calls?

If I calling this kernel twice it takes 173 ms( total time).

Have you ran it through the profiler? I get the feeling you’re getting a lot of cache misses with those textures, especially with that weird addressing scheme.

Yes I have run through visual profiler and got folwing static for this kernel:

Timestamp: 		72978.5

Method: 		getValue

GPU Time 		75569

CPU Time		75587

gride size X		1875

block size X		256

static shared		40


register/thread		12

Occupancy		1

gld coalesced : 		0

gst coalesced : 		30309376

local load : 		0

local store :		0

branch :			31824

divergent branch :	28

instructions : 		886147

warp serialize :		58332

cta launched : 		234

gld 32b :		0

gld 64b :		0

gld 128b :		0

gst 32b : 		15154688

gst 32b :		 0

gst 32b :		 0

If you have any idea please reply.

I guess it depends on the type of GPU you’re working with.

You’re running 1875 blocks of 256 threads, where each thread performs 256 texture accesses which are not necessarily localized (cache misses).

So I’m not sure if 92ms is so bad as you think.

By the way, where does the value x in

int y = ((idx) / (x));

come from? Is it in global memory?

It’s also a little strange that the profiler indicates that all stores are coalesced, because I get the impression that is not the case when browsing through your code.

Maybe it’s optimized by the compiler…


I am using Quadro CX and using 2.2 SDK and tool kit.

If I need to reduce the execution time then what should I do?

Ah, I noticed that

warp serialize : 58332

I’m guessing that’s the problem.

Other than that ,the occupancy is 1, virtually no shared memory is used, register count is low, you’re running enough threads per block and enough blocks, and stores are coalesced. Not much to improve here.
Maybe you can do some high level optimizations to reduce the operation count which is always a win.


In My code given the begning of this thread , each thread requires 4 *8 8 ( here 48 times loop is running for each thread and within loop we have 8 statements ) = 256 “Value” value . Then using shared memory for Value

required 32*256 size shared memory (< 16kb) for each thread( If I use block size is 32 thread). But it does not give performance improvement.

Could you explain some more detail your point to implement my kernel?

Using only 32 threads per block is definitely not a good idea. You can only have 3 active blocks running simultaneously on 1 SM, so you would only have 96 active threads, which is far too low.
You should probably try to re-order the instructions to get something like

Value[ind+threadIdx] = tex…

So that all memory writes are coalesced.

EDIT: I noticed right now that your value is an unsigned char array which makes it a bit more difficult to achieve coalescing. I typically write to uchar4 registers in the kernel before writing them to global memory.


Could you explain some more detail , how uchar4 is used in my code to achive performance?

Here’s an example (for devices with Compute Capability 1.0 and 1.1):


uchar4 tmp;

for (int i=0;i<iLimit;++i) {

tmp.x = tex…;

    tmp.y = tex...;

    tmp.z = tex...;

    tmp.w = tex...;

Value[i*blockIdx.x+threadIdx.x] = tmp;





Should it work ??

Only use of uchar4 gives performance improvement. please explore more I have the same problem .

Take a look at p.80 of the 2.2 manual. It states that:

Coalescing on Devices with Compute Capability 1.0 and 1.1

The global memory access by all threads of a half-warp is coalesced into one or two

memory transactions if it satisfies the following three conditions:

Threads must access

  <b>Either 32-bit</b> words, resulting in one 64-byte memory transaction,

Or <b>64-bit</b> words, resulting in one 128-byte memory transaction,

Or <b>128-bit</b> words, resulting in two 128-byte memory transactions;

All 16 words must lie in the same segment of size equal to the memory

transaction size (or twice the memory transaction size when accessing 128-bit


Threads must access the words in sequence: The kth thread in the half-warp must

access the kth word.

An unsigned char is an 8-bit word, so you can never achieve memory coalescing on device of compute capability =< 1.1 using unsigned chars.

So one solution is to keep a uchar4 (32-bit word) in the register while calculating the 4 components and once all 4 values have been calculated, you can store it in global memory with coalesced memory accesses.


But my card Quadro CX has compute capatability 1.3 and I thik it provides 8-bit memory coalescing.

Even so, if you expand your code, you will see it is something like

Value[base+…+256*threadIdx.x] = …;

So there’s a 256 byte stride between neighbouring threads which greatly affects performance.


True, a compute 1.3 device can coalesce 8-bit reads. In benchmarks, however, coalesced 8-bit reads are still painfully slow compared to coalesced 32/64/128 bit reads.

In my code actually Value has size 256xValueyValue .

What I am doing in ind is to serialize the index of Value.And my Value access is sequential but one bad thing is ,it is global access.

So, if stride between neighbouring threads then what should I do here?