OK, those functions are on host side. Could you call your kernel twice (in a single call to your host program) and measure the difference between 2 calls?
Have you ran it through the profiler? I get the feeling you’re getting a lot of cache misses with those textures, especially with that weird addressing scheme.
I guess it depends on the type of GPU you’re working with.
You’re running 1875 blocks of 256 threads, where each thread performs 256 texture accesses which are not necessarily localized (cache misses).
So I’m not sure if 92ms is so bad as you think.
By the way, where does the value x in
int y = ((idx) / (x));
come from? Is it in global memory?
It’s also a little strange that the profiler indicates that all stores are coalesced, because I get the impression that is not the case when browsing through your code.
Other than that ,the occupancy is 1, virtually no shared memory is used, register count is low, you’re running enough threads per block and enough blocks, and stores are coalesced. Not much to improve here.
Maybe you can do some high level optimizations to reduce the operation count which is always a win.
In My code given the begning of this thread , each thread requires 4 *8 8 ( here 48 times loop is running for each thread and within loop we have 8 statements ) = 256 “Value” value . Then using shared memory for Value
required 32*256 size shared memory (< 16kb) for each thread( If I use block size is 32 thread). But it does not give performance improvement.
Could you explain some more detail your point to implement my kernel?
Using only 32 threads per block is definitely not a good idea. You can only have 3 active blocks running simultaneously on 1 SM, so you would only have 96 active threads, which is far too low.
You should probably try to re-order the instructions to get something like
Value[ind+threadIdx] = tex…
So that all memory writes are coalesced.
EDIT: I noticed right now that your value is an unsigned char array which makes it a bit more difficult to achieve coalescing. I typically write to uchar4 registers in the kernel before writing them to global memory.
Take a look at p.80 of the 2.2 manual. It states that:
Coalescing on Devices with Compute Capability 1.0 and 1.1
The global memory access by all threads of a half-warp is coalesced into one or two
memory transactions if it satisfies the following three conditions:
Threads must access
<b>Either 32-bit</b> words, resulting in one 64-byte memory transaction,
Or <b>64-bit</b> words, resulting in one 128-byte memory transaction,
Or <b>128-bit</b> words, resulting in two 128-byte memory transactions;
All 16 words must lie in the same segment of size equal to the memory
transaction size (or twice the memory transaction size when accessing 128-bit
words);
Threads must access the words in sequence: The kth thread in the half-warp must
access the kth word.
An unsigned char is an 8-bit word, so you can never achieve memory coalescing on device of compute capability =< 1.1 using unsigned chars.
So one solution is to keep a uchar4 (32-bit word) in the register while calculating the 4 components and once all 4 values have been calculated, you can store it in global memory with coalesced memory accesses.
True, a compute 1.3 device can coalesce 8-bit reads. In benchmarks, however, coalesced 8-bit reads are still painfully slow compared to coalesced 32/64/128 bit reads.