Simple caching kernel yields low performance

I’m trying to build a simple kernel that will cache reads and than issue writes. I’m using GTX 650 Ti (cc_30), CUDA 7 on Windows 7 64 bit. I’m caching into registers (in Debug mode nvcc writes onto local, instead), obviously using unroll. From debugging I see the input address is aligned to 256 bytes.
with cache size of 1 and 32 threads the kernel runs in ~48ms (2M items).
cache of 3 -> ~18ms.
Bigger caches do not give the expected boost, however. cache of 30 is about 7ms.
Moving from 32 threads to 1024 should have given me x32 in performance, but it does not. The best configuration of 1024 threads and 30 cache size yields 2ms kernel. It should have been few 100s microseconds, instead.

Here’s the quite generic kernel (dev_data is an array of pointers):

#define CACHE_SIZE 30

template <int unused>
__global__ void caching_kernel(int numElements, float *output)
	float *input=dev_data[0];
	int i=threadIdx.x;
	float cache[CACHE_SIZE];
	int step=blockDim.x*CACHE_SIZE;
	clock_t start,time=0,wait=0;
	int loops=0;


	while (i<numElements) {
#pragma unroll
		for (int j=0 ; j<CACHE_SIZE ; j++)
		if (loops<10)

#pragma unroll
		for (int j=0 ; j<CACHE_SIZE ; j++) {
			if (j*step+i<numElements) {
	if (threadIdx.x==0)
		printf("issue:%d block:%d registers:%d waited:%d\n",time/10,blockDim.x,CACHE_SIZE,wait/loops);

Few things I’ve seen in the resulting PTX which seem off:

  1. The unrolled loops are using more and more registers to compute the address of the input. Why isn’t the compiler reusing the same register?
BB3_18:	%p10, %r13, %r20;
@%p10 bra BB3_20; %r65, %r21;
shl.b32 %r66, %r13, 2;
add.s32 %r67, %r65, %r66; [%r67], %f8;

BB3_20:	%p11, %r14, %r20;
@%p11 bra BB3_22; %r68, %r21;
shl.b32 %r69, %r14, 2;
add.s32 %r70, %r68, %r69; [%r70], %f9;
  1. The compiler is using more registers that are supported in the architecture (sm_30, 63 registers). See above access to %r70 etc. The PTX recognizes the arch just fine. What is going on?
Fatbin elf code:
arch = sm_30
code version = [1,7]
producer = cuda
host = windows
compile_size = 32bit

Fatbin ptx code:
arch = sm_30
code version = [4,2]
producer = cuda
host = windows
compile_size = 32bit
ptxasOptions = -v 

.version 4.2
.target sm_30
.address_size 32
  1. Increasing CACHE_SIZE from 1 to 5 sees gradual small increases from 160 cycles, ~174 cycles for cache of 2, ~210 for cache of 4 and a huge jump to ~440 cycles for a cache of 5. Any idea what might cause such an increase?

  2. There are a lot of instructions for loading and storing. I will compile the “step” variable to be fixed (instead of blockDim.x*CACHE_SIZE) and see if those instructions are few and smaller. I’d expect the loads to look like:

ld.f32 %f0, [%r34+0];
ld.f32 %f1, [%r34+128];
ld.f32 %f2, [%r34+256];

instead of the following for just one item:

mad.lo.s32 %r8, %r23, 20, %r3;
add.s32 %r34, %r33, %r32;
ld.f32 %f3, [%r34];

Do you think the load with offset will yield better performance?

I appreciate any input.

There are many performance-limiting factors with any complex processor, you cannot expect performance to simply scale linearly with the number of threads. Your code could easily be limited by memory bandwidth, for example, and running more threads cannot remove that limit. It can help to exploit the existing bandwidth more efficiently, but that is a diminishing effect.

The Visual Profiler is a powerful tool that can indicate the bottlenecks in your CUDA code. If you have not had a chance to use it, I would suggest giving it a try. To maximize memory bandwidth, you would want to follow the recommendations in the Best Practices Guide (see chapter 9).

Don’t analyze PTX. It’s not the code the machine actually executes.

The thing that you are calling a “cache” is a local memory construct. The compiler will aggressively optimize this, but as you make it larger, it’s unlikely to do what you expect. Performance from a “small cache size” where the compiler can aggressively optimize into registers, is not likely to extrapolate to larger cache sizes, where the compiler may not be able to as easily optimize usage into registers.

Again, drawing conclusions from PTX is rarely useful.