__local slower than __global?

I’m taking my first steps in GPGPU computing and decided to do so in OpenCL.

I have written a very simple and naive brute-force raytracer app.

It works like this: Every work-item corresponds to 1 pixel. The kernel gets launched with an array of spheres, it goes through these spheres in __global memory, with a for loop, calculating the closest one for that pixel. The resulting color is written out to a pixels array, also in __global memory, and in the end read out by the host to plot to the screen.

This works fine and I’m getting ±50 fps(only counting the kernel runtime) for 100 spheres.

But when I try to copy the spheres array to a __local array first and do the for loop through the local array, the fps drops to ± 40 fps.

Instead of every kernel having to go through all 100 spheres in __global memory, every sphere is only read once per workgroup in __global memory and then accessed through __local memory for all work-items.

To my understanding, this massively reduces the amount of global memory reads and should be a lot faster, but it is slower!

The kernel:

__kernel void rayTrace(__global float8* spheres, 

                       int numberOfSpheres, 

                       __global char4* pixels, 

                       int width, 

                       int heigth, 

                       __local float8* locSph){

	

    int iGID = get_global_id(0);

    int tId = get_local_id(0);

    int lSize = get_local_size(0);

for(int k=tId;k<numberOfSpheres;k+=lSize){

		locSph[k] = spheres[k];

    }

    barrier(CLK_LOCAL_MEM_FENCE);

//Also tried this instead, was even a little bit slower:

    //event_t copyEvent;

    //async_work_group_copy (locSph, spheres, numberOfSpheres, copyEvent);

    //wait_group_events(1, &copyEvent);

//Some irrelevant private initializations here

float8 hitSphere;

	for(int i=0;i<numberOfSpheres;i++){

			

			float8 thisSphere = locSph[i];

				

			//Intersection calculations here, update hitSphere when closer hit

	}

		

	if (closestSphere == -1){ //if no hit, paint black

	        pixels[iGID] = (char4)(0,0,0,255);

		return;

	}

//calculate shadingFactor here

	

	pixels[iGID] = (char4)(shadingFactor * hitSphere.s4, shadingFactor * hitSphere.s5, shadingFactor * hitSphere.s6, 255);

}

The really strange thing is, that even when I simply comment out the “copy from global memory” part(and thus getting undefined results), the version with global accesses in the for loop is still faster…

Global worksize is the resolution (960x720), local worksize is 256. I’ve tried other values, with no significant relative differences between both versions.

I’m running this on a GTX 260, windows 7 64bit, nvidia driver 263.06, MSVC compiler.

If there’s anyone with an explaination for this, I would be very gratefull. I’ve spent 2 full days searching, with no success. Thank you for your time!

I suspect your problem could be (under)utilization. In order to hide memory latencies, the GPU must keep multiple work groups “in flight” simultaneously. The number of work groups that it can keep active at the same time depend on the amount of shared memory used by each work group. For details, see the section on “Occupancy” in the NVidia OpenCL Best Practices Guide (I don’t have the link handy, sorry). I’ve found that as a rule of thumb, on the GTX260 you’ll want to be able to have about 8 work groups of 256 work items each active simultaneously. 100 spheres of 8 floats each will take ~3.2k, so you’ll only be able to fit five of those into the GTX260’s 16K shared memory per core. As the registers used to run each work item are also mapped to the shared memory, you’ll probably only be able to have two or three work groups active simultaneously in practice. One idea that pops into mind is to split the spheres into multiple sets, and then copy each set into __local memory and do the hit calculations before moving on the next set.

Use the NVidia profiler on your code and check if occupancy is less than 1; if it’s not, you can disregard everything I said above.

P.S.: Perhaps you could put the sphere data into __constant memory ? I think the GTX260 can cache __constant memory accesses, so that could also speed up your code.