I’m taking my first steps in GPGPU computing and decided to do so in OpenCL.
I have written a very simple and naive brute-force raytracer app.
It works like this: Every work-item corresponds to 1 pixel. The kernel gets launched with an array of spheres, it goes through these spheres in __global memory, with a for loop, calculating the closest one for that pixel. The resulting color is written out to a pixels array, also in __global memory, and in the end read out by the host to plot to the screen.
This works fine and I’m getting ±50 fps(only counting the kernel runtime) for 100 spheres.
But when I try to copy the spheres array to a __local array first and do the for loop through the local array, the fps drops to ± 40 fps.
Instead of every kernel having to go through all 100 spheres in __global memory, every sphere is only read once per workgroup in __global memory and then accessed through __local memory for all work-items.
To my understanding, this massively reduces the amount of global memory reads and should be a lot faster, but it is slower!
The kernel:
__kernel void rayTrace(__global float8* spheres,
int numberOfSpheres,
__global char4* pixels,
int width,
int heigth,
__local float8* locSph){
int iGID = get_global_id(0);
int tId = get_local_id(0);
int lSize = get_local_size(0);
for(int k=tId;k<numberOfSpheres;k+=lSize){
locSph[k] = spheres[k];
}
barrier(CLK_LOCAL_MEM_FENCE);
//Also tried this instead, was even a little bit slower:
//event_t copyEvent;
//async_work_group_copy (locSph, spheres, numberOfSpheres, copyEvent);
//wait_group_events(1, ©Event);
//Some irrelevant private initializations here
float8 hitSphere;
for(int i=0;i<numberOfSpheres;i++){
float8 thisSphere = locSph[i];
//Intersection calculations here, update hitSphere when closer hit
}
if (closestSphere == -1){ //if no hit, paint black
pixels[iGID] = (char4)(0,0,0,255);
return;
}
//calculate shadingFactor here
pixels[iGID] = (char4)(shadingFactor * hitSphere.s4, shadingFactor * hitSphere.s5, shadingFactor * hitSphere.s6, 255);
}
The really strange thing is, that even when I simply comment out the “copy from global memory” part(and thus getting undefined results), the version with global accesses in the for loop is still faster…
Global worksize is the resolution (960x720), local worksize is 256. I’ve tried other values, with no significant relative differences between both versions.
I’m running this on a GTX 260, windows 7 64bit, nvidia driver 263.06, MSVC compiler.
If there’s anyone with an explaination for this, I would be very gratefull. I’ve spent 2 full days searching, with no success. Thank you for your time!