Increasing GPU utilization of OpenCL kernel

I have a rather long OpenCL kernel that I can effectively run in parallel on my CPU cores, utilizing them, and see scaling. On my GPU I only get roughly the performance of a single core of my CPU. I ran the Nsight profiler and it’s claiming the kernel only utilizes 25% of the GPU. For completeness, I am not overlapping transferring the results out of the write buffers back to host memory. If I read this right, I think that’s up to another 12% of GPU utilization. Basically, I see my kernel and two memory copy routines accounting, at most, for 37% of GPU runtime. There isn’t anything else coming up.

I am going to start trying to interleave these memory copies anyways since I know that is getting in the way. However, I am trying to understand how I can increase that GPU utilization overall. I am assuming it’s coming down to scheduling, but I’m not sure.

The kernel is a loop on a long routine to calculate values for four different buffers. To make this parallel, I created a larger buffer containing these sub-buffers, and for every type of calculation, I have the kernel compute different based on the global ID. In this way, I can schedule many work items to get parallelism. Currently, I enqueue jobs as 1024 work items. I’ve tried a workgroup dimension of just 1, but I have also just left the range NULL, which delegates that decision to the implementation. Either way, I don’t get much of a difference. If I run the work items 1 at a time instead of 1024, I do indeed get a huge slowdown. So I know that has made a huge difference.

I had already tried scheduling multiple 1024 work item chunks at a time and then waiting on all of them to complete, but that didn’t make a noticeable impact. Is there something else I can do in the scheduling to try to get the GPU more involved?

Just to be certain, I had pondered that maybe it only showed the time in the program when the GPU section was even running, which would have been roughly a quarter of the time. However, I ran a build where the only thing involved was the GPU section, and it was still 25% utilized. I could have always screwed that up .

Some more details:
I’m using a GeForce 660Ti
I’m using the Intel OpenCL SDK here
I’m running on Win8, generating code with VS2010