Thread utilization

I need to calculate a few hundred 6x6 symmetric positive def matrices and then invert them. A 6x6 symmetric matrix has 21 unique elements.
I am looking for advice how to best utilize threads.
It seems good to map one work group to one matrix. Therefore, what is the best way to allocate threads within a work group given that each work group will only produce 21 outputs? (i.e. not a multiple of the warp size). Is full utilization even a concern? What if my matrices grow to 8x8 (36 unique elements)? In this case the number of threads is slightly over 32 so that means the thread scheduler would allocate 64 threads, meaning 64-36=28 are wasted.
Any advice?

Maybe your problem is not one targeted for GPU OpenCL: to use modern nVidia GPU (or ATI too) you will have to consider being able to launch ad-minima a thousand thread and preferably tens thousand!

How about mapping one work item to one matrix. You have several hundred, so you could hopefully get a large enough work group to make this worth while. These matices are pretty tiny, so that should give you a pretty good max workgroup size.

I should give you more information. Each element in the output matrix is an accumulation across some other data source (think Least Squares). In my current example, the sums have ~200 terms or so.

So I have 100 6x6 matrices each of which requires 21 200 term sums. Seems like the logical thing to do is assign a workgroup to one matrix. Within a workgroup I could assign 21 threads to each do their independent sums and store the results. To further optimize this I create even more threads that partition the sums into smaller pieces, but the number of threads in a workgroup would be a multiple of 21.

The main trouble I see with one workitem per matrix is how to do get memory coalescing to work. In this case, would threads assigned to different matrices help load individual elements from different matrices?

Your problem might be a good application for image access (read only, & difficult to coalesce). Image access only requires 1 thread, and performs by issuing larger/fewer transactions. The implementation to hide the latency. Maybe something like this psuedo code:

[codebox]const sampler_t sam = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;

kernel void myKernel(…, __read_only image2d_t myData){

int  myRow = (int) get_global_id(0);

float[6][6] output; // need a loop to init to 0

int2 addr;

addr.y = myRow;

float4 texel;

for(int i = 0; i < 21; i++{

    float sum = 0;

for(int j = 0; j < 200 /4; j++){

        addr.x = j;

        texel = read_imagef(myData, sam, addr);

        sum += texel.s0;

        sum += texel.s1;

        sum += texel.s2;

        sum += texel.s3;

    }

    output[??] = sum;

}

// write output

}[/codebox]

I agree that exploiting textures could help with memory reading bottlenecks. But what about output? If there is one thread per matrix, then what is the best way to write out the 21 matrix elements to global memory? To get coalescing, consecutive threads need to write to consecutive global memory addresses.