best practise for permutations and combination


What is the best practise to perform permutation and comb kind of calculations in GPU.

For example :

int array1 = {a, b, c, d};
int array2 = {u, v, x, y};

now to multiply each element of array1 with those of array2, i ll have to run loop like this in CPU :

for(i =0;i <4;i++)
for(j =0;j <4;j++)
int mult = array1[i]*array2[j];

to do this in GPU i ll have to lauch a kernel with 16 threads.
one way to do this is to pass both the 4 dimensional array and try to access elements of the both the array randomly. But wont it lead to data coalescing problem?

another way to do this is to arrange the data in array1 and array2 to a 16 dimension order like :
int array1 = {a,a,a,a, b, b,b,b,c,c,c,c, d, d, d, d};
int array2 = {u, v, x, y,u, v, x, y,u, v, x, y,u, v, x, y};

and now each thread can access each element without data coalescing problem.
But this will led to huge allocation of global memory.

Please suggest the best way to launch this kind of kernel.

Please pardon my english.