Hi all, I have a small problem that I have now implemented within a big if(threadIdx.x==0) because it is not really parallizable (is that valid english???) It is a very small step in my algorithm, but costs the most time by far… :(
Anyway here is what I need to do:
d_input is an int array of size N, that contains zeros and other values that sum up to N. It tells me basically how often I have to recreate a value, so if the 20th value is 1000, I have to copy the 20th value of another array 1000 times. I was solving that by creating an array with indices that contains 1000 times the value 20 and using a texture to ‘sample’ the other array using the indices. (while making sure that the 20th value in my indices array is 20, so I get no races)
So for example :
d_input = [0 0 2 0 6 0 5 0 0 1 0 0 0 0] d_output1 = [2 2 4 4 4 4 4 4 6 6 6 6 6 9] d_output2 = [2 4 2 4 4 4 6 4 4 9 6 6 6 6]
d_input needs to be converted into d_output1 and to prevent races I have to convert that into d_output2 (each value that is an element of this array is placed at that index)
Does anybody see a way to do this more parallel? Can I use atomics to calculate which index into the output array is next to be filled? Can I block the whole d_output array to be written to? Or only one element of the output array?
I hope I have explained my problem so people understand what I need to do.
And I hope people have a good idea. I can always copy the d_input array to host and do it on CPU, copying back the indices array, but this will involve copying lots of memory and also give lots of CPU-scheduling challenges, since I will have a very busy CPU to keep the GPU busy.