Unparallizable problem solvable with atomic ops?

Hi all, I have a small problem that I have now implemented within a big if(threadIdx.x==0) because it is not really parallizable (is that valid english???) It is a very small step in my algorithm, but costs the most time by far… :(

Anyway here is what I need to do:

d_input is an int array of size N, that contains zeros and other values that sum up to N. It tells me basically how often I have to recreate a value, so if the 20th value is 1000, I have to copy the 20th value of another array 1000 times. I was solving that by creating an array with indices that contains 1000 times the value 20 and using a texture to ‘sample’ the other array using the indices. (while making sure that the 20th value in my indices array is 20, so I get no races)

So for example :

d_input   = [0 0 2 0 6 0 5 0 0 1 0 0 0 0]

d_output1 = [2 2 4 4 4 4 4 4 6 6 6 6 6 9]

d_output2 = [2 4 2 4 4 4 6 4 4 9 6 6 6 6]

d_input needs to be converted into d_output1 and to prevent races I have to convert that into d_output2 (each value that is an element of this array is placed at that index)

Does anybody see a way to do this more parallel? Can I use atomics to calculate which index into the output array is next to be filled? Can I block the whole d_output array to be written to? Or only one element of the output array?

I hope I have explained my problem so people understand what I need to do.

And I hope people have a good idea. I can always copy the d_input array to host and do it on CPU, copying back the indices array, but this will involve copying lots of memory and also give lots of CPU-scheduling challenges, since I will have a very busy CPU to keep the GPU busy.

This is the code I have come up with, I will have access to a compute 1.1 capable GPU this weekend, so will test it then. Is there anybody that has done some atomic ops that can shoot holes in my implementation? Or someone that has not done atomic ops may also shoot holes in it offcourse ;)

__device__ function generate_indices_atomic(int *arr_index, int *indices, int *input, int num_el)

{ //indices has to be initialized with negative values, arr_index[0] = 0; at the start

 int index = threadIdx.x + __mul24(blockIdx.x, blockDim.x)

  if (index < num_el) {

   int num = input[index];

    if (num>0) { //value index has to come back (once or more times)

     // first we put value index at array_index index;

      int ind = index;

      int prev = AtomicExch(indices[ind],index);

     if (prev>=0) { //there was already a value placed there, so we have to put it at a free place

        ind = AtomicInc(arr_index[0], num_el);

        int prev = AtomicExch(indices[ind], prev);

       while (prev>=0) { // the place was not free...

          prev = AtomicExch(indices[ind], prev); // place the value that was already there back

          ind = AtomicInc(arr_index[0], num_el);

          prev = AtomicExch(indices[ind], prev); // place it at an (hopefully) free place

        }

      }

     // then we put the rest of the values in free places in the array

      for (int k = 1; k < num; k++) {

        ind = AtomicInc(arr_index[0], num_el);

        prev = AtomicExch(indices[ind], index);

        while (prev>=0) { //there was already a value placed there, so place it back and put at another place

          int should_be_index = AtomicExch(indices[ind], prev);

          ind = AtomicInc(arr_index[0], num_el);

          prev = AtomicExch(indices[ind], index);

        }

      }

    }

  }

}

This looks like a perfect fit for prefix-sum/scan algorithm, which is probably a lot more efficient than using atomic ops all over the place (which require special hardware, and are quite slow).

Hmm, I use scan in the part that generates how many copies of each input have to be made, but I cannot see how I can use that to generate this list of indices.

I have a float array of weights[2048], from that I generate (using scan) an array with how many times each weight has to be copied (keeping the total amount 2048).

Often one of the weights turns into 1024 copies, and the rest is spread all over with 4 or 5 copies each.

I really do not see how to use scan to generate my list of indices. I will check prefix-sum tomorrow at work to see if that can be used, but I must admit that at this time I see no way how to do it ;) Or maybe generate an array of 2048*2048 big and remove the non-valid indices till I have 2048 elements left.

Hmm, I remember there was a thread about removing zero values from an array not so long ago, I will search for it to see what was advised there.

Just to give an update. As said I have 2048 values and luckily I could only copy a single value 1024 times maximum. So I had to generate 20481024 values and scan over that. As it turns out, the scan is very fast (although I do not get the right values yet after converting scanLargeArray from float to int) but generating the 20481024 values is taking so much extra time that I can better use my very crude kernel.

So now I will try to see how much gain I can get from using atomic ops.

Since when do you have a compute capability 1.1 device?

well, I have a notebook at home with a 8600M GT :D

And if it turns out I can gain a lot in the kernel I might be buying 2x 9800X2 in the future since my current benchmarks on a single 8800GTX indicate I should be able to get real-time performance with 4 GPU’s. And from what I saw up to now both core & memory seem to be higher clocked on 9800X2 than on 8800GTX.