setting bits in shared memory

thegallier · May 10, 2007, 2:10am

If one has to have multiple threads writing to different bits but in the same byte in shared memory (I suspect device memory has the same problem), am I correct that there will be a conflict (using the traditional mask write method) and you can not assume that the bit is actually set ? Any method around this (besides the read and verify method mark mentioned in another forum)?

Thanks

osiris1 · May 14, 2007, 4:50am

Guess you answered your own question (have not seen that post … link?)
According to my understanding (still awaiting response on Good programming topic) this should work in shared memory:
do {mask |= 1<<n;} while ((mask & ( 1<<n)) == 0);
if you cannot afford 1 byte per (as bytes are addressable) it will diverge and go round the necessary number of times to set all bits. Keep mask to byte array, as any bank conflict is cheaper than looping for larger types.
Don’t try this in device memory or your app will just about stop. It still should get the right answer. Device memory is virtually free so you can afford individually addressable units (bytes). It will still be slow.
Eric

ed: if you have a big enough array (and at least 8 warps) adjust your indexing so that 1 warp does not affect more than 1 bit in any one byte and then you don’t need the loop. This does assume that accessing shared memory does not cause a stall and warp context switch which is implied but not explicitly stated… was a good idea however register access can switch you (again implied) so this is not such a good idea as no doubt the data has to go though registers. Implementation dependent so should not be done in any case.

thegallier · May 14, 2007, 7:05am

link

[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

relevant code

// one histogram array per 32-thread warp:
shared unsigned int histogram[NHIST][NBIN];

unsigned short bin = pixelValue(); // load a pixel from memory

// do this until update is not overwritten:
do
{
u32 myVal = histogram[myHist][bin] & 0x7FFFFFF; // read the current bin val
myVal = ((tid & 0x1F) << 27) | (myVal + 1); // tag my updated val
histogram[myHist][bin] = myVal; // attempt to write the bin
} while (histogram[myHist][bin] != myVal);

I think that what makes the problem unique, is that setting random access bits, does not seem to be available in cuda and now you have to deal with conflicts of writing bits in the same byte (moving larger chunks of memory in one move creates more conflict problems… :wacko: ).

Thanks for the help.

osiris1 · May 14, 2007, 9:18am

My suggestion has the same restrictions as the histogram example - the target bit array has to be separate for each warp (still could be useful). Totally general random bit setting by concurrent threads in shared memory is not possible (IMHO) with the current architecture. A sync with slightly different semantics would allow it.
Eric

ed: as usual there is an update - if you read/sync/write if not set/sync 8 times in a row you must succeed when operating on bytes. Not very efficient!

thegallier · May 15, 2007, 6:59am

Maybe I misunderstand your posting: If random byte setting by concurrent threads is allowed in shared memory, why would a repeated write of a byte until the bit is set not be working? I do accept that all threads will have to wait until all bits have been set (not particularly efficient, but not a killer in my case).

Thanks

osiris1 · May 15, 2007, 7:50am

You do understand my post - the spec has been recently updated to say that if there are concurrent writes to the same location then one must succeed. You need to write 8 times as if you were unlucky enough that 8 threads wanted to set different bits in the same byte then each one would succeed on one of the tries. Use a for loop here as the compiler will unroll it.
Eric

ed: no change - changed my mind about a mod

thegallier · May 15, 2007, 9:54am

I agree. In the worst case (with very low probability in my case) it might take 8 writes (and lots of threads taking a break :blink: ) to set a bit.

Thanks for your help.

Mark_Harris · May 22, 2007, 12:22pm

If it’s not random threads setting bits, you could use a parallel reduction within each warp to update each 32-bit word. This reduction could be done in 5 lines of code with no syncthreads().

BTW, currently the CUDA compiler does not automatically unroll loops.

Mark

thegallier · May 22, 2007, 3:57pm

Its unfortunately, threads setting/reading random bits. Any way that we can get bit wise access to shared memory (its the different bits within the same byte conflict that bothers me)? Shared memory seems like a very valuable (and that might be an understatement) resource in cuda, and I understand that most applications will want bigger bandwidth per read access, but for some applications bit wise gather/scatter is necessary.

BTW, thanks for the compiler reminder. I did forget, and unlooping improved my speed quite a bit. :)

osiris1 · May 24, 2007, 2:23am

Mark,

The parallel reduction method would use 128 bytes to set 32 bits - not useful if you are trying to conserve shared memory.

This does bring up another issue - I have not seen a health warning about using the emulator with any code that relies upon synchronous operation within a warp, such as parallel reduction methods… I have an example here:

#include <stdio.h>

#include <cutil.h>

#define WARPZ   32

__shared__ int  a[WARPZ];

__global__ void

testKernel(int* data)

{

    const int   tid = threadIdx.x;

   a[tid] = data[tid];

    __syncthreads();

   a[tid] += a[tid + WARPZ / 2];

    a[tid] += a[tid + WARPZ / 4];

    a[tid] += a[tid + WARPZ / 8];

    a[tid] += a[tid + WARPZ / 16];

    a[tid] += a[tid + WARPZ / 32];

    if (tid == 0)

    {

        data[0] = a[0];

    }

}

int

main(int argc, char** argv)

{

    int*        p;

    int         data[WARPZ];

    int*        d_data;

   for (p = data + WARPZ; p-- != data; )

    {

        p[0] = 1;

    }

    CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, sizeof(int) * WARPZ));

    CUDA_SAFE_CALL(cudaMemcpy(d_data, data, sizeof(int) * WARPZ, cudaMemcpyHostToDevice));

   dim3  grid;

    dim3  threads(WARPZ);

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

   CUDA_SAFE_CALL(cudaMemcpy(data, d_data, sizeof(int) * WARPZ, cudaMemcpyDeviceToHost));

    printf("%8d\n", data[0]);

    return 0;

}

That gets the answer 6 under emulation and 32 on the hardware.

Have I missed something?

Thanks, Eric

ed: my error loop unrolling - misinterpretation when I first started that I never checked up on

osiris1 · May 24, 2007, 2:43am

You need to do something like this to get it to work both ways:

#include <stdio.h>

#include <cutil.h>

#define WARPZ   32

#ifdef __DEVICE_EMULATION__

#define warpsync()      __syncthreads()

#else

#define warpsync()

#endif

__shared__ int  a[WARPZ];

__global__ void

testKernel(int* data)

{

    const int   tid = threadIdx.x;

   a[tid] = data[tid];

    warpsync();

   a[tid] += a[tid + WARPZ / 2];

    warpsync();

    a[tid] += a[tid + WARPZ / 4];

    warpsync();

    a[tid] += a[tid + WARPZ / 8];

    warpsync();

    a[tid] += a[tid + WARPZ / 16];

    warpsync();

    a[tid] += a[tid + WARPZ / 32];

   data[tid] = a[tid];

}

int

main(int argc, char** argv)

{

    int*        p;

    int         data[WARPZ];

    int*        d_data;

   for (p = data + WARPZ; p-- != data; )

    {

        p[0] = 1;

    }

    CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, sizeof(int) * WARPZ));

    CUDA_SAFE_CALL(cudaMemcpy(d_data, data, sizeof(int) * WARPZ, cudaMemcpyHostToDevice));

   dim3  grid;

    dim3  threads(WARPZ);

    testKernel<<<grid, threads>>>(d_data);

    CUT_CHECK_ERROR("Kernel execution failed");

   CUDA_SAFE_CALL(cudaMemcpy(data, d_data, sizeof(int) * WARPZ, cudaMemcpyDeviceToHost));

    printf("%8d\n", data[0]);

    return 0;

}

ed: and then one is up the proverbial creek without a paddle if you are using parallel reduction within a section of code that not all threads reach (as I have)!

thegallier · May 24, 2007, 9:41am

I agree, you need the sync within the reduction. I also agree (as was mentioned in a previous thread) that a sync per warp would be very useful (assuming that “non active” warps use less compute time than “active” ones, something I have not quantified yet).

osiris1 · May 24, 2007, 11:14pm

Actually warps are already instruction synchronous - like that all the time. Problem is that emulation does not emulate this property at this time. It is possible to solve this problem fully (including divergent warps) in an emulator without having to go to a virtual machine. We still have all the nice debugging tools. It is quite a bit of work to do transparently and so I guess it is on someone’s todo list (if not it needs a bug report). I wrote my own emulator, and I have done it with callout macros for the moment.

Eric

osiris1 · June 5, 2007, 10:58pm

One last suggestion in this area - IMHO the current __syncthreads() is a misnomer and perhaps it should be changed to __syncWarps() as this emphasises the difference between threads and warps. Sure it would help newcomers. To me it seems like __syncthreads() is a carry over from the days when there was no hardware, only the emulator, which had a warp size of 1 (and still does).
Eric

thegallier · June 5, 2007, 11:08pm

I agree with you on the __sync and on the previous observation (“Actually warps are already instruction synchronous”). I now think of it as simd within a warp.

Also, I don’t know if you have looked at 0.9 but atomic operations were added (compute 1.1 so not 8800) for global memory, which I think is great for the stuff I am doing. Which leaves me with two questions:

There is no atomic not;
No atomic operation on shared memory.

Is there a way around both or is patience the solution?

seibert · June 6, 2007, 12:25am

You can simulate an atomic bitwise not by using the atomic XOR with 0xFFFFFFFF as the argument.

thegallier · June 6, 2007, 1:21am

Great. thx.