Undocumented memory pitfalls On correctness, not performance

  1. Coalesced write vs non-coalesced write in another block

One warp does a coalesced write. For example, to address range A~A+128, but only A~A+124 is actually written (a predicated write, or some threads returned before the write)

If another warp writes slot A+124 at the same time, the write would FAIL.

An example:

__global__ void ker0(int *a){


a[blockIdx.x*254+threadIdx.x]=threadIdx.x;//the second block may FAIL



  1. Predicated write to the same slot

If one warp does a predicated write to the same memory slot, but only one actually writes, the write may FAIL.

An example:

extern int __shared__ shi[];

int w;


if(threadIdx.x==0)shi[0]=w;//MAY FAIL

Are you sure of that?
Can anyone from NVIDIA confirm it or contradict?

First, I apologize for intentionally lying in forum.

I posted this to test nVidia’s desire of responding to questions.
1 is made up, and I’m mostly sure of 2 (I tested it).

Well, seems nVidia isn’t going to answer such questions.

The secret shopper of forums!

So, what are the actual questions?

  1. I get no failure as the result is as expected.

  2. Again, the result is as expected. If all the threads of a block are writing shi[0] to some global memory location, you have to make sure to call __syncthreads() between the setting of a shared memory location, and the read from it by other threads.

It’s also good to know that you post bogus questions to waste people’s time for fun. How do you think that affects the credibility and priority of your future questions? You’re probably smart enough to figure that out (I would think).


I’m very sorry to waste your time this way (which I estimate to be ~1.5 days), considering you took first priority in answering my question. I was being angry while working on Sunday to do the 5th complete rewrite of my program (to use more local memory), and made a selfish decision to let you share some of it. I apologize for this.
Truly, 1 is bogus. 2 indeed happened to me once in a big kernel, but now I can’t reproduce it in small kernels. It won’t be of much help to you then. This post turns out to be entirely my fault.

The thing is, mostly when something goes wrong, nobody knows whose fault it is. I’m working for EG08, the deadline is near, and now I barely have time for paper and demo.
Something like “kernel claims to use 31 registers, runs with no error at 256 threads, and returns bogus”, would happen once or twice every time I do a complete rewrite. Often, it’s my own fault. But at the point where “inserting debug code breaks the kernel seemingly at random”, or “kernel runs 2x slower after inserting a useless statement”. I don’t have much choice:

  1. Send my kernels to you. This leaks IP, and my boss would get angry at me. If it ends up still being my fault, then everyone, including myself, would waste time and get angry.
  2. Try to work around blindly. This is frustrating, and I get angry at you.
  3. Post my guess of problem here, it’ll likely end up like this one, and you get angry at me.
  4. Try to make a reproduce case. It won’t be of much use to me if the next release isn’t released before the EG08 deadline (9.26). Also, I have to go through 2 before this. It would end up with an angry me, and a satisfying, or equally angry (in case of a bogus repro case), you.
    Since I’m now at the demo&paper writing stage, I’m much less angry on a daily basis, and could sit down and talk calmly. If, just, more technical specs (e.g. warp divergence handling, ptxas source code, cubin binary code spec) could be made available, all of us would get much less angry.