Undocumented memory pitfalls On correctness, not performance

asadafag · August 20, 2007, 10:08am

Coalesced write vs non-coalesced write in another block

One warp does a coalesced write. For example, to address range A~A+128, but only A~A+124 is actually written (a predicated write, or some threads returned before the write)

If another warp writes slot A+124 at the same time, the write would FAIL.

An example:

__global__ void ker0(int *a){

if(threadIdx.x>=254)return;

a[blockIdx.x*254+threadIdx.x]=threadIdx.x;//the second block may FAIL

}

//blah

ker0<<<2,256,0>>>(blah);

Predicated write to the same slot

If one warp does a predicated write to the same memory slot, but only one actually writes, the write may FAIL.

An example:

extern int __shared__ shi[];

int w;

//blah

if(threadIdx.x==0)shi[0]=w;//MAY FAIL

weigo · August 21, 2007, 6:32am

Are you sure of that?
Can anyone from NVIDIA confirm it or contradict?

asadafag · August 21, 2007, 11:34am

First, I apologize for intentionally lying in forum.

I posted this to test nVidia’s desire of responding to questions.
1 is made up, and I’m mostly sure of 2 (I tested it).

Well, seems nVidia isn’t going to answer such questions.

cmorrison · August 21, 2007, 12:02pm

The secret shopper of forums!

paulius · August 27, 2007, 10:16pm

So, what are the actual questions?

I get no failure as the result is as expected.
Again, the result is as expected. If all the threads of a block are writing shi[0] to some global memory location, you have to make sure to call __syncthreads() between the setting of a shared memory location, and the read from it by other threads.

It’s also good to know that you post bogus questions to waste people’s time for fun. How do you think that affects the credibility and priority of your future questions? You’re probably smart enough to figure that out (I would think).

Paulius

asadafag · August 28, 2007, 4:34am

I’m very sorry to waste your time this way (which I estimate to be ~1.5 days), considering you took first priority in answering my question. I was being angry while working on Sunday to do the 5th complete rewrite of my program (to use more local memory), and made a selfish decision to let you share some of it. I apologize for this.
Truly, 1 is bogus. 2 indeed happened to me once in a big kernel, but now I can’t reproduce it in small kernels. It won’t be of much help to you then. This post turns out to be entirely my fault.

The thing is, mostly when something goes wrong, nobody knows whose fault it is. I’m working for EG08, the deadline is near, and now I barely have time for paper and demo.
Something like “kernel claims to use 31 registers, runs with no error at 256 threads, and returns bogus”, would happen once or twice every time I do a complete rewrite. Often, it’s my own fault. But at the point where “inserting debug code breaks the kernel seemingly at random”, or “kernel runs 2x slower after inserting a useless statement”. I don’t have much choice:

Send my kernels to you. This leaks IP, and my boss would get angry at me. If it ends up still being my fault, then everyone, including myself, would waste time and get angry.
Try to work around blindly. This is frustrating, and I get angry at you.
Post my guess of problem here, it’ll likely end up like this one, and you get angry at me.
Try to make a reproduce case. It won’t be of much use to me if the next release isn’t released before the EG08 deadline (9.26). Also, I have to go through 2 before this. It would end up with an angry me, and a satisfying, or equally angry (in case of a bogus repro case), you.
Since I’m now at the demo&paper writing stage, I’m much less angry on a daily basis, and could sit down and talk calmly. If, just, more technical specs (e.g. warp divergence handling, ptxas source code, cubin binary code spec) could be made available, all of us would get much less angry.

Topic		Replies	Views
Predicated write to the same location the statement in doc is still ambiguous CUDA Programming and Performance	0	2524	June 17, 2007
Clarification on Memory Access issue CUDA Programming and Performance	1	3753	September 9, 2009
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8030	July 20, 2007
Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1 CUDA Programming and Performance	10	4143	May 4, 2007
Warp writes to the shared memory CUDA Programming and Performance	0	1668	June 2, 2009
shared memory writes CUDA Programming and Performance	6	3198	December 30, 2007
non-atomic instruction by other warps? CUDA Programming and Performance	2	1468	March 9, 2009
Missing writes to global mem CUDA Programming and Performance	3	1115	April 22, 2009
How can I get memory coalescing in a branch write? CUDA Programming and Performance	2	2294	December 17, 2007
Speeding up memory writes CUDA Programming and Performance	5	3282	July 3, 2008

Undocumented memory pitfalls On correctness, not performance

Related topics