strange behaviour of free() in __global__ function

I have a kernel that has the (0,0,0) thread (a 3D thread block obviously) call malloc twice (i.e. two global memory arrays per block), the addresses are shared with other threads in the block via shared mem, there is a __syncthreads(), all threads write to both malloced arrays, there is another __syncthreads() again, then thread (0,0,0) free’s one of the arrays, all threads do a tonne of stuff, then thread (0,0,0) free’s the other array:

gloabal kernel() {
…stuff
if (thread 0) {
A = malloc
B = malloc
}
__syncthreads()
functionThatWritesTo(A, B)
__syncthreads()
if (thread 0) {
free(A)
}
… loads of stuff that never uses A
if (thread 0) {
free(B)
}
}

I’ve checked all the addresses and indexes, everything under the sun and all the code is correct but I’m getting a very few random bogus values in A.

Now if I move the free(A) to where the second free is at the very end of the kernel after thousands of operations and a few more __syncthreads(), the artifact goes away.

Also if I malloc a single global mem space per block and use a second pointer to the point in the block where the second section starts the artifact also goes away which seems further evidence that the indexing and addresses are all correct.

My only conclusion is that the compiler is moving code around and the free(A) is occurring before the first __syncthreads().

Is there something I don’t understand about what syncthreads() guarantees that would make this possible? Or any other suggestion anybody has to explain what is going on, what i’m doing wrong etc.

Yes I know I can inspect the PTX etc but thats not something I have any experience with so before I go down that road I wanted to throw this out there to hopefully save myself the pain.

The description suggests a race condition of some sort. Have you tried running the code under control of cuda-memcheck to see whether it reports any race conditions? Note that the tool can find only some, not all kinds of race conditions.

In the absence of a description of your understanding of __syncthreads() semantics I don’t see how this question can be answered.

If you suspect a code generation issue (unlikely but not impossible) the best code to look at is the machine code (SASS). PTX is merely an intermediate representation providing an abstract ISA. It is processed by an optimizing compiler (PTXAS) into machine code. Does the issue disappear when you compile with -Xpxtas -O1? If so, you definitely want to look at SASS, not PTX.

10,000ft question: What would be wrong with freeing both A and B at the very end of the kernel? What advantage does freeing A early provide?

Other than what I believe is good coding practice to release resources where logically they should i.e. when no longer needed, and ensuring that I don’t run out of heap… no advantage. But simply freeing it at the end doesn’t fix the problem and should not be required. I can’t see why an array that is free’d after a syncthreads and never used subsequently would affect the code in any way by being moved further down in the code. Plus I don’t like having time bombs in my code I’m not going to be happy just moving the free.

As for my understanding of synchthreads, the main points being that all writes to global and shared mem by all threads before the synch are visible to all other threads after the sync i.e. no cached values that have not been written, and also that all threads must reach that point before any thread moves past it.

Thanks for the tip on the SASS vs PTX. Looks like I’ll have to go down that road and you’ve likely saved me a lot of hours pointless effort.