Hi everyone,

I’m working on a real time finite element solver. I’m using 2 kernels so far. The first one with a loop on elements to compute the nodal forces. And I’m re-using the forces in my second kernel with a loop on nodes to compute the nodal displacements. Therefore in my first kernel I need to write in a node size array although the kernel is running for each element. So here is what my code looks like:

```
__global__ void Kernel1(float* Force_array)
{
/// Texture index
int th = blockIdx.x*blockDim.x + threadIdx.x;
/// Node list for the current element (contains node numbers for each element)
int4 ElNodes = texfetch(ElNodeInd_ref, th);
Nice calculations... And at the end:
/**
* Writes the result in global memory
*/
// First node
Force_array[4*ElNodes.x] += Force_temp[0];
Force_array[4*ElNodes.x+1] += Force_temp[1];
Force_array[4*ElNodes.x+2] += Force_temp[2];
// Second node
Force_array[4*ElNodes.y] += Force_temp[3];
Force_array[4*ElNodes.y+1] += Force_temp[4];
Force_array[4*ElNodes.y+2] += Force_temp[5];
// Third node
Force_array[4*ElNodes.z] += Force_temp[6];
Force_array[4*ElNodes.z+1] += Force_temp[7];
Force_array[4*ElNodes.z+2] += Force_temp[8];
// Fourth node
Force_array[4*ElNodes.w] += Force_temp[9];
Force_array[4*ElNodes.w+1] += Force_temp[10];
Force_array[4*ElNodes.w+2] += Force_temp[11];
}
```

Force_array is an array allocated with cudaMalloc from the host and Force_temp is a temporary array created in my kernel.

I’m fully aware that different threads might try to write in the same address in the same time. Indeed two threads (two elements) may share a node and they will try to write their result in the address memory representing this node in the same time. But in the Programming Guide Version 0.8.2 in section 3.2 we can read:

The order is not defined but since I sum the numbers I don’t care about the order. I’m not sure that every writes will occur that’s true. But in my understanding that shouldn’t crash at least.

But when I launch this kernel I randomly got this error back from cudaGetErrorString( cudaGetLastError() ): *unspecified driver error*. And I said randomly yes. Sometimes I don’t get any error. And I’m not in the situation where the first call works fine and the next ones don’t. If I keep trying, after 3 or 4 calls for instance it will work again. If I don’t have any OpenGL rendering of my results (I send the nodal displacements I computed and move one object basically) I got that error and that’s it. But if I render I barely got time to see the error messages that my computer has already frozen and I have to reboot. But everything works fine if I’m not using scatter operations.

When I looked for that error message in forums I found people saying they got that message when they were trying to run the CUDA examples with a 64 bits processor. I got a Dell Precision 690 with a dual core Xeon processor (64 bits), a GeForce 9800GTX and I’m running Kubuntu.

So my question is why do I get this error? Is it because on the undefined behaviour from the scatter operations? Or is it because I’m using a 64 bits processor with a 32 bits driver and it’s not always working properly? Or a little bit of both perhaps?

Any help would be very appreciated. Thanks for your time.