After some testing, the problem comes down to following:
the particle array (array of structure) has a lot of members. If I access the array and change e.g. particle.velocity.x in two different kernels the cuda part breaks down after some time. But this only happens when one kernel accesses the particle array on a thread basis and the other one looped in a way (the particles are sorted into cells and I loop over the particles in one cell, each cell is computed by a thread). This leaves me confused, there must be some accessing problem? But… the kernels themselves do the work, they just don’t like each other. What the hell…?
Edit: Also, if I reduce the accessed particles in the kernel below 2000, it works. Otherwise, CUDA_SAFE_CALL of the memcopy gives an unspecified launch failure so I suppose its a memory access problem (which can’t be, because of the if function, see below).
The kernel causing problems (reduced to the min while still yielding the error) looks like this:
__global__ void StreamingStep (Particle * particles)
int ParticleID = blockDim.x*blockIdx.x + threadIdx.x;
if(ParticleID < 5000) // more than 100000 particles are defined, 5000 is just for testing
particles[ParticleID].position.x += 0.1;
EDIT: After a restart of the GPU pc, the problem doesn’t occur again, so it might be a memory problem. Would be nice to know what happened, though