Hi all,
I’m currently working on a kernel which deals with a lot of data. These data are stored in arrays allocated through cudaMalloc().
To improve the speed, and as some has to be accessed many times, I copy the data used by one block in the shared memory. The whole computation is then done with the shared memory and at the end of the kernel I write the results back in the shared memory.
The kernel takes less than a second to run, which is the expected time. But sometimes, when I compile the file again without changing anything, the same code takes more than 5 minutes to execute. I don’t understand why this happens.
I also noticed that if I don’t change the value stored in the shared memory this never happens. the program is always fast. This may not be relevent as the compiler may simplify many things if it can see that I never chnage the values contained in the shared memory.
Here is my kernel :
__device__ inline void evolutionComputation(int p, int nbParticlesInBlock, float* evolution)
{
for (int n = 0; n<nbParticlesInblock; ++n)
{
evolution[p] += 1.f; //If y comment ou this line. The program is always quick
}
}
__global__ void evolutionKernel(float* p_evolution, int* p_firstParticleArray)
{
__shared__ int lastParticleIndex, firstParticleIndex, nbParticles, cellIndex;
extern __shared__ float s_data[]; //Array in shared memory used to store the information of the particles of this block
int threadIndex = threadIdx.x; //Index of the thread inside the block
// Thread 0 computes common data
if (threadIndex == 0)
{
//Reckon information about the block
cellIndex = blockIdx.x + blockIdx.y * gridDim.x; //Index of the current block
lastParticleIndex = p_firstParticleArray[cellIndex+1]; //Index of the last particle of this block
firstParticleIndex = p_firstParticleArray[cellIndex]; //Index of the first particle of this block
nbParticles = lastParticleIndex - firstParticleIndex; //Number of particles in the cell
}
//Synchronize the threads to ensure that all information are really available to all threads
__syncthreads();
//Reckon the global index of the particle
int particleIndex = firstParticleIndex + threadIndex;
//Check if the particle is in this block
if (particleIndex < lastParticleIndex)
{
//Create sub arrays
float* s_evolution = s_data;
s_evolution[threadIndex] = p_evolution[particleIndex];
//Wait until all threads are done with the copy operations
__syncthreads();
//Sets the evolution to 0
s_evolution[threadIndex] = 0.;
//Compute forces between particles of the same block
evolutionComputation(threadIndex, nbParticles, s_evolution);
//Writes the acceleration in the global memory
p_evolution[particleIndex] = s_evolution[threadIndex];
}
}
The two arrays received as arguments of the kernel are declared outside any functions as follow:
float* g_evolution;
int* g_firstParticleIndex;
and allocated through cudaMalloc() as usual.
Notice that there is a write in the shared memory in the global function as well but this one doesn’t slow down the computation. It is only the one in the device function.
I suppose that I’m doing something wrong somewhere.
Thanks for your help, I’m getting really confused with this not reproducible behaviour of my code.