Kernel and wait events

Hi everybody,

First of all, I am new to this forum, so please excuse my lack of knowledge!

I try to use asynchronous copy and wait events and I have a lot of trouble to understand the behavior of my function.

I declare a pinned memory using:

float *Neurons_device,*Neurons_host;

cudaSetDeviceFlags(cudaDeviceMapHost);

// Allocation of the memory on the host

cudaHostAlloc((float **)&Neurons_host, sizeNETWORK, cudaHostAllocMapped);

// The device pointer

cudaHostGetDevicePointer(&Neurons_device, Neurons_host, 0);

Then, I used the following code to call my kernels:

NetNeurons<<<1,NNEUR,0,streams[0]>>>(Neurons_device,d_OUT,0);

cudaMemcpyAsync(OUT_TOT, d_OUT, bytes, cudaMemcpyDeviceToHost, streams[0]);

for (i=1;i<NBLOCKS;i++){

 NetNeurons<<<1,NNEUR,0,streams[i]>>> (Neurons_device,d_OUT+(i*NNEUR*TOTSTEP),i);

 cudaEventRecord(kernelEvent[i], streams[i]);

 cudaStreamWaitEvent(streams[i], kernelEvent[i-1],0);

}

for (i=1;i<NBLOCKS;i++) cudaMemcpyAsync(OUT_TOT+(i*NNEUR*TOTSTEP), d_OUT+(i*NNEUR*TOTSTEP), bytes, cudaMemcpyDeviceToHost, streams[i]);

cudaEventSynchronize(kernelEvent[NBLOCKS]);

In my kernel, (NetNeurons), the variable Neurons_device loaded, I do some computations and change the value of Neurons_device for the next execution of the same kernel. I expected the two commands cudaEventRecord(kernelEvent[i], streams[i]); and cudaStreamWaitEvent(streams[i], kernelEvent[i-1],0); to generate the serialization of the call of the kernel (stream[1] then stream [2] then…). I also believe (but apparently, this is not the case!) that the next kernel would read its inputs when the previous one as recorded an event. What I observed is that the next kernel does not used the updated version of Neurons_device, which are the initial conditions for the next call.

here is the code of the kernel. The synctreads are not useful for the moment but will be in the near future!

__global__ void NetNeurons(float *IN,float*OUT,int iBlock)

{

	int idx=threadIdx.x;

	int i,j;

	float yout[NVarNEUR];

	float yin[NVarNEUR];

	float inputs[2];

	inputs[0]=0.0;

	inputs[1]=0.0;

	//__shared__ float Vtmp[NNEUR]; 

	//if (idx==0) for (i=0;i<NNEUR;i++) Vtmp[i]=IN.elements[i*NVarNEUR];

	for (i=0;i<NVarNEUR;i++) yout[i]=IN[idx*NVarNEUR+i];

	OUT[idx]=IN[idx*NVarNEUR];

	__syncthreads();

	for (j=1;j<TOTSTEP;j++){

		if (((j+iBlock*(TOTSTEP))<((int) 500/DT) )&((j+iBlock*(TOTSTEP))>=((int) 100/DT)))//<<<<<<--------

    		inputs[0]=0.05*yout[0];

    		else

    		inputs[0]=0.0;

		for (i=0;i<NVarNEUR;i++) yin[i]=yout[i];

		simpr(yin, yout, inputs);

		__syncthreads();

		//Vtmp[idx]=yout[0];

		OUT[idx+j*NNEUR]=yout[0];

		__syncthreads();

	}

	for (i=0;i<NVarNEUR;i++) IN[idx*NVarNEUR+i]=yout[i];

	__syncthreads();

}

I hope I am clear… If someone has an idea, it is most welcome!

Thank you,

Pierre