help with kernel synchronization?

I am making an implementation of parallel prefix sum based on the paper: “Efficient Parallel Scan Algorithms for GPUs” by: Shubhabrata Sengupta, Mark Harris, and Michael Garland FYI.

The problem is more related to CUDA programming. The problem is that I have a CUDA kernal that I call with device pointers. one pointer d_ptr to perform a parallel multiblock scan on and another pointer block_result to collect partial results from last thread of every block. The problem is that it seems like the host code is called directly after a kernel invocation (which is OK according to programming guide: kernel calls are async). My host code needs the device pointers d_ptr and block_result. Especially block_result to perform a recursive call with it to calculate the prefix sum of block_result (if number of elements spans several blocks). The reason to have a kernel to invoke multiblock scan is that thread blocks are synchronized after finishing the kernel. the problem is that after calling this kernel I call another host function that invokes another kernel with block_result. It seems like when this function is called d_ptr and block_result just stores garbage. Debugging with printfs I can see that my kernels do not finish kernel function when host function is called with pointers. Shouldn’t CUDA see that a host function is using a pointer that a kernel usees and then wait for it to finish execution before it calls host function?

A example to clarify:

//inside some host function

T *d_ptr;

T *block_result;

//call multiblock scan kernel

scan_multi_block<<< ..... >>>(d_ptr, block_result);

//block syncronized, asynchronous execution of kernel

//call host function to process block_result

foo(block_result);

__host__ foo(T* block_result)

{

//call another kernel with block_result

another_kernel<<< ... >>>(block_result);

...

}

Above code it seems it did not finish scan_multi_block and write the data to d_ptr and block_result before calling foo

What am i doing wrong? how can I achieve what I described?

No kernels are launched asynchronously. This makes sense cause while the kernel is executed by the GPU you may want to use your CPU for other work. Call cudaThreadSynchronize() to block the host thread until all CUDA functions have finished. Or use cudaStreamSynchronize() or cudaEventSynchronize() if you dont want to wait for everything.

Edit: Ah sry I misread the last part of your 2nd paragraph. Tim is right. Of course the kernels normally should be executed one after the other.

No kernels are launched asynchronously. This makes sense cause while the kernel is executed by the GPU you may want to use your CPU for other work. Call cudaThreadSynchronize() to block the host thread until all CUDA functions have finished. Or use cudaStreamSynchronize() or cudaEventSynchronize() if you dont want to wait for everything.

Edit: Ah sry I misread the last part of your 2nd paragraph. Tim is right. Of course the kernels normally should be executed one after the other.

You’re not checking your errors and you’re probably getting an unspecified launch failure. We can go into a much deeper discussion about CUDA ordering semantics if you want, but for now it’s pretty likely that it’s just an error.

You’re not checking your errors and you’re probably getting an unspecified launch failure. We can go into a much deeper discussion about CUDA ordering semantics if you want, but for now it’s pretty likely that it’s just an error.

I think I get what you mean but to clarify: kernel launches are per default asynchronous.

I have looked into the code of CUDPP and it seems that they do not add any synchronization code at all in their host scan function (scanArrayRecursive function) before calling vector addition kernel (link: http://code.google.com/p/cudpp/source/brow…pp/scan_app.cu).

Why is that? I mean don’t it have to be such a synchronization otherwise data is not guaranteed to be correct.

I think I get what you mean but to clarify: kernel launches are per default asynchronous.

I have looked into the code of CUDPP and it seems that they do not add any synchronization code at all in their host scan function (scanArrayRecursive function) before calling vector addition kernel (link: http://code.google.com/p/cudpp/source/brow…pp/scan_app.cu).

Why is that? I mean don’t it have to be such a synchronization otherwise data is not guaranteed to be correct.

How do I check for kernel errors. Do you mean using cudaThreadSynchronize() function and check the returned code?

How do I check for kernel errors. Do you mean using cudaThreadSynchronize() function and check the returned code?

cudaThreadSynchronize()/cudaGetLastError()

cudaThreadSynchronize()/cudaGetLastError()

Kernels in the same stream are executed one after the other on the device so cudaThreadSynchronize will only slow your performance (although you do need it when error checking is enabled as tmurray mentioned.)

Kernels in the same stream are executed one after the other on the device so cudaThreadSynchronize will only slow your performance (although you do need it when error checking is enabled as tmurray mentioned.)

So normal execution without streams is executed on 1 stream am I correct? That is what you guys mean? So in my case it is some error that I have in my code that I have to look for, because in my example above code is on one stream so kernels should finish it’s execution before next kernel is called.

So normal execution without streams is executed on 1 stream am I correct? That is what you guys mean? So in my case it is some error that I have in my code that I have to look for, because in my example above code is on one stream so kernels should finish it’s execution before next kernel is called.

You r right. “Without” streams you will use the 0-stream - the default one. Its the last argument inside the <<< >>>.

You r right. “Without” streams you will use the 0-stream - the default one. Its the last argument inside the <<< >>>.

Ok. thx for clarifying.

I have check for the returned error codes. It seems I get an cudaErrorUnknown from cudaGetLastError(), meaning the error is unknown, no good :(

Seriously I cannot see what is wrong with my code.

Below is the code snippet of my implementation:

host scan function

void Scan(T *ptr, const unsigned long int N)

{

	const unsigned long int threadsPerBlock	= 32;

	unsigned long int blocksPerGrid			= (N + threadsPerBlock - 1) / threadsPerBlock;

	unsigned long int totThreads = blocksPerGrid*threadsPerBlock;

	size_t size = totThreads*sizeof(T);

	T *d_ptr;

	cudaMalloc(&d_ptr, size);

	//copy ptr's data to d_ptr

	//d_ptr is allocated to fit threadsPerblock*blocksPerGrid elements

	//this do not affect the scan

	size = sizeof(T)*N;

	cudaMemcpy(d_ptr, ptr, size, cudaMemcpyHostToDevice);

	//if(DEBUG) DEBUG_COPY_PRINT(d_ptr, totThreads);

	//TODO: BRYT UT TILL EN FUNKTION

	T *block_result;

	size = PadSize<T>(blocksPerGrid, threadsPerBlock);

	cudaMalloc(&block_result, size); //pointer will have blocksPerGrid positions

	

	//calculate the amount of shared memory per block

	unsigned long int sMemSize = sizeof(T)*threadsPerBlock;

	ERROR_CHECK();	//No error recieved

	Do_scan_block<scanProp><<<blocksPerGrid, threadsPerBlock, sMemSize>>>(d_ptr, block_result);

	cudaThreadSynchronize();

	ERROR_CHECK();	//cudaErrorUnknown recieved

		.......

My Do_scan_block kernel:

template<class scanProp, class T>

__global__ void Do_scan_block(volatile T *ptr, T* block_result)

{

	extern __shared__ T s_ptr[];		//shared memory allocated on invocation

	unsigned long int tid = blockIdx.x * blockDim.x + threadIdx.x;

	

	//copy from global memory to shared memory

	//TODO: think about shared memory bank conflicts

	s_ptr[tid] = ptr[tid];

	//__syncthreads();

	//do per block scan

	T val = Scan_block<scanProp>(s_ptr, tid);

	//__syncthreads();

	//copy back from shared memory to global memory

	ptr[tid] = s_ptr[tid];

	//__syncthreads();

	if(threadIdx.x == blockDim.x - 1) block_result[blockIdx.x] = val;

	__syncthreads();

}

less relevant code (included for completeness. Are from paper so should most likely be correct):

template<class scanProp, class T>

__device__ T Scan_block(volatile T *ptr, const unsigned int idx = threadIdx.x)

{

	const unsigned int lane		= idx & 31; //thread id within warp

	const unsigned int warpid	= idx >> 5; //divide with warpsize (32) to get warpid

	//Step 1: Intra-warp scan on each warp

	T val = Scan_warp<scanProp>(ptr, idx);

	__syncthreads(); //synchronize with threads not belonging to same warp

	//Step 2: Collect per-warp partial sums

	if(lane == 31)	ptr[warpid] = ptr[idx];

	__syncthreads();

	//Step 3: Use first warp to inclusive scan per-warp results

	if(warpid == 0)	Scan_warp<scanProp>(ptr, idx);

	__syncthreads();

	//Step 4: Accumulate results from step 1 and 3

	if(warpid > 0)	val = scanProp::Apply(ptr[warpid-1], val);

	__syncthreads();

	//Step 5: Write and return the final result

	ptr[idx] = val;

	__syncthreads();

	return val;

}
//only do inclusive scans

template<class scanProp, class T>

__device__ T Scan_warp(volatile T *ptr, const unsigned int idx = threadIdx.x)

{

	const unsigned int lane = idx & 31; //index of thread in warp (0..31)

	if(lane >= 1)	ptr[idx] = scanProp::Apply(ptr[idx - 1],	ptr[idx]);

	if(lane >= 2)	ptr[idx] = scanProp::Apply(ptr[idx - 2],	ptr[idx]);

	if(lane >= 4)	ptr[idx] = scanProp::Apply(ptr[idx - 4],	ptr[idx]);

	if(lane >= 8)	ptr[idx] = scanProp::Apply(ptr[idx - 8],	ptr[idx]);

	if(lane >= 16)	ptr[idx] = scanProp::Apply(ptr[idx - 16],	ptr[idx]);

	return ptr[idx];

}

I can scan input arrays of 32 == threadsPerBlock elements but not when I have past that (>32) elements, meaning more than 1 block per grid.

I guess something is wrong with Do_scan_block but I cannot see what.

EDIT: I have a fermi card if it matters at all.

Ok. thx for clarifying.

I have check for the returned error codes. It seems I get an cudaErrorUnknown from cudaGetLastError(), meaning the error is unknown, no good :(

Seriously I cannot see what is wrong with my code.

Below is the code snippet of my implementation:

host scan function

void Scan(T *ptr, const unsigned long int N)

{

	const unsigned long int threadsPerBlock	= 32;

	unsigned long int blocksPerGrid			= (N + threadsPerBlock - 1) / threadsPerBlock;

	unsigned long int totThreads = blocksPerGrid*threadsPerBlock;

	size_t size = totThreads*sizeof(T);

	T *d_ptr;

	cudaMalloc(&d_ptr, size);

	//copy ptr's data to d_ptr

	//d_ptr is allocated to fit threadsPerblock*blocksPerGrid elements

	//this do not affect the scan

	size = sizeof(T)*N;

	cudaMemcpy(d_ptr, ptr, size, cudaMemcpyHostToDevice);

	//if(DEBUG) DEBUG_COPY_PRINT(d_ptr, totThreads);

	//TODO: BRYT UT TILL EN FUNKTION

	T *block_result;

	size = PadSize<T>(blocksPerGrid, threadsPerBlock);

	cudaMalloc(&block_result, size); //pointer will have blocksPerGrid positions

	

	//calculate the amount of shared memory per block

	unsigned long int sMemSize = sizeof(T)*threadsPerBlock;

	ERROR_CHECK();	//No error recieved

	Do_scan_block<scanProp><<<blocksPerGrid, threadsPerBlock, sMemSize>>>(d_ptr, block_result);

	cudaThreadSynchronize();

	ERROR_CHECK();	//cudaErrorUnknown recieved

		.......

My Do_scan_block kernel:

template<class scanProp, class T>

__global__ void Do_scan_block(volatile T *ptr, T* block_result)

{

	extern __shared__ T s_ptr[];		//shared memory allocated on invocation

	unsigned long int tid = blockIdx.x * blockDim.x + threadIdx.x;

	

	//copy from global memory to shared memory

	//TODO: think about shared memory bank conflicts

	s_ptr[tid] = ptr[tid];

	//__syncthreads();

	//do per block scan

	T val = Scan_block<scanProp>(s_ptr, tid);

	//__syncthreads();

	//copy back from shared memory to global memory

	ptr[tid] = s_ptr[tid];

	//__syncthreads();

	if(threadIdx.x == blockDim.x - 1) block_result[blockIdx.x] = val;

	__syncthreads();

}

less relevant code (included for completeness. Are from paper so should most likely be correct):

template<class scanProp, class T>

__device__ T Scan_block(volatile T *ptr, const unsigned int idx = threadIdx.x)

{

	const unsigned int lane		= idx & 31; //thread id within warp

	const unsigned int warpid	= idx >> 5; //divide with warpsize (32) to get warpid

	//Step 1: Intra-warp scan on each warp

	T val = Scan_warp<scanProp>(ptr, idx);

	__syncthreads(); //synchronize with threads not belonging to same warp

	//Step 2: Collect per-warp partial sums

	if(lane == 31)	ptr[warpid] = ptr[idx];

	__syncthreads();

	//Step 3: Use first warp to inclusive scan per-warp results

	if(warpid == 0)	Scan_warp<scanProp>(ptr, idx);

	__syncthreads();

	//Step 4: Accumulate results from step 1 and 3

	if(warpid > 0)	val = scanProp::Apply(ptr[warpid-1], val);

	__syncthreads();

	//Step 5: Write and return the final result

	ptr[idx] = val;

	__syncthreads();

	return val;

}
//only do inclusive scans

template<class scanProp, class T>

__device__ T Scan_warp(volatile T *ptr, const unsigned int idx = threadIdx.x)

{

	const unsigned int lane = idx & 31; //index of thread in warp (0..31)

	if(lane >= 1)	ptr[idx] = scanProp::Apply(ptr[idx - 1],	ptr[idx]);

	if(lane >= 2)	ptr[idx] = scanProp::Apply(ptr[idx - 2],	ptr[idx]);

	if(lane >= 4)	ptr[idx] = scanProp::Apply(ptr[idx - 4],	ptr[idx]);

	if(lane >= 8)	ptr[idx] = scanProp::Apply(ptr[idx - 8],	ptr[idx]);

	if(lane >= 16)	ptr[idx] = scanProp::Apply(ptr[idx - 16],	ptr[idx]);

	return ptr[idx];

}

I can scan input arrays of 32 == threadsPerBlock elements but not when I have past that (>32) elements, meaning more than 1 block per grid.

I guess something is wrong with Do_scan_block but I cannot see what.

EDIT: I have a fermi card if it matters at all.

No time to look at your code in detail, but triple check that you aren’t accessing shared memory out of bounds - Fermi will report an unspecified launch failure if you do so. Running your kernel through Ocelot’s PTX emulator is another way to test this.