incomprehensible behaviour limitations on kernel calls for host function?

huki · April 26, 2011, 2:52pm

Hi everyone,

recently i tried to optimize my working code, and instead calling

void func(pType * p){

...

        pType *p1;

        cudaMalloc((void**)&p1, MN*sizeof(pType));

for (int i = 0; i < JACOBI_ITERATIONS; i++)

        {

jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);   //stores results to first argument

                cudaThreadSynchronize();

                cudaMemcpy(p, p1, sizeof(pType)*dx*dy, cudaMemcpyDeviceToDevice);

}

        cudaFree(p1);

}

i tried something like this:

void func(pType * p){

...

        pType *p1;

        cudaMalloc((void**)&p1, MN*sizeof(pType));

for (int i = 0; i < JACOBI_ITERATIONS/2; i++)

        {

jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);      //stores results to first argument

                cudaThreadSynchronize();

jacobi_shared3<<<grid, block>>>(p, p1, rhs, dx, dy, alpha, beta, pitch);

                cudaThreadSynchronize();

}

        cudaFree(p1);

}

The problem is, I get a cudaError=30 (checked with cudeGetLastError()) when calling second cudaThreadSynchronize().

Why is that so? What am I doing wrong?

And will this change really affect calculations performance?

I have a GTX275 with CUDA 3.2 installed.

Any help will be appreciated.

(func paramter p is declared on a global scope and allocated virtually same as p1, with same size)

pQB · April 26, 2011, 4:52pm

Hello,

Wondering what the kernels does i suspect you read data form p1 in the second call that is not updated. Have you tried copy p into p1 before the loop to ensure both buffers are the same at the beginning?

Hope this help.

Regards!

insmvb00 · April 26, 2011, 5:54pm

Hi!

Check the error with cudaGetErrorString():

printf(“Error: %s\n”, cudaGetErrorString(cudaGetLastError()));

huki · April 26, 2011, 6:11pm

I tried to copy p to p1 before the loop, but it makes no difference

If it may help, this is what the kernel does:

__global__

void jacobi_shared3(pType *p, pType *p_prev, rhsType * rhs, int dx, int dy, float alpha, float beta, size_t pitch)

{

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int gtidx = blockIdx.x * (BLOCK_SIZE-2);

	int gtidy = blockIdx.y * (BLOCK_SIZE-2);

	int x = tx+gtidx;

	int y = ty+gtidy;

	__shared__ 	pType u_sh[BLOCK_SIZE][BLOCK_SIZE];

	u_sh[tx][ty] = p_prev[y * dx + x];

	__syncthreads();

	if (tx < BLOCK_SIZE - 1 && tx > 0 && ty < BLOCK_SIZE - 1 && ty > 0 && y < dx-1 && x < dy-1)

	{

		rhsType rhstemp = rhs[y*dy+x];

		p[y * dx + x] = beta * (u_sh[tx + 1][ty] + u_sh[tx - 1][ty] + u_sh[tx][ty + 1] + u_sh[tx][ty - 1] + alpha * rhstemp);

	}

}

@insmvb00

As i wrote in first post, it gives cudaError=30 (“unknown error”):

...//first call checked with the same manner with no error

jacobi_shared3<<<grid, block>>>(p, p1, rhs, dx, dy, alpha, beta, pitch);

printf("Error: %s\n", cudaGetErrorString(cudaGetLastError())); //no error reported here

cudaThreadSynchronize();

printf("Error: %s\n", cudaGetErrorString(cudaGetLastError())); //"unknown error"

...

LSChien · April 27, 2011, 1:46am

Is following code not out-of-array-bound?

u_sh[tx][ty] = p_prev[y * dx + x];

replace above code by

u_sh[tx][ty] = tx + ty ;

and check if error occurs.

huki · April 27, 2011, 8:40am

@LSChien

Well, it’s kind of embarrassing but I have to admit, it is out of bound. I cant check this on the GPU right now, but it clearly is:

dx: 512
tx,ty: 0…15
BLOCK_SIZE: 16
blockIdx.x,blockIdx.y: 0…36

so taking it to the limits:
((3614)+15)512+((3614)+15) = 266 247,
meanwhile:
512512 = 262 144

I’ll fix this and inform you about the results.
Its a bit weird that it didnt report any error with the first version…

pQB · April 27, 2011, 9:14am

Hello,

probably the reason is that the out-bound of p read from memory allocated for another variable, i.e. p1, but when you read the offset of p1 you ‘really’ go out-of-bound and the kernel fail.

There is a tool named cuda-memcheck that you can use for that purpose.

Regards!

DrAnderson42 · April 27, 2011, 11:30am

And an FYI - you only need those cudaThreadSynchronizes when you are checking for errors. Otherwise, they just slow your program down.

pQB · April 27, 2011, 2:16pm

In the new Fermi architecture is supposed (i have not tried it and neither know how to do it :) you can run concurrent kernels, so the cudaThreadSynchronize are needed for thouse graphic cards.

Regards!

DrAnderson42 · April 27, 2011, 2:20pm

No, you don’t need cudaThreadSynchronize on Fermi between kernels. Kernels in the same stream are always run one after the other. Kernels are only launched concurrently when you specify them in separate streams.

pQB · April 27, 2011, 2:39pm

Thanks for your feedback! :)

huki · April 28, 2011, 8:05am

@LSChien

Your suggestion worked. Thank you very much.

@pQB

I tried the memcheck tool, however i didn’t get reasonable results. I suppose it’s because the use of deprecated OGL Interoperability Im using, because the tool also returned errors for the NVIDIA samples i tested (fluidsGL, simpleGL).

@DrAnderson42

Your hint is extremely helpful, since those cudaThreadSynchronize calls were a serious drawback.
But is there any other reason one should call cudaThreadSynchronize besides error checking?

DrAnderson42 · April 28, 2011, 11:25am

Yeah, the cuda-memcheck tool is often hit or miss. Usually the trouble that I have with it is that it doesn’t always report errors that I know are there (i.e. I run it on one machine and it reports the error for only 5/10 runs, and never reports an error for the same code running on another machine).

The only times you need cuda*Synchronize() (Thread, Event, Stream) are when you are purposefully running asynchronous events in separate streams. For a single stream of kernels with no async memcpy calls, no synchronization is ever needed by the programmer (cudaMemcpy(device->host or host->device) will automatically sync as necessary).

Examples of cases where you need an explicit sync:

You use cudaMemcpyAsync to copy data from device->host. Need to ensure that the kernel producing the data is complete via a cudaEventSynchronize before calling the memcpy.

You use cudaMemcpyAsync to copy data from the host->device. Need to ensure that any previously queued kernels that may have been reading from that data pointer are complete before calling the copy (cudaEventSynchronize is the best here, too).

You submit kernels in multiple streams. One of the kernels in stream 2 depends on the results output from a previous kernel in stream 1. Since CUDA assume that streams are completely independent, you need to manually insert a sync to ensure that the data is available. I believe that these types of sync events are just inserted into the queue and performed on GPU, so the host doesn’t actually sync (think I remember Tim mentioning something like that for CUDA 4.0 - its not a feature I’ve used yet)

Similar to the last point, but you submit kernels onto multiple GPUs. A kernel on GPU 1 needs peer to peer data access to data produced by a kernel on GPU 0. Need a sync to make sure that data is available.

Lastly, any time you use host mapped memory (or peer to peer transfers where the GPU reads/writes the host mem) - you need to manually sync these.

There may be other esoteric examples, but I think this covers the basis.

Topic		Replies	Views
multiple kernel calls from one host function strange behaviour when calling kernel CUDA Programming and Performance	4	1622	April 21, 2011
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4425	September 22, 2009
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9791	April 7, 2011
cudaThreadSynchronize usage CUDA Programming and Performance	3	2993	October 21, 2008
No need to check cudaThreadSynchronize() in release mode? CUDA Programming and Performance	9	6463	April 21, 2009
kernel printf strange behaviour of printf in __global__ sub CUDA Programming and Performance	1	3981	February 22, 2011
cudaThreadSynchronize() error CUDA Programming and Performance	1	2999	October 5, 2009
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5461	August 17, 2009
cudaThreadSynchronize() and multiple kernels when is it necessary to sync? CUDA Programming and Performance	2	8406	June 20, 2008
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11361	June 16, 2010

incomprehensible behaviour limitations on kernel calls for host function?

Related topics