incomprehensible behaviour limitations on kernel calls for host function?

Hi everyone,

recently i tried to optimize my working code, and instead calling

void func(pType * p){

...

        pType *p1;

        cudaMalloc((void**)&p1, MN*sizeof(pType));

for (int i = 0; i < JACOBI_ITERATIONS; i++)

        {

jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);   //stores results to first argument

                cudaThreadSynchronize();

                cudaMemcpy(p, p1, sizeof(pType)*dx*dy, cudaMemcpyDeviceToDevice);

}

        cudaFree(p1);

}

i tried something like this:

void func(pType * p){

...

        pType *p1;

        cudaMalloc((void**)&p1, MN*sizeof(pType));

for (int i = 0; i < JACOBI_ITERATIONS/2; i++)

        {

jacobi_shared3<<<grid, block>>>(p1, p, rhs, dx, dy, alpha, beta, pitch);      //stores results to first argument

                cudaThreadSynchronize();

jacobi_shared3<<<grid, block>>>(p, p1, rhs, dx, dy, alpha, beta, pitch);

                cudaThreadSynchronize();

}

        cudaFree(p1);

}

The problem is, I get a cudaError=30 (checked with cudeGetLastError()) when calling second cudaThreadSynchronize().

Why is that so? What am I doing wrong?

And will this change really affect calculations performance?

I have a GTX275 with CUDA 3.2 installed.

Any help will be appreciated.

(func paramter p is declared on a global scope and allocated virtually same as p1, with same size)

Hello,

Wondering what the kernels does i suspect you read data form p1 in the second call that is not updated. Have you tried copy p into p1 before the loop to ensure both buffers are the same at the beginning?

Hope this help.

Regards!

Hi!

Check the error with cudaGetErrorString():

printf(“Error: %s\n”, cudaGetErrorString(cudaGetLastError()));

I tried to copy p to p1 before the loop, but it makes no difference

If it may help, this is what the kernel does:

__global__

void jacobi_shared3(pType *p, pType *p_prev, rhsType * rhs, int dx, int dy, float alpha, float beta, size_t pitch)

{

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int gtidx = blockIdx.x * (BLOCK_SIZE-2);

	int gtidy = blockIdx.y * (BLOCK_SIZE-2);

	int x = tx+gtidx;

	int y = ty+gtidy;

	__shared__ 	pType u_sh[BLOCK_SIZE][BLOCK_SIZE];

	u_sh[tx][ty] = p_prev[y * dx + x];

	__syncthreads();

	if (tx < BLOCK_SIZE - 1 && tx > 0 && ty < BLOCK_SIZE - 1 && ty > 0 && y < dx-1 && x < dy-1)

	{

		rhsType rhstemp = rhs[y*dy+x];

		p[y * dx + x] = beta * (u_sh[tx + 1][ty] + u_sh[tx - 1][ty] + u_sh[tx][ty + 1] + u_sh[tx][ty - 1] + alpha * rhstemp);

	}

}

@insmvb00

As i wrote in first post, it gives cudaError=30 (“unknown error”):

...//first call checked with the same manner with no error

jacobi_shared3<<<grid, block>>>(p, p1, rhs, dx, dy, alpha, beta, pitch);

printf("Error: %s\n", cudaGetErrorString(cudaGetLastError())); //no error reported here

cudaThreadSynchronize();

printf("Error: %s\n", cudaGetErrorString(cudaGetLastError())); //"unknown error"

...

Is following code not out-of-array-bound?

u_sh[tx][ty] = p_prev[y * dx + x];

replace above code by

u_sh[tx][ty] = tx + ty ;

and check if error occurs.

@LSChien

Well, it’s kind of embarrassing but I have to admit, it is out of bound. I cant check this on the GPU right now, but it clearly is:

dx: 512
tx,ty: 0…15
BLOCK_SIZE: 16
blockIdx.x,blockIdx.y: 0…36

so taking it to the limits:
((3614)+15)512+((3614)+15) = 266 247,
meanwhile:
512
512 = 262 144

I’ll fix this and inform you about the results.
Its a bit weird that it didnt report any error with the first version…

Hello,

probably the reason is that the out-bound of p read from memory allocated for another variable, i.e. p1, but when you read the offset of p1 you ‘really’ go out-of-bound and the kernel fail.

There is a tool named cuda-memcheck that you can use for that purpose.

Regards!

And an FYI - you only need those cudaThreadSynchronizes when you are checking for errors. Otherwise, they just slow your program down.

In the new Fermi architecture is supposed (i have not tried it and neither know how to do it :) you can run concurrent kernels, so the cudaThreadSynchronize are needed for thouse graphic cards.

Regards!

No, you don’t need cudaThreadSynchronize on Fermi between kernels. Kernels in the same stream are always run one after the other. Kernels are only launched concurrently when you specify them in separate streams.

Thanks for your feedback! :)

@LSChien

Your suggestion worked. Thank you very much.

@pQB

I tried the memcheck tool, however i didn’t get reasonable results. I suppose it’s because the use of deprecated OGL Interoperability Im using, because the tool also returned errors for the NVIDIA samples i tested (fluidsGL, simpleGL).

@DrAnderson42

Your hint is extremely helpful, since those cudaThreadSynchronize calls were a serious drawback.
But is there any other reason one should call cudaThreadSynchronize besides error checking?

Yeah, the cuda-memcheck tool is often hit or miss. Usually the trouble that I have with it is that it doesn’t always report errors that I know are there (i.e. I run it on one machine and it reports the error for only 5/10 runs, and never reports an error for the same code running on another machine).

The only times you need cuda*Synchronize() (Thread, Event, Stream) are when you are purposefully running asynchronous events in separate streams. For a single stream of kernels with no async memcpy calls, no synchronization is ever needed by the programmer (cudaMemcpy(device->host or host->device) will automatically sync as necessary).

Examples of cases where you need an explicit sync:

You use cudaMemcpyAsync to copy data from device->host. Need to ensure that the kernel producing the data is complete via a cudaEventSynchronize before calling the memcpy.

You use cudaMemcpyAsync to copy data from the host->device. Need to ensure that any previously queued kernels that may have been reading from that data pointer are complete before calling the copy (cudaEventSynchronize is the best here, too).

You submit kernels in multiple streams. One of the kernels in stream 2 depends on the results output from a previous kernel in stream 1. Since CUDA assume that streams are completely independent, you need to manually insert a sync to ensure that the data is available. I believe that these types of sync events are just inserted into the queue and performed on GPU, so the host doesn’t actually sync (think I remember Tim mentioning something like that for CUDA 4.0 - its not a feature I’ve used yet)

Similar to the last point, but you submit kernels onto multiple GPUs. A kernel on GPU 1 needs peer to peer data access to data produced by a kernel on GPU 0. Need a sync to make sure that data is available.

Lastly, any time you use host mapped memory (or peer to peer transfers where the GPU reads/writes the host mem) - you need to manually sync these.

There may be other esoteric examples, but I think this covers the basis.