Error in Textue

I got this error:

Cuda error: …Unbinding Texture… : in file <OnTheFly.cu>, line 201 : the launch timed out and was terminated.
Line 201: UnbindTex1D_rmetFP(texX);

Cuda error: …Freeing GPU memory… : in file <OnTheFly.cu>, line 205 : the launch timed out and was terminated.
Line 205: Free_GPU_memory(d_NN_X);

someone maybe know what can be the problem?
Amir

google for watchdog in nVidia’s forum… the kernel has taken too much time to complete and was terminated

by the watchdog. This is not related to Texture unbinding problem but to a previous error.

Make sure you check for cuda error code right after your kernel invocation.

eyal

Ok. now I am getting this:

cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <OnTheFly.cu>, line 193 : the launch timed out and was terminated.

What can be the probelm?

Thanks

Amir

The problem is just what the error states :)

The kernel ran for too long and was therefore terminated.

There is a watchdog mechanism to prevent the screen-card (GPU) from running too long and freeze the entire system

on MS its 5 seconds. Therefore after 5 seconds the kernel will be terminated.

The reasons are either you have a deadlock in your kernel or your kernel is simply indeed doing a lot of calculations

which takes too much time.

Deadlock- fix it

Too much work - break the kernel into multiple kernels to make the kernel run lower. Or use linux without X server :)

eyal

Interesting!!!

but how can you explain that sometimes it successes to run the kernel about 9s and sometimes not?

Thanks

Amir

What OS are you using?

Maybe with the code and grid configuration it might be easier to say…

eyal

linux x86_64

card Quardo FX 1700

NN=131072

THREADS=128

int blocks = (NN + THREADS - 1) / THREADS;

dim3 dimBlock(THREADS, 1, 1);

dim3 dimGrid(blocks, 1, 1);

reduce0<<< dimGrid, dimBlock >>> (d_C);

global void reduce0(float *g_odata) {

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

 

float rX;float rY;float rZ;float distSqr;

float basisX=tex1Dfetch(texX, i); float basisY=tex1Dfetch(texY, i); float basisZ=tex1Dfetch(texZ, i); float sum=0;

for (long int j = 0; j < NN; j++) {

	rX=basisX- tex1Dfetch(texX, j);

	rY=basisY- tex1Dfetch(texY, j);

	rZ=basisZ- tex1Dfetch(texZ, j);

	distSqr = rX * rX + rY * rY + rZ * rZ;

              	sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);		

}

g_odata[i]=sum;

}

Thanks

Amir

What if you reduce NN to 1000, for example? will it always run? are you sure that when it runs for 9s the kernel will end without errors?

Maybe its something that has to do with textures… in anycase I think you should be able to test something like this, as the texture usage here doesnt seem to me too reasonable:

__global__ void reduce0(float *pInputX, float *pInputY, float *pInputZ, float *g_odata) {

				__shared__ float smX[ THREADS ];

				__shared__ float smY[ THREADS ];

				__shared__ float smZ[ THREADS ];

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	 

	float rX;float rY;float rZ;float distSqr;

	float basisX=pInputX[ i ]; 

				float basisY=pInputY[ i ]; 

				float basisZ=pInputZ[ i ]; 

				float sum=0;

				for (long int j = 0; j < NN/NTHREADS; j++) {		

					   smX[ threadIdx.x ] = pInputX[ j * NTHREADS + threadIdx.x ];

					   smY[ threadIdx.x ] = pInputY[ j * NTHREADS + threadIdx.x ];

					   smZ[ threadIdx.x ] = pInputZ[ j * NTHREADS + threadIdx.x ];

					   __syncthreads();

					   for ( int k = 0; k < NTHREADS; k++ )

					  {

		rX=basisX- smX[ k ];

		rY=basisY- smY[ k ];

		rZ=basisZ- smZ[ k ];

		distSqr = rX * rX + rY * rY + rZ * rZ;

					  sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);	// same with shared mem for texVecB

					   }

					  __syncthreads();

				}

				g_odata[i]=sum;

}

Hope that helps a bit…

Wonderful !!!

Now, I think that I understand the difference between Texture and Shared mem.

The shared mem helps to get more parallel.

Your algorithm works x1.5 from mine.

it’s look like:

global void reduce2(float *pInputX, float *pInputY, float *pInputZ, float *InputVecB, float *g_odata) {

__shared__ float smX[ THREADS ];

__shared__ float smY[ THREADS ];

__shared__ float smZ[ THREADS ];

__shared__ float vecB[ THREADS ];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 

float rX;float rY;float rZ;float distSqr;

float basisX=pInputX[ i ]; float basisY=pInputY[ i ]; float basisZ=pInputZ[ i ]; 

float sum=0;

for (long int j = 0; j < NN/THREADS; j++) {

    unsigned int j_THREADS = j * THREADS;      

        smX[ threadIdx.x ] = pInputX[ j_THREADS + threadIdx.x ];

        smY[ threadIdx.x ] = pInputY[ j_THREADS + threadIdx.x ];

        smZ[ threadIdx.x ] = pInputZ[ j_THREADS + threadIdx.x ];

   vecB[ threadIdx.x ]=InputVecB[ j_THREADS + threadIdx.x ];

        __syncthreads();

        for ( int k = 0; k < THREADS; k++ )

        {

    	rX=basisX- smX[ k ]; rY=basisY- smY[ k ]; rZ=basisZ- smZ[ k ];

    	distSqr = rX * rX + rY * rY + rZ * rZ;

sum+=K*sqrt(distSqr)*vecB[ k ];

        }

        __syncthreads();

}

g_odata[i]=sum;

}

BTW - The orginal problem wasn’t solved (the launch timed out and was terminated).

However, I need to get a new computer with 295GTX and simple card so the 295GTX can work alone and I hope that it will solve this problem.

Do you have another idea to get more performances?

Thanks (again)

Amir

Amir,

Mind you that was a pseudo code :) you should verify that it indeed worked well. For example if NN % THREADS != 0 the code will not work

The timeout still bothers me that it might be because of a faulty code - are you sure its not because of a dead-lock?

As for further performance - can you post the register/smem usage for the kernel? did you run it via the profiler?

what if you change the k loop to run only NTHREADS / 2 - how does this affect the time of the kernel?

eyal

Another thing thats worth checking is whether there is a relationship between your input data arrays.

For example in pInputB = pInputA * pInputA (or some other relation) it might be worth

to simple re-calculate pInputB than to read it from memory.

eyal

It indeed worked well. I check against the CPU. For this moment, I dont care about (NN % THREADS != 0). (Maybe Later).

My system guy need to check the timeout. Maybe it’s happened because the card serves also for grahics and also for the CUDA. I told you that I need to get new computer with two cards so we can check this issue about that system.

How can I get the register/smem usage for the kernel? I will be happy to post this information for you.

I did not run via profiler. What can I ge with the profiler and how can I run it?

If I change the k loop to run only NTHREADS / 2, I will get (time/2) but of course that the output is incorrect. What is your idea about the NTHREADS / 2 running?

There is not a relationship between my input data arrays. However, as you can see when we calculate the distances between the points, we calculate once (point1-point2)^2 and once (point2-point1)^2. So if we build the Matrix from all the lines that we

create, we will get for example:

0 A B C

A 0 D E

B D 0 F

C E F 0

Maybe we can use it.

What do you think?

Thanks a lots about your help,

Amir

add --ptxas-options="-v -mem " to your .cu file compilation line (in release mode) you should see something like this:

1>ptxas info : Compiling entry function 'Z15CalculatePhaseAILj0EL19ECalculatedDataType1EEvjPfS1

PbS1_S1_S1_S1_S1_’

1>ptxas info : Used 15 registers, 2788+16 bytes smem, 80 bytes cmem[0], 24 bytes cmem[1]

This will tell you how many resources you’re using in each of your kernels - you can than put this info in the occupancy calculator

to get a grip of the occupancy of your kernel.

well that is a quick test. if you cut k by half and you get half the time, it means that your kernel probably compute intensive and not bandwidth. The k loop

already have all the data in shared memory so it doesnt access any memory just does computations.

You might want to try to use 256 threads per block, try replacing the sqrt with the intrinsic version (look for intrinsic in the programming manual)

also make sure that sqrt is not for doubles (if it is you might try to use sqrtf ).

Well you can probably cut the calculations (read kernel time) by half by just calculation what’s above and the 0s line and then do a mirroring of the results.

so just calculate :

0 A B C

   0 D E

	  0  F

		  0

and now do a mirroring kernel to just copy the data (instead of re-calculating it). What do you think?

eyal

Hi eyal,

Do you know if it can use all 480 Processor Cores at 295GTX for a single program?

Thanks

Amir

The two halves of the GTX295 are considered as two independant GPUs. You can’t share data between

them (maybe only via the new memory features from 2.2 onwards).

You might either to use only the second GPU (presumably the watchdog is not relevant for it)

or you can half the work you need and give each GPU half of the work.

Each GPU will require a CPU thread - you can either look at the simpleMultiGPU sample in the SDK

or google for MisterAndreson’s GPUWorker solution here in the newsgroups.

Check out the deviceQuery SDK sample as well to see how the GTX295 is seen by CUDA.

eyal

So as I see it, for “our” algorithm, it prefer to use with GTX 295 and not with GTX 285.

I think we can share the problem with the GPU. Am I right?

Thanks a lots,

Amir

why do you think this? because of the freezes? first on a non X server linux you wouldnt have any problem with

that. On a “GUI” environment you might be able to change the code so it will run less time (by breaking the kernels

into smaller chunks) and thus not “upsetting” the watchdog.

yes I guess you can.

Instead of this, for example:

(long int j = 0; j < NN/THREADS; j++)

you might want to run NN/THREADS/2 on the first GPU and the other half on the second GPU

and then merge the results on the CPU for example. Of course you’ll have to play with the code

to see what’s best for your performance…

eyal

No because the freezes, I think we can get x2.

Now I want to finish the algorithm before trying to get more performace (by checking registers, threads num…).

I need to work with complex numbers.

Instead of:

sum+=K*sqrt(distSqr)*vecB[ k ];

I need to write something like that:

sum+= exp(jKsqrt(distSqr)) * vecB[ k ];

exp(jKsqrt(distSqr))=cos(Ksqrt(distSqr)) + jsin(K*sqrt(distSqr))

Where j and vecB are complex numbers.

Amir