Error in Textue

Amir_Baruh · September 7, 2009, 2:25pm

I got this error:

Cuda error: …Unbinding Texture… : in file <OnTheFly.cu>, line 201 : the launch timed out and was terminated.
Line 201: UnbindTex1D_rmetFP(texX);

Cuda error: …Freeing GPU memory… : in file <OnTheFly.cu>, line 205 : the launch timed out and was terminated.
Line 205: Free_GPU_memory(d_NN_X);

someone maybe know what can be the problem?
Amir

eyalhir74 · September 7, 2009, 2:38pm

google for watchdog in nVidia’s forum… the kernel has taken too much time to complete and was terminated

by the watchdog. This is not related to Texture unbinding problem but to a previous error.

Make sure you check for cuda error code right after your kernel invocation.

eyal

Amir_Baruh · September 7, 2009, 7:07pm

Ok. now I am getting this:

cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <OnTheFly.cu>, line 193 : the launch timed out and was terminated.

What can be the probelm?

Thanks

Amir

eyalhir74 · September 8, 2009, 6:15am

The problem is just what the error states :)

The kernel ran for too long and was therefore terminated.

There is a watchdog mechanism to prevent the screen-card (GPU) from running too long and freeze the entire system

on MS its 5 seconds. Therefore after 5 seconds the kernel will be terminated.

The reasons are either you have a deadlock in your kernel or your kernel is simply indeed doing a lot of calculations

which takes too much time.

Deadlock- fix it

Too much work - break the kernel into multiple kernels to make the kernel run lower. Or use linux without X server :)

eyal

Amir_Baruh · September 8, 2009, 7:34am

Interesting!!!

but how can you explain that sometimes it successes to run the kernel about 9s and sometimes not?

Thanks

Amir

eyalhir74 · September 8, 2009, 8:02am

What OS are you using?

Maybe with the code and grid configuration it might be easier to say…

eyal

Amir_Baruh · September 8, 2009, 8:16am

linux x86_64

card Quardo FX 1700

NN=131072

THREADS=128

int blocks = (NN + THREADS - 1) / THREADS;

dim3 dimBlock(THREADS, 1, 1);

dim3 dimGrid(blocks, 1, 1);

reduce0<<< dimGrid, dimBlock >>> (d_C);

global void reduce0(float *g_odata) {

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

 

float rX;float rY;float rZ;float distSqr;

float basisX=tex1Dfetch(texX, i); float basisY=tex1Dfetch(texY, i); float basisZ=tex1Dfetch(texZ, i); float sum=0;

for (long int j = 0; j < NN; j++) {

	rX=basisX- tex1Dfetch(texX, j);

	rY=basisY- tex1Dfetch(texY, j);

	rZ=basisZ- tex1Dfetch(texZ, j);

	distSqr = rX * rX + rY * rY + rZ * rZ;

              	sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);

}

g_odata[i]=sum;

}

Thanks

Amir

eyalhir74 · September 8, 2009, 8:58am

linux x86_64

card Quardo FX 1700

NN=131072

THREADS=128

int blocks = (NN + THREADS - 1) / THREADS;

dim3 dimBlock(THREADS, 1, 1);

dim3 dimGrid(blocks, 1, 1);

reduce0<<< dimGrid, dimBlock >>> (d_C);

global void reduce0(float *g_odata) {
unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

 

float rX;float rY;float rZ;float distSqr;

float basisX=tex1Dfetch(texX, i); float basisY=tex1Dfetch(texY, i); float basisZ=tex1Dfetch(texZ, i); float sum=0;
for (long int j = 0; j < NN; j++) {
	rX=basisX- tex1Dfetch(texX, j);

	rY=basisY- tex1Dfetch(texY, j);

	rZ=basisZ- tex1Dfetch(texZ, j);

	distSqr = rX * rX + rY * rY + rZ * rZ;

              	sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);		
}

g_odata[i]=sum;

}

Thanks

Amir

What if you reduce NN to 1000, for example? will it always run? are you sure that when it runs for 9s the kernel will end without errors?

Maybe its something that has to do with textures… in anycase I think you should be able to test something like this, as the texture usage here doesnt seem to me too reasonable:

__global__ void reduce0(float *pInputX, float *pInputY, float *pInputZ, float *g_odata) {

				__shared__ float smX[ THREADS ];

				__shared__ float smY[ THREADS ];

				__shared__ float smZ[ THREADS ];

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	 

	float rX;float rY;float rZ;float distSqr;

	float basisX=pInputX[ i ]; 

				float basisY=pInputY[ i ]; 

				float basisZ=pInputZ[ i ]; 

				float sum=0;

				for (long int j = 0; j < NN/NTHREADS; j++) {		

					   smX[ threadIdx.x ] = pInputX[ j * NTHREADS + threadIdx.x ];

					   smY[ threadIdx.x ] = pInputY[ j * NTHREADS + threadIdx.x ];

					   smZ[ threadIdx.x ] = pInputZ[ j * NTHREADS + threadIdx.x ];

					   __syncthreads();

					   for ( int k = 0; k < NTHREADS; k++ )

					  {

		rX=basisX- smX[ k ];

		rY=basisY- smY[ k ];

		rZ=basisZ- smZ[ k ];

		distSqr = rX * rX + rY * rY + rZ * rZ;

					  sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);	// same with shared mem for texVecB

					   }

					  __syncthreads();

				}

				g_odata[i]=sum;

}

Hope that helps a bit…

Amir_Baruh · September 8, 2009, 2:26pm

What if you reduce NN to 1000, for example? will it always run? are you sure that when it runs for 9s the kernel will end without errors?

Maybe its something that has to do with textures… in anycase I think you should be able to test something like this, as the texture usage here doesnt seem to me too reasonable:

__global__ void reduce0(float *pInputX, float *pInputY, float *pInputZ, float *g_odata) {

				__shared__ float smX[ THREADS ];

				__shared__ float smY[ THREADS ];

				__shared__ float smZ[ THREADS ];

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	 

	float rX;float rY;float rZ;float distSqr;

	float basisX=pInputX[ i ]; 

				float basisY=pInputY[ i ]; 

				float basisZ=pInputZ[ i ]; 

				float sum=0;

				for (long int j = 0; j < NN/NTHREADS; j++) {		

					   smX[ threadIdx.x ] = pInputX[ j * NTHREADS + threadIdx.x ];

					   smY[ threadIdx.x ] = pInputY[ j * NTHREADS + threadIdx.x ];

					   smZ[ threadIdx.x ] = pInputZ[ j * NTHREADS + threadIdx.x ];

					   __syncthreads();

					   for ( int k = 0; k < NTHREADS; k++ )

					  {

		rX=basisX- smX[ k ];

		rY=basisY- smY[ k ];

		rZ=basisZ- smZ[ k ];

		distSqr = rX * rX + rY * rY + rZ * rZ;

					  sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);	// same with shared mem for texVecB

					   }

					  __syncthreads();

				}

				g_odata[i]=sum;

}

Hope that helps a bit…

Wonderful !!!

Now, I think that I understand the difference between Texture and Shared mem.

The shared mem helps to get more parallel.

Your algorithm works x1.5 from mine.

it’s look like:

global void reduce2(float *pInputX, float *pInputY, float *pInputZ, float *InputVecB, float *g_odata) {

__shared__ float smX[ THREADS ];

__shared__ float smY[ THREADS ];

__shared__ float smZ[ THREADS ];

__shared__ float vecB[ THREADS ];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 

float rX;float rY;float rZ;float distSqr;

float basisX=pInputX[ i ]; float basisY=pInputY[ i ]; float basisZ=pInputZ[ i ]; 

float sum=0;

for (long int j = 0; j < NN/THREADS; j++) {

    unsigned int j_THREADS = j * THREADS;      

        smX[ threadIdx.x ] = pInputX[ j_THREADS + threadIdx.x ];

        smY[ threadIdx.x ] = pInputY[ j_THREADS + threadIdx.x ];

        smZ[ threadIdx.x ] = pInputZ[ j_THREADS + threadIdx.x ];

   vecB[ threadIdx.x ]=InputVecB[ j_THREADS + threadIdx.x ];

        __syncthreads();

        for ( int k = 0; k < THREADS; k++ )

        {

    	rX=basisX- smX[ k ]; rY=basisY- smY[ k ]; rZ=basisZ- smZ[ k ];

    	distSqr = rX * rX + rY * rY + rZ * rZ;

sum+=K*sqrt(distSqr)*vecB[ k ];

        }

        __syncthreads();

}

g_odata[i]=sum;

}

BTW - The orginal problem wasn’t solved (the launch timed out and was terminated).

However, I need to get a new computer with 295GTX and simple card so the 295GTX can work alone and I hope that it will solve this problem.

Do you have another idea to get more performances?

Thanks (again)

Amir

eyalhir74 · September 8, 2009, 2:55pm

Wonderful !!!

Now, I think that I understand the difference between Texture and Shared mem.

The shared mem helps to get more parallel.

Your algorithm works x1.5 from mine.

it’s look like:

global void reduce2(float *pInputX, float *pInputY, float *pInputZ, float *InputVecB, float *g_odata) {
__shared__ float smX[ THREADS ];

__shared__ float smY[ THREADS ];

__shared__ float smZ[ THREADS ];

__shared__ float vecB[ THREADS ];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 

float rX;float rY;float rZ;float distSqr;

float basisX=pInputX[ i ]; float basisY=pInputY[ i ]; float basisZ=pInputZ[ i ]; 

float sum=0;
for (long int j = 0; j < NN/THREADS; j++) {
    unsigned int j_THREADS = j * THREADS;      

        smX[ threadIdx.x ] = pInputX[ j_THREADS + threadIdx.x ];

        smY[ threadIdx.x ] = pInputY[ j_THREADS + threadIdx.x ];

        smZ[ threadIdx.x ] = pInputZ[ j_THREADS + threadIdx.x ];

   vecB[ threadIdx.x ]=InputVecB[ j_THREADS + threadIdx.x ];

        __syncthreads();

        for ( int k = 0; k < THREADS; k++ )

        {

    	rX=basisX- smX[ k ]; rY=basisY- smY[ k ]; rZ=basisZ- smZ[ k ];

    	distSqr = rX * rX + rY * rY + rZ * rZ;

sum+=K*sqrt(distSqr)*vecB[ k ];

        }

        __syncthreads();

}

g_odata[i]=sum;
}

BTW - The orginal problem wasn’t solved (the launch timed out and was terminated).

However, I need to get a new computer with 295GTX and simple card so the 295GTX can work alone and I hope that it will solve this problem.

Do you have another idea to get more performances?

Thanks (again)

Amir

Amir,

Mind you that was a pseudo code :) you should verify that it indeed worked well. For example if NN % THREADS != 0 the code will not work

The timeout still bothers me that it might be because of a faulty code - are you sure its not because of a dead-lock?

As for further performance - can you post the register/smem usage for the kernel? did you run it via the profiler?

what if you change the k loop to run only NTHREADS / 2 - how does this affect the time of the kernel?

eyal

eyalhir74 · September 9, 2009, 9:44am

Another thing thats worth checking is whether there is a relationship between your input data arrays.

For example in pInputB = pInputA * pInputA (or some other relation) it might be worth

to simple re-calculate pInputB than to read it from memory.

eyal

Amir_Baruh · September 10, 2009, 7:33am

It indeed worked well. I check against the CPU. For this moment, I dont care about (NN % THREADS != 0). (Maybe Later).

My system guy need to check the timeout. Maybe it’s happened because the card serves also for grahics and also for the CUDA. I told you that I need to get new computer with two cards so we can check this issue about that system.

How can I get the register/smem usage for the kernel? I will be happy to post this information for you.

I did not run via profiler. What can I ge with the profiler and how can I run it?

If I change the k loop to run only NTHREADS / 2, I will get (time/2) but of course that the output is incorrect. What is your idea about the NTHREADS / 2 running?

There is not a relationship between my input data arrays. However, as you can see when we calculate the distances between the points, we calculate once (point1-point2)^2 and once (point2-point1)^2. So if we build the Matrix from all the lines that we

create, we will get for example:

0 A B C

A 0 D E

B D 0 F

C E F 0

Maybe we can use it.

What do you think?

Thanks a lots about your help,

Amir

eyalhir74 · September 10, 2009, 8:19am

add --ptxas-options="-v -mem " to your .cu file compilation line (in release mode) you should see something like this:

1>ptxas info : Compiling entry function 'Z15CalculatePhaseAILj0EL19ECalculatedDataType1EEvjPfS1

PbS1_S1_S1_S1_S1_’

1>ptxas info : Used 15 registers, 2788+16 bytes smem, 80 bytes cmem[0], 24 bytes cmem[1]

This will tell you how many resources you’re using in each of your kernels - you can than put this info in the occupancy calculator

to get a grip of the occupancy of your kernel.

well that is a quick test. if you cut k by half and you get half the time, it means that your kernel probably compute intensive and not bandwidth. The k loop

already have all the data in shared memory so it doesnt access any memory just does computations.

You might want to try to use 256 threads per block, try replacing the sqrt with the intrinsic version (look for intrinsic in the programming manual)

also make sure that sqrt is not for doubles (if it is you might try to use sqrtf ).

There is not a relationship between my input data arrays. However, as you can see when we calculate the distances between the points, we calculate once (point1-point2)^2 and once (point2-point1)^2. So if we build the Matrix from all the lines that we

create, we will get for example: 0 A B C
                                               A 0 D E

                                               B D 0 F

                                               C E F 0
Maybe we can use it.

What do you think?

Well you can probably cut the calculations (read kernel time) by half by just calculation what’s above and the 0s line and then do a mirroring of the results.

so just calculate :

and now do a mirroring kernel to just copy the data (instead of re-calculating it). What do you think?

eyal

Amir_Baruh · September 11, 2009, 7:36am

add --ptxas-options="-v -mem " to your .cu file compilation line (in release mode) you should see something like this:

1>ptxas info : Compiling entry function 'Z15CalculatePhaseAILj0EL19ECalculatedDataType1EEvjPfS1

PbS1_S1_S1_S1_S1_’

1>ptxas info : Used 15 registers, 2788+16 bytes smem, 80 bytes cmem[0], 24 bytes cmem[1]

This will tell you how many resources you’re using in each of your kernels - you can than put this info in the occupancy calculator

to get a grip of the occupancy of your kernel.

well that is a quick test. if you cut k by half and you get half the time, it means that your kernel probably compute intensive and not bandwidth. The k loop

already have all the data in shared memory so it doesnt access any memory just does computations.

You might want to try to use 256 threads per block, try replacing the sqrt with the intrinsic version (look for intrinsic in the programming manual)

also make sure that sqrt is not for doubles (if it is you might try to use sqrtf ).

Well you can probably cut the calculations (read kernel time) by half by just calculation what’s above and the 0s line and then do a mirroring of the results.

so just calculate :
0 A B C

   0 D E

	  0  F

		  0
and now do a mirroring kernel to just copy the data (instead of re-calculating it). What do you think?

eyal

Hi eyal,

Do you know if it can use all 480 Processor Cores at 295GTX for a single program?

Thanks

Amir

eyalhir74 · September 11, 2009, 11:22am

The two halves of the GTX295 are considered as two independant GPUs. You can’t share data between

them (maybe only via the new memory features from 2.2 onwards).

You might either to use only the second GPU (presumably the watchdog is not relevant for it)

or you can half the work you need and give each GPU half of the work.

Each GPU will require a CPU thread - you can either look at the simpleMultiGPU sample in the SDK

or google for MisterAndreson’s GPUWorker solution here in the newsgroups.

Check out the deviceQuery SDK sample as well to see how the GTX295 is seen by CUDA.

eyal

Amir_Baruh · September 13, 2009, 6:22am

So as I see it, for “our” algorithm, it prefer to use with GTX 295 and not with GTX 285.

I think we can share the problem with the GPU. Am I right?

Thanks a lots,

Amir

eyalhir74 · September 13, 2009, 7:00am

why do you think this? because of the freezes? first on a non X server linux you wouldnt have any problem with

that. On a “GUI” environment you might be able to change the code so it will run less time (by breaking the kernels

into smaller chunks) and thus not “upsetting” the watchdog.

yes I guess you can.

Instead of this, for example:

(long int j = 0; j < NN/THREADS; j++)

you might want to run NN/THREADS/2 on the first GPU and the other half on the second GPU

and then merge the results on the CPU for example. Of course you’ll have to play with the code

to see what’s best for your performance…

eyal

Amir_Baruh · September 13, 2009, 7:30am

No because the freezes, I think we can get x2.

yes I guess you can.

Instead of this, for example:
(long int j = 0; j < NN/THREADS; j++)
you might want to run NN/THREADS/2 on the first GPU and the other half on the second GPU

and then merge the results on the CPU for example. Of course you’ll have to play with the code

to see what’s best for your performance…

eyal

Now I want to finish the algorithm before trying to get more performace (by checking registers, threads num…).

I need to work with complex numbers.

Instead of:

sum+=K*sqrt(distSqr)*vecB[ k ];

I need to write something like that:

sum+= exp(jKsqrt(distSqr)) * vecB[ k ];

exp(jKsqrt(distSqr))=cos(Ksqrt(distSqr)) + jsin(K*sqrt(distSqr))

Where j and vecB are complex numbers.

Amir

Topic		Replies	Views
Cuda code performance CUDA Programming and Performance	14	3104	December 16, 2014
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
Texture Unit in Pascal architecture CUDA Programming and Performance	24	3911	April 11, 2018
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11117	May 23, 2010
Floyd algorithm problem Floyd cuda implementation getting wrong results. CUDA Programming and Performance	28	13771	May 9, 2009
Efficient use of shared memory CUDA Programming and Performance	29	4220	December 2, 2019
Cuda program taking more time. CUDA Programming and Performance	15	7057	November 21, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134550	May 26, 2010
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9995	January 18, 2012

Error in Textue

Related topics