I got this error:
Cuda error: …Unbinding Texture… : in file <OnTheFly.cu>, line 201 : the launch timed out and was terminated.
Line 201: UnbindTex1D_rmetFP(texX);
Cuda error: …Freeing GPU memory… : in file <OnTheFly.cu>, line 205 : the launch timed out and was terminated.
Line 205: Free_GPU_memory(d_NN_X);
someone maybe know what can be the problem?
Amir
I got this error:
Cuda error: …Unbinding Texture… : in file <OnTheFly.cu>, line 201 : the launch timed out and was terminated.
someone maybe know what can be the problem?
Amir
google for watchdog in nVidia’s forum… the kernel has taken too much time to complete and was terminated
by the watchdog. This is not related to Texture unbinding problem but to a previous error.
Make sure you check for cuda error code right after your kernel invocation.
eyal
google for watchdog in nVidia’s forum… the kernel has taken too much time to complete and was terminated
by the watchdog. This is not related to Texture unbinding problem but to a previous error.
Make sure you check for cuda error code right after your kernel invocation.
eyal
Ok. now I am getting this:
cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <OnTheFly.cu>, line 193 : the launch timed out and was terminated.
What can be the probelm?
Thanks
Amir
Ok. now I am getting this:
cutilCheckMsg cudaThreadSynchronize error: Kernel execution failed in file <OnTheFly.cu>, line 193 : the launch timed out and was terminated.
What can be the probelm?
Thanks
Amir
The problem is just what the error states :)
The kernel ran for too long and was therefore terminated.
There is a watchdog mechanism to prevent the screen-card (GPU) from running too long and freeze the entire system
on MS its 5 seconds. Therefore after 5 seconds the kernel will be terminated.
The reasons are either you have a deadlock in your kernel or your kernel is simply indeed doing a lot of calculations
which takes too much time.
Deadlock- fix it
Too much work - break the kernel into multiple kernels to make the kernel run lower. Or use linux without X server :)
eyal
The problem is just what the error states :)
The kernel ran for too long and was therefore terminated.
There is a watchdog mechanism to prevent the screen-card (GPU) from running too long and freeze the entire system
on MS its 5 seconds. Therefore after 5 seconds the kernel will be terminated.
The reasons are either you have a deadlock in your kernel or your kernel is simply indeed doing a lot of calculations
which takes too much time.
Deadlock- fix it
Too much work - break the kernel into multiple kernels to make the kernel run lower. Or use linux without X server :)
eyal
Interesting!!!
but how can you explain that sometimes it successes to run the kernel about 9s and sometimes not?
Thanks
Amir
What OS are you using?
Maybe with the code and grid configuration it might be easier to say…
eyal
linux x86_64
card Quardo FX 1700
NN=131072
THREADS=128
int blocks = (NN + THREADS - 1) / THREADS;
dim3 dimBlock(THREADS, 1, 1);
dim3 dimGrid(blocks, 1, 1);
reduce0<<< dimGrid, dimBlock >>> (d_C);
global void reduce0(float *g_odata) {
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=tex1Dfetch(texX, i); float basisY=tex1Dfetch(texY, i); float basisZ=tex1Dfetch(texZ, i); float sum=0;
for (long int j = 0; j < NN; j++) {
rX=basisX- tex1Dfetch(texX, j);
rY=basisY- tex1Dfetch(texY, j);
rZ=basisZ- tex1Dfetch(texZ, j);
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);
}
g_odata[i]=sum;
}
Thanks
Amir
linux x86_64
card Quardo FX 1700
NN=131072
THREADS=128
int blocks = (NN + THREADS - 1) / THREADS;
dim3 dimBlock(THREADS, 1, 1);
dim3 dimGrid(blocks, 1, 1);
reduce0<<< dimGrid, dimBlock >>> (d_C);
global void reduce0(float *g_odata) {
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=tex1Dfetch(texX, i); float basisY=tex1Dfetch(texY, i); float basisZ=tex1Dfetch(texZ, i); float sum=0;
for (long int j = 0; j < NN; j++) {
rX=basisX- tex1Dfetch(texX, j);
rY=basisY- tex1Dfetch(texY, j);
rZ=basisZ- tex1Dfetch(texZ, j);
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j);
}
g_odata[i]=sum;
}
Thanks
Amir
What if you reduce NN to 1000, for example? will it always run? are you sure that when it runs for 9s the kernel will end without errors?
Maybe its something that has to do with textures… in anycase I think you should be able to test something like this, as the texture usage here doesnt seem to me too reasonable:
__global__ void reduce0(float *pInputX, float *pInputY, float *pInputZ, float *g_odata) {
__shared__ float smX[ THREADS ];
__shared__ float smY[ THREADS ];
__shared__ float smZ[ THREADS ];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=pInputX[ i ];
float basisY=pInputY[ i ];
float basisZ=pInputZ[ i ];
float sum=0;
for (long int j = 0; j < NN/NTHREADS; j++) {
smX[ threadIdx.x ] = pInputX[ j * NTHREADS + threadIdx.x ];
smY[ threadIdx.x ] = pInputY[ j * NTHREADS + threadIdx.x ];
smZ[ threadIdx.x ] = pInputZ[ j * NTHREADS + threadIdx.x ];
__syncthreads();
for ( int k = 0; k < NTHREADS; k++ )
{
rX=basisX- smX[ k ];
rY=basisY- smY[ k ];
rZ=basisZ- smZ[ k ];
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j); // same with shared mem for texVecB
}
__syncthreads();
}
g_odata[i]=sum;
}
Hope that helps a bit…
What if you reduce NN to 1000, for example? will it always run? are you sure that when it runs for 9s the kernel will end without errors?
Maybe its something that has to do with textures… in anycase I think you should be able to test something like this, as the texture usage here doesnt seem to me too reasonable:
__global__ void reduce0(float *pInputX, float *pInputY, float *pInputZ, float *g_odata) {
__shared__ float smX[ THREADS ];
__shared__ float smY[ THREADS ];
__shared__ float smZ[ THREADS ];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=pInputX[ i ];
float basisY=pInputY[ i ];
float basisZ=pInputZ[ i ];
float sum=0;
for (long int j = 0; j < NN/NTHREADS; j++) {
smX[ threadIdx.x ] = pInputX[ j * NTHREADS + threadIdx.x ];
smY[ threadIdx.x ] = pInputY[ j * NTHREADS + threadIdx.x ];
smZ[ threadIdx.x ] = pInputZ[ j * NTHREADS + threadIdx.x ];
__syncthreads();
for ( int k = 0; k < NTHREADS; k++ )
{
rX=basisX- smX[ k ];
rY=basisY- smY[ k ];
rZ=basisZ- smZ[ k ];
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*tex1Dfetch(texVecB, j); // same with shared mem for texVecB
}
__syncthreads();
}
g_odata[i]=sum;
}
Hope that helps a bit…
Wonderful !!!
Now, I think that I understand the difference between Texture and Shared mem.
The shared mem helps to get more parallel.
Your algorithm works x1.5 from mine.
it’s look like:
global void reduce2(float *pInputX, float *pInputY, float *pInputZ, float *InputVecB, float *g_odata) {
__shared__ float smX[ THREADS ];
__shared__ float smY[ THREADS ];
__shared__ float smZ[ THREADS ];
__shared__ float vecB[ THREADS ];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=pInputX[ i ]; float basisY=pInputY[ i ]; float basisZ=pInputZ[ i ];
float sum=0;
for (long int j = 0; j < NN/THREADS; j++) {
unsigned int j_THREADS = j * THREADS;
smX[ threadIdx.x ] = pInputX[ j_THREADS + threadIdx.x ];
smY[ threadIdx.x ] = pInputY[ j_THREADS + threadIdx.x ];
smZ[ threadIdx.x ] = pInputZ[ j_THREADS + threadIdx.x ];
vecB[ threadIdx.x ]=InputVecB[ j_THREADS + threadIdx.x ];
__syncthreads();
for ( int k = 0; k < THREADS; k++ )
{
rX=basisX- smX[ k ]; rY=basisY- smY[ k ]; rZ=basisZ- smZ[ k ];
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*vecB[ k ];
}
__syncthreads();
}
g_odata[i]=sum;
}
BTW - The orginal problem wasn’t solved (the launch timed out and was terminated).
However, I need to get a new computer with 295GTX and simple card so the 295GTX can work alone and I hope that it will solve this problem.
Do you have another idea to get more performances?
Thanks (again)
Amir
Wonderful !!!
Now, I think that I understand the difference between Texture and Shared mem.
The shared mem helps to get more parallel.
Your algorithm works x1.5 from mine.
it’s look like:
global void reduce2(float *pInputX, float *pInputY, float *pInputZ, float *InputVecB, float *g_odata) {
__shared__ float smX[ THREADS ];
__shared__ float smY[ THREADS ];
__shared__ float smZ[ THREADS ];
__shared__ float vecB[ THREADS ];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float rX;float rY;float rZ;float distSqr;
float basisX=pInputX[ i ]; float basisY=pInputY[ i ]; float basisZ=pInputZ[ i ];
float sum=0;
for (long int j = 0; j < NN/THREADS; j++) {
unsigned int j_THREADS = j * THREADS;
smX[ threadIdx.x ] = pInputX[ j_THREADS + threadIdx.x ];
smY[ threadIdx.x ] = pInputY[ j_THREADS + threadIdx.x ];
smZ[ threadIdx.x ] = pInputZ[ j_THREADS + threadIdx.x ];
vecB[ threadIdx.x ]=InputVecB[ j_THREADS + threadIdx.x ];
__syncthreads();
for ( int k = 0; k < THREADS; k++ )
{
rX=basisX- smX[ k ]; rY=basisY- smY[ k ]; rZ=basisZ- smZ[ k ];
distSqr = rX * rX + rY * rY + rZ * rZ;
sum+=K*sqrt(distSqr)*vecB[ k ];
}
__syncthreads();
}
g_odata[i]=sum;
}
BTW - The orginal problem wasn’t solved (the launch timed out and was terminated).
However, I need to get a new computer with 295GTX and simple card so the 295GTX can work alone and I hope that it will solve this problem.
Do you have another idea to get more performances?
Thanks (again)
Amir
Amir,
Mind you that was a pseudo code :) you should verify that it indeed worked well. For example if NN % THREADS != 0 the code will not work
The timeout still bothers me that it might be because of a faulty code - are you sure its not because of a dead-lock?
As for further performance - can you post the register/smem usage for the kernel? did you run it via the profiler?
what if you change the k loop to run only NTHREADS / 2 - how does this affect the time of the kernel?
eyal
Amir,
Mind you that was a pseudo code :) you should verify that it indeed worked well. For example if NN % THREADS != 0 the code will not work
The timeout still bothers me that it might be because of a faulty code - are you sure its not because of a dead-lock?
As for further performance - can you post the register/smem usage for the kernel? did you run it via the profiler?
what if you change the k loop to run only NTHREADS / 2 - how does this affect the time of the kernel?
eyal
Another thing thats worth checking is whether there is a relationship between your input data arrays.
For example in pInputB = pInputA * pInputA (or some other relation) it might be worth
to simple re-calculate pInputB than to read it from memory.
eyal
Amir,
Mind you that was a pseudo code :) you should verify that it indeed worked well. For example if NN % THREADS != 0 the code will not work
The timeout still bothers me that it might be because of a faulty code - are you sure its not because of a dead-lock?
As for further performance - can you post the register/smem usage for the kernel? did you run it via the profiler?
what if you change the k loop to run only NTHREADS / 2 - how does this affect the time of the kernel?
eyal
It indeed worked well. I check against the CPU. For this moment, I dont care about (NN % THREADS != 0). (Maybe Later).
My system guy need to check the timeout. Maybe it’s happened because the card serves also for grahics and also for the CUDA. I told you that I need to get new computer with two cards so we can check this issue about that system.
How can I get the register/smem usage for the kernel? I will be happy to post this information for you.
I did not run via profiler. What can I ge with the profiler and how can I run it?
If I change the k loop to run only NTHREADS / 2, I will get (time/2) but of course that the output is incorrect. What is your idea about the NTHREADS / 2 running?
There is not a relationship between my input data arrays. However, as you can see when we calculate the distances between the points, we calculate once (point1-point2)^2 and once (point2-point1)^2. So if we build the Matrix from all the lines that we
create, we will get for example:
0 A B C
A 0 D E
B D 0 F
C E F 0
Maybe we can use it.
What do you think?
Thanks a lots about your help,
Amir
add --ptxas-options="-v -mem " to your .cu file compilation line (in release mode) you should see something like this:
1>ptxas info : Compiling entry function 'Z15CalculatePhaseAILj0EL19ECalculatedDataType1EEvjPfS1
PbS1_S1_S1_S1_S1_’
1>ptxas info : Used 15 registers, 2788+16 bytes smem, 80 bytes cmem[0], 24 bytes cmem[1]
This will tell you how many resources you’re using in each of your kernels - you can than put this info in the occupancy calculator
to get a grip of the occupancy of your kernel.
well that is a quick test. if you cut k by half and you get half the time, it means that your kernel probably compute intensive and not bandwidth. The k loop
already have all the data in shared memory so it doesnt access any memory just does computations.
You might want to try to use 256 threads per block, try replacing the sqrt with the intrinsic version (look for intrinsic in the programming manual)
also make sure that sqrt is not for doubles (if it is you might try to use sqrtf ).
There is not a relationship between my input data arrays. However, as you can see when we calculate the distances between the points, we calculate once (point1-point2)^2 and once (point2-point1)^2. So if we build the Matrix from all the lines that we
create, we will get for example: 0 A B C
A 0 D E
B D 0 F
C E F 0
Maybe we can use it.
What do you think?
Well you can probably cut the calculations (read kernel time) by half by just calculation what’s above and the 0s line and then do a mirroring of the results.
so just calculate :
0 A B C
0 D E
0 F
0
and now do a mirroring kernel to just copy the data (instead of re-calculating it). What do you think?
eyal
add --ptxas-options="-v -mem " to your .cu file compilation line (in release mode) you should see something like this:
1>ptxas info : Compiling entry function 'Z15CalculatePhaseAILj0EL19ECalculatedDataType1EEvjPfS1
PbS1_S1_S1_S1_S1_’
1>ptxas info : Used 15 registers, 2788+16 bytes smem, 80 bytes cmem[0], 24 bytes cmem[1]
This will tell you how many resources you’re using in each of your kernels - you can than put this info in the occupancy calculator
to get a grip of the occupancy of your kernel.
well that is a quick test. if you cut k by half and you get half the time, it means that your kernel probably compute intensive and not bandwidth. The k loop
already have all the data in shared memory so it doesnt access any memory just does computations.
You might want to try to use 256 threads per block, try replacing the sqrt with the intrinsic version (look for intrinsic in the programming manual)
also make sure that sqrt is not for doubles (if it is you might try to use sqrtf ).
Well you can probably cut the calculations (read kernel time) by half by just calculation what’s above and the 0s line and then do a mirroring of the results.
so just calculate :
0 A B C
0 D E
0 F
0
and now do a mirroring kernel to just copy the data (instead of re-calculating it). What do you think?
eyal
Hi eyal,
Do you know if it can use all 480 Processor Cores at 295GTX for a single program?
Thanks
Amir
The two halves of the GTX295 are considered as two independant GPUs. You can’t share data between
them (maybe only via the new memory features from 2.2 onwards).
You might either to use only the second GPU (presumably the watchdog is not relevant for it)
or you can half the work you need and give each GPU half of the work.
Each GPU will require a CPU thread - you can either look at the simpleMultiGPU sample in the SDK
or google for MisterAndreson’s GPUWorker solution here in the newsgroups.
Check out the deviceQuery SDK sample as well to see how the GTX295 is seen by CUDA.
eyal
The two halves of the GTX295 are considered as two independant GPUs. You can’t share data between
them (maybe only via the new memory features from 2.2 onwards).
You might either to use only the second GPU (presumably the watchdog is not relevant for it)
or you can half the work you need and give each GPU half of the work.
Each GPU will require a CPU thread - you can either look at the simpleMultiGPU sample in the SDK
or google for MisterAndreson’s GPUWorker solution here in the newsgroups.
Check out the deviceQuery SDK sample as well to see how the GTX295 is seen by CUDA.
eyal
eyal
So as I see it, for “our” algorithm, it prefer to use with GTX 295 and not with GTX 285.
I think we can share the problem with the GPU. Am I right?
Thanks a lots,
Amir
why do you think this? because of the freezes? first on a non X server linux you wouldnt have any problem with
that. On a “GUI” environment you might be able to change the code so it will run less time (by breaking the kernels
into smaller chunks) and thus not “upsetting” the watchdog.
yes I guess you can.
Instead of this, for example:
(long int j = 0; j < NN/THREADS; j++)
you might want to run NN/THREADS/2 on the first GPU and the other half on the second GPU
and then merge the results on the CPU for example. Of course you’ll have to play with the code
to see what’s best for your performance…
eyal
why do you think this? because of the freezes? first on a non X server linux you wouldnt have any problem with
that. On a “GUI” environment you might be able to change the code so it will run less time (by breaking the kernels
into smaller chunks) and thus not “upsetting” the watchdog.
No because the freezes, I think we can get x2.
yes I guess you can.
Instead of this, for example:
(long int j = 0; j < NN/THREADS; j++)
you might want to run NN/THREADS/2 on the first GPU and the other half on the second GPU
and then merge the results on the CPU for example. Of course you’ll have to play with the code
to see what’s best for your performance…
eyal
Now I want to finish the algorithm before trying to get more performace (by checking registers, threads num…).
I need to work with complex numbers.
Instead of:
sum+=K*sqrt(distSqr)*vecB[ k ];
I need to write something like that:
sum+= exp(jK sqrt(distSqr)) * vecB[ k ];
exp(jK sqrt(distSqr))=cos(Ksqrt(distSqr)) + j sin(K*sqrt(distSqr))
Where j and vecB are complex numbers.
Amir