multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes

laughingrice · May 17, 2010, 1:26pm

I’ve got a computer with one half of a tesla S1070 installed (one interface card, two GPUs) and a long running (20 minutes) single GPU matlab code which I’m trying to run twice
System is setup in exclusive mode, cudaThreadExit is called on exit from matlab code, kernel doesn’t ask for a device (the idea of exclusive mode)

A single matlab instance runs through fine
if I run a second instance (or in fact ANY gpu code), the kernel crashes with an unspecified launch error (4).
The crash happens on the call to another GPU code, not on launch (it can run for 15 minutes, I call radixSort from the SDK and the matlab code crashes with unspecified error)

Any ideas on the problem and how to solve it?

Thanks

avidday · May 17, 2010, 2:09pm

Are you certain both GPUs are really in compute exclusive mode? And are you really sure your matlab code is doing what you think it should be doing?

I say that because my development box has a pair of GT200s, both kept in exclusive mode, and it has never done anything inexplicable. I can run multithreaded apps with two threads, or two single threaded apps, or an MPI app with two members of the communicator and it all works. If both cards have contexts attached, anything else I try and run returns a “no device available” error. I have had no problems with device management since about CUDA 2.2 on multigpu systems with properly written code in Linux (by which I mean no hardcoded device IDs anywhere and proper testing of the compute mode so that no attempt is made to use compute prohibited GPUs, etc).

laughingrice · May 17, 2010, 9:49pm

Are you certain both GPUs are really in compute exclusive mode? And are you really sure your matlab code is doing what you think it should be doing?

I say that because my development box has a pair of GT200s, both kept in exclusive mode, and it has never done anything inexplicable. I can run multithreaded apps with two threads, or two single threaded apps, or an MPI app with two members of the communicator and it all works. If both cards have contexts attached, anything else I try and run returns a “no device available” error. I have had no problems with device management since about CUDA 2.2 on multigpu systems with properly written code in Linux (by which I mean no hardcoded device IDs anywhere and proper testing of the compute mode so that no attempt is made to use compute prohibited GPUs, etc).

deviceQuerry reports both GPUs are in compute exclusive mode. Also choosing the device explicitly for two jobs fails with no device available.

When running a single job it finishes with some output which is close enough to the matlab (pure CPU) simulation (floating point and float vs. double so I don’t expect an exact match).

I don’t think that there is an access violation, but it’s hard to check. Everything seems to run fine until I run the second job.

Seems to be specific with this job, since I ran another long running job and couldn’t crash it, but I don’t know where to start looking.

laughingrice · May 23, 2010, 10:15am

Are you certain both GPUs are really in compute exclusive mode? And are you really sure your matlab code is doing what you think it should be doing?

I say that because my development box has a pair of GT200s, both kept in exclusive mode, and it has never done anything inexplicable. I can run multithreaded apps with two threads, or two single threaded apps, or an MPI app with two members of the communicator and it all works. If both cards have contexts attached, anything else I try and run returns a “no device available” error. I have had no problems with device management since about CUDA 2.2 on multigpu systems with properly written code in Linux (by which I mean no hardcoded device IDs anywhere and proper testing of the compute mode so that no attempt is made to use compute prohibited GPUs, etc).

Did some more tests:

I’ve got a kernel that performs the same computations on different data sets inside a loop

Doing the computations once inside the kernel is ok. Doing it a second time causes the problem (a stripped down kernel is attached with a comment which part causes the problem).

What I see:

Machine with one Tesla s1070 interface card, connected to two cards.

Running the full computation on one card finishes ok and the results is close enough to the matlab code (presumably float vs double)

Running the full computation on one card in exclusive mode and then running anything on the same card causes the computation to crash with an unknown launch error (doesn’t matter at what stage of the computations).

Leaving only one computation out of the 6 or changing the card to standard mode and everything runs fine

So this has something to do with the combination of the computation and exclusive mode

I tried to clean up the sample as much as possible:

[codebox]

template device inline T pow2(T a) { return a*a; }

global void loop_det_cu_kernel(

IN int x1, IN int wx, IN int y1, IN int wy, IN int m, IN int n, IN float res_det,

IN float a, IN float *xlens_mat, IN float *ylens_mat, IN float *part_integr, IN float *part_integi,

IN float *E2x_sr, IN float *E2x_si, IN float *E2y_sr, IN float *E2y_si, IN float *E2z_sr, IN float *E2z_si,

IN float *E2x_ir, IN float *E2x_ii, IN float *E2y_ir, IN float *E2y_ii, IN float *E2z_ir, IN float *E2z_ii,

OUT float *E_det_x_sr, OUT float *E_det_y_sr, OUT float *E_det_z_sr,

OUT float *E_det_x_ir, OUT float *E_det_y_ir, OUT float *E_det_z_ir,

OUT float *E_det_x_si, OUT float *E_det_y_si, OUT float *E_det_z_si,

OUT float *E_det_x_ii, OUT float *E_det_y_ii, OUT float *E_det_z_ii)

{

// The current working pixel

int x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

int y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

if (x >= wx || y >= wy)

	return;

float xx = __int2float_rn(x)*res_det - x1;

float yy = __int2float_rn(y)*res_det - y1;

float t_E_det_x_sr = 0;

float t_E_det_x_si = 0;

float t_E_det_y_sr = 0;

float t_E_det_y_si = 0;

for (int nn = 0; nn < n ; nn++)

{

	for (int mm = 0; mm < m ; mm++)

	{

		float r_sqr = pow2(xx - *(xlens_mat++)) + pow2(yy - *(ylens_mat++));

		float exp_r = cosf(a * r_sqr);

		float exp_i = sinf(a * r_sqr);

		float integ_Vr = *part_integr * exp_r - *part_integi * exp_i;

		float integ_Vi = *part_integr * exp_i + *part_integi * exp_r;

		part_integr++;

		part_integi++;

		float Vr;

		float Vi;

		float tr;

		float ti;

		

		/* --------------------*/

		/* This first one works */

		/* --------------------*/

		tr = *(E2x_sr++);

		ti = *(E2x_si++);

		

		Vr = tr*integ_Vr - ti*integ_Vi;

		Vi = tr*integ_Vi + ti*integ_Vr;

		t_E_det_x_sr += Vr; /* eqs 31 & 30 */

		t_E_det_x_si += Vi;//Vi; /* eqs 31 & 30 */

		/* ------------------------------------------------*/

		/* Adding a second computation causes the crash */

		/* ------------------------------------------------*/

		

		tr = *(E2y_sr++);

		ti = *(E2y_si++);

		

		Vr = tr*integ_Vr - ti*integ_Vi;

		Vi = tr*integ_Vi + ti*integ_Vr;

		t_E_det_y_sr += Vr; /* eqs 31 & 30 */

		t_E_det_y_si += Vi; /* eqs 31 & 30 */

	}

}

*(E_det_x_sr + x*wy + y) = t_E_det_x_sr;

*(E_det_x_si + x*wy + y) = t_E_det_x_si;

*(E_det_y_sr + x*wy + y) = t_E_det_y_sr;

*(E_det_y_si + x*wy + y) = t_E_det_y_si;

}

[/codebox]

laughingrice · May 23, 2010, 10:15am

Are you certain both GPUs are really in compute exclusive mode? And are you really sure your matlab code is doing what you think it should be doing?

I say that because my development box has a pair of GT200s, both kept in exclusive mode, and it has never done anything inexplicable. I can run multithreaded apps with two threads, or two single threaded apps, or an MPI app with two members of the communicator and it all works. If both cards have contexts attached, anything else I try and run returns a “no device available” error. I have had no problems with device management since about CUDA 2.2 on multigpu systems with properly written code in Linux (by which I mean no hardcoded device IDs anywhere and proper testing of the compute mode so that no attempt is made to use compute prohibited GPUs, etc).

Did some more tests:

I’ve got a kernel that performs the same computations on different data sets inside a loop

Doing the computations once inside the kernel is ok. Doing it a second time causes the problem (a stripped down kernel is attached with a comment which part causes the problem).

What I see:

Machine with one Tesla s1070 interface card, connected to two cards.

Running the full computation on one card finishes ok and the results is close enough to the matlab code (presumably float vs double)

Running the full computation on one card in exclusive mode and then running anything on the same card causes the computation to crash with an unknown launch error (doesn’t matter at what stage of the computations).

Leaving only one computation out of the 6 or changing the card to standard mode and everything runs fine

So this has something to do with the combination of the computation and exclusive mode

I tried to clean up the sample as much as possible:

[codebox]

template device inline T pow2(T a) { return a*a; }

global void loop_det_cu_kernel(

IN int x1, IN int wx, IN int y1, IN int wy, IN int m, IN int n, IN float res_det,

IN float a, IN float *xlens_mat, IN float *ylens_mat, IN float *part_integr, IN float *part_integi,

IN float *E2x_sr, IN float *E2x_si, IN float *E2y_sr, IN float *E2y_si, IN float *E2z_sr, IN float *E2z_si,

IN float *E2x_ir, IN float *E2x_ii, IN float *E2y_ir, IN float *E2y_ii, IN float *E2z_ir, IN float *E2z_ii,

OUT float *E_det_x_sr, OUT float *E_det_y_sr, OUT float *E_det_z_sr,

OUT float *E_det_x_ir, OUT float *E_det_y_ir, OUT float *E_det_z_ir,

OUT float *E_det_x_si, OUT float *E_det_y_si, OUT float *E_det_z_si,

OUT float *E_det_x_ii, OUT float *E_det_y_ii, OUT float *E_det_z_ii)

{

// The current working pixel

int x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

int y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

if (x >= wx || y >= wy)

	return;

float xx = __int2float_rn(x)*res_det - x1;

float yy = __int2float_rn(y)*res_det - y1;

float t_E_det_x_sr = 0;

float t_E_det_x_si = 0;

float t_E_det_y_sr = 0;

float t_E_det_y_si = 0;

for (int nn = 0; nn < n ; nn++)

{

	for (int mm = 0; mm < m ; mm++)

	{

		float r_sqr = pow2(xx - *(xlens_mat++)) + pow2(yy - *(ylens_mat++));

		float exp_r = cosf(a * r_sqr);

		float exp_i = sinf(a * r_sqr);

		float integ_Vr = *part_integr * exp_r - *part_integi * exp_i;

		float integ_Vi = *part_integr * exp_i + *part_integi * exp_r;

		part_integr++;

		part_integi++;

		float Vr;

		float Vi;

		float tr;

		float ti;

		

		/* --------------------*/

		/* This first one works */

		/* --------------------*/

		tr = *(E2x_sr++);

		ti = *(E2x_si++);

		

		Vr = tr*integ_Vr - ti*integ_Vi;

		Vi = tr*integ_Vi + ti*integ_Vr;

		t_E_det_x_sr += Vr; /* eqs 31 & 30 */

		t_E_det_x_si += Vi;//Vi; /* eqs 31 & 30 */

		/* ------------------------------------------------*/

		/* Adding a second computation causes the crash */

		/* ------------------------------------------------*/

		

		tr = *(E2y_sr++);

		ti = *(E2y_si++);

		

		Vr = tr*integ_Vr - ti*integ_Vi;

		Vi = tr*integ_Vi + ti*integ_Vr;

		t_E_det_y_sr += Vr; /* eqs 31 & 30 */

		t_E_det_y_si += Vi; /* eqs 31 & 30 */

	}

}

*(E_det_x_sr + x*wy + y) = t_E_det_x_sr;

*(E_det_x_si + x*wy + y) = t_E_det_x_si;

*(E_det_y_sr + x*wy + y) = t_E_det_y_sr;

*(E_det_y_si + x*wy + y) = t_E_det_y_si;

}

[/codebox]

sWienke · May 25, 2010, 9:29am

I have a similar problem!

I also use 2 GPUS of a Tesla S1070 system with driver version 195.36.24 and both GPUs (really) in exclusive compute mode.

When I now start 2 arbitrary jobs, e.g. first “nbody -benchmark -n=300000 -device=0” from the SDK and then a task which only allocates memory on GPU 0 by cudaSetDevice everytime BOTH jobs crash! The second one gives the expected error message that there is no CUDA-capable device available, but the first always gets an “unspecified launch failure”.

I don’t know what to do. I tried every combination on the GPUs and even different jobs (e.g. 2 nbody simulations on the same GPU)… I get always this error! Even if a do not set a device explicitely by cudaSetDevice the first job crashes if it’s running on device 0.

Any ideas here?

As I want to establish a batch system for GPUs this is really a big big problem!

sWienke · May 25, 2010, 9:29am

I have a similar problem!

I also use 2 GPUS of a Tesla S1070 system with driver version 195.36.24 and both GPUs (really) in exclusive compute mode.

When I now start 2 arbitrary jobs, e.g. first “nbody -benchmark -n=300000 -device=0” from the SDK and then a task which only allocates memory on GPU 0 by cudaSetDevice everytime BOTH jobs crash! The second one gives the expected error message that there is no CUDA-capable device available, but the first always gets an “unspecified launch failure”.

I don’t know what to do. I tried every combination on the GPUs and even different jobs (e.g. 2 nbody simulations on the same GPU)… I get always this error! Even if a do not set a device explicitely by cudaSetDevice the first job crashes if it’s running on device 0.

Any ideas here?

As I want to establish a batch system for GPUs this is really a big big problem!

avidday · May 25, 2010, 10:27am

This has got to be specific to the S1070 or something. To give an example of “just works”:

my development machine is a white box with a pair of GTX275s in it, running Ubuntu 9.04 with 195.36.15 drivers. I put one card into compute prohibited and the second into compute exlcusive:

avidday@cuda:~/code/Jacobi$ uname -a

Linux cuda 2.6.28-18-generic #60-Ubuntu SMP Fri Mar 12 04:26:47 UTC 2010 x86_64 GNU/Linux

avidday@cuda:~/code/Jacobi$ nvidia-smi -s

COMPUTE mode rules for GPU 0: 2

COMPUTE mode rules for GPU 1: 1

I then have a runtime API application which I hacked to keep the GPU busy for about 20 seconds (it is a Jacobi solver that I make sure never converges):

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:09 EEST 2010

JOR failed to converge

for N=2500

JOR   perfomance: 25000 calls, 13697.462891 ms elapsed, 22.869198 GFLOPS

sgemv perfomance: 5000 calls, 2723.711914 ms elapsed, 22.955805 GFLOPS

real	0m17.232s

user	0m17.057s

sys	0m0.168s

Tue May 25 13:16:26 EEST 2010

While that is happening, I try running a second instance repeatly:

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:15 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.359s

user	0m0.144s

sys	0m0.204s

Tue May 25 13:16:16 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:17 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.346s

user	0m0.140s

sys	0m0.204s

Tue May 25 13:16:18 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:19 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.313s

user	0m0.100s

sys	0m0.200s

Tue May 25 13:16:20 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:21 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.335s

user	0m0.136s

sys	0m0.196s

Tue May 25 13:16:22 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:23 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.367s

user	0m0.152s

sys	0m0.204s

Tue May 25 13:16:24 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:25 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.335s

user	0m0.128s

sys	0m0.208s

Tue May 25 13:16:26 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:27 EEST 2010

JOR failed to converge

for N=2500

JOR   perfomance: 25000 calls, 13691.371094 ms elapsed, 22.879374 GFLOPS

sgemv perfomance: 5000 calls, 2722.701660 ms elapsed, 22.964323 GFLOPS

real	0m17.059s

user	0m16.889s

sys	0m0.168s

Tue May 25 13:16:44 EEST 2010

If you look at the timestamps, you can clearly see that the first job runs until complete, while the second fails until the first is done, after which it also runs correctly. I have a production cluster with GT200s set up the same way, we use Sun Grid Engine to schedule and keep the cards in compute exclusive mode at all times and it “just works”. The only difference I can see is the hardware. You are using S1070s, I am not.

avidday · May 25, 2010, 10:27am

This has got to be specific to the S1070 or something. To give an example of “just works”:

my development machine is a white box with a pair of GTX275s in it, running Ubuntu 9.04 with 195.36.15 drivers. I put one card into compute prohibited and the second into compute exlcusive:

avidday@cuda:~/code/Jacobi$ uname -a

Linux cuda 2.6.28-18-generic #60-Ubuntu SMP Fri Mar 12 04:26:47 UTC 2010 x86_64 GNU/Linux

avidday@cuda:~/code/Jacobi$ nvidia-smi -s

COMPUTE mode rules for GPU 0: 2

COMPUTE mode rules for GPU 1: 1

I then have a runtime API application which I hacked to keep the GPU busy for about 20 seconds (it is a Jacobi solver that I make sure never converges):

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:09 EEST 2010

JOR failed to converge

for N=2500

JOR   perfomance: 25000 calls, 13697.462891 ms elapsed, 22.869198 GFLOPS

sgemv perfomance: 5000 calls, 2723.711914 ms elapsed, 22.955805 GFLOPS

real	0m17.232s

user	0m17.057s

sys	0m0.168s

Tue May 25 13:16:26 EEST 2010

While that is happening, I try running a second instance repeatly:

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:15 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.359s

user	0m0.144s

sys	0m0.204s

Tue May 25 13:16:16 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:17 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.346s

user	0m0.140s

sys	0m0.204s

Tue May 25 13:16:18 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:19 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.313s

user	0m0.100s

sys	0m0.200s

Tue May 25 13:16:20 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:21 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.335s

user	0m0.136s

sys	0m0.196s

Tue May 25 13:16:22 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:23 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.367s

user	0m0.152s

sys	0m0.204s

Tue May 25 13:16:24 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:25 EEST 2010

FAILURE no CUDA-capable device is available in jacobi.cu, line 213

real	0m0.335s

user	0m0.128s

sys	0m0.208s

Tue May 25 13:16:26 EEST 2010

avidday@cuda:~/code/Jacobi$ date; time ./jacobi; date

Tue May 25 13:16:27 EEST 2010

JOR failed to converge

for N=2500

JOR   perfomance: 25000 calls, 13691.371094 ms elapsed, 22.879374 GFLOPS

sgemv perfomance: 5000 calls, 2722.701660 ms elapsed, 22.964323 GFLOPS

real	0m17.059s

user	0m16.889s

sys	0m0.168s

Tue May 25 13:16:44 EEST 2010

If you look at the timestamps, you can clearly see that the first job runs until complete, while the second fails until the first is done, after which it also runs correctly. I have a production cluster with GT200s set up the same way, we use Sun Grid Engine to schedule and keep the cards in compute exclusive mode at all times and it “just works”. The only difference I can see is the hardware. You are using S1070s, I am not.

laughingrice · May 25, 2010, 6:44pm

I’m guessing from current tests that this is related resources usage.

I changed the code (as described previously) to use an index instead of pointer arithmetic, which dropped register usage from 56 to 34 IIRC and the problem went away.

Will try to create a more minimal test, but my guess is that checking whether the GPU is free takes some resources away which leaves the running kernel sitting high and dry

laughingrice · May 25, 2010, 6:44pm

I’m guessing from current tests that this is related resources usage.

I changed the code (as described previously) to use an index instead of pointer arithmetic, which dropped register usage from 56 to 34 IIRC and the problem went away.

Will try to create a more minimal test, but my guess is that checking whether the GPU is free takes some resources away which leaves the running kernel sitting high and dry

avidday · May 25, 2010, 6:50pm

I don’t mean to be repetitive, but am going to ask the question again: are you really sure the GPUs are in exlcusive mode? In exclusive mode there should never be any resource contention between jobs, because the driver should refuse a new thread from ever even establishing a context on an exclusive device which is in use [you can see that in the error message of my post]. Nothing you do with resources or anything else within a job should have anything to do with the behaviour.

Can you do a test like the one I posted? (I can post the code for the example program I used if you like).

avidday · May 25, 2010, 6:50pm

I don’t mean to be repetitive, but am going to ask the question again: are you really sure the GPUs are in exlcusive mode? In exclusive mode there should never be any resource contention between jobs, because the driver should refuse a new thread from ever even establishing a context on an exclusive device which is in use [you can see that in the error message of my post]. Nothing you do with resources or anything else within a job should have anything to do with the behaviour.

Can you do a test like the one I posted? (I can post the code for the example program I used if you like).

laughingrice · May 25, 2010, 8:01pm

Yes 100% that both GPUs are in exclusive mode. (tested low resource jobs, deviceQuerry and nvidia-smi, all agree that the device is in exclusive mode)

Besides, it shouldn’t matter, whatever the mode, two kernels can’t run at the same time.

In regular mode I get no problems

In exclusive mode with 64 registers per thread I get a crash, with 32 registers per thread everything is ok, so it’s definitely a register usage issue (didn’t test farther yet to find the exact limit though)

If you post your code I can test it as well, or try compiling with --maxrregcount 64 (make sure that it actually uses 64 registers though) and see if it crashes.

Compiler output for problematic job (64 registers):

nvcc --ptxas-options -v -O2 -c --maxrregcount 64 loop_det_cu_kernel.cu --compiler-options -fPIC -arch sm_13

ptxas info : Compiling entry function '_Z18loop_det_cu_kerneliiiiiiffPKfS0_S0_S0_S0_S0_S0_S0_S0

S0_S0_S0_S0_S0_S0_S0_PfS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1’ for ‘sm_13’

ptxas info : Used 64 registers, 68+0 bytes lmem, 256+16 bytes smem, 24 bytes cmem[0], 100 bytes cmem[1]

mex -O loop_det_cu.cpp loop_det_cu_kernel.o -L/usr/local/cuda/lib64/ -lcudart -I/usr/local/cuda/include/

Starting the second job crashes the first one

Compiler output for --maxrregcount 32

nvcc --ptxas-options -v -O2 -c --maxrregcount 32 loop_det_cu_kernel.cu --compiler-options -fPIC -arch sm_13

ptxas info : Compiling entry function '_Z18loop_det_cu_kerneliiiiiiffPKfS0_S0_S0_S0_S0_S0_S0_S0

S0_S0_S0_S0_S0_S0_S0_PfS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1’ for ‘sm_13’

ptxas info : Used 32 registers, 224+0 bytes lmem, 256+16 bytes smem, 24 bytes cmem[0], 100 bytes cmem[1]

mex -O loop_det_cu.cpp loop_det_cu_kernel.o -L/usr/local/cuda/lib64/ -lcudart -I/usr/local/cuda/include/

First job runs fine, no crash

The second job returns “matrixMul.cu(131) : cudaSafeCall() Runtime API error : no CUDA-capable device is available.” in any case as it should.

laughingrice · May 25, 2010, 8:01pm

Yes 100% that both GPUs are in exclusive mode. (tested low resource jobs, deviceQuerry and nvidia-smi, all agree that the device is in exclusive mode)

Besides, it shouldn’t matter, whatever the mode, two kernels can’t run at the same time.

In regular mode I get no problems

In exclusive mode with 64 registers per thread I get a crash, with 32 registers per thread everything is ok, so it’s definitely a register usage issue (didn’t test farther yet to find the exact limit though)

If you post your code I can test it as well, or try compiling with --maxrregcount 64 (make sure that it actually uses 64 registers though) and see if it crashes.

Compiler output for problematic job (64 registers):

nvcc --ptxas-options -v -O2 -c --maxrregcount 64 loop_det_cu_kernel.cu --compiler-options -fPIC -arch sm_13

ptxas info : Compiling entry function '_Z18loop_det_cu_kerneliiiiiiffPKfS0_S0_S0_S0_S0_S0_S0_S0

S0_S0_S0_S0_S0_S0_S0_PfS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1’ for ‘sm_13’

ptxas info : Used 64 registers, 68+0 bytes lmem, 256+16 bytes smem, 24 bytes cmem[0], 100 bytes cmem[1]

mex -O loop_det_cu.cpp loop_det_cu_kernel.o -L/usr/local/cuda/lib64/ -lcudart -I/usr/local/cuda/include/

Starting the second job crashes the first one

Compiler output for --maxrregcount 32

nvcc --ptxas-options -v -O2 -c --maxrregcount 32 loop_det_cu_kernel.cu --compiler-options -fPIC -arch sm_13

ptxas info : Compiling entry function '_Z18loop_det_cu_kerneliiiiiiffPKfS0_S0_S0_S0_S0_S0_S0_S0

S0_S0_S0_S0_S0_S0_S0_PfS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1’ for ‘sm_13’

ptxas info : Used 32 registers, 224+0 bytes lmem, 256+16 bytes smem, 24 bytes cmem[0], 100 bytes cmem[1]

mex -O loop_det_cu.cpp loop_det_cu_kernel.o -L/usr/local/cuda/lib64/ -lcudart -I/usr/local/cuda/include/

First job runs fine, no crash

The second job returns “matrixMul.cu(131) : cudaSafeCall() Runtime API error : no CUDA-capable device is available.” in any case as it should.

laughingrice · May 25, 2010, 8:02pm

Sorry, previous post was submitted twice by mistake

laughingrice · May 25, 2010, 8:02pm

Sorry, previous post was submitted twice by mistake

avidday · May 25, 2010, 8:20pm

That isn’t strictly correct - a GPU can have more than one context simultaneously (unless it is in exclusive mode) and those context, the memory they allocate, and kernels they launch will compete for resources (and hose one other if something in the code is bad). And that is most of my point too - registers, resources, memory, all of it should be completely irrelevant when the GPU is truly in exclusive mode.

What you are seeing really sounds like two contexts are winding up on the same GPU, compute exclusive or not. It might well be a bug in the driver or something specific to your S1070 setup, but a second thread should never even be able to even establish a context on an occupied, exclusive GPU and the process of trying should have no effect on the existing context.

I will put up the code I used as an attachment in another post in a couple of minutes.

avidday · May 25, 2010, 8:20pm

That isn’t strictly correct - a GPU can have more than one context simultaneously (unless it is in exclusive mode) and those context, the memory they allocate, and kernels they launch will compete for resources (and hose one other if something in the code is bad). And that is most of my point too - registers, resources, memory, all of it should be completely irrelevant when the GPU is truly in exclusive mode.

What you are seeing really sounds like two contexts are winding up on the same GPU, compute exclusive or not. It might well be a bug in the driver or something specific to your S1070 setup, but a second thread should never even be able to even establish a context on an occupied, exclusive GPU and the process of trying should have no effect on the existing context.

I will put up the code I used as an attachment in another post in a couple of minutes.

laughingrice · May 25, 2010, 8:38pm

That isn’t strictly correct - a GPU can have more than one context simultaneously (unless it is in exclusive mode) and those context, the memory they allocate, and kernels they launch will compete for resources (and hose one other if something in the code is bad). And that is most of my point too - registers, resources, memory, all of it should be completely irrelevant when the GPU is truly in exclusive mode.

What you are seeing really sounds like two contexts are winding up on the same GPU, compute exclusive or not. It might well be a bug in the driver or something specific to your S1070 setup, but a second thread should never even be able to even establish a context on an occupied, exclusive GPU and the process of trying should have no effect on the existing context.

I will put up the code I used as an attachment in another post in a couple of minutes.

Contexts share global memory and possibly constant memory (not sure how that is implemented in hardware as there is very little information).

registers and shared memory are allocated per warp at runtime and not per context. Textures are implemented as cache but there is a binding issues so contexts may compete on that.

There is also some sort of basic memory protection (doesn’t always work) so you should be able to touch memory outside what you allocated.

The kernels themselves don’t run in parallel but are queued one after the other. One of the witnesses to that is the watchdog timer. It wouldn’t be needed if they could be time sliced.

The interesting thing is that i see problems only in exclusive mode. The machines I have with geforces runs windows so exclusive mode is a problem, but I’ll see if I can test something (possibly put the geforces in the servers on Thursday), I’ll see if I can make your sample test crash and then we’ll have something to compare.

Topic		Replies	Views
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3464	March 10, 2011
Speed problem on 295 gtx cards CUDA Programming and Performance	19	10484	January 8, 2010
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14006	September 12, 2015
GPU in state where results are not reproducible! CUDA Programming and Performance	50	16697	November 2, 2012
cudaMalloc hang when building x64 version binary CUDA Programming and Performance	23	3827	June 26, 2017
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64316	April 20, 2011
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9589	December 4, 2017
advice needed by a PhD student CUDA Programming and Performance	26	2864	December 4, 2011
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12717	July 26, 2010
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2800	September 10, 2010

multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes

Related topics