System just halts for simple CUDA program

Hi there, I have posted this on the Windows XP section of this forum but no one ever replied. Please help out.


Hi there, the following code brings my Dell Precision 7500 Box ( with a Tesla C2050 ) to a complete halt. The only thing left I can do is to press to power button to turn off the system. I cannot see what is wrong with this code. Despite the fact it’s the worst implementation of a simple convolution algorithm. It only runs one thread total. Here is the code:

[codebox]#include

#include

#include

#include <cuda.h>

const unsigned int SRC_N = 10000;

const unsigned int DST_N = 2 * SRC_N - 1;

global void convolve_gpu_simple( float* src_1

                               , float* src_2

                               , float* dst

                               )

{

for( int i = 0; i < SRC_N; ++i )

{

    for( int j = 0; j < SRC_N; ++j )

    {

        dst[ i + j ] += src_1[i] * src_2[j];

    }

}

}

void check_for_error( cudaError_t err )

{

if( err != cudaSuccess )

{

    const char* error_msg = cudaGetErrorString( err );

    exit( 1 );

}

}

void convolve_cpu( float* a, float* b, float* c )

{

for( int i = 0; i < SRC_N; ++i )

{

    for( int j = 0; j < SRC_N; ++j )

    {

        c[ i + j ] += a[i] * b[j];

    }

}

}

bool compare_results( float* a, float* b )

{

unsigned int m = 0;

for( int i = 0; i < DST_N; ++i )

{

    int v_a = *(( unsigned int* )&a[i]);

    int v_b = *(( unsigned int* )&b[i]);

int ulp = abs( v_a - v_b );

    m = max( m, ulp );

if( ulp > 5 )

        return false;

}

return true;

}

using namespace std;

int main(int argc, char** argv)

{

srand( 0 );

float *src_1, *src_2, *dst_cpu, *dst_gpu;

float *dev_src_1, *dev_src_2, *dev_dst;

src_1 = new float[ SRC_N ];

src_2   = new float[ SRC_N ];

dst_cpu = new float[ DST_N ];

dst_gpu = new float[ DST_N ];

check_for_error( cudaMalloc( &dev_src_1, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_src_2, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_dst  , DST_N * sizeof( float )));

fill( src_1, src_1 + SRC_N, 1.0f );

fill( src_2, src_2 + SRC_N, 1.0f );

fill( dst_cpu, dst_cpu + DST_N, 0.0f );

fill( dst_gpu, dst_gpu + DST_N, 0.0f );

convolve_cpu( src_1, src_2, dst_cpu );

check_for_error( cudaMemcpy( dev_src_1, src_1, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

check_for_error( cudaMemcpy( dev_src_2, src_2, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

convolve_gpu_simple<<< 1, 1 >>>( dev_src_1, dev_src_2, dev_dst );

check_for_error( cudaGetLastError() );

check_for_error( cudaMemcpy( dst_gpu, dev_dst, DST_N * sizeof( float ), cudaMemcpyDeviceToHost ));

assert( compare_results( dst_cpu, dst_gpu ));

cudaFree( dev_src_1 );

cudaFree( dev_src_2 );

cudaFree( dev_dst );

delete src_1;

delete[] src_2;

delete[] dst_cpu;

delete[] dst_gpu;

return 0;

}[/codebox]

Any help is very much appreciated. I’m using VS2008 on a Windows 7 x64 box. The NVidia driver is 258.96 and I use CUDA 3.1.

Thanks,

Christian

Your SRC_N and DST_N vars are on the CPU, they should (also) be on the GPU.

Put device in front of them, and it will work.

EDIT: since you need them also on the CPU, you could declare other constants, but better, pass them as parameters to the convolve_gpu_simple() routine.

The compiler is (probably) clever enough to put them in constant memory, if you declare the parameter vars as const.

Your SRC_N and DST_N vars are on the CPU, they should (also) be on the GPU.

Put device in front of them, and it will work.

EDIT: since you need them also on the CPU, you could declare other constants, but better, pass them as parameters to the convolve_gpu_simple() routine.

The compiler is (probably) clever enough to put them in constant memory, if you declare the parameter vars as const.

Hi there, thanks a lot of your answer. Please see below.

As you suggested I’ve added the number of points as an another parameter to my convolve_gpu_simple() kernel function. It still halts my system and I have to reboot. Just to be clear when SRC_N is 1000 everything works as expected. When SRC_N 10,000 the system freezes. Did you run the program?

Here is my updated code:

[codebox]#include

#include

#include

#include

#include <cuda.h>

const unsigned int SRC_N = 1000;

const unsigned int TMP_N = SRC_N * SRC_N;

const unsigned int DST_N = 2 * SRC_N - 1;

global void convolve_gpu_simple( float* src_1

                               , float* src_2

                               , float* dst

                               , unsigned int num

                               )

{

for( int i = 0; i < num; ++i )

{

    for( int j = 0; j < num; ++j )

    {

        dst[ i + j ] += src_1[i] * src_2[j];

    }

}

}

void check_for_error( cudaError_t err )

{

if( err != cudaSuccess )

{

    const char* error_msg = cudaGetErrorString( err );

    exit( 1 );

}

}

float get()

{

//return float( rand() ) / RAND_MAX;

return 1.f / SRC_N;

}

using namespace std;

int main(int argc, char** argv)

{

srand( 0 );

float *src_1, *src_2, *dst_gpu;

float *dev_src_1, *dev_src_2, *dev_dst;

src_1 = new float[ SRC_N ];

src_2   = new float[ SRC_N ];

dst_gpu = new float[ DST_N ];

check_for_error( cudaMalloc( &dev_src_1, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_src_2, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_dst  , DST_N * sizeof( float )));

generate_n( src_1, SRC_N, get );

generate_n( src_2, SRC_N, get );

fill( dst_gpu, dst_gpu + DST_N, 0.0f );

check_for_error( cudaMemcpy( dev_src_1, src_1, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

check_for_error( cudaMemcpy( dev_src_2, src_2, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

convolve_gpu_simple<<< 1, 1 >>>( dev_src_1, dev_src_2, dev_dst, SRC_N );

check_for_error( cudaGetLastError() );

check_for_error( cudaMemcpy( dst_gpu, dev_dst, DST_N * sizeof( float ), cudaMemcpyDeviceToHost ));

cudaFree( dev_src_1 );

cudaFree( dev_src_2 );

cudaFree( dev_dst   );

delete src_1;

delete[] src_2;

delete[] dst_gpu;

return 0;

}

[/codebox]

Hi there, thanks a lot of your answer. Please see below.

As you suggested I’ve added the number of points as an another parameter to my convolve_gpu_simple() kernel function. It still halts my system and I have to reboot. Just to be clear when SRC_N is 1000 everything works as expected. When SRC_N 10,000 the system freezes. Did you run the program?

Here is my updated code:

[codebox]#include

#include

#include

#include

#include <cuda.h>

const unsigned int SRC_N = 1000;

const unsigned int TMP_N = SRC_N * SRC_N;

const unsigned int DST_N = 2 * SRC_N - 1;

global void convolve_gpu_simple( float* src_1

                               , float* src_2

                               , float* dst

                               , unsigned int num

                               )

{

for( int i = 0; i < num; ++i )

{

    for( int j = 0; j < num; ++j )

    {

        dst[ i + j ] += src_1[i] * src_2[j];

    }

}

}

void check_for_error( cudaError_t err )

{

if( err != cudaSuccess )

{

    const char* error_msg = cudaGetErrorString( err );

    exit( 1 );

}

}

float get()

{

//return float( rand() ) / RAND_MAX;

return 1.f / SRC_N;

}

using namespace std;

int main(int argc, char** argv)

{

srand( 0 );

float *src_1, *src_2, *dst_gpu;

float *dev_src_1, *dev_src_2, *dev_dst;

src_1 = new float[ SRC_N ];

src_2   = new float[ SRC_N ];

dst_gpu = new float[ DST_N ];

check_for_error( cudaMalloc( &dev_src_1, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_src_2, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_dst  , DST_N * sizeof( float )));

generate_n( src_1, SRC_N, get );

generate_n( src_2, SRC_N, get );

fill( dst_gpu, dst_gpu + DST_N, 0.0f );

check_for_error( cudaMemcpy( dev_src_1, src_1, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

check_for_error( cudaMemcpy( dev_src_2, src_2, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

convolve_gpu_simple<<< 1, 1 >>>( dev_src_1, dev_src_2, dev_dst, SRC_N );

check_for_error( cudaGetLastError() );

check_for_error( cudaMemcpy( dst_gpu, dev_dst, DST_N * sizeof( float ), cudaMemcpyDeviceToHost ));

cudaFree( dev_src_1 );

cudaFree( dev_src_2 );

cudaFree( dev_dst   );

delete src_1;

delete[] src_2;

delete[] dst_gpu;

return 0;

}

[/codebox]

Without analyzing your code, the mere fact that this is a simple O(n^2) algorithm means your example with SRC_N=10,000 will take 100times longer than the 1,000 case.
Are you sure your machine is truly halted? Maybe it’s just still computing.

This is especially possible if you’re using the same GPU for display and compute… the display will completely freeze for as long as the compute is going.

Without analyzing your code, the mere fact that this is a simple O(n^2) algorithm means your example with SRC_N=10,000 will take 100times longer than the 1,000 case.
Are you sure your machine is truly halted? Maybe it’s just still computing.

This is especially possible if you’re using the same GPU for display and compute… the display will completely freeze for as long as the compute is going.

Yes, I ran the program. On GTX275 it takes about a minute. See SPWorley’s remarks, they are quite correct.

BTW I ended up by just #defining the constants.

EDIT1: a minute is true for release (no debug info for gpu). For debug, it takes at least twice as long.

EDIT2: you really should use more thread(block)s before going to the higher numbers

Yes, I ran the program. On GTX275 it takes about a minute. See SPWorley’s remarks, they are quite correct.

BTW I ended up by just #defining the constants.

EDIT1: a minute is true for release (no debug info for gpu). For debug, it takes at least twice as long.

EDIT2: you really should use more thread(block)s before going to the higher numbers

Thanks for your answer. I have two graphics cards in my machine. One Tesla C2050 and a Quadro NVS 295. I believe the Quadro card drives the monitor. To make sure I use the Tesla card I have expanded my code use cudaSetDevice. I also have lowered my SRC_N to 3000. Still the same problem. I can run when SRC_N is 2000 which takes a bit longer than SRC_N = 1000.

Here is the new code:

[codebox]#include

#include

#include

#include

#include <cuda.h>

const unsigned int SRC_N = 3000;

const unsigned int TMP_N = SRC_N * SRC_N;

const unsigned int DST_N = 2 * SRC_N - 1;

global void convolve_gpu_simple( float* src_1

                               , float* src_2

                               , float* dst

                               , unsigned int num

                               )

{

for( int i = 0; i < num; ++i )

{

    for( int j = 0; j < num; ++j )

    {

        dst[ i + j ] += src_1[i] * src_2[j];

    }

}

}

void check_for_error( cudaError_t err )

{

if( err != cudaSuccess )

{

    const char* error_msg = cudaGetErrorString( err );

    exit( 1 );

}

}

float get()

{

//return float( rand() ) / RAND_MAX;

return 1.f / SRC_N;

}

using namespace std;

bool select_tesla_device()

{

int deviceCount = 0;

check_for_error( cudaGetDeviceCount( &deviceCount ));

bool device_found = false;

for( int i = 0; i < deviceCount; ++i )

{

    cudaDeviceProp deviceProp;

    check_for_error( cudaGetDeviceProperties(&deviceProp, i ));

if( strcmp( deviceProp.name, “Tesla C2050” ) == 0 )

    {

        check_for_error( cudaSetDevice( i ));

        device_found = true;

break;

    }

}

if( !device_found )

{

    std::cout << "No Tesla GPU found" << std::endl;

return false;

}

return true;

}

int main(int argc, char** argv)

{

if( select_tesla_device() == false )

{

    return 1;

}

srand( 0 );

float *src_1, *src_2, *dst_gpu;

float *dev_src_1, *dev_src_2, *dev_dst;

src_1 = new float[ SRC_N ];

src_2   = new float[ SRC_N ];

dst_gpu = new float[ DST_N ];

check_for_error( cudaMalloc( &dev_src_1, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_src_2, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_dst  , DST_N * sizeof( float )));

generate_n( src_1, SRC_N, get );

generate_n( src_2, SRC_N, get );

fill( dst_gpu, dst_gpu + DST_N, 0.0f );

check_for_error( cudaMemcpy( dev_src_1, src_1, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

check_for_error( cudaMemcpy( dev_src_2, src_2, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

convolve_gpu_simple<<< 1, 1 >>>( dev_src_1, dev_src_2, dev_dst, SRC_N );

check_for_error( cudaGetLastError() );

check_for_error( cudaMemcpy( dst_gpu, dev_dst, DST_N * sizeof( float ), cudaMemcpyDeviceToHost ));

cudaFree( dev_src_1 );

cudaFree( dev_src_2 );

cudaFree( dev_dst   );

delete src_1;

delete[] src_2;

delete[] dst_gpu;

return 0;

}[/codebox]

Thanks for all your answers.

Christian

Thanks for your answer. I have two graphics cards in my machine. One Tesla C2050 and a Quadro NVS 295. I believe the Quadro card drives the monitor. To make sure I use the Tesla card I have expanded my code use cudaSetDevice. I also have lowered my SRC_N to 3000. Still the same problem. I can run when SRC_N is 2000 which takes a bit longer than SRC_N = 1000.

Here is the new code:

[codebox]#include

#include

#include

#include

#include <cuda.h>

const unsigned int SRC_N = 3000;

const unsigned int TMP_N = SRC_N * SRC_N;

const unsigned int DST_N = 2 * SRC_N - 1;

global void convolve_gpu_simple( float* src_1

                               , float* src_2

                               , float* dst

                               , unsigned int num

                               )

{

for( int i = 0; i < num; ++i )

{

    for( int j = 0; j < num; ++j )

    {

        dst[ i + j ] += src_1[i] * src_2[j];

    }

}

}

void check_for_error( cudaError_t err )

{

if( err != cudaSuccess )

{

    const char* error_msg = cudaGetErrorString( err );

    exit( 1 );

}

}

float get()

{

//return float( rand() ) / RAND_MAX;

return 1.f / SRC_N;

}

using namespace std;

bool select_tesla_device()

{

int deviceCount = 0;

check_for_error( cudaGetDeviceCount( &deviceCount ));

bool device_found = false;

for( int i = 0; i < deviceCount; ++i )

{

    cudaDeviceProp deviceProp;

    check_for_error( cudaGetDeviceProperties(&deviceProp, i ));

if( strcmp( deviceProp.name, “Tesla C2050” ) == 0 )

    {

        check_for_error( cudaSetDevice( i ));

        device_found = true;

break;

    }

}

if( !device_found )

{

    std::cout << "No Tesla GPU found" << std::endl;

return false;

}

return true;

}

int main(int argc, char** argv)

{

if( select_tesla_device() == false )

{

    return 1;

}

srand( 0 );

float *src_1, *src_2, *dst_gpu;

float *dev_src_1, *dev_src_2, *dev_dst;

src_1 = new float[ SRC_N ];

src_2   = new float[ SRC_N ];

dst_gpu = new float[ DST_N ];

check_for_error( cudaMalloc( &dev_src_1, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_src_2, SRC_N * sizeof( float )));

check_for_error( cudaMalloc( &dev_dst  , DST_N * sizeof( float )));

generate_n( src_1, SRC_N, get );

generate_n( src_2, SRC_N, get );

fill( dst_gpu, dst_gpu + DST_N, 0.0f );

check_for_error( cudaMemcpy( dev_src_1, src_1, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

check_for_error( cudaMemcpy( dev_src_2, src_2, SRC_N * sizeof( float ), cudaMemcpyHostToDevice ));

convolve_gpu_simple<<< 1, 1 >>>( dev_src_1, dev_src_2, dev_dst, SRC_N );

check_for_error( cudaGetLastError() );

check_for_error( cudaMemcpy( dst_gpu, dev_dst, DST_N * sizeof( float ), cudaMemcpyDeviceToHost ));

cudaFree( dev_src_1 );

cudaFree( dev_src_2 );

cudaFree( dev_dst   );

delete src_1;

delete[] src_2;

delete[] dst_gpu;

return 0;

}[/codebox]

Thanks for all your answers.

Christian

I did one more try with SRC_N = 2500. No problem. It took about 2 secs. But with SRC_N = 3000 I get the same problem. Running the program I can see that system stays responsive for a couple of seconds. I can move around windows and the cursor in my console is blinking. Than everything just stops and only the mouse pointer still moves. No keyboard responses. Only the power button helps.

I did run the app for 5min and it did not return.

Christian

I did one more try with SRC_N = 2500. No problem. It took about 2 secs. But with SRC_N = 3000 I get the same problem. Running the program I can see that system stays responsive for a couple of seconds. I can move around windows and the cursor in my console is blinking. Than everything just stops and only the mouse pointer still moves. No keyboard responses. Only the power button helps.

I did run the app for 5min and it did not return.

Christian

Something wrong with your installation, maybe.

I copied your last code, commented the tesla bit out (I’m not that expensive), compile and run (VS208 x64 release), just a few seconds.

EDIT1: checked versions. All identical to your specs except the box itself. Driver 258.96, cuda tools 3.1, win7 64. compile for x64, release, SM_13 (gtx 275).

EDIT2: ran the test with 20000. Took a few minutes, but returned normally.

EDIT3: For some reason (one core of) the cpu is also fully occupied all the time, even although in this version the convolution is not done on cpu.

Jan

Something wrong with your installation, maybe.

I copied your last code, commented the tesla bit out (I’m not that expensive), compile and run (VS208 x64 release), just a few seconds.

EDIT1: checked versions. All identical to your specs except the box itself. Driver 258.96, cuda tools 3.1, win7 64. compile for x64, release, SM_13 (gtx 275).

EDIT2: ran the test with 20000. Took a few minutes, but returned normally.

EDIT3: For some reason (one core of) the cpu is also fully occupied all the time, even although in this version the convolution is not done on cpu.

Jan

I know. I have a parallel version for the algorithm. Here I can go to SRC_N = 8000 with 32 threads and 1 block with no problem. The system halts when trying SRC_N = 10000. After the next reboot ( quite a frustrating exercise ) I run with SRC_N = 10000 and 128 threads and it works.

It seems I keep hitting some upper bounds that crashes or halts something in the CUDA runtime.

Regards,

Christian

I know. I have a parallel version for the algorithm. Here I can go to SRC_N = 8000 with 32 threads and 1 block with no problem. The system halts when trying SRC_N = 10000. After the next reboot ( quite a frustrating exercise ) I run with SRC_N = 10000 and 128 threads and it works.

It seems I keep hitting some upper bounds that crashes or halts something in the CUDA runtime.

Regards,

Christian

Interesting that you can run without any problems. The only major difference seems to be that I run a Tesla C2050 and you do it on a GTX 275 board. I don’t think it’s an installation problem. My box is brand new. I have had it delivered about a week ago and just installed CUDA and CUDA SDK. All samples from the SDK run without any problems.

Is there anything more I could do to get rid of my problem, besides improving the algorithm? I’m working on that. I rather would have an error issued by the CUDA runtime than a complete system halt.

Interesting that you can run without any problems. The only major difference seems to be that I run a Tesla C2050 and you do it on a GTX 275 board. I don’t think it’s an installation problem. My box is brand new. I have had it delivered about a week ago and just installed CUDA and CUDA SDK. All samples from the SDK run without any problems.

Is there anything more I could do to get rid of my problem, besides improving the algorithm? I’m working on that. I rather would have an error issued by the CUDA runtime than a complete system halt.

If I can think of anything, I’ll let you know, of course, but preventing lockups generally is not my ballgame.

Could you try to compile for and run the code on your 295?

If it runs on the 295, I should think that a further test could be to compile only for SM_13, and try to run that on the tesla.

If that one were to run, the problem would be narrowed down to SM_20 compilation or to your particular tesla, the latter being improbable since everything else runs fine.

That could give some results which you can use in a support-ticket.

And, maybe someone else with a tesla (2050) can try the code and see if the problem can be reproduced.