CL_INVALID_COMMAND_QUEUE error on clFinish command - a lot of operations in each kernel driver crash

Hi,

I’m new to OpenCL and have a problem with the porting of an existing inverse-DCT program into OpenCL. As I’m trying not change the whole program, I’m not working with any opencl image types. The informations about the image to perform my calculation on is an array of one dimension.

My implementation works fine with the intel sdk on the CPU but the nvidia sdk returns -5 (CL_OUT_OF_RESOURCES) as error value when I’m calling clEnqueueReadBuffer to get my results.

As a result I get a driver crash: NVIDIA Windows Kernel Driver, Version 267.21

The problem seems to be the amount of calculations to be done in each kernel.

To reproduce this error I wrote a small test. I didn’t reproduce the CL_OUT_OF_RESOURCES error because I don’t read from a buffer. The following code produces -36 ([i]CL_INVALID_COMMAND_QUEUE[i]) as error.

Host code:

size_t globalThreads[1]={32767};

size_t localThreads[1]={1};

ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, NULL

// wait for all commands to end

ret = clFlush(command_queue);

ret |= clFinish(command_queue);

if (ret) {

	debugOut("Error: Failed to wait for the executed commands!\n");

	return false;

}

Kernel code:

__kernel void inverseDct() {

	int indexX = get_global_id(0);

	float Cm = 0.0f;

	float Cn = 0.0f;

	float c1 = 0.0f;

	float c2 = 0.0f;

	float sum = 0.0f;

	// y is not used in the following lines, remove the loop and it works fine.

	for (unsigned int y=0; y<8; y++) {

		for (unsigned int x=0; x<8; x++) {

			for (unsigned int m=0; m<8; m++) {

				Cm = (float) (m ? 1 : (1 / sqrt(2.0)));

				for (unsigned int n=0; n<8; n++) {

					Cn = (float)(n ? 1 : (1 / sqrt(2.0)));

					c1 = (float)half_cos(((2 * x + 1) * M_PI * m) / 16);

					c2 = (float)half_cos(((2 * 1 + 1) * M_PI * n) / 16);

					sum += Cn * Cm * indexX * c1 * c2;

				}

			}	

		}

	}

}

Some informations about my system:

OpenCL SW Info:

CL_PLATFORM_NAME:      NVIDIA CUDA

 CL_PLATFORM_VERSION:   OpenCL 1.0 CUDA 3.2.1

 OpenCL SDK Revision:   7027912

OpenCL Device Info:

1 devices found supporting OpenCL:

---------------------------------

 Device NVS 3100M

 ---------------------------------

  CL_DEVICE_NAME:                       NVS 3100M

  CL_DEVICE_VENDOR:                     NVIDIA Corporation

  CL_DRIVER_VERSION:                    267.21

  CL_DEVICE_VERSION:                    OpenCL 1.0 CUDA

  CL_DEVICE_TYPE:                       CL_DEVICE_TYPE_GPU

  CL_DEVICE_MAX_COMPUTE_UNITS:          2

  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:   3

  CL_DEVICE_MAX_WORK_ITEM_SIZES:        512 / 512 / 64

  CL_DEVICE_MAX_WORK_GROUP_SIZE:        512

  CL_DEVICE_MAX_CLOCK_FREQUENCY:        1468 MHz

  CL_DEVICE_ADDRESS_BITS:               32

  CL_DEVICE_MAX_MEM_ALLOC_SIZE:         128 MByte

  CL_DEVICE_GLOBAL_MEM_SIZE:            218 MByte

  CL_DEVICE_ERROR_CORRECTION_SUPPORT:   no

  CL_DEVICE_LOCAL_MEM_TYPE:             local

  CL_DEVICE_LOCAL_MEM_SIZE:             16 KByte

  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:   64 KByte

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

  CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_PROFILING_ENABLE

  CL_DEVICE_IMAGE_SUPPORT:              1

  CL_DEVICE_MAX_READ_IMAGE_ARGS:        128

  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:       8

  CL_DEVICE_SINGLE_FP_CONFIG:           INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_IMAGE <dim>                 2D_MAX_WIDTH     4096

                                        2D_MAX_HEIGHT    32768

                                        3D_MAX_WIDTH     2048

                                        3D_MAX_HEIGHT    2048

                                        3D_MAX_DEPTH     2048

CL_DEVICE_EXTENSIONS:                 cl_khr_byte_addressable_store

                                        cl_khr_icd

                                        cl_khr_gl_sharing

                                        cl_nv_d3d9_sharing

                                        cl_nv_d3d10_sharing

                                        cl_khr_d3d10_sharing

                                        cl_nv_d3d11_sharing

                                        cl_nv_compiler_options

                                        cl_nv_device_attribute_query

                                        cl_nv_pragma_unroll

                                        cl_khr_global_int32_base_atomics

                                        cl_khr_global_int32_extended_atomics

                                        cl_khr_local_int32_base_atomics

                                        cl_khr_local_int32_extended_atomics

CL_DEVICE_COMPUTE_CAPABILITY_NV:      1.2

  NUMBER OF MULTIPROCESSORS:            2

  NUMBER OF CUDA CORES:                 16

  CL_DEVICE_REGISTERS_PER_BLOCK_NV:     16384

  CL_DEVICE_WARP_SIZE_NV:               32

  CL_DEVICE_GPU_OVERLAP_NV:             CL_TRUE

  CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:     CL_TRUE

  CL_DEVICE_INTEGRATED_MEMORY_NV:       CL_FALSE

  CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>  CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0

---------------------------------

  2D Image Formats Supported (71)

  ---------------------------------

  #     Channel Order   Channel Type

1     CL_R            CL_FLOAT

  2     CL_R            CL_HALF_FLOAT

  3     CL_R            CL_UNORM_INT8

  4     CL_R            CL_UNORM_INT16

  5     CL_R            CL_SNORM_INT16

  6     CL_R            CL_SIGNED_INT8

  7     CL_R            CL_SIGNED_INT16

  8     CL_R            CL_SIGNED_INT32

  9     CL_R            CL_UNSIGNED_INT8

  10    CL_R            CL_UNSIGNED_INT16

  11    CL_R            CL_UNSIGNED_INT32

  12    CL_A            CL_FLOAT

  13    CL_A            CL_HALF_FLOAT

  14    CL_A            CL_UNORM_INT8

  15    CL_A            CL_UNORM_INT16

  16    CL_A            CL_SNORM_INT16

  17    CL_A            CL_SIGNED_INT8

  18    CL_A            CL_SIGNED_INT16

  19    CL_A            CL_SIGNED_INT32

  20    CL_A            CL_UNSIGNED_INT8

  21    CL_A            CL_UNSIGNED_INT16

  22    CL_A            CL_UNSIGNED_INT32

  23    CL_RG           CL_FLOAT

  24    CL_RG           CL_HALF_FLOAT

  25    CL_RG           CL_UNORM_INT8

  26    CL_RG           CL_UNORM_INT16

  27    CL_RG           CL_SNORM_INT16

  28    CL_RG           CL_SIGNED_INT8

  29    CL_RG           CL_SIGNED_INT16

  30    CL_RG           CL_SIGNED_INT32

  31    CL_RG           CL_UNSIGNED_INT8

  32    CL_RG           CL_UNSIGNED_INT16

  33    CL_RG           CL_UNSIGNED_INT32

  34    CL_RA           CL_FLOAT

  35    CL_RA           CL_HALF_FLOAT

  36    CL_RA           CL_UNORM_INT8

  37    CL_RA           CL_UNORM_INT16

  38    CL_RA           CL_SNORM_INT16

  39    CL_RA           CL_SIGNED_INT8

  40    CL_RA           CL_SIGNED_INT16

  41    CL_RA           CL_SIGNED_INT32

  42    CL_RA           CL_UNSIGNED_INT8

  43    CL_RA           CL_UNSIGNED_INT16

  44    CL_RA           CL_UNSIGNED_INT32

  45    CL_RGBA         CL_FLOAT

  46    CL_RGBA         CL_HALF_FLOAT

  47    CL_RGBA         CL_UNORM_INT8

  48    CL_RGBA         CL_UNORM_INT16

  49    CL_RGBA         CL_SNORM_INT16

  50    CL_RGBA         CL_SIGNED_INT8

  51    CL_RGBA         CL_SIGNED_INT16

  52    CL_RGBA         CL_SIGNED_INT32

  53    CL_RGBA         CL_UNSIGNED_INT8

  54    CL_RGBA         CL_UNSIGNED_INT16

  55    CL_RGBA         CL_UNSIGNED_INT32

  56    CL_BGRA         CL_UNORM_INT8

  57    CL_BGRA         CL_SIGNED_INT8

  58    CL_BGRA         CL_UNSIGNED_INT8

  59    CL_ARGB         CL_UNORM_INT8

  60    CL_ARGB         CL_SIGNED_INT8

  61    CL_ARGB         CL_UNSIGNED_INT8

  62    CL_INTENSITY    CL_FLOAT

  63    CL_INTENSITY    CL_HALF_FLOAT

  64    CL_INTENSITY    CL_UNORM_INT8

  65    CL_INTENSITY    CL_UNORM_INT16

  66    CL_INTENSITY    CL_SNORM_INT16

  67    CL_LUMINANCE    CL_FLOAT

  68    CL_LUMINANCE    CL_HALF_FLOAT

  69    CL_LUMINANCE    CL_UNORM_INT8

  70    CL_LUMINANCE    CL_UNORM_INT16

  71    CL_LUMINANCE    CL_SNORM_INT16

---------------------------------

  3D Image Formats Supported (71)

  ---------------------------------

  #     Channel Order   Channel Type

1     CL_R            CL_FLOAT

  2     CL_R            CL_HALF_FLOAT

  3     CL_R            CL_UNORM_INT8

  4     CL_R            CL_UNORM_INT16

  5     CL_R            CL_SNORM_INT16

  6     CL_R            CL_SIGNED_INT8

  7     CL_R            CL_SIGNED_INT16

  8     CL_R            CL_SIGNED_INT32

  9     CL_R            CL_UNSIGNED_INT8

  10    CL_R            CL_UNSIGNED_INT16

  11    CL_R            CL_UNSIGNED_INT32

  12    CL_A            CL_FLOAT

  13    CL_A            CL_HALF_FLOAT

  14    CL_A            CL_UNORM_INT8

  15    CL_A            CL_UNORM_INT16

  16    CL_A            CL_SNORM_INT16

  17    CL_A            CL_SIGNED_INT8

  18    CL_A            CL_SIGNED_INT16

  19    CL_A            CL_SIGNED_INT32

  20    CL_A            CL_UNSIGNED_INT8

  21    CL_A            CL_UNSIGNED_INT16

  22    CL_A            CL_UNSIGNED_INT32

  23    CL_RG           CL_FLOAT

  24    CL_RG           CL_HALF_FLOAT

  25    CL_RG           CL_UNORM_INT8

  26    CL_RG           CL_UNORM_INT16

  27    CL_RG           CL_SNORM_INT16

  28    CL_RG           CL_SIGNED_INT8

  29    CL_RG           CL_SIGNED_INT16

  30    CL_RG           CL_SIGNED_INT32

  31    CL_RG           CL_UNSIGNED_INT8

  32    CL_RG           CL_UNSIGNED_INT16

  33    CL_RG           CL_UNSIGNED_INT32

  34    CL_RA           CL_FLOAT

  35    CL_RA           CL_HALF_FLOAT

  36    CL_RA           CL_UNORM_INT8

  37    CL_RA           CL_UNORM_INT16

  38    CL_RA           CL_SNORM_INT16

  39    CL_RA           CL_SIGNED_INT8

  40    CL_RA           CL_SIGNED_INT16

  41    CL_RA           CL_SIGNED_INT32

  42    CL_RA           CL_UNSIGNED_INT8

  43    CL_RA           CL_UNSIGNED_INT16

  44    CL_RA           CL_UNSIGNED_INT32

  45    CL_RGBA         CL_FLOAT

  46    CL_RGBA         CL_HALF_FLOAT

  47    CL_RGBA         CL_UNORM_INT8

  48    CL_RGBA         CL_UNORM_INT16

  49    CL_RGBA         CL_SNORM_INT16

  50    CL_RGBA         CL_SIGNED_INT8

  51    CL_RGBA         CL_SIGNED_INT16

  52    CL_RGBA         CL_SIGNED_INT32

  53    CL_RGBA         CL_UNSIGNED_INT8

  54    CL_RGBA         CL_UNSIGNED_INT16

  55    CL_RGBA         CL_UNSIGNED_INT32

  56    CL_BGRA         CL_UNORM_INT8

  57    CL_BGRA         CL_SIGNED_INT8

  58    CL_BGRA         CL_UNSIGNED_INT8

  59    CL_ARGB         CL_UNORM_INT8

  60    CL_ARGB         CL_SIGNED_INT8

  61    CL_ARGB         CL_UNSIGNED_INT8

  62    CL_INTENSITY    CL_FLOAT

  63    CL_INTENSITY    CL_HALF_FLOAT

  64    CL_INTENSITY    CL_UNORM_INT8

  65    CL_INTENSITY    CL_UNORM_INT16

  66    CL_INTENSITY    CL_SNORM_INT16

  67    CL_LUMINANCE    CL_FLOAT

  68    CL_LUMINANCE    CL_HALF_FLOAT

  69    CL_LUMINANCE    CL_UNORM_INT8

  70    CL_LUMINANCE    CL_UNORM_INT16

  71    CL_LUMINANCE    CL_SNORM_INT16

oclDeviceQuery, Platform Name = NVIDIA CUDA, Platform Version = OpenCL 1.0 CUDA 3.2.1, SDK Revision = 7027912, NumDevs = 1, Device = NVS 3100M

I used different setups to test:

global threads: ~32k; with y-loop! -> crash!

global threads: ~16k; with y-loop! -> works

global threads: ~32k; with y-loop! -> works

global threads: ~65k; with y-loop! -> works

global threads: ~130k with y-loop! -> works

global threads: ~260 with y-loop! -> crash!

Is there a connection between the global thread size and the amount of calculations to be done in the kernels?

My assumption is that I’m running out of local or private memory. But how can that happen if I’m just looping and doing some calculations?

I hope my explanation is understandable. Thanks in advance, appreciate any help.

My guess is that the kernel execution for all work items takes so much time (by the way, local worksize 1 is really bad one) that the system resets video driver.

Thanks for the reply!

Do you know if there is a way to tell my system not to reset the driver?

In the actual program I’m using a dynamic local worksize < 512, depending on the global worksize.
One example of an images that crashes:
Global worksize: 214668
Local worksize: 402

For everyone having similar problems:
I solved it by setting TdrDelay and TdrDdiDelay as discribed in here: http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx

Thanks for the hint.