2nd context creation fails on tesla C2050

dney · August 12, 2010, 7:44pm

I have a problem that seems to be seen only on a Tesla C2050.

Our application creates multiple contexts for the same device.
This works fine on most cards.
But on our Tesla C2050 creating a 2nd context while a context is already active fails with a dreaded “unknown error” (we have to get better error reporting–is there some log file somewhere which can tell you what actually went wrong?).

I have modified one of the SDK samples to show the problem. The source is attached.

When run on my development system which has 2 GTX 480s and 1 C2050 it reports this:

CUDA Device Query (Driver API) statically linked version
There are 3 devices supporting CUDA

Device 0: “GeForce GTX 480”
CUDA Driver Version: 3.10
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1576468480 bytes
created first context 0000000002241C10
created second context 0000000002999250

Device 1: “GeForce GTX 480”
CUDA Driver Version: 3.10
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1576599552 bytes
created first context 0000000002241C10
created second context 000000000293C2E0

Device 2: “Tesla C2050”
CUDA Driver Version: 3.10
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 3181969408 bytes
created first context 0000000002241C10
FAILURE: could not create 2nd context (error=999)

FAILURES

You can see it works for the GTX 480s but not the tesla.
The same result is obtained on a system with a Quadro FX5800 and a tesla C2050 (the FX5800 works, the tesla does not).

This was tested with both the 258.96 and 259.03 server driver.

I will also file a bug report.
-Derek Ney

I am not sure the attachment facility is working so here is the code (it is not long)

/*

Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
NVIDIA Corporation and its licensors retain all intellectual property and
proprietary rights in and to this software and related documentation.
Any use, reproduction, disclosure, or distribution of this software
and related documentation without an express license agreement from
NVIDIA Corporation is strictly prohibited.
Please refer to the applicable NVIDIA end user license agreement (EULA)
associated with this source code for terms and conditions that govern
your use of this NVIDIA software.

*/

/* This sample queries the properties of the CUDA devices present in the system. */

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <cuda.h>

#include <cutil.h>

// utilities and system includes
#include <shrUtils.h>

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main( int argc, char** argv)
{
CUdevice dev;
int major = 0, minor = 0;
int deviceCount = 0;
char deviceName[256];
bool failed = false;
int options = CU_CTX_SCHED_YIELD;

// note your project will need to link with cuda.lib files on windows
printf("CUDA Device Query (Driver API) statically linked version \n");

CUresult err = cuInit(0);
    CU_SAFE_CALL_NO_SYNC(cuDeviceGetCount(&deviceCount));
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0) {
    printf("There is no device supporting CUDA\n");
}
    for (dev = 0; dev < deviceCount; ++dev) {
	CU_SAFE_CALL_NO_SYNC( cuDeviceComputeCapability(&major, &minor, dev) );

    if (dev == 0) {
		// This function call returns 9999 for both major & minor fields, if no CUDA capable devices are present
        if (major == 9999 && minor == 9999)
            printf("There is no device supporting CUDA.\n");
        else if (deviceCount == 1)
            printf("There is 1 device supporting CUDA\n");
        else
            printf("There are %d devices supporting CUDA\n", deviceCount);
    }
	CU_SAFE_CALL_NO_SYNC( cuDeviceGetName(deviceName, 256, dev) );
    printf("\nDevice %d: \"%s\"\n", dev, deviceName);

    int driverVersion = 0;
    cuDriverGetVersion(&driverVersion);
    printf("  CUDA Driver Version:                           %d.%d\n", driverVersion/1000, driverVersion%100);
    printf("  CUDA Capability Major revision number:         %d\n", major);
    printf("  CUDA Capability Minor revision number:         %d\n", minor);

	unsigned int totalGlobalMem;
	CU_SAFE_CALL_NO_SYNC( cuDeviceTotalMem(&totalGlobalMem, dev) );
    printf("  Total amount of global memory:                 %u bytes\n", totalGlobalMem);

    CUresult res;
    CUcontext ctx;
    CUcontext ctx2;
    
    if ((res = cuCtxCreate(&ctx, options, dev)) != CUDA_SUCCESS)
    {
      printf("FAILURE: could not create initial context (error=%d)\n", res);
      failed = true;
      continue;
    }

    printf("  created first context %p\n", ctx);
    
    if ((res = cuCtxCreate(&ctx2, options, dev)) != CUDA_SUCCESS)
    {
      printf("FAILURE: could not create 2nd context (error=%d)\n", res);
      failed = true;
      continue;
    }

    printf("  created second context %p\n", ctx2);

    if ((res = cuCtxDetach(ctx2)) != CUDA_SUCCESS)
    {
      printf("FAILURE: could not detach from 2nd context (error=%d)\n", res);
      failed = true;
      continue;
    }

    if ((res = cuCtxDetach(ctx)) != CUDA_SUCCESS)
    {
      printf("FAILURE: could not detach from initial context (error=%d)\n", res);
      failed = true;
      continue;
    }
  }

if (failed)
  printf("\nFAILURES\n");
else
  printf("\nPASSED\n");

CUT_EXIT(argc, argv);

}

dney · August 20, 2010, 10:11pm

Nvidia helped me with this problem. Here is the information they gave me:

Turns out there is a configuration setting that is enabled by default for Tesla C2050 cards. It is the “compute mode”. It has 3 values:

0 = normal
1 = exclusive
2 = prohibited

It can be manipulated with the nvidia-smi program which is normally installed here: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe.

nvidia-smi -L -a will list GPUs,
nvidia-smi -g (GPU ID) -s will show the GPU’s current compute mode,
nvidia-smi -g (GPU ID) -c (compute mode) will change the compute mode.

Indeed the -s command showed that my C2050 ws in mode 1. In mode 1 only one context can be created per card.
I am not sure why this is the default. I changed it to zero with this command: nvidia-smi -g 2 -c 0. And then my test
program worked.

dney · August 20, 2010, 10:11pm