Only 300 MB of free memory on Tesla S2050 GPUs

I have a Tesla S2050 with driver version 260.19.26 installed. Each of the GPUs is supposed to have 3GB of global memory and that’s what deviceQuery tells me too.

However, when I run a simple program that calls cudaMemGetInfo (see code snippet below), I find that only 307757056 (~300 MB) of the 2817982464 bytes are “free”. I get the same result for all 4 GPUs. No other jobs are running on my GPUs.

For example, I bump into this memory issue also when I try running the SDK’s FDTD3d example:


fdtdGPU…
cudaGetDeviceCount
cudaSetDevice (device 0)
cudaMalloc bufferOut
cudaMalloc bufferIn

!!! Error # 0 at cudaMalloc ‘out of memory’.
GPU FDTD loop

fdtdGPU complete

FAILED

!!! Error # 0 at line 44 , in file src/FDTD3d.cpp !!!

Exiting…

However when I run the corresponding example in OpenCL, the code runs just fine. So is this a driver problem?

Thanks in advance,
Mark

==============NVSMI LOG==============

Timestamp : Mon Jan 10 14:06:30 2011

Driver Version : 260.19.26

Unit 0:
Product Name : NVIDIA Tesla S2050
Product ID : 920-20804-0020
Serial Number : 0383110000030
Firmware Ver : 6.2
Intake Temperature : 26 C
GPU 0:
Product Name : Tesla S2050
Serial : 0322810019679
PCI Device/Vendor ID : 6de10de
PCI Location ID : 1:7:0
Bridge Port : 0
Temperature : 47 C
Utilization :
GPU : 0%
Memory : 0%
Volatile ECC errors:
Single Bit :
FB : 0
RF : 0
L1 : 0
L2 : 0
Total : 0
Double Bit :
FB : 0
RF : 0
L1 : 0
L2 : 0
Total : 0
Aggregate ECC errors:
Single Bit :
Total : 0
Double Bit :
Total : 510203

… same for the other 3 GPUs…

----- Mem check code -------
#include
#include <cuda_runtime.h>

int main(int argc, char** argv)
{
cudaSetDevice(0);
size_t freeMem, totalMem;
int a;
a = cudaMemGetInfo(&freeMem, &totalMem);
printf(“Free = %lu, Total = %lu, error = %i\n”, (unsigned long)freeMem, (unsigned long)totalMem, a);
return 0;
}

It works fine for me with 260.19.21.

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010

GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

Could you try with the following code?

// Compile with gcc check_mem.c  -I/usr/local/cuda/include -lcuda

#include <cuda.h>

#include <stdio.h>

static unsigned long inKB(unsigned long bytes)

   { return bytes/1024; }

static unsigned long inMB(unsigned long bytes)

   { return bytes/(1024*1024); }

static void printStats(unsigned long free, unsigned long total)

   {

      printf("^^^^ Free : %lu bytes (%lu KB) (%lu MB)\n", free, inKB(free), inMB(free));

      printf("^^^^ Total: %lu bytes (%lu KB) (%lu MB)\n", total, inKB(total), inMB(total));

      printf("^^^^ %f%% free, %f%% used\n", 100.0*free/(double)total, 100.0*(total - free)/(double)total);

   }

int main(int argc, char **argv)

   {

      size_t free, total;

      int gpuCount, i;

      CUresult res;

      CUdevice dev;

      CUcontext ctx;

      char name[100];

CUresult e = cuInit(0);

cuDeviceGetCount(&gpuCount);

      printf("Detected %d GPU\n",gpuCount);

for (i=0; i<gpuCount; i++)

      {

      cuDeviceGet(&dev,i);

      cuDeviceGetName(name,100,dev);

      cuCtxCreate(&ctx, 0, dev);

      res = cuMemGetInfo(&free, &total);

      if(res != CUDA_SUCCESS) printf("!!!! cuMemGetInfo failed! (status = %x)", res);

      printf("^^^^ Device: %d %s\n",i,name);

      printStats(free, total);

      cuCtxDetach(ctx);

      }

return 0;

   }

and I am getting the correct output:

This is on a node with 1 M2050 (ECC OFF) and 1 S2050 (ECC ON)

Detected 5 GPU

^^^^ Device: 0 Tesla M2050

^^^^ Free : 2937323520 bytes (2868480 KB) (2801 MB)

^^^^ Total: 3220897792 bytes (3145408 KB) (3071 MB)

^^^^ 91.195800% free, 8.804200% used

^^^^ Device: 1 Tesla S2050

^^^^ Free : 2534932480 bytes (2475520 KB) (2417 MB)

^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)

^^^^ 89.955580% free, 10.044420% used

^^^^ Device: 2 Tesla S2050

^^^^ Free : 2534932480 bytes (2475520 KB) (2417 MB)

^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)

^^^^ 89.955580% free, 10.044420% used

^^^^ Device: 3 Tesla S2050

^^^^ Free : 2534932480 bytes (2475520 KB) (2417 MB)

^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)

^^^^ 89.955580% free, 10.044420% used

^^^^ Device: 4 Tesla S2050

^^^^ Free : 2534932480 bytes (2475520 KB) (2417 MB)

^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)

^^^^ 89.955580% free, 10.044420% used

Actually I with 260.19.21 I had the exact same problem. Anyway I ran your code and got this

Detected 4 GPU
^^^^ Device: 0 Tesla S2050
^^^^ Free : 307757056 bytes (300544 KB) (293 MB)
^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)
^^^^ 10.921184% free, 89.078816% used
^^^^ Device: 1 Tesla S2050
^^^^ Free : 307757056 bytes (300544 KB) (293 MB)
^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)
^^^^ 10.921184% free, 89.078816% used
^^^^ Device: 2 Tesla S2050
^^^^ Free : 307757056 bytes (300544 KB) (293 MB)
^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)
^^^^ 10.921184% free, 89.078816% used
^^^^ Device: 3 Tesla S2050
^^^^ Free : 307757056 bytes (300544 KB) (293 MB)
^^^^ Total: 2817982464 bytes (2751936 KB) (2687 MB)
^^^^ 10.921184% free, 89.078816% used

Is there any command that I should use to configure the GPUs before running SDK examples or any CUDA code for that matter?

Thanks for looking into this.

Cheers,
Mark

You should not need any additional commands, loading the driver and creating the /dev/nvidia* files as documented in the release notes should be enough.

Which OS are you running?

This is the OS installed on the machine.

SUSE Linux Enterprise Server 11 (x86_64)

VERSION = 11

PATCHLEVEL = 1

output from “uname -a”:

Linux ultra 2.6.32.19-0.3.1.1982.2.PTF-default #1 SMP 2010-09-17 20:28:21 +0200 x86_64 x86_64 x86_64 GNU/Linux

Thanks,

Mark

What is the output of cat /proc/driver/nvidia/version?

I don’t have a SLES machine.
If you are a registered developer, please file a bug. If you are not, please apply to become one.

I just filed a bug report (incident #779166).

Cheers,

Mark