cudaMalloc(long*, size) failure on Tesla C2070 device

I have a Tesla C2070 in a HPC system.

I am trying to get cudamalloc(&longpointer, size) to run, but am getting the runtime error report below -

---------- ERROR REPORT ----------------
user@computer:/sata2/NVIDIA_GPU_Computing_SDK/C/src/MagicSquares$ …/…/bin/linux/release/magic
[magic] starting…

Using CUDA device [0]: Tesla C2070
Device 0: “Tesla C2070” with Compute 2.0 capability
magic: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)’ failed.
Aborted


I carefully followed Chapter 3 Programming Interface page 20 of the Programming Guide 00 Nvidia CUDA C Programming Guide v4.0 and added some support code around it so that the module would compile.

Why is cudaMalloc() failing? I even followed the other code examples of wrapping it inside a cutilSafeCall() and used a (void**) to change the pointer type given to the cudaMalloc function, but still got the same failure as above.

Program_Listing:

---------------------- test.cu -----------------------------------
#include <stdio.h>
#include <shrUtils.h>
#include <shrQATest.h>
#include <cutil_inline.h>

global void myfunction()
{
}

int main(int argc, char **argv)
{
int devID;
cudaDeviceProp props;

shrQAStart(argc, argv);

// Check which GPU is used
cutilChooseCudaDevice(argc, argv);

// Get GPU information

cutilSafeCall(cudaGetDevice(&devID));
cutilSafeCall(cudaGetDeviceProperties(&props, devID));
printf(“Device %d: “%s” with Compute %d.%d capability\n”, devID, props.name, props.major, props.minor);

size_t size = 1000;

// Allocate Host Memory
long* host_mem = (long*) malloc(size);

// Zeroize the Answers
int i;
for(i=0;i<size;i++) { host_mem[i] = 0; }

// Allocate Device Memory
int* dev_mem = 0;
cudaMalloc(&dev_mem, size);

// Copy (zeroized) answers from host memory to device memory
cudaMemcpy(dev_mem, host_mem, size, cudaMemcpyHostToDevice);

// Invoke kernel

myfunction<<<1,1>>>();
cutilDeviceSynchronize();

// Copy result from device memory to host memory

cudaMemcpy(host_mem, dev_mem, size, cudaMemcpyDeviceToHost);

// Check Results;
// … (to be filled in)

// Free device memory
cudaFree(dev_mem);
// Free host memory
free(host_mem);

cutilDeviceReset();
shrQAFinishExit(argc, (const char **)argv, QA_PASSED);
}

---------------- setup lines in Makefile ---------------------

Add source files here

EXECUTABLE := magic

Cuda source files (compiled with cudacc)

CUFILES := test.cu
CUDEPS :=

C/C++ source files (compiled with gcc / c++)

CCFILES :=

add command line parameters so we can target multiple architectures

GENCODE_ARCH := -gencode=arch=compute_20,code=“sm_20,compute_20”

Your code mixes lengths in bytes, lengths in [font=“Courier New”]sizeof(int)[/font], and length in [font=“Courier New”]sizeof(long)[/font]. [font=“Courier New”]malloc()[/font], [font=“Courier New”]cudaMalloc()[/font], and [font=“Courier New”]cudaMemCpy()[/font] take a size in bytes and should thus be called like

cudaMalloc((void*)&dev_mem, len*sizeof(*dev_mem))

Also host pointer and device pointer should always be of the same type.

Thank you for your reply. I am not sure how my code posted with the mismatch of the pointers so I apologize for that.

Here (below) is the actual test.cu code again, so we can make sure that you have an identical copy with mine.

I did play with the local Makefile for each project inside the SDK and found out some interesting things which were not obvious at the start.

First of all, it appears that the PTXAS was compiling for the sm_10 architecture not the sm_20 which my Tesla Card C2070 is. (device 0 btw)

I found that this architecture coercion could be induced by changing the name of the CUFILES to CUFILES_sm_20 inside the local make file or by changing the NVCCFLAGS to include -arch=sm_20. I disocvered this the hard way, by actually moving code from some of the other projects and experimenting with them, and finding out that “doubles” were rejected which was a strong clue as to which architecture was being actually used during creation of the actual executable.

I also took a cu code unit from another project which contained the cudaMalloc() function and compiled that successfully using doubles, so I know that cudaMalloc is apparently working okay in that code unit (transpose.cu from the transpose project).

One thing also that I had to experiment with was the “make clean” command and its true effects. It scrubs the libraries such as the cutil_x86_64 library upon which the make will not run and this confused me until I was able to figure out why the /bin/ld command could not find that file. I initially thought that running “make clean” would only remove my executable, but it does a lot more than that and forces one to go to the top level and do a make from that level then descend back down to the particular project folder under the …/C/src location. It would be nice to have “make clean” in the local project area only remove the executable and obj files, not the libraries necessary for compiling.

I do understand that malloc(), cudaMalloc() and cudaMemCpy() take size in bytes and I think the code below is correct in that regard.

One other thing, most of the code that I examined in the SDK has (void**) in front of the address to the memory chunk, not (void*).

Finally I would appreciate if someone would actually run the code below and see if it runs or not. This would be a big help to determining what is wrong?

Thanks for any help.

  • Randall

-------------- test.cu --------------

#include <stdio.h>

global void myfunction()

{

// do nothing yet

}

int main(int argc, char **argv)

{

int devID = 0;

cudaDeviceProp props;

// Set Device

cudaSetDevice(devID);

// Get GPU information

devID = cudaGetDevice(&devID);

cudaGetDeviceProperties(&props, devID);

printf(“Device %d: “%s” with Compute %d.%d capability\n”, devID, props.name, props.major, props.minor);

// Specify 1000 longs

int longwords = 1000;

size_t mem_size = longwords * sizeof(long);

// Allocate Host Memory

long* host_mem = (long*) malloc(mem_size);

// Zeroize the Host Memory

unsigned int i;

for(i=0;i<mem_size;i++) { host_mem[i] = 0; }

// Allocate Device Memory

long* dev_mem;

cudaMalloc(&dev_mem, mem_size);

// Copy (zeroized) host memory to device memory

cudaMemcpy(dev_mem, host_mem, mem_size, cudaMemcpyHostToDevice);

// Invoke kernel

myfunction<<<1,1>>>();

// Synchronize threads

cudaDeviceSynchronize();

// Copy result from device memory to host memory

cudaMemcpy(host_mem, dev_mem, mem_size, cudaMemcpyDeviceToHost);

// Check Results;

// … (to be filled in)

// Free device memory

cudaFree(dev_mem);

// Free host memory

free(host_mem);

// Reset device

cudaDeviceReset();

}


You are still confusing number of elements and size

for(i=0;i<mem_size;i++) { host_mem[i] = 0; }

should be

for(i=0;i< longwords;i++) { host_mem[i] = 0; }

The right syntax for your cudaMalloc is:

cudaMalloc((void **) &dev_mem, mem_size);

You don’t need the cudaDeviceSynchronize() before the cudaMemcpy.

A final suggestion, write your own makefile, do not use the one coming with the SDK.

Your first comment is correct, it exposes the fact that I cannot increment a long pointer since it jumps 8 bytes.

Secondly, the section of code supposedly clearing out the host memory is actually the line of code that caused all the problems and blew up the cudaMalloc() command

I actually do use (void**) in the cudaMalloc() in the working code.

I finally have everything working as I wished including the correct compile to arch=sm_20 level and all 4 cards are now being heavily utilized.

Thanks for your comments.