Gmp & cuda

Hi all

I would like to use the GMP library (GNU Multiple Precision Arithmetic Library) with CUDA. So far I have written a CUDA program which currently uses double precision floating point and a seperate CPU (non CUDA) test program which performs multiple precision GMP floating point calculations. Both programs work well, but so far I have not been able to integrate the GMP library into CUDA.

Does anybody know how to compile/link CUDA with GMP? All code in one single .cu file.

At the moment I am working on XP but I would also be happy with a linux solution.

There is a GMP implementation for CUDA from Tsukuba University.

You can download the software at:
http://www.hpcs.cs.tsukuba.ac.jp/~nakayama/cump/index.php?The%20CUDA%20Multiple%20Precision%20Arithmetic%20Library

Thanks for the Link but I found it myself some time ago. I have tried to use the library but I have had no success. Unfortunately there is no documentation available. Also this library is based on the GMP so I still have to find out how to compile CUDA with GMP. I would need some kind of step by step instruction for the installation and compiling process.

I have added the gmp.h and libgmp.a to the include and lib folder in my CUDA installation folder. For compiling I use following command:

nvcc --compiler-bindir “C:\Programme\Microsoft Visual Studio 10.0\VC\bin” program.cu -arch=sm_21

This works properly as long as I do not use GMP commands. Then I get this error message:

error LNK2019: unresolved external symbol “__gmpf_set_default_prec” referenced in function “_main”

I keep getting a “500” error from the Tsukuba server, so I had a go at it myself with the “mini-gmp” found in the sources. I renamed mini-gmp.cpp mini-gmp.cu, decorated (almost) everything with “host device” and rewrote the realloc routine, as there’s no realloc in cuda. The only compiler warning I get is “#20011-D: calling a host function(“abort”) from a host device function(“gmp_die”) is not allowed”. Don’t know how to get around that on (asm(“trap;”); works only on the device and gives me a compile error) - ignoring for now.

Here’s my little test programm obviously using unified memory (sm_86):

#include “mini-gmp_.cu”

device managed mpz_t t;

global void test() {

printf(“tst: 0x%lx\n”, t);
printf(“tst: %d\n”, t->_mp_alloc);
printf(“tst: %d\n”, t->_mp_size);
printf(“tst: 0x%lx\n”, t->_mp_d);
printf(“tst: %ld\n”, t->_mp_d[0]);
}

int main() {

cudaError err;

mpz_init_set_ui(t, 10);
printf(“t: %ld, 0x%lx\n”, mpz_get_ui(t), t);
printf(“_mp_d: 0x%lx\n”, t->_mp_d);
test<<<1, 1>>>();
err = cudaDeviceSynchronize();
printf(“sync: %s\n”, cudaGetErrorString(err));
cudaFree(t);
return 0;
}

mpz_t is defined so in mini-gmp.h:

typedef struct
{
int _mp_alloc; /* Number of limbs allocated and pointed
to by the _mp_d field. /
int _mp_size; /
abs(_mp_size) is the number of limbs the
last field points to. If _mp_size is
negative this is a negative number. */
mp_limb_t _mp_d; / Pointer to the limbs. */
} __mpz_struct;

typedef __mpz_struct mpz_t[1];

“mp_limb_t” is unsigned long.

It gives me this output:

t: 10, 0x7f3916000000
_mp_d: 0x211ca70
tst: 0x7f3916000000
tst: 1
tst: 1
tst: 0x211ca70
sync: an illegal memory access was encountered

The exact addresses will vary between runs, of course. And compute-sanitizer:

========= COMPUTE-SANITIZER
t: 10, 0x7f3916000000
_mp_d: 0x211ca70
tst: 0x7f3916000000
tst: 1
tst: 1
tst: 0x211ca70
========= Invalid global read of size 8 bytes
========= at 0xa40 in /home/fiddler/gmp-cu/test.cu:11:test()
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x211ca70 is out of bounds
========= and is 8.653.780.368 bytes before the nearest allocation at 0x205e
00000 of size 8.388.864 bytes

So on the CPU everything is in order, the mpz_t struct gets copied to the GPU as well, including the pointer to the array (*_mp_d), but not the array itself: t->_mp_d is correct, but t->_mp_d[0] is OOB on the GPU. What’s going on here?

Please format code properly when posting on this forum. You can do this by editing your question, selecting the code, then pressing the </> button at the top of the edit window, then save your changes.

I don’t have any of the mini-gmp files, however it seems evident that the structure is in managed memory, and the structure contains a pointer (_mp_d), and the pointer does not point to a valid device allocation. Presumably this is set up in mpz_init_set_ui(). So you would need to look carefully at how the allocation is created, and also how the pointer is set.

Sorry for the messy formatting, I didn’t realize </> was necessary until too late.
(Meanwhile I found CUMP has moved here: https://github.com/skystar0227/CUMP, but is nowhere near complete.)

Anyway, here’s a simpler test case:

#include "stdio.h"

typedef struct {
  int a;
  int *b;
} test_t;

__managed__ test_t t;

__global__ void kernel() {

  printf("kernel a=%d\n", t.a);
  printf("kernel b[0]=%d\n", t.b[0]);
  t.b[0]++;
}

int main() {

  t.b = (int*)malloc(5*sizeof(int));

  t.a = 1;
  t.b[0] = 2;

  kernel<<<1, 1>>>();
  cudaDeviceSynchronize();

  printf("a=%d\n", t.a);
  printf("b[0]=%d\n", t.b[0]);
}

Output is
kernel a=1
a=1
b[0]=2

and
========= COMPUTE-SANITIZER
kernel a=1
========= Invalid global read of size 4 bytes
========= at 0x520 in /home/fiddler/t/t.cu:13:kernel()
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xe496e0 is out of bounds
========= and is 139.848.247.372.064 bytes before the nearest allocation at 0x7f30f6000000 of size 16 bytes

So b is again OOB in the kernel. When I replace malloc with cudaMallocManaged it works - but I’m confused because I thought in unified memory __managed__ variables were transparently visible to both host and device and calls to cudaMallocManaged obsolete. Is this not so?
Furthermore, in gmp I can’t use cudaMallocManaged because the functions that use it are called from both host and device code.

malloc in host code allocates from host memory, and the returned pointer points to host memory.

In CUDA you can’t use (i.e. dereference) such a pointer in device code.

The __managed__ variables are “transparently visible” to both host and device code. That includes the struct and all its members. But the thing the pointer points to is not part of that object/struct.