Memory Management

Hello,

First time poster, but here goes.

I can’t seem to find out how to do mallocs, frees, and memcpys from device to device.

__device__ unsigned short int modMP(unsigned long int * x, unsigned long int sizeL, unsigned short int p)

{

	unsigned long int size = sizeL * mp_bits_per_limb;

	unsigned long int * x2_Original;

	unsigned long int * x2;

	cudaMalloc( (void **) &x2, size);

	x2_Original = x2;

	cudaMemcpy( x2, x, size, cudaMemcpyDeviceToDevice );

	memcpy(x2, x, size);

	

	unsigned long int * limit = x2 - 32 + size;

	while(x2 <= limit)

	{

		*x2 = *x2 % p;

		x2 += 16;

	}

	unsigned short int r = *((unsigned short int *)(x2-16));

	cudaFree(x2_Original);

	return r;

}

x is a pointer to the data of a gmp integer, while sizeL is how many limbs that integer has. As you can see I need to create a copy of x called x2 which I will then use to compute the mod. I have tested this function on host and it works. The error I get is that it does not like that I’m calling cudaMalloc, cudaFree and cudaMemcpy from the device. So I was wondering if there were functions that works on the device that accomplish the same goal.

Thanks in advance…

Short answer: you can’t do memory management from the device.

Long answer: you can’t do memory management from the device because it is inherently not parallel and 99% of the time people would try to use it for the wrong thing.

Read the programming guide thoroughly if you haven’t done so already–looking at the rest of your code, it seems like you don’t have a very thorough understanding of how CUDA works.

Well, here is what I need to do. I have an array of primes and I need to mod the same MP integer for each p in primes. I figured that cuda would be good for this as it’s a really large list of primes for the Chinese Remainder.

Could you be a little more specific about how the rest of my code is not good for cuda?

The only way that I could see doing this then is from the host create an array of primes.length copies of the MP integer on the device for each of the device threads to work with. But that seems like it would quickly take up too much room in memory (i’m dealing with really large input integers on the scale of 512 bits).

I appreciate your help and advice.