Is it recommended to create variables in heap memory

2011csgulfam163 · December 21, 2019, 3:43am

Hi,
I am working on an algorithm and I was using CudaMalloc method to allocate memory. But I had to make multiple calls to kernel which means I would have to transfer data every time. So in order to avoid that data transfer, I created a device variable and allocated memory to this variable an a kernel function using malloc method and freed it using free().
Apparently these variables are being created on heap. heap also have limited memory but it can be extended at the start of the application.
My question is that if it would speed up the processing or should I try something else. Please guide me for better understanding.
Here is my Code for understanding (any suggestion related to code would also be appreciated, Thanks)

__device__ int *device_population;
__device__ int *device_populationOffSpring;
__device__ int *device_candidatesCount;
__device__ int *device_problemSize;
__device__ int *device_threadsCount;
__global__ void init_CudaRandom(unsigned int seed, curandState_t* states, int numberOfElements) {
	/* we have to initialize the state */
	int i = blockDim.x*blockIdx.x + threadIdx.x;
	if (i < numberOfElements)
	{
		curand_init(seed, 			/* the seed can be the same for each core, here we pass the time in from the CPU */
			i, 		/* the sequence number should be different for each core (unless you want all
									  cores to get the same sequence of numbers for some reason - use thread id! */
			0, 			/* the offset is how much extra we advance in the sequence for each call, can be 0 */
			&states[i]);
	}
}
__global__ void setBits(curandState_t* states, int totalElements)
{
	int i = blockDim.x*blockIdx.x + threadIdx.x;
	//printf("%d and totalElements are %d\n",i,totalElements);
	if (i < totalElements)
	{
		device_population[i] 	= curand(&states[0]) % 2;
	}
	
}
__global__ void initializePopulation(int numberOfCandidates, int problemSize)//, int *device_Pop)
{
	//printf("");
	device_problemSize = (int*)malloc(sizeof(int));
	*device_problemSize = problemSize;
	device_candidatesCount = (int*)malloc(sizeof(int));
	*device_candidatesCount = numberOfCandidates;
	device_population = (int*)malloc(sizeof(int)*(*device_candidatesCount)*(*device_problemSize));
	device_populationOffSpring = (int*)malloc(sizeof(int)*(*device_candidatesCount)*(*device_problemSize));
	device_threadsCount = (int*)malloc(sizeof(int));
	if (numberOfCandidates < 2048)
	{
		*device_threadsCount = numberOfCandidates;
	}
	else
	{
		*device_threadsCount = 1024;
	}
	//Now initializing the population with random data;
	curandState_t* states = NULL;
	states = (curandState_t*)malloc(sizeof(curandState_t)*(*device_candidatesCount)*(*device_problemSize));
	int tryCount = 0;
	while (tryCount < 10 && states == NULL)
	{
		states = (curandState_t*)malloc(sizeof(curandState_t)*(*device_candidatesCount)*(*device_problemSize));
		tryCount++;
	}
	/*if (states == NULL)
		printf("fuckery at memory initilization \n");*/ 
	int gridSize = (*device_candidatesCount)*(*device_problemSize) / (*device_threadsCount) ;
	init_CudaRandom << <gridSize, *device_threadsCount >> > (1234,states,(*device_candidatesCount)*(*device_problemSize));
	cudaDeviceSynchronize();
	/*if(states==NULL)
		printf("fuckery \n");*/
	setBits<<<gridSize, *device_threadsCount >>>(states, (*device_candidatesCount)*(*device_problemSize));
	cudaDeviceSynchronize(); 
	/*for (int i = 0; i < (*device_candidatesCount)*(*device_problemSize); i++)
	{
		device_Pop[i] = device_population[i];
	}*/
	free(states);
}
__global__ void deletePopulation()
{
	free(device_population);
	free(device_populationOffSpring);
	free(device_candidatesCount);
	free(device_problemSize);
	free(device_threadsCount);
}

mnicely · December 21, 2019, 2:51pm

Can you describe what you are trying to accomplish?

2011csgulfam163 · December 21, 2019, 5:55pm

I am working on a genetic algorithm, where I have to initialize a population and then the process it. I have to do it multiple time, on the same population.
So, I was thinking that I can just initialize the population in GPU itself and do the processing to avoid the data transfer cost.
The above given code is just for initializing and then in the end deleting the memory.

mnicely · December 22, 2019, 10:47pm

It looks like you’re combining Dynamic Parallelism and dynamic global memory allocation. You might want to read the following links.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations
https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/

*But I had to make multiple calls to kernel which means I would have to transfer data every time.

Are you talking about transfers from host to device memory? Or are you talking about transfers from global to on-chip. Because allocations to heap are stored in global memory. Each time you call a kernel data must be loaded from global to on-chip resources, like registers and shared memory.

2011csgulfam163 · December 23, 2019, 7:42pm

Are you talking about transfers from host to device memory?
Yes.
I didn’t know global memory in GPU takes more time for being transferred to on chip memory. So it means with more data in global variables it will slow down the processing.
Thanks for the links, I’ll read them.
I appreciate the help so far.

mnicely · December 23, 2019, 7:50pm

If you must transfer data from host memory to global memory and in turn from global to on-chip each time step, please make sure the reads (global to on-chip) are coalesced.

https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/

Also, depending on the size of the transfers from host to global, check out pinned memory. If I were you, I’d try to see if there is a way to keep all the processing on the GPU. If you can’t, maybe overlapping transfers and compute will help performance.

https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

2011csgulfam163 · December 24, 2019, 4:58am

Yes, I am trying to do exactly the same. I am trying to do all the processing in GPU, including initializing and deleting variables.
But I only know about the global variables with the keywords device which is created on heap.
Now I am thinking about creating variables using cudaMalloc on host and transfer it to device, do all the processing there and then only return the results and delete the variables after coming back to device.
Let me know if you have better suggestion.
I have read some of the link that you have shared and they are really helpful.
Thanks

mnicely · December 24, 2019, 1:35pm

While you can certainly have separate functions to initialize and destroy variables. If you can put it in one kernel efficiently, that will probably be best for performance. Arrays created on the device with cudaMalloc are also stored in global memory. If you are loading multiple values into a single thread you should check out https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/.

mnicely · December 24, 2019, 1:50pm

I forgot mention you should definitely profile your code with Nsight Compute to help you find bottlenecks and issues.

https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

2011csgulfam163 · December 25, 2019, 12:35am

Thank a lot @mnicely.
The resources that you have shared are very helpful.
I very much appreciate your help and time.

Topic		Replies	Views
Dynamic Heap initialization CUDA Programming and Performance	12	258	June 24, 2024
cudaMalloc() vs Malloc() in pure C CUDA Programming and Performance	5	124	September 18, 2024
cuda memory management in multi-gpu programming CUDA Programming and Performance	4	4717	December 4, 2014
Controlling device from host CUDA Programming and Performance	8	2435	August 6, 2008
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	1919	October 17, 2015
Conceptual questions on memory and blocks limit CUDA Programming and Performance	2	716	May 29, 2018
sending parameters to kernel CUDA Programming and Performance	1	2671	June 12, 2011
CUDA header input file Declaring device variables in a separate .h or .cuh file CUDA Programming and Performance	7	8390	June 11, 2011
Multi-GPU array CUDA Programming and Performance	2	527	June 4, 2021
Global arrays? CUDA Programming and Performance	24	10612	August 18, 2010

Is it recommended to create variables in heap memory

Related topics