Is it recommended to create variables in heap memory

Hi,
I am working on an algorithm and I was using CudaMalloc method to allocate memory. But I had to make multiple calls to kernel which means I would have to transfer data every time. So in order to avoid that data transfer, I created a device variable and allocated memory to this variable an a kernel function using malloc method and freed it using free().
Apparently these variables are being created on heap. heap also have limited memory but it can be extended at the start of the application.
My question is that if it would speed up the processing or should I try something else. Please guide me for better understanding.
Here is my Code for understanding (any suggestion related to code would also be appreciated, Thanks)

__device__ int *device_population;
__device__ int *device_populationOffSpring;
__device__ int *device_candidatesCount;
__device__ int *device_problemSize;
__device__ int *device_threadsCount;
__global__ void init_CudaRandom(unsigned int seed, curandState_t* states, int numberOfElements) {
	/* we have to initialize the state */
	int i = blockDim.x*blockIdx.x + threadIdx.x;
	if (i < numberOfElements)
	{
		curand_init(seed, 			/* the seed can be the same for each core, here we pass the time in from the CPU */
			i, 		/* the sequence number should be different for each core (unless you want all
									  cores to get the same sequence of numbers for some reason - use thread id! */
			0, 			/* the offset is how much extra we advance in the sequence for each call, can be 0 */
			&states[i]);
	}
}
__global__ void setBits(curandState_t* states, int totalElements)
{
	int i = blockDim.x*blockIdx.x + threadIdx.x;
	//printf("%d and totalElements are %d\n",i,totalElements);
	if (i < totalElements)
	{
		device_population[i] 	= curand(&states[0]) % 2;
	}
	
}
__global__ void initializePopulation(int numberOfCandidates, int problemSize)//, int *device_Pop)
{
	//printf("");
	device_problemSize = (int*)malloc(sizeof(int));
	*device_problemSize = problemSize;
	device_candidatesCount = (int*)malloc(sizeof(int));
	*device_candidatesCount = numberOfCandidates;
	device_population = (int*)malloc(sizeof(int)*(*device_candidatesCount)*(*device_problemSize));
	device_populationOffSpring = (int*)malloc(sizeof(int)*(*device_candidatesCount)*(*device_problemSize));
	device_threadsCount = (int*)malloc(sizeof(int));
	if (numberOfCandidates < 2048)
	{
		*device_threadsCount = numberOfCandidates;
	}
	else
	{
		*device_threadsCount = 1024;
	}
	//Now initializing the population with random data;
	curandState_t* states = NULL;
	states = (curandState_t*)malloc(sizeof(curandState_t)*(*device_candidatesCount)*(*device_problemSize));
	int tryCount = 0;
	while (tryCount < 10 && states == NULL)
	{
		states = (curandState_t*)malloc(sizeof(curandState_t)*(*device_candidatesCount)*(*device_problemSize));
		tryCount++;
	}
	/*if (states == NULL)
		printf("fuckery at memory initilization \n");*/ 
	int gridSize = (*device_candidatesCount)*(*device_problemSize) / (*device_threadsCount) ;
	init_CudaRandom << <gridSize, *device_threadsCount >> > (1234,states,(*device_candidatesCount)*(*device_problemSize));
	cudaDeviceSynchronize();
	/*if(states==NULL)
		printf("fuckery \n");*/
	setBits<<<gridSize, *device_threadsCount >>>(states, (*device_candidatesCount)*(*device_problemSize));
	cudaDeviceSynchronize(); 
	/*for (int i = 0; i < (*device_candidatesCount)*(*device_problemSize); i++)
	{
		device_Pop[i] = device_population[i];
	}*/
	free(states);
}
__global__ void deletePopulation()
{
	free(device_population);
	free(device_populationOffSpring);
	free(device_candidatesCount);
	free(device_problemSize);
	free(device_threadsCount);
}

Can you describe what you are trying to accomplish?

I am working on a genetic algorithm, where I have to initialize a population and then the process it. I have to do it multiple time, on the same population.
So, I was thinking that I can just initialize the population in GPU itself and do the processing to avoid the data transfer cost.
The above given code is just for initializing and then in the end deleting the memory.

It looks like you’re combining Dynamic Parallelism and dynamic global memory allocation. You might want to read the following links.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations
https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/

*But I had to make multiple calls to kernel which means I would have to transfer data every time.

Are you talking about transfers from host to device memory? Or are you talking about transfers from global to on-chip. Because allocations to heap are stored in global memory. Each time you call a kernel data must be loaded from global to on-chip resources, like registers and shared memory.

Are you talking about transfers from host to device memory?
Yes.
I didn’t know global memory in GPU takes more time for being transferred to on chip memory. So it means with more data in global variables it will slow down the processing.
Thanks for the links, I’ll read them.
I appreciate the help so far.

If you must transfer data from host memory to global memory and in turn from global to on-chip each time step, please make sure the reads (global to on-chip) are coalesced.

https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/

Also, depending on the size of the transfers from host to global, check out pinned memory. If I were you, I’d try to see if there is a way to keep all the processing on the GPU. If you can’t, maybe overlapping transfers and compute will help performance.

https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

Yes, I am trying to do exactly the same. I am trying to do all the processing in GPU, including initializing and deleting variables.
But I only know about the global variables with the keywords device which is created on heap.
Now I am thinking about creating variables using cudaMalloc on host and transfer it to device, do all the processing there and then only return the results and delete the variables after coming back to device.
Let me know if you have better suggestion.
I have read some of the link that you have shared and they are really helpful.
Thanks

While you can certainly have separate functions to initialize and destroy variables. If you can put it in one kernel efficiently, that will probably be best for performance. Arrays created on the device with cudaMalloc are also stored in global memory. If you are loading multiple values into a single thread you should check out https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/.

I forgot mention you should definitely profile your code with Nsight Compute to help you find bottlenecks and issues.

https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

Thank a lot @mnicely.
The resources that you have shared are very helpful.
I very much appreciate your help and time.