How get in host the memory allocated from device

Hi guys,
I have a trouble in how to get in the host a memory allocated from device. In host I don’t know the size of memory necessary so I use a struct to get this values to get back in host but not work

//~ nvcc -g -G -arch=sm_35 -o allocinsidekernel -lcudadevrt -rdc=true
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <assert.h>
#include <curand.h>
#include <curand_kernel.h>

#define _THREADSFX 32
#define _THREADSTOTAL 64

#define cudaCheckErrors(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline __host__ __device__ void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
	if (code != cudaSuccess) {
		#ifdef __CUDACC__
			printf("GPUassert: Error(%d) %s %s in line %d\n", (int)code, cudaGetErrorString(code), file, line);
			if (abort)
			fprintf(stderr,"GPUassert: Error(%d) %s %s in line %d\n", (int)code, cudaGetErrorString(code), file, line);
			if (abort)

struct myrows{
	int size;
	int *matrix;

__host__ __device__ unsigned int nextPow2(unsigned int x){
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return ++x;

__host__ __device__ void getNumBlocksAndThreads(int n, int maxThreads, int &blocks, int &threads){
		n *= (-1);
		if(n == 0){
			n = 1;
	threads = (n < maxThreads) ? nextPow2(n) : maxThreads;
	blocks = (n + threads - 1) / threads;

__global__ void kernel(struct myrows *data, int maxsize){
	int tid = (blockIdx.x  + gridDim.x  * blockIdx.y) * (blockDim.x * blockDim.y) + (threadIdx.x + blockDim.x * threadIdx.y);
	int i;
	if(tid < maxsize){
		data[tid].size = 100;
		data[tid].matrix = (int*)malloc(100*sizeof(int));

		for(i=0; i<100; i++){
			curandState state;
			curand_init(1, tid, 0, &state);
			data[tid].matrix[i] = curand_uniform(&state);
			printf("%d, ", data[tid].matrix[i]);

int main(){
	struct myrows *host_data, *device_data;
	int i, j;
	cudaCheckErrors(cudaMalloc((void**)&device_data, _THREADSTOTAL*sizeof(struct myrows)));
	for(i=0; i<_THREADSTOTAL; i++){
		cudaCheckErrors(cudaMalloc((void**)&device_data[i].matrix, 2*sizeof(int)));
	int blocos = 0, threads = 0;
	getNumBlocksAndThreads(_THREADSTOTAL, _THREADSFX, blocos, threads);
	kernel<<<blocos, threads>>>(device_data, _THREADSTOTAL);
	host_data = (struct myrows*)malloc(_THREADSTOTAL*sizeof(struct myrows));	
	for(i=0; i<_THREADSTOTAL; i++){
		cudaCheckErrors(cudaMemcpy(&host_data[i].size, &device_data[i].size, sizeof(int), cudaMemcpyDeviceToHost));
		host_data[i].matrix = (int*)malloc(host_data[i].size*sizeof(int));
		cudaCheckErrors(cudaMemcpy(host_data[i].matrix, device_data[i].matrix, host_data[i].size*sizeof(int), cudaMemcpyDeviceToHost));
	//~ cudaCheckErrors(cudaMemcpy(h_output, d_output, _QNT_MATRIX*sizeof(int), cudaMemcpyDeviceToHost));
	for(i=0; i<_THREADSTOTAL; i++){
		for(j=0; j<host_data[i].size; j++){
			printf("%d ", host_data[i].matrix[j]);

	for(i=0; i<_THREADSTOTAL; i++){
	return 0;

I’m allocating device memory on lines 75~78, call kernel on line 82, reading back to host on lines 85~90

You probably need to explain what the problem is that you’re having, but the first thing I don’t understand from your code is why you’re allocating 2sizeof(int) on the device for your matrix, but then copying back host_data[i].sizesizeof(int). if host_data.size is larger than 2, that will have an invalid memory access.

I’m reallocating in line 88. I need copy back matrix (allocated in device) to host but host don’t know before what size is this matrix I’m trying to do this. In other words, I alloc memory in device and need copy back to host without know a prior the size of matrix (I will know size after kernel is finish)

If you’re asking for help, instead of saying " but not work", its better if you describe exactly why you think it is not working. For example saying “it hits a seg fault” would be a much more useful description. Even better if you identify the line of code that is causing the segfault. This is a basic debugging technique and is not specific to CUDA, and you should not need to come to a CUDA programming forum to learn how to localize a seg fault to a particular line of code that is causing it. In CUDA seg faults always originate from host code.

Anyway, when I compile and run your code, I get a seg fault. That seg fault is occurring on this line:

cudaCheckErrors(cudaMalloc((void**)&device_data[i].matrix, 2*sizeof(int)));

With such a short code, you can disover this easily enough just by sprinkling printf statements in your code, or else you can use the backtrace/dump/reporting facility in any modern debugger (e.g. gdb).

You are not allowed to pass a device address (location) for the target of a cudaMalloc operation. This operation:

cudaCheckErrors(cudaMalloc((void**)&device_data, _THREADSTOTAL*sizeof(struct myrows)));

creates a device allocation, and stores the address of that device allocation in the device_data pointer. The device_data variable is located in host memory, so this operation is perfectly legal. cudaMalloc can access device_data so as to store the allocation pointer there.

This operation:

cudaCheckErrors(cudaMalloc((void**)&device_data[i].matrix, 2*sizeof(int)));

is asking cudaMalloc to create an allocation of size 2*sizeof(int), and store the address of that allocation in the location device_data[i].matrix

But device_data[i].matrix is a location in device memory. You cannot access device memory from host code EVEN with cudaMalloc, and so this operation is illegal and causes a seg fault - the inevitable result of attempting to dereference a device pointer in host code. cudaMalloc cannot directly access device_data[i].matrix (nor can any host code).

This is in the general category of pointer-to-pointer allocations, which require a “deep copy”, and they require extra programming effort.

If you want to use this approach (it should not be necessary) then you should google and study appropriate treatments of the topic. It certainly is not necessary to go to this level of complication simply to get the quantity of device memory allocated on the device - that should be trivial to pass that quantity back to the host. For those who are struggling with deep copy operations, the usual advice is to “flatten” your data structures, so that all data can be accessed offset from a single pointer. This has a variety of advantages anyway both for code complexity as well as actual performance.

As a final caveat, if you intend to allocate memory via in-kernel malloc (or in-kernel new, or in-kernel cudaMalloc), and then copy directly from that memory to the host, using the pointer returned by malloc, it will not be possible. Host-API cudaMemcpy operations cannot access pointers (the space they point to) when those pointers are returned by in-kernel malloc.

If you want to follow this paradigm, it will be necessary to:

  1. allocate space using host cudaMalloc
  2. allocate space using in-kernel malloc
  3. in the kernel, copy the data from the in-kernel malloc space to the host cudaMalloc space
  4. after completion of the kernel, copy the data from device to host using the previously allocated host cudaMalloc space

Sorry for not tell what is error exactly because when I tested nothing was called saying error on cudaMemcpy causing by error in derrefence Unified Memory…

So if I reallocate device_data[i].matrix on gpu and try to copy back will not work even I reallocating before copy back the data?
And if I decide to realize a deep copy I need allocate previously a safe amount of memory in device_data[i].matrix and then copy back to host?

I think that is correct, as best as I can interpret your questions.

A pointer returned by in-kernel malloc cannot participate in a host cudaMemcpy* call.

If you decide to pre-allocate using host cudaMalloc operations, then you will need to know how much to preallocate.

A 3rd alternative (rather tedious) might be:

  1. run your kernel with in-kernel malloc, while computing the total amount allocated on the fly.
  2. return the total amount allocated after that kernel completes
  3. on the host, allocate the necessary amount
  4. run another kernel to perform the device-to-device copy operation, into the region allocated in step 3
  5. run your cudaMemcpy operation from the region allocated in step 3, to copy the data back to the host

Yes I must know how much allocate… I will need run 2 kernels and back to allocate the necessary amount, thanks for clarify

There is precedent in established libraries for such an approach. For example, many LAPACK routines require one or several work areas (buffers) for internal processing, with the caller having the responsibility of allocating the buffers.

But it is not always clear how large those buffers should be. One way of finding out is to perform a so-called “LWORK query”, by calling the library function in question with its LWORK argument set to -1. The library function will then return relevant sizing information. A caller can then set up the required buffers and call the same function again, but this time in “execution mode”, rather than “query mode”.

Obviously this kind of mechanism requires a deterministic relationship between the data passed to the function and the required buffer sizes. Since the OP seems to be using a PRNG output as part of the kernel processing, it is not clear to me that this condition is met here. At minimum it would appear to require that both calls to the kernel (i.e., “query” and “execution”) use the exact same initial PRNG state.

So “LWORDK query” will do same thing that @txbob say (run 1 kernel to get size, return to host, alloc e run another kernel to get/calculate data)

It could be the same kernel (with a query or dry-run parameter set appropriately), or it could be a different kernel. LAPACK actually uses both approaches, depending on what kind of sizing inforation needs to be conveyed.

There is a function called ILAENV that you can pass a LAPACK API function name and possibly some other info and it will return relevant sizing information for that function. That is equivalent to a two-kernel approach.

In other cases (or alternatively) the LAPACK API function itself support an LWORK query, which means you call the actual LAPACK API function, but with an argument set to indicate you are calling it in “query mode”. It then returns relevant sizing information. That is equivalent to a one-kernel approach.

Which of the two approaches is appropriate in this case I can’t say, not knowing the code in question.

Does anyone know if there are plans for Cuda to allow copying of memory allocated on device to host? ie if you allocate with cudaMalloc on device you cannot use host api cudaMemCpy to copy it to host. This is pretty inconvienient.