zero-copy not working on tx1

Hello All,

I am using zero copy for an iterative solver which transfers an integer from device to host each iteration. The kernel set the value for the device integer. 
Before zero copy, the device integer is copied back to the host and it works.
After I used zero copy to map the device pointer to the host pointer, without calling cudaMemcpy, I read directly the host integer, and  it failed. After I add a cudaDeviceSynchronize, it worked again!

Any ideas?


Are you sure the operation was running on the Jetson? Something like a separate video card won’t share main memory with GPU…if you are doing remote display to a desktop then it is possible all GPU computation offloaded to the host instead of the Jetson. If this is not the case you might post some of the source code around the copy and exact error.

Yes. It is on tx1.
I have copied a test example.
Sometimes, it failes!

#include <stdio.h>
#include <stdlib.h>     /* srand, rand */
#include <time.h>       /* time */

#include <iostream>

#include <cuda_runtime.h>                                                       
#include <helper_cuda.h> 
#include <helper_functions.h>   

#define FLT_SIZE sizeof(float)
#define INT_SIZE sizeof(int)

void test();

void init_rand(float *array, int len)
	for(int i=0; i<len; i++) {                                                 
		array[i] = (float)rand()/RAND_MAX;                                        

void print_1d(float *data, int len)
	for(int i=0; i<len; i++) {                                                 
		printf("%12.6f ", data[i]);

inline int BLK(int number, int blksize)                                         
    return (number + blksize - 1) / blksize;                                    

// constant memory
//__constant__ float const_mem[16000];

__global__ void kernel(const int len,
		const float* __restrict__ A,
		int* __restrict__ const result)
	int gx = threadIdx.x + __mul24(blockIdx.x, blockDim.x);
	if(gx < len) {
		//printf("%f\n", d_A[gx]);
		if(A[gx] > 0.5f) {
			result[gx] = 1;	
		}else {
			result[gx] = 0;	

void test()
	srand (time(NULL));

	cudaEvent_t startEvent, stopEvent;
	checkCudaErrors( cudaEventCreate(&startEvent) );
	checkCudaErrors( cudaEventCreate(&stopEvent) );

	int len = 10;

	// host
	float *A;
	checkCudaErrors(cudaMallocHost((void **)&A, 	len * FLT_SIZE));

	int *result;
	checkCudaErrors(cudaMallocHost((void **)&result, 	len * INT_SIZE));

	// device
	float *d_A;
	checkCudaErrors(cudaHostGetDevicePointer((void **)&d_A, (void *)A, 0));
	int *d_result;
	checkCudaErrors(cudaHostGetDevicePointer((void **)&d_result, (void *)result, 0));

	// init
	init_rand(A, len);
	print_1d(A, len);

	// kernel
    dim3 Blk_config = dim3(128, 1, 1);                                           
    dim3 Grd_config = dim3(BLK(len, 128), 1, 1);

	kernel<<< Grd_config, Blk_config>>>(len, d_A, d_result);


	// check result
	for(int i=0; i<len; i++)
		printf("%12d ", result[i]);

	for(int i=0; i<len; i++) {
		int value;
		if(A[i] > 0.5f) {
			value = 1;	
		} else {
			value = 0;	

		if(value != result[i]) {
			fprintf(stderr, "wrong result!\n");	


	// release
	if (A != NULL)				checkCudaErrors(cudaFreeHost(A));
	if (result != NULL)				checkCudaErrors(cudaFreeHost(result));

int main(int argc, char **argv) {

	cudaDeviceProp prop;
	checkCudaErrors( cudaGetDeviceProperties(&prop, 0) );
	printf("Device: %s\n",;

	// Set flag to enable zero copy access



With cudaDeviceSynchronize(), the results are correct.

I don’t think the jetson tx1 is a fine-grained unified memory system.

The explicit device synchronization has to be forced in order to make sure the memory consistency.

I can’t actually test CUDA 8 code at the moment, but how were you logged in to the Jetson, and where was the output viewed (host versus Jetson)?