How to read an element of a array from _ global _

Hello. I interests, it’s possible to read an element of a array which number does not depend from blockIdx.x, blockDim.x, threadIdx.x from __ global __ ? Here the example of that doesn’t work.

int N = 1024;

int blocksize = 16;

float *bitmap, *gbitmap;

float *anything, *ganything;

__global__ void gRunIt(float* odata, float* idata)


	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	if( x<N && y<N)


 Â int i = x + y*N;

 Â int j = 3; //some special process to compute index

 Â odata[i] = idata[j]; //doesn't work :`(



void anyvoid()


	bitmap = new float[N*N];

	cudaMalloc( (void**)&gbitmap, N*N*sizeof(float));

	anything = new float[10];

	for(int i=0; i<10; i++) anything[i] = (float)i; //some special process to fill array

	cudaMalloc( (void**)&ganything, 10*sizeof(float));

	cudaMemcpy(ganything, anything, 10*sizeof(float), cudaMemcpyHostToDevice);

	dim3 dimBlock = dim3( blocksize, blocksize );

	dim3 dimGrid = dim3( N/dimBlock.x, N/dimBlock.y );

	gRunIt<<<dimGrid, dimBlock>>>(gbitmap, ganything);

	cudaMemcpy(bitmap, gbitmap, N*N*sizeof(float), cudaMemcpyDeviceToHost);


It definitely looks like it SHOULD work. Can you try running in emulation mode, printing out all array read/write positions that are less than 0 and greater than the array size?

In emulation mode it works perfectly.

You should use “cudaThreadSynchronize()” to wait for kernel completion before copying results out.