But I am not using . cudaMalloc3DArray at all. Look at the code. It is as you suggeted " 1D linear memory space allocated with cudaMalloc and index into it using a 3D index" ,but I cannot access more than an index of 29791 _DSIZE max =31.
so what was the reference to cudaMalloc3DArray in your previous post for? The last piece of code you posted has at several errors in that mean that if it works, it is only by accident.
I think I found one reason for the confusion. There is a difference again between emulation and release code. I compiled the last code with and without emulation and the emulation code would not run above _DSIZE=31 while the release code would run up to _DSIZE=445. I don’t think this is a memory issue because I still have 1GB free in the host memory.
That code works and I hope it is not by accident!
[codebox]
include <assert.h>
include <stdio.h>
include <cuda_runtime.h>
include <cutil_inline.h>
// 3D to linear array conversion
define FTNREF3D(i_index,j_index,k_index,i_size,j_size,i_lb,j_lb,k_l
b) (i_size)(j_size)(k_index-k_lb)+(i_size)*(j_index-j_lb)+i_index-i_lb
#ifndef gpuAssert
include <stdio.h>
define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, “\n FAILURE %d in %s, line %d\n”, condition, FILE, LINE );exit( 1 );}}
define _DSIZE 445
device float *ad;
global void testkernel2(float *d)
{
unsigned int idx = threadIdx.x + blockDim.x*blockIdx.x + 1 ;
ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] = 23.0*idx;
d[idx] =ad[FTNREF3D(idx,idx,idx,_DSIZE,_DSIZE,1,1,1)] ;
}
int main()
{
int device = 0;
cudaSetDevice(device);
float d ,_d;
float *_a;
assert( !(( d = (float *)malloc(_DSIZE * sizeof(float)) ) == NULL) );
gpuAssert( cudaMalloc( (void**)&_a, _DSIZE*_DSIZE*_DSIZE* sizeof(float) ) );
gpuAssert( cudaMalloc( (void**)&_d, _DSIZE * sizeof(float) ) );
gpuAssert ( cudaMemcpy(_d, d, _DSIZE * sizeof(float), cudaMemcpyHostToDevice) );
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));
testkernel2 <<< 1, _DSIZE >>> (_d);
( cudaThreadSynchronize() );
( cudaMemcpy(d, _d, _DSIZE * sizeof(float), cudaMemcpyDeviceToHost) );
for(int i = 0; i < _DSIZE; i++) {
fprintf(stdout, "%2d %6.1f\n", i, d[i]);
}
cudaFree(_a);
cudaFree(_d);
free(d);
return cudaThreadExit();
}
[/codebox]
I am just about done with this, I am sorry, but this:
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float )));
shouldn’t even compile according to the documentation, and it has not right to work, even if it does compile. You are copying a pointer to a float, not a float. On 64 bit machines, they are not the same size. On 32 bit machines, they are. Whichever architecture you are working on, if it works, it is accidental. It should be something like this:
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float *),(size_t)0, cudaMemcpyHostToDevice));
Ok. I’ve got it. I am on a 64bits machine.
sizeof(float) = 4 sizeof(float * ) = 8
gpuAssert( cudaMemcpyToSymbol( ad, &_a, sizeof(float *)));
is OK for both emulation and release code.
I did not noticed that I had changed float* to float.
From page 97 of the Progamming guide:
“Intermediate data structures may be created in device
memory, operated on by the device, and destroyed without ever being mapped by
the host or copied to host memory”
How do you create/destroy an array inside the kernel? Can you do cudaMalloc/cudaFree inside the kernel?
That doesn’t mean what you think it means. It simply means that you can allocate and manipulate device memory directly through the API, without requiring any analogous host memory allocation or mapping of device allocations to host memory (ie. cudaMalloc doesn’t require any host memory allocation or mapping to work). This is in contrast to older programmable shader languages which required each device buffer be bound to a host buffer.
You can’t, and that text you are quoting doesn’t say that you can.