pointer arithmetic is this allowed

e.ping · May 27, 2008, 11:30pm

Hi,

the following gives funky results, so I was wondering if this is legal:

cudaMalloc((void**)&data, size*sizeof(float));

someKernel<<<grid, block>>>(data+someOffset, someMoreParams);

As this works fine from plain C, so I assume this is ok. As soon as templates and C++ join the show (basically an extension of the ‘reduction’ SDK example for NPOT arrays), I can cause reproducible freezes of X on Linux. So the question is: Can I do the above pointer arithmetic on the host side?

Thanks for insights,

dom

Mu-Chi_Sung · May 27, 2008, 11:49pm

Hi,

the following gives funky results, so I was wondering if this is legal:
cudaMalloc((void**)&data, size*sizeof(float));

someKernel<<<grid, block>>>(data+someOffset, someMoreParams);
As this works fine from plain C, so I assume this is ok. As soon as templates and C++ join the show (basically an extension of the ‘reduction’ SDK example for NPOT arrays), I can cause reproducible freezes of X on Linux. So the question is: Can I do the above pointer arithmetic on the host side?

Thanks for insights,

dom

[snapback]384400[/snapback]

Yes for sure. I did this very often. But I am not sure what cause the freeze or crash unless you post your code. I guess it’s probably reading/writing global memory which is not allocated or is not initialized.

Skribtsov · May 28, 2008, 4:31am

one must be careful about offsets when operated with 2D arrays that have array “width” measured in bytes, so to calculate row offset you

first convert the pointer into type char *
offset by “width” bytes*row_number
convert pointer into original type
offset by column_number
only then access

skipping step 1 or 3 usually causes unpredicted behavior when dereferencing pointers. SDK documentation points to this potential bug explicitly.

as to simple , non-strided arrays, there were no problems.
I only noticed one issue once, when I was trying to read 3 bytes as a single “long int” read (with non-aligned address). I was applying mask and shift then to break this long to the necessary triple. For some reason it just did not want to work the same as successive 3 byte reads, although I verified the approach on CPU and it worked well. If somebody else encountered such an issue, let me know