2D Array using OpenCL Arrays vs Images & How-to

Hello all,

I am trying to get rolling on using OpenCL for a CFD Code I am writing for my thesis. I am, unfortunately, getting stuck on what should be a fairly simple operation. I am hoping that someone will be able to help me out here.

I have a file that gives me the points on a grid (X, Y, and Z). Grid has 181 points in the X-direction (imax) and 65 in the Y-direction (jmax) (and 1 in the Z- direction). I want to build a 2-D Array of these points, since this essentially a 2-D problem (there are no calculations involving Z, but the mesh is given to me with all of the coordinates anyway.

I can read the file into a 2-D Array:

[codebox]cl_float4** cl2DGridPoints;

cl2DGridPoints = new cl_float4*[imax];

for(i=0;i<imax;i++){

cl2DGridPoints[i] = new cl_float4[jmax];

}

for(j=0;j<jmax;j++){

for(i=0;i<imax;i++){

MeshInput >> XX >> YY >> ZZ;

cl2DGridPoints[i][j].s[0]	= XX;

cl2DGridPoints[i][j].s[1] = YY;

cl2DGridPoints[i][j].s[2] = ZZ;

cl2DGridPoints[i][j].s[3] = 0.0;

}

}[/codebox]

I can then access the data on the C++ side by the following:

[codebox]MyMesh[i][j].s[0],

MyMesh[i][j].s[1],

MyMesh[i][j].s[2],

0[/codebox]

This allows me to use the float4 in OpenCL. However, I am stuck on how to create a buffer / give the array to an OpenCL kernel.

I learned that OpenCL supports 2-D and 3-D images (which appear to have a width, height, and 4 values (RGBA) at each point), but I do not know how to use them.

Would someone be willing to assist me on this matter? I can do the 1-D arrays all day, but I would like to learn how to use the 2-D method since I will need to further extend this later.

Thank you in advance!

-Kevin Kennedy

First, quite importantly, you should check if your device supports the image features, by CL_DEVICE_IMAGE_SUPPORT in your C++ host code, and the IMAGE_SUPPORT ifdef in your CL code. The image feature works (mostly) fine on nVidia, but ATI’s cl compiler commits suicide and will crash your application.
Then, instead of creating a buffer, you’ll have to create either a 2d or 3d image. If you pass data to this, this’ll be simply all rows after each other.
In addition to that, you’ll need a sampler to sample values from the image. This can be created in C++ host code, or directly as a const in the CL code.

On another note though, images are most appropriate to use when you are sampling values that are not on integer indices, hence why you need the sampler to use them.
For simple integer access to a 2D array, you should just use a 1D buffer, and calculate the indices manually (i = x + (y * width), y = i / width, x = i - (y * width)), that method of working works pretty much everywhere.

One way to make a completely 2D array (although not contiguous) is to make two 2D arrays, one that is [host][device] pointers, and the other [device][device]. So it would be something like this in pseudo-code. I can post actual code later if it is needed, but this should give you an idea of how to do it.

cl_mem* host_and_device; // this is actually 2 dimensional, cl_mem is declared as _cl_mem* or something, this is off the top of my head

host_and_device = new cl_mem[100]; // cl_mem is already a pointer type, so just allocate more cl_mem's

for(int i = 0; i < 100; i++) {

host_and_device[i] = clCreateBuffer(blah blah blah); // dont know the exact command args, can make it any size you want

}

cl_mem* device_only;

device_only = clCreateBuffer(blah); // make whatever size you want, just for the sake of the 'for' loop below, pretend it's 100

// now copy the data from the host_and_device to device_only

for(int i = 0; i < 100; i++) {

clCopyBuffer(device_only[i], host_and_device[i], sizeof(whateveryouwant) ); // again, don't know the args

}

Since your array is relatively small, see if it is possible to use float4 and create a buffer and pass it to a kernel as constant memory. That will be the fastest way to access it, faster than texture memory (or image memory as you are discussing it). If it’s just too big, you can put it into global memory and I would make sure to use coalesced access to it. Depending on whether the values will be reused, it may be useful to use shared memory to get the value from global memory and then operate on shared. Look into the SDK for examples.

Hey scwizzo,

why need we to copy buffer to a device_only memory? Can’t we use the host_and_device memory directly?

Thank you

My initial assumption (or better yet, the way I went about it) was wrong, that OpenCL can support 2D arrays (it still probably can, I just haven’t found a good way of doing it yet that I like). The way I posted does not work since cl_mem does not have a size associated with it, and the compiler doesn’t like having to return a struct with no size, pointer or not.

As for your question vandop, no you can’t use host_and_device directly. It is two separate locals of memory. The first [being pointers to pointers] is usable on the host side. The second [being pointers to objects] is usable on the device side. This means if you want to use the 2D array in it’s entirety, on one side or the other, then there will be conflicts with where the memory resides. The method for 2D arrays I posted works for CUDA [but not OpenCL] because CUDA can have pointers to defined objects, which have defined sizes, like float or int. OpenCL on the other hand, does not have pointers to objects directly.

I don’t quite understand the original question. But who needs 2D arrays anyway? Just create a big buffer on the GPU like say

cl_int SizeX = …;

cl_int Size_y = …;

cl_float4* CPU_y = new cl_float4;

… init CPU_y …

cl_mem y = clCreateBuffer(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * SizeXSizeY, (void)CPU_y, &ciErr);

Then pass the arguments to the kernel as

ciErr = clSetKernelArg(my_kernel, 0, sizeof(cl_mem), (void*)&y);

ciErr = clSetKernelArg(my_kernel, 1, sizeof(cl_int), (void*)&SizeX);

ciErr = clSetKernelArg(my_kernel, 2, sizeof(cl_int), (void*)&SizeY);

Of cause you only need one of SizeX and SizeY to index your “array”.

You could use images also, but you can not write to them. I use images in my CFD code for accessing 2D thermodynamic data tables. The image sampler gives me linear interpolation “for free”.

Hope it helps,

Madsen

Sorry scwizzo, I’m just a newbie in GPGPU, and I’m getting some troubles to full understand this memory allocation questions:\ For instance, if I want to use a char[1024][1024] in GPU side for output and get it back to CPU to make some printfs with the data that comes from GPU, how should I allocate it?

Thank you:)

Sorry, I hadn’t seen this answer before my previous question. It solves my buffer [1024][1024] problem, but about 2D Arrays with different sizes on Y coordinate, like this in CPU code:

char** buffer = new char*[1024];

buffer[0] = new char[2];

buffer[1] = new char[10];

and so on

How can I allocate this kind of buffer to use in GPU?

You can pass one big buffer and then a vector of indices:

char bbuffer[1024];

int index[3];

index[0]=0;

index[1]=2;

index[2]=2+10;

so buffer[2][5] (or is it buffer[5][2]?) would be bbuffer[index[2]+5]]. Then pass the linear arrays as usual.

Madsen

Hum, I see! Thank you for your help:)

The reason for wanting a 2D array is that my existing Fortran code that I am converting is in a 2D array (but is inconsistent with the 0- and 1- bases. I can’t seem to get the 1-D arrays to behave correctly in calculating a solution (purely in C++). I would like to get the 1D array to work, but things aren’t working now. Thanks for the advice!

The reason for wanting a 2D array is that my existing Fortran code that I am converting is in a 2D array (but is inconsistent with the 0- and 1- bases. I can’t seem to get the 1-D arrays to behave correctly in calculating a solution (purely in C++). I would like to get the 1D array to work, but things aren’t working now. Thanks for the advice!