Reading data

global void ReadKernel(uint32_t** pcreTbl, uint32_t* pcreTblSize, uchar* packet, PacketInfo* pkInfo, char* pcreRes)
{
unsigned char ch = 0;
uint32_t matched = false;
uint32_t size = 0;
uchar *base = 0;

base = packet + pkInfo[threadIdx.x].base;
size = pkInfo[threadIdx.x].size;

/* TODO : LOOP unrolling */
for (uint32_t i = 0; i < size; i++) {
ch = base[i];
matched ^= ch;
}
pcreRes[threadIdx.x + blockIdx.x * blockDim.x] = matched;
}

This kernel is too slow.
Do you tell me why it is? :(

Bytewise reading is inefficient, at least on compute capability 1.x devices. Try reading a complete [font=“Courier New”]uchar4[/font] at a time and then process each of the members.

Uhm, it is not coalesced?

Bytewise memory access is uncoalesced on compute capability 1.0 and 1.1.
It is coalesced on compute capability 1.2 and 1.3, however the same data is read from memory twice as the number of bytes read is smaller than the minimum transaction size.

Instead of reading a [font=“Courier New”]uchar4[/font], you could also use a texture so that read accesses are cached.

Actually I am using GTX 580 whose compute capability is 2.0.

I think in that version, global memory has its own cache. So I thought that it has no need to use texture memory.

Am I wrong?

The L1 cache and texture cache in an sm_2x (Fermi) device have different properties, so depending on the access pattern using one or the other may lead to higher performance. In general, if accesses are properly coalesced, the L1 cache is a good choice and requires no deviation from classical C programming idioms. If accesses are not coalesced, but still have good spatial and temporal locality, the texture cache can be a better choice.

However there are tradeoffs. Since the texture cache is not coherent within a single kernel call, accesses via the texture cache are pretty much limited to read-only data. Use of the texture cache requires using textures, and the binding and unbinding of the textures can cause some amount of overhead if it has to happen on the fly. In addition, textures are limited in size, so large data structures may require mapping via multiple textures, which complicates the code somewhat.

Note that in code operating on multiple data objects, the best strategy may be to read some via the L1 cache and others via the texture cache, as this effectively gives the benefit of the combined cache sizes. Once I even managed to simultaneously use the constant cache, the texture cache, and the L1 cache in the same piece of code, for the maximum amount of caching.

What does it mean: “texture cache is not coherent within a single kernel call” ?

Do you mean if kernels write to texture memory that the texture cache will be different ?

Or do you mean the texture cache has some sort of problem when doing reads only ? ← Sounds weird.

If a write occurs to the global memory mapped by the texture, the texture cache may or may not contain the updated data on a subsequent read access to the location written. Writes to global memory in a previous kernel are however reflected by a texture access in a subsequent kernel because the driver flushes the texture cache prior to launching a kernel. See section 3.2.10.4 of the Programming Guide.

I’ve been wondering about that… I’ve been wondering if it’s possible to read and write from and to texture memory.

I have seen some slides and presentations and tables which show tables claiming “texture” is read only.

Some sections of the guide mention reading and writing is possible but it produces undefined results (in combination with the cache ?)

However section “3.3.13 Graphics Interoperability” does mention the possibility of mapping and unmapping resources.

These are Driver API calls though (host side) and they could have high overhead.

The guide isn’t completely clear to me, also I am not much of a graphics programmer when it comes to OpenGL/DirectX, my interest in that is low. I am more interested in general computation.

But if textures can somehow be used to read and write reliably I might be interested in it, especially if the texture caching effect can lead to higher performance.

But so far reading and writing seems unreliable (at least in combination with the cache) ???

Also I started using cuda to get away from textures and such because it’s more weird/more complex/more akward to use for general computation External Image :)

Otherwise I could just go back to dx9/opengl External Image :) and my now obsolete 7900 gtx 512 MB with 70 GB/sec (?!) =D (<- but I know it’s probably obsolete, to little write storage space for render to multi texture. just 64 bytes per pixel if I remember correctly External Image).

Also biggest problem with opengl/directx before cuda existed is probably iteration count… It’s like calling a kernel millions of times… that overhead is just to large… that problem is now much less, since cuda can read and write to global memory…

… from within a kernel I should add.

Search for “surface” in the Programming Guide. These are writeable textures.

Ok, this “coherency issues” seems to apply to both, textures and surfaces:

"
Read/Write Coherency
The texture and surface memory is cached (see Section 5.3.2.5) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
"

Therefore it doesn’t seem to be of much use for a kernel which want to do a lot of reading and writing in a single kernel call.