2D char array of random characters represented as 1D char array Trying to make each thread on cuda g

defermin · October 22, 2010, 2:56am

Hello.

I’m a CUDA newbie so I apologize if this is a silly question.

I’m trying to generate say 1000 random strings using the letters of the alphabet. Each “word” has to be 10 characters long.

I don’t want to deal with pitch and 2D arrays since these are nothing more than 1D arrays on the device. So here is the pseudo code for what I was thinking I’d do:

char word_array[ 1000 * 10 ]; // every block of sizeof(char)*10 memory banks 1 word.

char alphabet[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; // my alphabet for picking random letters

// initialize word_array to be all '

char word_array[ 1000 * 10 ]; // every block of sizeof(char)*10 memory banks 1 word.

char alphabet = “ABCDEFGHIJKLMNOPQRSTUVWXYZ”; // my alphabet for picking random letters

// initialize word_array to be all ‘\0’

// initialize random number generator on GPU

// allocate GPU memory for word_array

// allocate GPU memory for alphabet array

// copy word_array to device

// copy alphabet to device

// execute kernel

// copy word_array back to host

// print word_array contents (each 10 characters and then a new line)

' 

// initialize random number generator on GPU

// allocate GPU memory for word_array

// allocate GPU memory for alphabet array

// copy word_array to device

// copy alphabet to device

// execute kernel

// copy word_array back to host

// print word_array contents (each 10 characters and then a new line)

My goal is to have each thread of the GPU generate it’s own random word and store it in one of the “word banks” in the word_array.

I’ve got the code to generate 1 random word on 1 thread, but I can’t scale this up. I’m guessing I’ve screwed up the thread indexing somehow. Can anyone point me in the right direction?

Here is my kernel that generates the random word:

// The main kernel

__global__ void genRandomPep(curandState *dev_state, char *dev_alphabet, char *dev_words) {

	int tid = threadIdx.x + (blockIdx.x * blockDim.x); // this is wrong for what I'm trying to do I think

	int i,j;

	curandState localState = dev_state[tid];

	for(i = tid; i < 11; i++) { //iterate over a block of memory covering 10 char's 

		j = curand( &localState ) % N; // get random number between 0 and 25

		dev_words[i] = dev_alphabet[j];

	}

	__syncthreads(); // do I need this?

}

I’m working on RHEL 5.0 (64bit) with a Geforce GTS 250 (1GB) if that matters.

Thanks in advance for any and all help.

defermin · October 22, 2010, 2:56am

Hello.

I’m a CUDA newbie so I apologize if this is a silly question.

I’m trying to generate say 1000 random strings using the letters of the alphabet. Each “word” has to be 10 characters long.

I don’t want to deal with pitch and 2D arrays since these are nothing more than 1D arrays on the device. So here is the pseudo code for what I was thinking I’d do:

char word_array[ 1000 * 10 ]; // every block of sizeof(char)*10 memory banks 1 word.

char alphabet[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; // my alphabet for picking random letters

// initialize word_array to be all '

char word_array[ 1000 * 10 ]; // every block of sizeof(char)*10 memory banks 1 word.

char alphabet = “ABCDEFGHIJKLMNOPQRSTUVWXYZ”; // my alphabet for picking random letters

// initialize word_array to be all ‘\0’

// initialize random number generator on GPU

// allocate GPU memory for word_array

// allocate GPU memory for alphabet array

// copy word_array to device

// copy alphabet to device

// execute kernel

// copy word_array back to host

// print word_array contents (each 10 characters and then a new line)

' 

// initialize random number generator on GPU

// allocate GPU memory for word_array

// allocate GPU memory for alphabet array

// copy word_array to device

// copy alphabet to device

// execute kernel

// copy word_array back to host

// print word_array contents (each 10 characters and then a new line)

My goal is to have each thread of the GPU generate it’s own random word and store it in one of the “word banks” in the word_array.

I’ve got the code to generate 1 random word on 1 thread, but I can’t scale this up. I’m guessing I’ve screwed up the thread indexing somehow. Can anyone point me in the right direction?

Here is my kernel that generates the random word:

// The main kernel

__global__ void genRandomPep(curandState *dev_state, char *dev_alphabet, char *dev_words) {

	int tid = threadIdx.x + (blockIdx.x * blockDim.x); // this is wrong for what I'm trying to do I think

	int i,j;

	curandState localState = dev_state[tid];

	for(i = tid; i < 11; i++) { //iterate over a block of memory covering 10 char's 

		j = curand( &localState ) % N; // get random number between 0 and 25

		dev_words[i] = dev_alphabet[j];

	}

	__syncthreads(); // do I need this?

}

I’m working on RHEL 5.0 (64bit) with a Geforce GTS 250 (1GB) if that matters.

Thanks in advance for any and all help.

MutantJohn · February 13, 2015, 4:17pm

for(i = tid; i < 11; i++) { //iterate over a block of memory covering 10 char's 

		j = curand( &localState ) % N; // get random number between 0 and 25

		dev_words[i] = dev_alphabet[j];

	}

Shouldn’t this loop i through 0 through 9, not tid to 10?

Edit : You also only need synchthreads() if you need synchronization for all threads in a particular block. This is most commonly used with shared memory.

Dude1205 · February 13, 2015, 6:59pm

I agree with MutantJohn.
Additionally, ASCII values for the 26 letters of the alphabet are contiguous. You could directly generate a random number between the ascii value of ‘a’ and the ascii value of ‘z’ and avoid a costly random access to a lookup table in global memory (ie. dev_alphabet).
Finally, writing your generated words this way into dev_words will yield non-coalesced accesses. You could maybe use shared memory to improve those accesses.

MutantJohn · February 14, 2015, 6:17pm

Right, writing into a 1d array the OP has results in non-coalesced access.

I literally just learned about this like 2 days ago so I’m still new but I think the efficient CUDA way would to be write all this and treat it like a 2d array.

Basically, the string : “the cat is fat” would be stored in data as :

t  c  i  f
h  a  s  a
e  t -1  t

This is a weird way to access data I think but it should allow fully coalesced access. Global memory is read in as a contiguous block and this will allow each warp to fully maximize the data read in.

I played with this kind of access yesterday and kind of came up :

const int row_width = ...//num_insertions(aa.nominated, aa.size);

const int row_depth = 4; // maximum 4 child tetrahedra

thrust::host_vector<int*> buff(row_depth);

// allocate rows contiguously
for (int i = 0; i < row_depth; ++i)
    cudaMalloc(&(buff[i]), row_width * sizeof(*(buff[i])));

// storage for row addresses
int **fl;
cudaMalloc(&fl, row_depth * sizeof(*fl));

cudaMemcpy(fl, buff.data(), row_depth * sizeof(*fl), cudaMemcpyHostToDevice);

write_fract_locations<<<bpg, tpb>>>(row_width, row_depth, fl);
cudaDeviceSynchronize();

for (int i = 0; i < row_depth; ++i)
    cudaFree(buff[i]);

cudaFree(fl);

__global__
void write_fract_locations
(
const int row_width,
const int row_depth,
int **fl
)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    // 1 thread for every element in a row
    for (int tid = thread_id; tid < row_width; tid += grid_size)
    {
        // iterate by row...
        for (int i = 0; i < row_depth; ++i)
        {
            // row_address 
            int *ra = fl[i];
            ra[tid] = -1;

        }
    }
}

And for me, it was working. I’m not sure if it’s bugged or not. But the idea is, write your words as columns instead of rows which is super not natural. But if you store everything as columns, you can get coalesced access. I’m not sure if the code I posted even works.

But I think it’s got okay access patterns as each thread in a warp will read the same locations in fl so that’s a guaranteed hit for all threads. Then the efficiency is determined by word length so I guess 10 is kind of bad. Hmm…

Maybe this post is a bad one.

Would shared memory have been better?

Dude1205 · February 19, 2015, 8:35pm

Shared memory would not have been better than reorganizing the data the way mutantjohn mentioned since there is no data reuse.
If for some reason you’d rather keep the data the way it is, you can always reorganize it in shared memory instead.

Skybuck · February 22, 2015, 4:48pm

MutantJohn:

Right, writing into a 1d array the OP has results in non-coalesced access. I literally just learned about this like 2 days ago so I’m still new but I think the efficient CUDA way would to be write all this and treat it like a 2d array. Basically, the string : “the cat is fat” would be stored in data as :
t c i f h a s a e t -1 t
This is a weird way to access data I think but it should allow fully coalesced access. Global memory is read in as a contiguous block and this will allow each warp to fully maximize the data read in. I played with this kind of access yesterday and kind of came up :
const int row_width = ...//num_insertions(aa.nominated, aa.size); const int row_depth = 4; // maximum 4 child tetrahedra thrust::host_vector<int*> buff(row_depth); // allocate rows contiguously for (int i = 0; i < row_depth; ++i) cudaMalloc(&(buff[i]), row_width * sizeof(*(buff[i]))); // storage for row addresses int **fl; cudaMalloc(&fl, row_depth * sizeof(*fl)); cudaMemcpy(fl, buff.data(), row_depth * sizeof(*fl), cudaMemcpyHostToDevice); write_fract_locations<<<bpg, tpb>>>(row_width, row_depth, fl); cudaDeviceSynchronize(); for (int i = 0; i < row_depth; ++i) cudaFree(buff[i]); cudaFree(fl); __global__ void write_fract_locations ( const int row_width, const int row_depth, int **fl ) { const int thread_id = threadIdx.x + blockIdx.x * blockDim.x; // 1 thread for every element in a row for (int tid = thread_id; tid < row_width; tid += grid_size) { // iterate by row... for (int i = 0; i < row_depth; ++i) { // row_address int *ra = fl[i]; ra[tid] = -1; } } }
And for me, it was working. I’m not sure if it’s bugged or not. But the idea is, write your words as columns instead of rows which is super not natural. But if you store everything as columns, you can get coalesced access. I’m not sure if the code I posted even works. But I think it’s got okay access patterns as each thread in a warp will read the same locations in fl so that’s a guaranteed hit for all threads. Then the efficiency is determined by word length so I guess 10 is kind of bad. Hmm… Maybe this post is a bad one. Would shared memory have been better?

This makes no sense… coalesced memory is mostly about reading the data sequentially/linear…

MutantJohn · February 22, 2015, 5:18pm

Oh man, I’m so dumb. I’m sorry, I gotta start reading things more closely.

I think what I was trying to go on about was storing things in a column major order. That’s probably what my point was.