Annoying problems with memory and/or syntax

Im having major problems with trying to perform bitwise operations on two arrays that are initialised as device arrays.

im trying to do the following simple steps:

  1. intitialise 2 device arrays

  2. copy loads of text from host mem to device mem

  3. take a chunk of the text and put it in an array

  4. Xor the elements from one array with the text chunk in, with the coresponding element in the other array

  5. Return this result to device memory, and then copy to host memory

Im getting a really annoying problem. Its probably really simple. What am i doing wrong?

Ive proved my code sorrectly copies from host to device and then device to host and then also proved that the arrays are being filled correctly, but why arnt they allowing me to Xor the data? it appears as though the code is ignored?! PLEASE HELP! <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

Can you post your kernel, or the important parts of your program?

In general, be sure that you use the calculated values otherwise the compiler will optimize the calculations out.

Ive attached my kernel for you. Im basically trying to implement my AES encryption code i have already written in C. This has been proved to work already and just needs to be converted. I thought this was going to be fairly straight forward. But so far ive struggled to convert my add initial round key phase!!! nt good :(

Just change the extension to .cu from .c, probs didnt need to mention that but just in case ;)

Meh dont think it worked anyway so ive just copy and pasted it lol!

#include <stdio.h>

#include <cutil.h>

#define INPUT_FILE_NAME "input.txt"

#define OUTPUT_FILE_NAME "output.txt"

//------------------------------------------------------------------------------Variable Definitions

int length;

__device__ unsigned char current_state[16];

//----------------------------------------------------------------------------Function Preprocessors

__global__ void encrypt(unsigned char* deviceBuffer, unsigned char* changesDevBuffer, int length);

__device__ void AddInitRoundKey(unsigned char current_state[], unsigned char cipherkey[]);

//------------------------------------------------------------------------------Resource Definitions

__device__ unsigned char cipherkey[16]; = {'d','f','h','l','p','y','%','^','5',')','*','j','v','1','@','#'};

__device__ unsigned char s_box[256] = {

	0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76, 

	0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0, 

	0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15, 

	0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75, 

	0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84, 

	0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf, 

	0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8, 

	0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2, 

	0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73, 

	0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb, 

	0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79, 

	0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08, 

	0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a, 

	0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e, 

	0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf, 

	0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16,


//----------------------------------------------------------------------------------Functions Listing

__device__ void AddInitRoundKey(unsigned char current_state[], unsigned char cipherkey[]){


	int temp[16];

	int i;

	for (i = 0; i<=15; i++){

   temp[i] = current_state[i];


	int j;

	for ( j = 0; j<= 15; j++){

  current_state [j] = temp[j]^cipherkey2[j];





__global__ void encrypt(unsigned char* deviceBuffer, int length){

	for (int i = 0; i <= length; i+=16){



 for(int j = i; j <= i+15; j++){


  	current_state[j] = deviceBuffer[j];


  AddInitRoundKey(current_state, cipherkey);

 for(int j = i; j <= i+15; j++){

  	deviceBuffer[j] = current_state[j];







void fillHostBuffer(unsigned char* hostBuffer, FILE* inputfile){

	int next = 0xFF;

	int pos = 0;


	while (next!=EOF){

 next = fgetc(inputfile);



  	hostBuffer[pos] = next; 




  //	printf("Finished reading inputfile to Host Memory...\n");



	}	//  printf("%d; ",pos);


int main (){


	// perform CUDA device initialization


	// display CUDA device info

	int deviceCount;


	for (int dev = 0; dev < deviceCount; ++dev)


  cudaDeviceProp deviceProp;

  CUDA_SAFE_CALL(cudaGetDeviceProperties(&deviceProp, dev));

  printf("\nDevice %d: \"%s\"\n", dev,;

  printf("  Major revision number:                         %d\n", deviceProp.major);

  printf("  Minor revision number:                         %d\n", deviceProp.minor);

  printf("  Total amount of global memory:                 %d bytes\n", deviceProp.totalGlobalMem);

  printf("  Clock rate:                                    %d kilohertz\n", deviceProp.clockRate);



	unsigned char * deviceBuffer;

	unsigned char * hostBuffer;	


	int length;



	FILE* inputfile;

	inputfile=fopen(INPUT_FILE_NAME, "r");

	FILE* outputfile;

	outputfile=fopen(OUTPUT_FILE_NAME, "w");

	if (inputfile!=NULL){

 fseek(inputfile, 0, SEEK_END);

  length = ftell(inputfile);

  printf("Number of Characters in the file to be Encrypted: %d\n", length);

  printf("This consumes %d byte(s) of memory...\n", sizeof(char)*length);


 hostBuffer = (unsigned char*) malloc (length);

  CUDA_SAFE_CALL(cudaMalloc((void**)&deviceBuffer, length));

  fillHostBuffer(hostBuffer, inputfile);

  CUDA_SAFE_CALL(cudaMemcpy(deviceBuffer, hostBuffer, length, cudaMemcpyHostToDevice));


  printf("Copied Host Memory Buffer to Device Memory Buffer...\n");

  printf("Encryption Process has begun...\n");

 encrypt<<<64,64>>>(deviceBuffer, length);

 printf("Encryption Process Complete...\n");

 CUDA_SAFE_CALL(cudaMemcpy(hostBuffer, deviceBuffer, length, cudaMemcpyDeviceToHost));


  printf("Copied Memory from Device to Host...\n");

 for(int i =0; i<=length; i++){

  	fprintf(outputfile,"%c; ", hostBuffer[i]);

  	//printf("%x; ", hostBuffer[i]);  	


 fclose (inputfile);

  fclose (outputfile);





 printf("ERROR: Inptfile loading failed\n");





return 0;


So, first of all you don’t schedule this between the threads. The same code for all array elements will be executed by all threads. So you are doing redundant work.
There are several examples in the SDK how to decompose vectors or matrices to threads and blocks.

IN your device program
device void AddInitRoundKey(unsigned char current_state, unsigned char cipherkey){

where do you get cipherkey2 from?

In my opinion this should never compile. If it compiles it certainly does nothing.
So you first copy deviceBuffer to current_state and then back again.

However if it compiles there should be something different in the deviceBuffer variable

ok cool, ill have a look at that.

opps that was a typo i was playing with the code before i posted it :S

You are right the code at the moment does nothing but it should atleast execute this line correctly no?

current_state [j] = temp[j]^cipherkey[j];

this appears to be ignored? why is that?

Well, you have 64*64 threads all reading/writing deviceBuffer[j] causing lots of race conditions.

You really only one one thread to handle one element. Something a little like this.

__global__ void encrypt(unsigned int* deviceBuffer, int length)


   int idx = blockDim.x * blockIdx.x + threadIdx.x;

   if (idx < length)

       deviceBuffer[idx] = deviceBuffer[idx] ^ cipherKey[idx % 16];

  1. Note the change to int: this is for coalesced reads/writes (will boost performance of char ~20x)

  2. I’m not sure exactly how cipherKey works, you will need to adjust the indexing of that read occordingly. Also: for performance don’t put cipherKey into device memory: use constant which is optimized for many threads accessing the same element at once.

  3. You need to launch this kernel with a given block size (a multiple of 32 for coalescing) and a gird size large enough so that every element in the buffer is covered.

thanks so much for your reply. Do you have any suggestions with regards to page numbers in the SDK documentation that would be of particular use? Ive had a flick through it and theres loads of examples but im not sure which ones are applicable.

Also, not really sure if i understand this line…

int idx = blockDim.x * blockIdx.x + threadIdx.x;


In the programming guide 1.1, Pages 1-38 and 47-77 should all be read. Every single section, and in order. Especially the introduction sections are very well written and explain everything you need to know about the architecture. This is a must because the data-parallel computing model in CUDA is very different from your everday CPU program, even a multithreaded one.

This just computes an index for each thread. Within the blocked model of CUDA, this line produces consecutively increasing indices for each thread specifying which element of the array the thread is to work on.

It’s all in the programming guide. Specifically in section 2.2. I will refrain from explaining more at this stage. I’ll be happy to answer any specific questions you have regarding confusing sections of the programming guide.

Ok so i have done what you recomended and had a read of the guide in a bit mpore detail. Cleared a few things up. So on further inspection, as i am only actually processing 16 values at any one time, i would only need 1grid and 1 block with 16 threads? Based on the thinking that multiple grids i could look at different states being computed at the same time?

Yep. Though, due to the parallel nature of the GPU you could process 100’s or 1000’s of elements in the same time it takes to process 16.

I’m not sure I understand this question. All blocks in a single grid are run “at the same time”, so they can’t communicate. If you make a 2nd kernel call (e.g. a 2nd grid), it is run sequentially after the first kernel.

Right, i think i now understand the programming model. It was quite different to what i initially thought it looked like! Shows what reading can do for you :P

I now understand conceptually how i need to modify my code to meet the programming model. However I’m probably going to struggle with the code as it looks a bit complex :wacko:

However one thing that is still troubling me is how the kernel will be able to look ahead at the different states (by this meaning current_state arrays) and processing them. they are all independant of each other, so i guess it is possible to do. Would i need to allocate a block for a current_state array and then 16 threads in that block to compute the modifications on the elements?

PS thanks for your help so far, much appreciated External Image

Hmm, I’m still not sure what you mean by needing to look-ahead in the current_state array. The code you posted just copies the memory value at index idx to current_state, xor’s it and then copies current_state back to the device memory at index idx: I just pulled out all the extra copies in the example kernel I gave.

If you do need threads to look-ahead at other values, you can use shared memory. Just load a block of data into the shared memory, then all of the threads in the block can read ahead to the values the other threads in the block are working on. Here, you need __synthreads() to prevent race conditions.

Hmm Im not so sure that I really understand now! I am trying now to implement blocks and threads within this kernel. I dont really understand how to write in coding termshow to execute on the GPU. I understand what i want to do but dont see a consistent way iwhich to define the number of blocks or threads a function is allocated?

Is there a standard way of doing this as i have been looking at the code examples from the SDK for a long time and not seeing a great deal of consistancy? Is there a way in which that blocks and threads should be consistantly defined in a kernel? Does anyone have a crystal clear example kernel?

The way blocks and threads are assigned depends entirely on how you break down your problem into threads. There are a lot of options as blocks have a 2D index and threads have a 3D one. Hence the lack of consistency: different problems lead to different indexing schemes.

However, I would argue that the most commonly used (and most suitable to your problem) is to use the indexing scheme I gave above in the example: Each thread is responsible for calculating the resulting element idx, where idx = blockDim.x * blockIdx.x + threadIdx.x. Then, as you correctly identified above, you need BLOCK_SIZE (variable) threads in each block and ceil(N_elements/BLOCK_SIZE) blocks.

idx = blockDim.x * blockIdx.x + threadIdx.x

With this line and then use the idx values as the iterator for my current_state array…i dont see how that will work? how is idx recognised as the identifier for threads and not just another integer variable?

eg how is:

int idx = blockDim.x * blockIdx.x + threadIdx.x

different from

int a, b,c;

a = b*c


Because blockDim, blockIdx, threadIDx, and gridDim (not used above) are special internal variables that are assigned to values only when run on the device. Each block gets a different blockIdx, and each thread within a single block gets a different threadIdx.x (CUDA programming guide pages 8 and 21).

What you have to understand in CUDA is that you write one kernel “function” which is replicated and executed for each thread. The blockIdx/threadIdx tell each thread their IDs so each thread can work on different data. This is type of threading model is called data-parallel.

my visualisation of CUDA was slightly out when reading those pages the first time. so when i call a function like:


this calls the function assigning a certain number of blocks and threads to the GPU execution.

Then, in each device function you tell each thread and block what to do…like in my instance, deviceBuffer[idx] = deviceBuffer[idx] ^ cipherKey[idx % 16];

(BTW what does the % 16 do exactly?)

so then instead of using lots of different variables to iterate through the arrays i can just use idx to tell the threads what to do? meaning that if i was to make another loop in that function, would idx still be used as the iterator for the loop?

I think it would be smart to check the SDK examples. There you have a LOT of examples with accompanying normal C code. Start with reduction for example, then go to the scan example and end with the matrixmul example. When you understand those examples, you are a long way into being able to write CUDA code.


will indeed make functionA being run 256000 times, and to know which of those 256000 you are doing at that time, you use
int which_thread_is_this = blockDim.x * blockIdx.x + threadIdx.x (in the above case blockDim.x would be equal to 256)

% 16 is a standard C operator, if I am not mistaken it is the remainder after dividing by 16.

In your original sample code, you were only using 16 elements in the cipher key. So, in my one-line sample I tried to imitate that by using the mod 16 to keep all cipherKey accesses within the first 16 elements. I really have no idea what you were trying to accomplish with your original code, I just tried to emulate it using CUDA.

Regarding loops: treating the thread index as a “loop” index is one simple way to think about it, though you aren’t limited to just that. The reduction sample in particular shows one simple way in which the thread index is not just a loop index.

And you can loop within a device function as much as you want. It is just C code after all. I can’t tell you want you can or can’t loop over because is entirely dependent on what your algorithm requires. For instance, I have one kernel where each thread calculates the force on one particle: in that kernel there is a loop over all neighbors of that particle.

The key thing to keep in mind with CUDA is that because of the data-parallel nature, you can’t have threads writing to the same global memory that other threads are reading from, which will cause race conditions and your results will be incorrect depending on the order threads are run. Except for this limitation, the sky is the limit as to what you can do in a kernel function.