newbie in Cuda needs help with 2D arrays

bokibo · March 4, 2018, 4:11pm

Hello everyone. I am newbie in CUDA and I need your help to do my little project I have.
I’m working on implementing AES algorithm in CUDA. So far I did this algorithm in C/C++ using serial execution, and it is working very well.
In that project I used this struct

struct Block {
       char item[4][4];
};

I divided my input (plaintext) into these blocks 4x4, because it was more easier for me to work with it. And it was like this

Block *plaintext;
Block *keys;
Block *ciphertext

But things are not same in CUDA, or maybe they are but I don’t know how to do the same.
My question is. Can I use these Block arrays in CUDA too? And how?
If Block arrays are not possible to use, can I use 2D array somehow?
For example. Instead of using array of 5 Blocks can I use 4x20 array? Like this char[4][20]?
I saw that I can use 1D array, linear array, but it would be much harder to work with.
Hope you could understand me.

njuffa · March 4, 2018, 4:26pm

CUDA is a dialect of C++ (both a subset and a superset), so you can use exactly the same data structures you can use in other C++ code. Whether it is wise to do so from a performance perspective is a different story.

bokibo · March 4, 2018, 4:34pm

What would be better option? And at same time not to hard to implement.
To use this struct or to use 2d char array?

njuffa · March 4, 2018, 4:52pm

Generally speaking, GPUs can’t achieve best possible performance when dealing with ‘char’ data. The native word size of the GPU is 32 bits, and it can also deal efficiently with 64-bit and 128-bit data. You might want to look into using the built-in type ‘char4’ instead of ‘char’, which means ‘Block’ would turn into a 1D-array of four ‘char4’ values. Whether that is easily possible will depend on context.

BTW, I assume that you are using the ‘char’ type here to represent bytes, not text. If so, it would be better to use ‘unsigned char’ (or ‘uchar4’ in CUDA), since the signedness of ‘char’ is implementation-defined in C++, which can lead to nasty surprises when moving code between multiple programming environments.

bokibo · March 4, 2018, 6:01pm

I saw that CUDA has lots of built-in types like uchar4 but I did not think of using it. It would be hard to use that in my case, at least I think. Because there are vertical and horizontal operation in those blocks 4x4.
And one more aggravating thing is that in the end I need do decryption as well. I found lots of examples on internet of AES in CUDA, and OpenCL. But none of them have decryption implemented. And I did decryption in my C/C++ project.
What would be best choice of these structs?

struct Block {
           int item[4][4];
    };
    struct Block {
           int item[16];
    };
    struct Block {
           uchar item[4][4];
    };
    struct Block {
           uchar item[16];
    };

Or I could use uchar16 instead of 4 x uchar4? That would make things easier. I think.

njuffa · March 4, 2018, 7:15pm

To my knowledge there is no ‘uchar16’ type: the available vector types are {char|short|int|long|float|double}{1|2|4}. Take a look at the documentation and/or try to build a miniature test app to double check.

It has been a long time since I last looked at AES, but isn’t the only difference between encryption and decryption in the use of the S-BOX, which is a byte-wise substitution by table look-up?

bokibo · March 4, 2018, 9:51pm

Then uchar16 is in available in OpenCL. I thought it is also in CUDA.
There is more than just S-BOX. There is also inverse Shift rows, inverse MixColumns.

thiagovs23 · March 7, 2018, 3:19pm

Although CUDA is a dialect of C++, the device is only able to recognize one dimension vectors (there are several ways to “fake” a 2D vectors inside the device code, for example with cudaMallocPitch). My sugestion is that you make a vector of one dimension and inside the for loop use the following strategy for the index:

for (int i = 0; i < number_of_rows; i++) {
for (int j = 0; j < number_of_columns; j++) {
index = j + (i * number_of_columns);
printf(“%c”, item[index]);
}
}

using this strategy you have an abstraction of a 2D matrix but it still 1D

Robert_Crovella · March 7, 2018, 3:31pm

Not sure what you mean by that. It looks like a false statement to me.

It’s entirely possible to set up doubly-subscripted or even multiply-subscripted access of data on the device.

bokibo · March 9, 2018, 6:06pm

Can someone help me with this code? I have problem in kernel “testKernel”.
when I call function “subbytes” for 2nd time I get error about copying from device to host.
But when i comment lines after syncthread in kernel, everything is ok. What is the problem? I dont have a clue.
First code is kernel.cu, second is consts.h If some is using visual studio with cuda toolkit.

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda.h>
#include <device_functions.h>
#include <cuda_runtime_api.h>
#include <cstdlib>
#include <fstream>
#include <stdio.h>
#include "consts.h"

using namespace std;
typedef struct {
	unsigned int item[4][4];
} Block;

__constant__ unsigned int Sbox_dev[256] =
{
	0x63, 0x7C, 0x77, 0x7B, 0xF2, 0x6B, 0x6F, 0xC5, 0x30, 0x01, 0x67, 0x2B, 0xFE, 0xD7, 0xAB, 0x76 ,
	0xCA, 0x82, 0xC9, 0x7D, 0xFA, 0x59, 0x47, 0xF0, 0xAD, 0xD4, 0xA2, 0xAF, 0x9C, 0xA4, 0x72, 0xC0 ,
	0xB7, 0xFD, 0x93, 0x26, 0x36, 0x3F, 0xF7, 0xCC, 0x34, 0xA5, 0xE5, 0xF1, 0x71, 0xD8, 0x31, 0x15 ,
	0x04, 0xC7, 0x23, 0xC3, 0x18, 0x96, 0x05, 0x9A, 0x07, 0x12, 0x80, 0xE2, 0xEB, 0x27, 0xB2, 0x75 ,
	0x09, 0x83, 0x2C, 0x1A, 0x1B, 0x6E, 0x5A, 0xA0, 0x52, 0x3B, 0xD6, 0xB3, 0x29, 0xE3, 0x2F, 0x84 ,
	0x53, 0xD1, 0x00, 0xED, 0x20, 0xFC, 0xB1, 0x5B, 0x6A, 0xCB, 0xBE, 0x39, 0x4A, 0x4C, 0x58, 0xCF ,
	0xD0, 0xEF, 0xAA, 0xFB, 0x43, 0x4D, 0x33, 0x85, 0x45, 0xF9, 0x02, 0x7F, 0x50, 0x3C, 0x9F, 0xA8 ,
	0x51, 0xA3, 0x40, 0x8F, 0x92, 0x9D, 0x38, 0xF5, 0xBC, 0xB6, 0xDA, 0x21, 0x10, 0xFF, 0xF3, 0xD2 ,
	0xCD, 0x0C, 0x13, 0xEC, 0x5F, 0x97, 0x44, 0x17, 0xC4, 0xA7, 0x7E, 0x3D, 0x64, 0x5D, 0x19, 0x73 ,
	0x60, 0x81, 0x4F, 0xDC, 0x22, 0x2A, 0x90, 0x88, 0x46, 0xEE, 0xB8, 0x14, 0xDE, 0x5E, 0x0B, 0xDB ,
	0xE0, 0x32, 0x3A, 0x0A, 0x49, 0x06, 0x24, 0x5C, 0xC2, 0xD3, 0xAC, 0x62, 0x91, 0x95, 0xE4, 0x79 ,
	0xE7, 0xC8, 0x37, 0x6D, 0x8D, 0xD5, 0x4E, 0xA9, 0x6C, 0x56, 0xF4, 0xEA, 0x65, 0x7A, 0xAE, 0x08 ,
	0xBA, 0x78, 0x25, 0x2E, 0x1C, 0xA6, 0xB4, 0xC6, 0xE8, 0xDD, 0x74, 0x1F, 0x4B, 0xBD, 0x8B, 0x8A ,
	0x70, 0x3E, 0xB5, 0x66, 0x48, 0x03, 0xF6, 0x0E, 0x61, 0x35, 0x57, 0xB9, 0x86, 0xC1, 0x1D, 0x9E ,
	0xE1, 0xF8, 0x98, 0x11, 0x69, 0xD9, 0x8E, 0x94, 0x9B, 0x1E, 0x87, 0xE9, 0xCE, 0x55, 0x28, 0xDF ,
	0x8C, 0xA1, 0x89, 0x0D, 0xBF, 0xE6, 0x42, 0x68, 0x41, 0x99, 0x2D, 0x0F, 0xB0, 0x54, 0xBB, 0x16
};

__constant__ unsigned int MixCol_dev[4][4] =
{
	{ 0x02,0x03,0x01,0x01 },
	{ 0x01,0x02,0x03,0x01 },
	{ 0x01,0x01,0x02,0x03 },
	{ 0x03,0x01,0x01,0x02 }
};

__host__ long file_length(const char* filename) {
	FILE * f = fopen(filename, "r");
	long length;
	if (f)
	{
		fseek(f, 0, SEEK_END);
		length = ftell(f);
		fclose(f);
		return length;
	}
	else
		return 0;
}

__host__ Block * key_scheduling() {
	Block *keys = new Block[11];
	//initial key
	unsigned int key[4][4] = {
		{ 0x54, 0x73, 0x20, 0x67 },
		{ 0x68, 0x20, 0x4b, 0x20 },
		{ 0x61, 0x6d, 0x75, 0x46 },
		{ 0x74, 0x79, 0x6e, 0x75 } };
	/*
	char key[4][4] = {
	{ 'k', 'l', 'j', 'u' },
	{ 'c', ' ', 'z', 'a' },
	{ ' ', 'a', 'e', 's' },
	{ ' ', '1', '2', '8' } };
	*/
	for (int i = 0; i < 4; i++)
	{
		for (int j = 0; j < 4; j++)
		{
			keys[0].item[i][j] = key[i][j];
		}
	}

	// key scheduling algorithm
	for (int k = 1; k <= 10; k++) {
		Block tempNew;
		Block tempOld = keys[k - 1];
		unsigned int temp[4] = { tempOld.item[0][3], tempOld.item[1][3], tempOld.item[2][3], tempOld.item[3][3] }; //last column of first key
		//ROTWORD
		unsigned int t = temp[0];
		temp[0] = temp[1];
		temp[1] = temp[2];
		temp[2] = temp[3];
		temp[3] = t;
		//SUBBYTES
		//cout << endl << "3." << endl;
		for (int i = 0; i < 4; i++) {
			int x = (temp[i] >> 4) & 0xf;
			int y = temp[i] & 0xf;
			temp[i] = Sbox[x][y];
		}
		unsigned int temp2[4] = { tempOld.item[0][0], tempOld.item[1][0], tempOld.item[2][0], tempOld.item[3][0] }; //first column of first key
		//xor second column and temp and Rcon 1st round
		for (int i = 0; i < 4; i++) {
			temp2[i] = temp[i] ^ tempOld.item[i][0];
			temp2[i] = temp2[i] ^ Rcon[i][k - 1];
		}
		for (int i = 0; i < 4; i++)  //first column of 2nd key
			tempNew.item[i][0] = temp2[i];

		for (int j = 1; j < 4; j++) {
			for (int i = 0; i < 4; i++)
			{
				tempNew.item[i][j] = (tempNew.item[i][j - 1] ^ tempOld.item[i][j]);
			}
		}
		keys[k] = tempNew;
	} //end of key scheduling
	return keys;
}

__host__ Block* plaintext_initialization(unsigned int* source, long source_length, int num_of_blocks) {
	Block * plaintext = new Block[num_of_blocks];
	int t = 0;
	int i = 0;
	int j = 0;
	int k = 0;
	Block temp;
	while (i < source_length) {
		temp.item[j][k] = source[i];
			j++;
			i++;
			if (j == 4) {
				k++;
				j = 0;
			}
			if (i % 16 == 0) {
				plaintext[t] = temp;
				t++;
				j = 0;
				k = 0;
			}
	}
	return plaintext;
}


__host__ unsigned int* readFileAsInt(const char* filename)
{
	unsigned int* source;
	char c;
	long length;
	FILE * f = fopen(filename, "r");
	if (f)
	{
		fseek(f, 0, SEEK_END);
		length = ftell(f);
		fseek(f, 0, SEEK_SET);
		source = (unsigned int*)calloc(length, sizeof(unsigned int));
		int i = 0;
		do {
			c = fgetc(f);
			source[i] = c;
			i++;
		} while (c != EOF);
		fclose(f);
	}
	return source;
}

__device__ Block addRound(Block key, Block plaintext) {
	//Block result;
	for (size_t j = 0; j < 4; j++)
	{
		for (size_t k = 0; k < 4; k++)
		{
			key.item[j][k] = plaintext.item[j][k] ^ key.item[j][k];
		}
	}
	return key;
}

__device__ Block subBytes(Block in) {
	//Block out;
	for (size_t j = 0; j < 4; j++)
	{
		for (size_t k = 0; k < 4; k++)
		{
			in.item[j][k] = Sbox_dev[in.item[j][k]];
		}
	}
	return in;
}

__device__ Block shiftRows(Block in) {
	char temp;
	// Rotate first row 1 columns to left
	temp = in.item[1][0];
	in.item[1][0] = in.item[1][1];
	in.item[1][1] = in.item[1][2];
	in.item[1][2] = in.item[1][3];
	in.item[1][3] = temp;
	// Rotate second row 2 columns to left
	temp = in.item[2][0];
	in.item[2][0] = in.item[2][2];
	in.item[2][2] = temp;
	temp = in.item[2][1];
	in.item[2][1] = in.item[2][3];
	in.item[2][3] = temp;
	// Rotate third row 3 columns to left
	temp = in.item[3][0];
	in.item[3][0] = in.item[3][3];
	in.item[3][3] = in.item[3][2];
	in.item[3][2] = in.item[3][1];
	in.item[3][1] = temp;
	return in;
}

__device__ Block mixColumns(Block in)
{
	char nn[4][4];
	for (int i = 0; i < 4; ++i)
	{
		for (int j = 0; j < 4; ++j)
		{
			char tt[4];
			for (int k = 0; k < 4; ++k)
			{
				char mix_temp;
				if ((char)0x01 == MixCol_dev[i][k])
				{
					mix_temp = in.item[k][j];
				}
				else if ((char)0x02 == MixCol_dev[i][k])
				{
					mix_temp = in.item[k][j] << 1;
					int msb = ((in.item[k][j] & 0x80) >> 7) & 0x01;
					if (msb == 1)
					{
						mix_temp ^= 0x1b;
					}
				}
				else if ((char)0x03 == MixCol_dev[i][k])
				{
					mix_temp = in.item[k][j] << 1;
					int msb = ((in.item[k][j] & 0x80) >> 7) & 0x01;
					if (msb == 1)
					{
						mix_temp ^= 0x1b;
					}
					mix_temp ^= in.item[k][j];
				}
				tt[k] = mix_temp;
			}
			char tempc = tt[0] ^ tt[1] ^ tt[2] ^ tt[3];
			nn[i][j] = tempc;
		}
	}
	for (int i = 0; i < 4; i++)
	{
		for (int j = 0; j < 4; j++)
		{
			in.item[i][j] = nn[i][j];
		}

	}
	return in;
}

__device__ void printBlock(Block b) {
	for (size_t i = 0; i < 4; i++)
	{
		for (size_t j = 0; j < 4; j++)
		{
			printf(" %0x ", b.item[i][j]);
		}
		printf("\n");
	}
}

__global__ void testKernel(Block *keys, Block *plaintext, Block  *ciphertext, unsigned int num_of_blocks)
{
	int i = threadIdx.x;
	//initial round
	ciphertext[i] = addRound(keys[0], plaintext[i]);
	//1st round
	ciphertext[i] = subBytes(ciphertext[i]);
	ciphertext[i] = shiftRows(ciphertext[i]);
	ciphertext[i] = mixColumns(ciphertext[i]);
	ciphertext[i] = addRound(keys[1], ciphertext[i]);
	
	//__syncthreads();
	
	ciphertext[i] = subBytes(ciphertext[i]); //PROBLEM IS HERE
	ciphertext[i] = shiftRows(ciphertext[i]);
	ciphertext[i] = mixColumns(ciphertext[i]);
	ciphertext[i] = addRound(keys[2], ciphertext[i]);
}

int main()
{
	Block *keys = key_scheduling();
	long plaintext_length = file_length("test.txt");
	int num_of_blocks = (plaintext_length % 16 == 0) ? plaintext_length / 16 : plaintext_length / 16 + 1;
	unsigned int* source = readFileAsInt("test.txt");
	Block *plaintext = plaintext_initialization(source, plaintext_length, num_of_blocks);

	cudaError_t cudaStatus;
	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
	}

	Block *keys_dev;
	cudaStatus = cudaMalloc((void**)&keys_dev, 11 * sizeof(Block));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}

	Block *plaintext_dev;
	cudaStatus = cudaMalloc((void**)&plaintext_dev, num_of_blocks * sizeof(Block));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}

	Block *ciphertext_dev;
	cudaStatus = cudaMalloc((void**)&ciphertext_dev, num_of_blocks * sizeof(Block));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}

	cudaStatus = cudaMemcpy(keys_dev, keys, 11 * sizeof(Block), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "keys cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(plaintext_dev, plaintext, num_of_blocks * sizeof(Block), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "plaintext cudaMemcpy failed!");
		goto Error;
	}
	Block* ciphertext = new Block[num_of_blocks];

	cudaStatus = cudaMemcpy(ciphertext_dev, ciphertext, num_of_blocks * sizeof(Block), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "ciphertext cudaMemcpy failed!");
		goto Error;
	}
	testKernel << < 1, num_of_blocks >> > (keys_dev, plaintext_dev, ciphertext_dev, num_of_blocks);
	cudaStatus = cudaMemcpy(ciphertext, ciphertext_dev, num_of_blocks * sizeof(Block), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "ciphertext to host cudaMemcpy failed!");
		goto Error;
	}
	printf("\n---------------------\n");
	for (size_t j = 0; j < 4; j++)
	{
		for (size_t k = 0; k < 4; k++)
		{
			printf(" %0x ", ciphertext[0].item[j][k] & 0xff);
		}
	}

Error:
	cudaFree(keys_dev);
	cudaFree(plaintext_dev);
	cudaFree(ciphertext_dev);
	getchar();
	return 0;
}

#pragma once

const unsigned int Sbox[16][16] =
{
	{ 0x63, 0x7C, 0x77, 0x7B, 0xF2, 0x6B, 0x6F, 0xC5, 0x30, 0x01, 0x67, 0x2B, 0xFE, 0xD7, 0xAB, 0x76 },
	{ 0xCA, 0x82, 0xC9, 0x7D, 0xFA, 0x59, 0x47, 0xF0, 0xAD, 0xD4, 0xA2, 0xAF, 0x9C, 0xA4, 0x72, 0xC0 },
	{ 0xB7, 0xFD, 0x93, 0x26, 0x36, 0x3F, 0xF7, 0xCC, 0x34, 0xA5, 0xE5, 0xF1, 0x71, 0xD8, 0x31, 0x15 },
	{ 0x04, 0xC7, 0x23, 0xC3, 0x18, 0x96, 0x05, 0x9A, 0x07, 0x12, 0x80, 0xE2, 0xEB, 0x27, 0xB2, 0x75 },
	{ 0x09, 0x83, 0x2C, 0x1A, 0x1B, 0x6E, 0x5A, 0xA0, 0x52, 0x3B, 0xD6, 0xB3, 0x29, 0xE3, 0x2F, 0x84 },
	{ 0x53, 0xD1, 0x00, 0xED, 0x20, 0xFC, 0xB1, 0x5B, 0x6A, 0xCB, 0xBE, 0x39, 0x4A, 0x4C, 0x58, 0xCF },
	{ 0xD0, 0xEF, 0xAA, 0xFB, 0x43, 0x4D, 0x33, 0x85, 0x45, 0xF9, 0x02, 0x7F, 0x50, 0x3C, 0x9F, 0xA8 },
	{ 0x51, 0xA3, 0x40, 0x8F, 0x92, 0x9D, 0x38, 0xF5, 0xBC, 0xB6, 0xDA, 0x21, 0x10, 0xFF, 0xF3, 0xD2 },
	{ 0xCD, 0x0C, 0x13, 0xEC, 0x5F, 0x97, 0x44, 0x17, 0xC4, 0xA7, 0x7E, 0x3D, 0x64, 0x5D, 0x19, 0x73 },
	{ 0x60, 0x81, 0x4F, 0xDC, 0x22, 0x2A, 0x90, 0x88, 0x46, 0xEE, 0xB8, 0x14, 0xDE, 0x5E, 0x0B, 0xDB },
	{ 0xE0, 0x32, 0x3A, 0x0A, 0x49, 0x06, 0x24, 0x5C, 0xC2, 0xD3, 0xAC, 0x62, 0x91, 0x95, 0xE4, 0x79 },
	{ 0xE7, 0xC8, 0x37, 0x6D, 0x8D, 0xD5, 0x4E, 0xA9, 0x6C, 0x56, 0xF4, 0xEA, 0x65, 0x7A, 0xAE, 0x08 },
	{ 0xBA, 0x78, 0x25, 0x2E, 0x1C, 0xA6, 0xB4, 0xC6, 0xE8, 0xDD, 0x74, 0x1F, 0x4B, 0xBD, 0x8B, 0x8A },
	{ 0x70, 0x3E, 0xB5, 0x66, 0x48, 0x03, 0xF6, 0x0E, 0x61, 0x35, 0x57, 0xB9, 0x86, 0xC1, 0x1D, 0x9E },
	{ 0xE1, 0xF8, 0x98, 0x11, 0x69, 0xD9, 0x8E, 0x94, 0x9B, 0x1E, 0x87, 0xE9, 0xCE, 0x55, 0x28, 0xDF },
	{ 0x8C, 0xA1, 0x89, 0x0D, 0xBF, 0xE6, 0x42, 0x68, 0x41, 0x99, 0x2D, 0x0F, 0xB0, 0x54, 0xBB, 0x16 }
};


const unsigned int InvSbox[16][16] =
{
	{ 0x52, 0x09, 0x6A, 0xD5, 0x30, 0x36, 0xA5, 0x38, 0xBF, 0x40, 0xA3, 0x9E, 0x81, 0xF3, 0xD7, 0xFB },
	{ 0x7C, 0xE3, 0x39, 0x82, 0x9B, 0x2F, 0xFF, 0x87, 0x34, 0x8E, 0x43, 0x44, 0xC4, 0xDE, 0xE9, 0xCB },
	{ 0x54, 0x7B, 0x94, 0x32, 0xA6, 0xC2, 0x23, 0x3D, 0xEE, 0x4C, 0x95, 0x0B, 0x42, 0xFA, 0xC3, 0x4E },
	{ 0x08, 0x2E, 0xA1, 0x66, 0x28, 0xD9, 0x24, 0xB2, 0x76, 0x5B, 0xA2, 0x49, 0x6D, 0x8B, 0xD1, 0x25 },
	{ 0x72, 0xF8, 0xF6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xD4, 0xA4, 0x5C, 0xCC, 0x5D, 0x65, 0xB6, 0x92 },
	{ 0x6C, 0x70, 0x48, 0x50, 0xFD, 0xED, 0xB9, 0xDA, 0x5E, 0x15, 0x46, 0x57, 0xA7, 0x8D, 0x9D, 0x84 },
	{ 0x90, 0xD8, 0xAB, 0x00, 0x8C, 0xBC, 0xD3, 0x0A, 0xF7, 0xE4, 0x58, 0x05, 0xB8, 0xB3, 0x45, 0x06 },
	{ 0xD0, 0x2C, 0x1E, 0x8F, 0xCA, 0x3F, 0x0F, 0x02, 0xC1, 0xAF, 0xBD, 0x03, 0x01, 0x13, 0x8A, 0x6B },
	{ 0x3A, 0x91, 0x11, 0x41, 0x4F, 0x67, 0xDC, 0xEA, 0x97, 0xF2, 0xCF, 0xCE, 0xF0, 0xB4, 0xE6, 0x73 },
	{ 0x96, 0xAC, 0x74, 0x22, 0xE7, 0xAD, 0x35, 0x85, 0xE2, 0xF9, 0x37, 0xE8, 0x1C, 0x75, 0xDF, 0x6E },
	{ 0x47, 0xF1, 0x1A, 0x71, 0x1D, 0x29, 0xC5, 0x89, 0x6F, 0xB7, 0x62, 0x0E, 0xAA, 0x18, 0xBE, 0x1B },
	{ 0xFC, 0x56, 0x3E, 0x4B, 0xC6, 0xD2, 0x79, 0x20, 0x9A, 0xDB, 0xC0, 0xFE, 0x78, 0xCD, 0x5A, 0xF4 },
	{ 0x1F, 0xDD, 0xA8, 0x33, 0x88, 0x07, 0xC7, 0x31, 0xB1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xEC, 0x5F },
	{ 0x60, 0x51, 0x7F, 0xA9, 0x19, 0xB5, 0x4A, 0x0D, 0x2D, 0xE5, 0x7A, 0x9F, 0x93, 0xC9, 0x9C, 0xEF },
	{ 0xA0, 0xE0, 0x3B, 0x4D, 0xAE, 0x2A, 0xF5, 0xB0, 0xC8, 0xEB, 0xBB, 0x3C, 0x83, 0x53, 0x99, 0x61 },
	{ 0x17, 0x2B, 0x04, 0x7E, 0xBA, 0x77, 0xD6, 0x26, 0xE1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0C, 0x7D }
};

const unsigned int Rcon[4][10] =
{
	{ 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36 },
	{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
	{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
	{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }
};

Topic		Replies	Views
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4765	April 8, 2008
multi dimension array CUDA Programming and Performance	26	32756	February 12, 2010
Help a newbie with CUDA development! CUDA Programming and Performance	16	14547	May 1, 2008
An Easy Introduction to CUDA C and C++ Technical Blog	48	1095	July 19, 2018
How to modify the Cuda output CUDA Programming and Performance	0	444	May 21, 2018
Inconsistent Cuda Errors CUDA Programming and Performance	10	1050	July 7, 2017
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7107	December 6, 2007
problem compiling cuda code HELP CUDA Programming and Performance	5	3811	July 21, 2009
Cuda program taking more time. CUDA Programming and Performance	15	7054	November 21, 2010
Cuda code performance CUDA Programming and Performance	14	3066	December 16, 2014

newbie in Cuda needs help with 2D arrays

Related topics