Texture Upload speed optimization on Tegra 3 in OpenGL ES

I am hoping that someone has someone has some ideas on how to improve texture upload speeds on Tegra 3 h/w. In comparison to other chipsets around the same time (2012-2013) the Tegra 3 doesn’t seem to perform very well.

Below are the results for tests on an Asus transformer (with a Tegra 3 gfx chip) and a Samsung Tab 2 (PowerVR SGX540) - which appears to run better. The results show performance alone and under load (in game)

#####################################################
Asus TF300T 1GB RAM [1280 x 760] Quad Core							
GPU: Nvidia Tegra 3 T30L: CPU: ARM v7	
######################################################
[b]				
                      Ave FPS   Ave (ms/frame)  Ave Upload/(ms)  Min Upload (ms)
[/b]
3xGL_ALPHA, STANDALONE							
[512x512 texture]       46.7       10.78            17.23           5.86	

3 x GL_ALPHA, in GAME							
[512x512 texture]	11.2       72.13            49.01          22.29	     
[256x256 texture]	16.6       42.41            40.38          23.16	
							
GL_RGBA, in GAME							
[512x512 texture]	10.6       68.08            42.73          15.05	
[256x256 texture]	11.9       50.57            40.95          23.00	
							

#####################################################
Samsung Tab 2 P3100 1GB RAM [1024 x 600] Dual Core
GPU: PowerVR SGX540 CPU: ARM v7
######################################################
[b]				
                      Ave FPS Ave (ms/frame) Ave Upload/(ms) Min Upload (ms)
[/b]
3xGL_ALPHA, STANDALONE							
[512x512 texture]	54.9	5.59		 7.73	         1.86	

3xGL_ALPHA, in GAME									
[512x512 texture]	35.8	7.16		 6.63	         3.75			
[256x256 texture]	34.9	4.47		 4.42	         1.4			
												
GL_RGBA, in GAME									
[512x512 texture]	16.1	42.37		12.65	         8.97			
[256x256 texture]	31.8	10.43		 4.72	         3.20

I am uploading 3 textures each using the following OpenGL call

glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, textures->width, textures->height, GL_ALPHA, GL_UNSIGNED_BYTE, static_cast<char *>(bp_src->YPlane.Buffer));

As you can see - packing the three textures into a single RGBA (there is option for alpha but it’s not used yet) texture helps with upload speed a bit but not enough to offset having to copy the data from 3 separate buffers into one block of memory for a single glTexSubImage call.

The main point is that these texture speeds are horribly slow - the Samsung Tab 2 speeds are bareable but are still slow compared to newer h/w Is there something I can do to optimize this or some work around for devices with such slow uploads ?