gpu - OpenCl matrix transposition with memory coalescing -

May 15, 2010

i´m trying transpose matrix in opencl memory coalescing.

i've tansposed matrix in "simple" way worked fine. when tried same thing memory coalescing, hoping see little improvement in execution time, implementation slower simple implementation (the implementation correct, it's not efficent). think haven't understood how ensure horizontally neighboring work-items write on horizontally neighboring addresses.

here kernel coalisced implementation:

__kernel void matrixtranspose(__global const float* matrix,  __global float* matrixtransposed, uint width, uint height, __local float* block) {      int2 globalid;     globalid.x = get_global_id(0);     globalid.y = get_global_id(1);      int2 localid;     localid.x = get_local_id(0);     localid.y = get_local_id(1);      block[localid.y*get_local_size(0) + localid.x] = matrix[globalid.y*width + globalid.x];      barrier(clk_local_mem_fence);      int2 groupid;     groupid.x = get_group_id(0);     groupid.y = get_group_id(1);     int2 localsize;     localsize.x = get_local_size(0);     localsize.y = get_local_size(1);     matrixtransposed[height*(localid.x + groupid.x*localsize.x) + height - (localid.y + groupid.y*localsize.y) - 1] = block[localid.y*localsize.x + localid.x]; }

i hope can give me advice, thank :)

unfortunately, going bound global read , write speed of device. transpose matrix calculation, , helps hide latency. reading local memory, waiting barrier, , writing black global in example. adds step , complexity of using local memory.

you should data while in local memory if want hide global memory latency.

if want transpose matrix, read global , write target location in global directly. maybe async_work_group_copy if still want try using local memory.

now answer.

try making work item responsible more single float. if read 4x4 region work item, can transpose in private memory. not skip local memory, eliminate need barrier, , reduce number of work items need factor of 16.

steps:

calculate src , dest global memory addresses
load 4 float4 values global
transpose 4x4 floats swapping w,x,y,z values accordingly
store 4 float4 values @ new location in global memory
handle edge regions of matrix in separate kernel, or in host program matrices non-multiple-of-four dimensions (or pad input matrix make multiple of 4)

Search This Blog

O9

gpu - OpenCl matrix transposition with memory coalescing -

Comments

Post a Comment

Popular posts from this blog

java - How to specify maven bin in eclipse maven plugin? -

single sign on - Logging into Plone site with credentials passed through HTTP -

php - Why does AJAX not process login form? -