gpu - OpenCl matrix transposition with memory coalescing -
i´m trying transpose matrix in opencl memory coalescing.
i've tansposed matrix in "simple" way worked fine. when tried same thing memory coalescing, hoping see little improvement in execution time, implementation slower simple implementation (the implementation correct, it's not efficent). think haven't understood how ensure horizontally neighboring work-items write on horizontally neighboring addresses.
here kernel coalisced implementation:
__kernel void matrixtranspose(__global const float* matrix, __global float* matrixtransposed, uint width, uint height, __local float* block) { int2 globalid; globalid.x = get_global_id(0); globalid.y = get_global_id(1); int2 localid; localid.x = get_local_id(0); localid.y = get_local_id(1); block[localid.y*get_local_size(0) + localid.x] = matrix[globalid.y*width + globalid.x]; barrier(clk_local_mem_fence); int2 groupid; groupid.x = get_group_id(0); groupid.y = get_group_id(1); int2 localsize; localsize.x = get_local_size(0); localsize.y = get_local_size(1); matrixtransposed[height*(localid.x + groupid.x*localsize.x) + height - (localid.y + groupid.y*localsize.y) - 1] = block[localid.y*localsize.x + localid.x]; } i hope can give me advice, thank :)
unfortunately, going bound global read , write speed of device. transpose matrix calculation, , helps hide latency. reading local memory, waiting barrier, , writing black global in example. adds step , complexity of using local memory.
you should data while in local memory if want hide global memory latency.
if want transpose matrix, read global , write target location in global directly. maybe async_work_group_copy if still want try using local memory.
now answer.
try making work item responsible more single float. if read 4x4 region work item, can transpose in private memory. not skip local memory, eliminate need barrier, , reduce number of work items need factor of 16.
steps:
- calculate src , dest global memory addresses
- load 4 float4 values global
- transpose 4x4 floats swapping w,x,y,z values accordingly
- store 4 float4 values @ new location in global memory
- handle edge regions of matrix in separate kernel, or in host program matrices non-multiple-of-four dimensions (or pad input matrix make multiple of 4)
Comments
Post a Comment