c++ - The cost of atomic counters and spinlocks on x86(

c++ - The cost of atomic counters and spinlocks on x86(_64) -

March 15, 2013

preface

i came across synchronization problems, led me spinlocks , atomic counters. searching bit more, how these work , found std::memory_order , memory barriers (mfence, lfence , sfence).

so now, seems should use acquire/release spinlocks , relaxed counters.

some reference

x86 mfence - memory fence
x86 lock - assert lock# signal

question

what machine code (edit: see below) 3 operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) default (seq_cst) memory order , acquire/release/relaxed (in order 3 operations). what difference (which memory barriers where) and cost (how many cpu cycles)?

purpose

i wondering how bad old code (not specifying memory order = seq_cst used) , if should create class atomic_counter derived std::atomic using relaxed memory ordering (as spinlock acquire/release instead of mutexes on places ...or use boost library - have avoided boost far).

my knowledge

so far understand spinlocks protect more (but shared resource/memory well), so, there must makes memory view coherent multiple threads/cores (that acquire/release , memory fences). atomic counter lives , need atomic increment (no other memory involved , not care value when read it, informative , can few cycles old, no problem). there lock prefix , instructions xchg implicitly have it. here knowledge ends, don't know how cache , buses work , behind (but know modern cpus can reorder instructions, execute them in parallel , use memory cache , synchronization). thank explanation.

p.s.: have old 32bit pc now, can see lock addl , simple xchg, nothing else - versions same (except unlock), memory_order makes no difference on old pc (except unlock, release uses move instead of xchg). true 64bit pc? (edit: see below) have care memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)

the code:

#include <atomic> using namespace std;  atomic_flag spinlock; atomic<int> counter;  void inc1() {     counter++; } void inc2() {     counter.fetch_add(1, memory_order_relaxed); } void lock1() {     while(spinlock.test_and_set()) ; } void lock2() {     while(spinlock.test_and_set(memory_order_acquire)) ; } void unlock1() {     spinlock.clear(); } void unlock2() {     spinlock.clear(memory_order_release); }  int main() {     inc1();     inc2();     lock1();     unlock1();     lock2();     unlock2(); }

g++ -std=c++11 -o1 -s (32bit cygwin, shortened output)

__z4inc1v: __z4inc2v:     lock addl   $1, _counter    ; both seq_cst , relaxed     ret __z5lock1v: __z5lock2v:     movl    $1, %edx l5:     movl    %edx, %eax     xchgb   _spinlock, %al      ; both seq_cst , acquire     testb   %al, %al     jne l5     rep ret __z7unlock1v:     movl    $0, %eax     xchgb   _spinlock, %al      ; seq_cst     ret __z7unlock2v:     movb    $0, _spinlock       ; release     ret

update x86_64bit: (see `mfence` in `unlock1`)

_z4inc1v: _z4inc2v:     lock addl   $1, counter(%rip)   ; both seq_cst , relaxed     ret _z5lock1v: _z5lock2v:     movl    $1, %edx .l5:     movl    %edx, %eax     xchgb   spinlock(%rip), %al     ; both seq_cst , acquire     testb   %al, %al     jne .l5     ret _z7unlock1v:     movb    $0, spinlock(%rip)     mfence                          ; seq_cst     ret _z7unlock2v:     movb    $0, spinlock(%rip)      ; release     ret

x86 has strong memory model, usual stores/loads have release/acquire semantics implicitly. exception sse non-temporal store operations require sfence ordered usual. read-modify-write (rmw) instructions lock prefix imply full memory barrier, i.e. seq_cst.

thus on x86, have

test_and_set can coded lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or xchg implies lock). other spin-lock implementations can use instructions lock inc (or dec) if need e.g. fairness. not possible implement try_lock release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear coded lock and (for bit-wise) or lock xchg, though, more efficient implementations use plain write (mov) instead of locked instruction.
fetch_add coded lock add.

for release/relaxed operations, remove lock prefix.

in our experience, not matter rmw atomic operation performed take same number of cycles execute (and mfence x0.5 of lock operation). can estimate performance of synchronization algorithms counting number of atomic operations (and mfences), , number of memory indirections (cache misses).

Search This Blog

O9