c++ - The cost of atomic counters and spinlocks on x86(_64) -
preface
i came across synchronization problems, led me spinlocks , atomic counters. searching bit more, how these work , found std::memory_order , memory barriers (mfence
, lfence
, sfence
).
so now, seems should use acquire/release spinlocks , relaxed counters.
some reference
x86 mfence - memory fence
x86 lock - assert lock# signal
question
what machine code (edit: see below) 3 operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) default (seq_cst) memory order , acquire/release/relaxed (in order 3 operations). what difference (which memory barriers where) and cost (how many cpu cycles)?
purpose
i wondering how bad old code (not specifying memory order = seq_cst used) , if should create class atomic_counter
derived std::atomic
using relaxed memory ordering (as spinlock acquire/release instead of mutexes on places ...or use boost library - have avoided boost far).
my knowledge
so far understand spinlocks protect more (but shared resource/memory well), so, there must makes memory view coherent multiple threads/cores (that acquire/release , memory fences). atomic counter lives , need atomic increment (no other memory involved , not care value when read it, informative , can few cycles old, no problem). there lock
prefix , instructions xchg
implicitly have it. here knowledge ends, don't know how cache , buses work , behind (but know modern cpus can reorder instructions, execute them in parallel , use memory cache , synchronization). thank explanation.
p.s.: have old 32bit pc now, can see lock addl
, simple xchg
, nothing else - versions same (except unlock), memory_order makes no difference on old pc (except unlock, release uses move
instead of xchg
). true 64bit pc? (edit: see below) have care memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
the code:
#include <atomic> using namespace std; atomic_flag spinlock; atomic<int> counter; void inc1() { counter++; } void inc2() { counter.fetch_add(1, memory_order_relaxed); } void lock1() { while(spinlock.test_and_set()) ; } void lock2() { while(spinlock.test_and_set(memory_order_acquire)) ; } void unlock1() { spinlock.clear(); } void unlock2() { spinlock.clear(memory_order_release); } int main() { inc1(); inc2(); lock1(); unlock1(); lock2(); unlock2(); }
g++ -std=c++11 -o1 -s (32bit cygwin, shortened output)
__z4inc1v: __z4inc2v: lock addl $1, _counter ; both seq_cst , relaxed ret __z5lock1v: __z5lock2v: movl $1, %edx l5: movl %edx, %eax xchgb _spinlock, %al ; both seq_cst , acquire testb %al, %al jne l5 rep ret __z7unlock1v: movl $0, %eax xchgb _spinlock, %al ; seq_cst ret __z7unlock2v: movb $0, _spinlock ; release ret
update x86_64bit: (see mfence
in unlock1
)
_z4inc1v: _z4inc2v: lock addl $1, counter(%rip) ; both seq_cst , relaxed ret _z5lock1v: _z5lock2v: movl $1, %edx .l5: movl %edx, %eax xchgb spinlock(%rip), %al ; both seq_cst , acquire testb %al, %al jne .l5 ret _z7unlock1v: movb $0, spinlock(%rip) mfence ; seq_cst ret _z7unlock2v: movb $0, spinlock(%rip) ; release ret
x86 has strong memory model, usual stores/loads have release/acquire semantics implicitly. exception sse non-temporal store operations require sfence
ordered usual. read-modify-write (rmw) instructions lock
prefix imply full memory barrier, i.e. seq_cst.
thus on x86, have
test_and_set
can codedlock bts
(for bit-wise operations),lock cmpxchg
, orlock xchg
(orxchg
implieslock
). other spin-lock implementations can use instructionslock inc
(or dec) if need e.g. fairness. not possible implementtry_lock
release/acquire fence (at least you'd need standalone memory barriermfence
anyway).clear
codedlock and
(for bit-wise) orlock xchg
, though, more efficient implementations use plain write (mov
) instead of locked instruction.fetch_add
codedlock add
.
for release/relaxed operations, remove lock
prefix.
in our experience, not matter rmw atomic operation performed take same number of cycles execute (and mfence x0.5 of lock operation). can estimate performance of synchronization algorithms counting number of atomic operations (and mfences), , number of memory indirections (cache misses).
Comments
Post a Comment