c++ - Confusing Memory Reordering Behavior -


i trying run simple task (obtaining x2apic id of current processor) on every available hardware thread. wrote following code this, works on machines tested on (see here complete mwe, compilable on linux c++11 code).

void print_x2apic_id() {         uint32_t r1, r2, r3, r4;         std::tie(r1, r2, r3, r4) = cpuid(11, 0);         std::cout << r4 << std::endl; }  int main() {         const auto _ = std::ignore;         auto nprocs = ::sysconf(_sc_nprocessors_onln);         auto set = ::cpu_set_t{};         std::cout << "processors online: " << nprocs << std::endl;          (auto = 0; != nprocs; ++i) {                 cpu_set(i, &set);                 check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));                 cpu_clr(i, &set);                 print_x2apic_id();         } } 

output on 1 machine (when compiled g++, version 4.9.0):

0 2 4 6 32 34 36 38 

each iteration printed different x2apic id, things working expected. problems start. replaced call print_x2apic_id following code:

uint32_t r4; std::tie(_, _, _, r4) = cpuid(11, 0); std::cout << r4 << std::endl; 

this causes same id printed each iteration of loop:

36 36 36 36 36 36 36 36 

my guess happened compiler noticed call cpuid not depend on loop iteration (even though does). compiler "optimized" code hoisting call cpuid outside loop. try fix this, converted r4 atomic:

std::atomic<uint32_t> r4; std::tie(_, _, _, r4) = cpuid(11, 0); std::cout << r4 << std::endl; 

this failed fix problem. surprisingly, does fix problem:

std::atomic<uint32_t> r1; uint32_t r2, r3, r4; std::tie(r1, r2, r3, r4) = cpuid(11, 0); std::cout << r4 << std::endl; 

... ok, i'm confused.

edit: replacing asm statement in cpuid function asm volatile fixes issue, don't see how should necessary.

my questions

  1. shouldn't inserting acquire fence before call cpuid , release fence after call cpuid sufficient prevent compiler performing memory reordering?
  2. why didn't converting r4 std::atomic<uint32_t> work? , why did storing first 3 outputs r1, r2, , r3 instead of ignoring them cause program work?
  3. how can correctly write loop, using least amount of synchronization necessary?

i've reproduced problem optimization enabled (-o). right suspecting compiler optimization. cpuid serves full memory barrier (for processor) itself; compiler generates code without calling cpuid function in loop since threats constant function. asm volatile prevents compiler such optimization saying has side-effects.

see answer details: https://stackoverflow.com/a/14449998/2527797


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

python - Django-cities exits with "killed" -

python - How to get a widget position inside it's layout in Kivy? -