6 years ago · f89afd389f
--- a/docs/en/atomic_instructions.md
+++ b/docs/en/atomic_instructions.md
@@ -33,9 +33,9 @@ A related programming trap is false sharing: Accesses to infrequently updated or
 
				 
			
 
				 Just atomic counting cannot synchronize accesses to resources, simple structures like [spinlock](https://en.wikipedia.org/wiki/Spinlock) or [reference counting](https://en.wikipedia.org/wiki/Reference_counting) that seem correct may crash as well. The key is **instruction reordering**, which may change the order of read/write and cause instructions behind to be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may reorder.
			
 
				 
			
 
				-The motivation is natural: CPU wants to fill each cycle with instructions and execute as many as possible instructions within given time. As above section says, an instruction for loading memory may cost hundreds of nanoseconds for synchronizing the cacheline. A efficient solution to synchronize multiple cachelines is to move them simultaneously rather than one-by-one. Thus modifications to multiple variables by a thread may be visible to another thread in a different order. On the other hand, different threads need different data, synchronizing on-demand is reasonable and may also change order between cachelines.
			
 
				+The motivation is natural: CPU wants to fill each cycle with instructions and execute as many as possible instructions within given time. As said in above section, an instruction for loading memory may cost hundreds of nanoseconds due to cacheline synchronization. A efficient solution for synchronizing multiple cachelines is to move them simultaneously rather than one-by-one. Thus modifications to multiple variables by a thread may be visible to another thread in a different order. On the other hand, different threads need different data, synchronizing on-demand is reasonable and may also change order between cachelines.
			
 
				 
			
 
				-If the first variable plays the role of switch, controlling accesses to following variables. When these variables are synchronized to other CPU cores, new values may be visible in a different order, and the first variable may not be the first one updated, which causes other threads to think that the following variables are still valid, which are actually deleted, causing undefined behavior. For example:
			
 
				+For example: the first variable plays the role of switch, controlling accesses to following variables. When these variables are synchronized to other CPU cores, new values may be visible in a different order, and the first variable may not be the first updated, which causes other threads to think that the following variables are still valid, which are actually not, causing undefined behavior. Check code snippet below:
			
 
				 
			
 
				 ```c++
			
 
				 // Thread 1
			
@@ -50,14 +50,14 @@ if (ready) {
 
				     p.bar();
			
 
				 }
			
 
				 ```
			
 
				-From a human perspective, the code is correct because thread2 only accesses `p` when `ready` is true which means p is initialized according to logic in thread1. But the code may not run as expected on multi-core machines:
			
 
				+From a human perspective, the code is correct because thread2 only accesses `p` when `ready` is true which means p is initialized according to the logic in thread1. But the code may not run as expected on multi-core machines:
			
 
				 
			
 
				 - `ready = true` in thread1 may be reordered before `p.init()` by compiler or CPU, making thread2 see uninitialized `p` when  `ready` is true. The same reordering may happen in thread2 as well. Some instructions in `p.bar()` may be reordered before checking `ready`.
			
 
				-- Even if the above reordering did not happen, cachelines of `ready` and `p` may be synchronized independently to the CPU core that thread2 runs, making thread2 see unitialized `p` when `ready` is true.
			
 
				+- Even if above reorderings do not happen, cachelines of `ready` and `p` may be synchronized independently to the CPU core that thread2 runs, making thread2 see unitialized `p` when `ready` is true.
			
 
				 
			
 
				 Note: On x86/x64, `load` has acquire semantics and `store` has release semantics by default, the code above may run correctly provided that reordering by compiler is turned off.
			
 
				 
			
 
				-With this simple example, you may get a glimpse of the complexity of programming using atomic instructions. In order to solve the reordering issue, CPU and compiler offer [memory fences](http://en.wikipedia.org/wiki/Memory_barrier) to let programmers decide the visibility order between instructions. boost and C++11 conclude memory fence into following types:
			
 
				+With this simple example, you may get a glimpse of the complexity of atomic instructions. In order to solve the reordering issue, CPU and compiler offer [memory fences](http://en.wikipedia.org/wiki/Memory_barrier) to let programmers decide the visibility order between instructions. boost and C++11 conclude memory fence into following types:
			
 
				 
			
 
				 | memory order         | Description                              |
			
 
				 | -------------------- | ---------------------------------------- |
			
@@ -86,22 +86,22 @@ if (ready.load(std::memory_order_acquire)) {
 
				 
			
 
				 The acquire fence in thread2 matches the release fence in thread1, making thread2 see all memory operations before the release fence in thread1 when thread2 sees `ready` being set to true.
			
 
				 
			
 
				-Note that memory fence does not guarantee visibility. Even if thread2 reads `ready` just after thread1 sets it to true, thread2 is not guaranteed to see the new value, because cache synchronization takes time. Memory fence guarantees order of visibilities: "If I see the new value of a, I should see the new value of b as well".
			
 
				+Note that memory fence does not guarantee visibility. Even if thread2 reads `ready` just after thread1 sets it to true, thread2 is not guaranteed to see the new value, because cache synchronization takes time. Memory fence guarantees ordering between visibilities: "If I see the new value of a, I should see the new value of b as well".
			
 
				 
			
 
				-A related problem is that: How to know whether a value is new or old? Two cases in generally:
			
 
				+A related problem: How to know whether a value is updated or not? Two cases in generally:
			
 
				 
			
 
				 - The value is special. In above example, `ready=true` is a special value. Once `ready` is true, `p` is ready. Reading special values or not both mean something.
			
 
				 - Increasing-only values. Some situations do not have special values, we can use instructions like `fetch_add` to increase variables. As long as the value range is large enough, new values are different from old ones for a long period of time, so that we can distinguish them from each other.
			
 
				 
			
 
				-More examples can be found in [boost.atomic](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html). Official descriptions of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
			
 
				+More examples can be found in [boost.atomic](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html). Official descriptions of atomics can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
			
 
				 
			
 
				 # wait-free & lock-free
			
 
				 
			
 
				-Atomic instructions provide two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) and [lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how OS schedules, all threads are doing useful jobs; lock-free, which is weaker than wait-free, means no matter how OS schedules, at least one thread is doing useful jobs. If locks are used, the thread holding the lock might be swapped out by OS, in which case all threads trying to hold the lock are blocked. So code using locks are neither lock-free nor wait-free. To make tasks done within given time, critical paths in real-time OS is at least lock-free. Miscellaneous online services inside Baidu also pose serious restrictions to running time. If the critical path in brpc is wait-free or lock-free, many services are benefited by better and stable QoS. Actually, both read(in the sense of even dispatching) and write in brpc are wait-free, check [IO](io.md) for more.
			
 
				+Atomic instructions provide two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) and [lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how OS schedules, all threads are doing useful jobs; lock-free, which is weaker than wait-free, means no matter how OS schedules, at least one thread is doing useful jobs. If locks are used, the thread holding the lock might be paused by OS, in which case all threads trying to hold the lock are blocked. So code using locks are neither lock-free nor wait-free. To make tasks done within given time, critical paths in real-time OS is at least lock-free. Miscellaneous online services inside Baidu also pose serious restrictions on running time. If the critical path in brpc is wait-free or lock-free, many services are benefited from better and stable QoS. Actually, both read(in the sense of even dispatching) and write in brpc are wait-free, check [IO](io.md) for details.
			
 
				 
			
 
				 Note that it is common to think that wait-free or lock-free algorithms are faster, which may not be true, because:
			
 
				 
			
 
				 - More complex race conditions and ABA problems must be handled in lock-free and wait-free algorithms, which means the code is often much more complicated than the one using locks. More code, more running time.
			
 
				-- Mutex solves contentions by backoff, which means that when contention happens, another way is chosen to avoid the contention temporarily. Threads failed to lock a mutex are put into sleep, making the thread holding the mutex complete the task or even following several tasks exclusively, which may increase the overall throughput.
			
 
				+- Mutex solves contentions by backoff, which means that when contention happens, another branch is entered to avoid the contention temporarily. Threads failed to lock a mutex are put into sleep, making the thread holding the mutex complete the task or even follow-up tasks exclusively, which may increase the overall throughput.
			
 
				 
			
 
				-Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of one thread or all threads, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which is alos implemented by atomic instructions.
			
 
				+Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of the system, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which needs more atomic instructions in total.