The complexity of efficient synchronization design increases with the continuous growth in the number of physical and logical cores on today's machines. Opinion is divided on which synchronization strategy is more powerful, opposing typical mechanisms, such as locks and atomic primitives, to emergent technologies, like transactional memory. We perform an extensive scalability study on many-core systems, evaluating most widely-used synchronization mechanisms in terms of application throughput and operation latency. We show that, from a performance perspective, current best-effort implementations of hardware transactional memory (HTM) are comparable to well-established locking or lock-free mechanisms. We also find that they scale better with the number of threads. We then showcase the ease-of-use of HTM in real-life applications. Finally, we analyze the impact of simultaneous multithreading (SMT) technologies on HTM performance. We propose a new cache replacement strategy that takes into account the transactional state of each cache line and aims to mitigate SMT-induced transactional overflow aborts.
