March 07, 22
Implementation of an RDMA-based Database High-availability Feature for a Persistent-memory-native Database System
Persistent-memory-native Database High-availability Feature March 7, 2022 Shohei Matsuura Copyright ©2022 (C)Yahoo 2020 Japan Yahoo Corporation Japan Corporation. All rights Allreserved. Rights Reserved.
1. Database high-availability & continuation of business operations In case of a database node failure, failover to another node & continue the operations Business Application Unable to Continue Business Operations Database Source Node Database Replica Node Leo Leo Machine or PMEM Failure PMEM PMEM DB Data Synchronization Log DB Log * PMEM: Persistent Memory * Leo: Persistent-memory-native MySQL storage engine under development by Yahoo Japan Corp. ©2022 Yahoo Japan Corporation All rights reserved. 2
2. Leo High-availability Feature • Tradi'onal Source-Replica Architecture • RDMA-based Data Synchroniza'on/Replica'on Read-only Application (Optional) Read-Write Application Load-balancer Database Source Node Database Replica Node Leo PMEM DB Leo RDMA-based Data Sync/Replication PMEM Log DB Log ©2022 Yahoo Japan Corporation All rights reserved. 3
3. Conventional vs Leoʼs RDMA Data Synchronization Conventional (Binlog-based) Leo RDMA • Logical Replication • Physical Replication • Sync data by re-executing SQL stmts on replica nodes • Sync data w/o re-executing SQL stmts on replica nodes SQL SQL Source Replica MySQL Thread MySQL Thread Source Replica MySQL Thread MySQL Thread call Data & Redo Applier Thread Dump Thread Update Binlog (1) Overhead Binlog Send/Recv SQL Stmt TCP/IP IO Thread (1) Remote Memory Access (2) Overhead RE-execute SQLs Update (2) No Need for Re-execution Direct RPMEM Write Update PMEM PMEM PMEM DB DB Log Log ©2022 Yahoo Japan Corporation All rights reserved. PMEM apply DB DB Log Log 4
4. Design Ideas Considered for RDMA Data Synchronization Approach Chosen! Ideas Overview Idea1: RDMA-Write (One-sided Operation) Source Replica Thread Thread Idea2: RDMA-Read (One-sided Operation) Source Replica Source Replica Thread Thread Thread Thread Remote Write DRAM/ PMEM Idea3: RDMA Send-Receive (Two-sided Operation) Remote Read DRAM/ PMEM DRAM/ PMEM DRAM/ PMEM Send-Recv DRAM/ PMEM DRAM/ PMEM Pros • Save CPU resources on the replica nodes in redo log shipping from the source node to the replica nodes • Save CPU resources on the source node in redo log shipping • Easier to implement event-driven flows Cons • Consume more CPU resources on the source node in redo log shipping to the replica nodes • Consume more CPU resources on the replica nodes in reading redo log records from the source node • More performance overhead due to send-receive synchronization ©2022 Yahoo Japan Corporation All rights reserved. 5
5. Performance Gains by the RDMA Data Synchronization • Workload: sysbench oltp_write_only (1, 2, 4, 8, 16, 32, 64 threads) • Environment: 104 CPU cores (2-socket), DRAM: 192GB, PMEM(DB/Log): 1.5TB, RDMA-capable 25G NIC Data Synchronization Method and Transaction Throughput Conventional (Binary-log Based) RDMA 6 Transaction Throughput (Normalized) 5 4 3 2 1 0 1 2 4 8 16 32 64 # Threads ©2022 Yahoo Japan Corporation All rights reserved. 6
6. Failover Management with Orchestrator* Leo-Orchestrator Integration health check Orchestrator (MySQL HA Monitoring OSS) Topology Management EP Mgmt Endpoint Leo(Source) Leo Thread Leo(Replica) Leo Source Leo Replica Leo Thread (1)COMMIT (3)COMMIT PMEM PMEM DB Log Data Redo DB Log Data Redo (2) Redo-log Shipping * MySQL HA Monitoring & Failover Management OSS: https://github.com/openark/orchestrator ©2022 Yahoo Japan Corporation All rights reserved. 7
©2022 Yahoo Japan Corporation All rights reserved.