90-分布式技术原理与算法解析 聂鹏程
https://time.geekbang.org/column/intro/100036401
Keywords
分布式互斥(集中式算法,分布式算法P2P,令牌环算法,两层结构的分布式令牌环算法)
分布式选举(Buddy,Raft(2014),ZAB(2011)) crash-recovery atomic broadcast algorithm
分布式共识(PoW 工作量证明,PoS 权益证明,DPos 委托权益证明)
分布式事务(2PC,3PC,MQ) 刚性事务 ACID 柔性事务 BASE
ACID:2pc,2pc Base:MQ
分布式锁(ZK,DB,Redis) 分布式锁羊群效应
分布式体系结构:集中式(Borg(2015),Mesos(2013),K8S(2014)) Mesos+Marathon(Docker,cgroups)
分布式体系结构:非集中式(Akka Cluster(可容错,去中心化集群管理), Redis Cluster, Cassandra Cluster) akka,redis是去中心化,但是需要选主,而cassandra是p2p,不用选主
分布式调度
单体调度:Bord Scheduler,Hadoop,Spark,K8S
两层调度:Mesos(MMF-最大最小公平算法,DRF-主导资源公平算法),Yarn
共享状态调度:Google Omega,Microsoft Apollo,Hashicorp Nomad
分布式计算 MapReduce(Fork-Join) Stream:Storm,Flink,Spark,yahoo S4,Facebook Puma,Baidu Dstream,Taobao银河流数据处理平台,InfoSphere Stream Actor:Erlang/OTP, Akka,Quasar 流水线:tersorflow
分布式通信 RPC,RMI,Dubbo 消息(P2P,PubSub),Kafka 消息队列 RocketMQ
CAP CA:mysql,oracle CP:Redis,HBase,ZK AP:Eureka,CoachDB,Cassandra,DynamoDB
分布式存储 分布式数据库(结构化):MySQL sharding,Microsoft SQL Azure, Spanner, OceanBase 分布式键值对(半结构化):Redis,Memcache 分布式存储(非结构化):Ceph,GFS,HDFS,Swift(OpenStack)
数据分布(Sharding) Hash 一致性Hash(Cassandra) 带有限负载的一致性Hash(Google Cloud Pub/Sub, HAProxy) 带有虚拟节点的一致性Hash(Memcache)
数据复制 同步复制(MySQL) 异步复制(MySQL默认,Redis Cluster) 半同步复制(ZK CP,Microsoft SQL Azure Cloud SQL Server, ETCD)
MySQL 全同步复制 半同步复制 异步复制 Oracle 最大保护模式 最大性能模式 最大可用性模式
故障检测 固定心跳检测策略 (K,T) 历史心跳信息预测故障策略:phi φ值检测:akka,hz
网络分区处理策略(产生很多分区后,选择一个分区对外服务) 集中式拓扑与非集中式拓扑 Static Quorum Keep Majority(保留子集群节点数w>n/2的集群,如果分区过多,很难从很多分区找出一个>n/2集群) 设置仲裁机制 基于共享资源方式处理网络分区
Akka network partion 发现 stable-after 处理策略 Static Quorum Keep Majority Keep Oldest Keep Referee Down All Lease Majority
Hazelcast network partion Bron–Kerbosch algorithm
OOP vs Aoctor
E-PVM算法
Akka Cluster:面向应用程序平台的分布式集群管理
Spark,Hadoop,Marathon
Docker,Rocket
Memcache Ketama算法
Repcache实现Memchche的复制功能
Notes
吞吐量 qps tps bps
akka虽然有Leader和非Leader节点,但是不影响非集中式结构的节点的平等关系。节点有角色,但是没有控制关系。
去中心化和选主没有必然关系。一般master/slave是集中结构,而leader/非leader是非集中结构。也不一定。
Acotr=状态+行为+消息
Omega是Borg的下一代。
Question
为什么需要分布式系统?
一致性和共识区别?
Wiki
Person
Jeff Dean
Sage Weil: https://en.wikipedia.org/wiki/Sage_Weil
Sanjay Ghemawat
Paper
PacificA: Replication in Log-Based Distributed Storage Systems 2008
Large-scale cluster management at Google with Borg
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing
Omega: flexible, scalable schedulers for large compute clusters
Consistent Hashing with Bounded Loads Google 2017
The φ Accrual Failure Detector
Paper Recommend
理论基础
Time, Clocks, and the Ordering of Events in a Distributed System 1978 Lamport
The byzantine generals problem 1982 LAMPORT
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web 2002 Seth Gilbert Nancy Lynch
CAP Twelve Years Later: How the "Rules" Have Changed 2012 Eric Brewer
Base: An Acid Alternative 2008 Dan Pritchett, Ebay
A simple totally ordered broadcast protocol 2008
Virtual Time and Global States of Distributed Systems 2002
分布式一致性算法
A brief history of Consensus, 2PC and Transaction Commit. 2007
Paxos Made Simple 2001 Lamport
Paxos Made Practical 2007 David Mazieres
Paxos Made Live - An Engineering Perspective 2006 Tushar Deepak Chandra Robert Griesemer Joshua Redstone
In Search of an Understandable Consensus Algorithm(Extended Version) 2014 Diego Ongaro and John Ousterhout
ZooKeeper: Wait-free coordination for Internet-scale systems 2010
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore 2011
Impossibility of Distributed Consensus with One Faulty Process 1985
Consensus in the Presence of Partial Synchrony 1988
分布式数据结构
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications 2003
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems 2001
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric 2002
A Scalable Content-Addressable Network 2001
Ceph: A Scalable, High-Performance Distributed File System 2006 OSDI Sage Weil
The Log-Structured Merge-Tree (LSM-Tree) 1996
HBase: A NoSQL Database 2017 Hiren Patel
Tango: Distributed Data Structures over a Shared Log 2013
分布式系统实战
The Google File System 2003 Sanjay Ghemawat
Bigtable: A Distributed Storage System for Structured Data 2006
The Chubby lock service for loosely-coupled distributed systems 2006
Finding a needle in Haystack: Facebook's photo storage 2010
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency 2011
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing 2012
Scaling Distributed Machine Learning with the Parameter Server 2014
Dremel: Interactive Analysis of Web-Scale Datasets 2010
Pregel: A System for Large-Scale Graph Processing 2013
Spanner: Google's Globally-Distributed Database 2012
Dynamo: Amazon’s Highly Available Key-value Store 2007
S4: Distributed Stream Computing Platform 2010
Storm @Twitter 2015 https://cs.brown.edu/courses/cs227/archives/2015/papers/ss-storm.pdf
Large-scale cluster management at Google with Borg 2015
F1: A Distributed SQL Database That Scales 2013
Cassandra - A Decentralized Structured Storage System 2009
Megastore: Providing Scalable, Highly Available Storage for Interactive Services 2011
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure 2010
Kafka : a Distributed Messaging System for Log Processing 2011
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases 2017
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes 2018
Product
mongodb bully: https://docs.mongodb.com/manual/core/replica-set-elections/
es bully: https://www.elastic.co/blog/found-leader-election-in-general
es pacificA:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-replication.html
kafka pacificA: https://kafka.apache.org/documentation/#design_replicatedlog
k8s:https://kubernetes.io/blog/2016/01/simple-leader-election-with-kubernetes/
zk leader selection:https://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html
etcd: https://etcd.io/docs/v3.5/learning/why/
Bookeeper https://zookeeper.apache.org/doc/r3.2.2/bookkeeperStarted.html
https://mesos.apache.org/documentation/latest/high-availability/
https://doc.akka.io/docs/akka/current/typed/cluster-concepts.html (Gossip+ Phi Accrual Failure Detector)
https://redis.io/topics/cluster-spec
https://redis.io/topics/sentinel
https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archTOC.html
https://docs.hazelcast.com/imdg/4.2/network-partitioning/partial-network-partitions
Book
Site
https://hazelcast.com/glossary/cap-theorem/
https://docs.hazelcast.com/imdg/4.2/consistency-and-replication/consistency
https://ai.googleblog.com/2017/04/consistent-hashing-with-bounded-loads.html
https://www.metabrew.com/article/libketama-consistent-hashing-algo-memcached-clients
https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html