高可用集群原理详解
资源粘性:
资源约束:Constraint
排列约束: (colocation)
资源是否能够运行于同一节点
score:
正值:可以在一起
负值:不能在一起
位置约束:(location), score(分数)
正值:倾向于此节点
负值:倾向于逃离于此节点
顺序约束: (order)
定义资源启动或关闭时的次序
vip, ipvs
ipvs–>vip
-inf: 负无穷
inf: 正无穷
资源隔离:
节点级别:STONITH
资源级别:
例如:FC SAN switch可以实现在存储资源级别拒绝某节点的访问
STONITH:
split-brain: 集群节点无法有效获取其它节点的状态信息时,产生脑裂
后果之一:抢占共享存储
active/active: 高可用
高可用集群原理之共享存储
IDE:(ATA),130M
SATA:600M
7200rpm
IOPS: 100
SCSI: 320M
SAS:
15000rpm
IOPS: 200
USB 3.0: 400M
机械:
随机读写
顺序读写
固态:
IDE, SCSI: 并口
SATA, SAS, USB: 串口
DAS:
Direct Attached Storage
直接接到主板总线,BUS
文件:块
NAS:
Network Attached Storage
文件服务器:文件级别
SAN:
Storage Area network
存储区域网络
FC SAN
IP SAN: iSCSI
SCSI: Small Computer System Interface
高可用集群原理之多节点集群
crm:使本身不具备高可用的使其具有高可用,rm本身就是一个脚本。
资源粘性:资源对某点的依赖程度,通过score定义
资源约束:
location: 资源对节点倾向程度
coloation: 资源间依赖性
order: 资源的采取动作的次序
Heartbeat v1 自带的资源管理器
haresources:
Heartbeat v2 自带的资源管理器
haresources
crm
Heartbeat v3: 资源管理器crm发展为独立的项目,pacemaker
Resource Type:
primitive: 主资源,在某一时刻是能运行在某一节点
clone: 可以在多个节点运行
group:把多个primitive归为组,一般只包含primitive
master/slave: drbd只能运行在两个节点
RA: Resource Agent
RA Classes:
Legacy heartbeat v1 RA
LSB (/etc/rc.d/init.d/) Linux Standard Base
OCF (Open Cluster Framework)
pacemaker
linbit (drbd)
STONITH:管理硬件stonith设备
隔离级别:
节点级别
STONTIH
资源级别
FC SAN Switch
Stonith设备
1、Power Distribution Units (PDU)
Power Distribution Units are an essential element in managing power capacity and functionality for critical network, server and data center equipment. They can provide remote load monitoring of connected equipment and individual outlet power control for remote power recycling.
2、Uninterruptible Power Supplies (UPS)
A stable power supply provides emergency power to connected equipment by supplying power from a separate source in the event of utility power failure.
3、Blade Power Control Devices
If you are running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be
capable of managing single blade computers.
4、Lights-out Devices
Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and may even become standard in off-the-shelf computers. However, they are inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be just as useless. In that case, the CRM would continue its attempts to fence the node indefinitely while all other resource operations would wait for the fencing/STONITH operation to complete.
5、Testing Devices
Testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Once the cluster goes into production, they must be replaced
with real fencing devices.
stonithd
stonithd is a daemon which can be accessed by local processes or over the network. It accepts the commands which correspond to fencing operations: reset, power-off, and power-on. It can also check the status of the fencing device.
The stonithd daemon runs on every node in the CRM HA cluster. The stonithd instance running on the DC node receives a fencing request from the CRM. It is up to this and other stonithd programs to carry out the desired fencing operation.
STONITH Plug-ins
For every supported fencing device there is a STONITH plug-in which is capable of controlling said device. A STONITH plug-in is the interface to the fencing device.
On each node, all STONITH plug-ins reside in /usr/lib/stonith/plugins (or in /usr/lib64/stonith/plugins for 64-bit architectures). All STONITH plug-ins look the same to stonithd, but are quite different on the other side reflecting the nature of the fencing device.
Some plug-ins support more than one device. A typical example is ipmilan (or external/ipmi) which implements the IPMI protocol and can control any device which supports this protocol.
Heartbeat:udp:694
高可用集群之heartbeat安装配置两个节点:172.16.100.6 172.16.100.7
vip:172.16.100.1
1.两个节点互相通信
2.配置主机名:hostname node1.magedu.com uname -a
永久生效:vim /etc/sysconfig/network HOSTNAME=node1.magedu.com
3.ssh双机互信:
4.配置主机名解析
vim /etc/hosts
172.16.100.6 node1.magedu.com node1
172.16.100.7 node2.magedu.com node2
关闭iptables
5.时间同步
ntpdate 172.16.0.1
service ntpd stop
chkconfig ntpd off
为了保证以后尽可能同步:crontab -e
*/5 * * * * /sbin/ntpdate 172.16.0.1 &> /dev/null
scp /var/spool/cron/root /node2:/var/spool/cron/
epel
heartbeat – Heartbeat subsystem for High-Availability Linux
heartbeat-devel – Heartbeat development package
heartbeat-gui – Provides a gui interface to manage heartbeat clusters
heartbeat-ldirectord – Monitor daemon for maintaining high availability resources, 为ipvs高可用提供规则自动生成及后端realserver健康状态检查的组件;
heartbeat-pils – Provides a general plugin and interface loading library
heartbeat-stonith – Provides an interface to Shoot The Other Node In The Head
http://dl.fedoraproject.org/pub/epel/5/i386/repoview/letter_h.group.html
三个配置文件:
1、密钥文件,600, authkeys
2、heartbeat服务的配置配置ha.cf
3、资源管理配置文件
haresources
vim authkeys
vim ha.cf
logfacility local0
keepalive 1
node node1.magedu.com
node node2.magedu.com
ping 172.16.0.1
vim haresources 保证httpd服务没启动,并且chkconfig httpd off
node1.magedu.com IPaddr::162.16.100.1/16/eth0 httpd
访问:172.16.0.1
模拟172.16.100.6故障,访问172.16.0.1出现node2.magedu.com
mkdir /web/htdocs -pv
vim /etc/exports
/web/htdocs 172.16.0.0/16(ro)
service nfs restart
showmount -e 172.16.100.10输出结果正常
关闭服务:mount 172.16.100.10:/web/htdocs /mnt
ls /mnt index.html
umount /mnt
编辑haresources:node1.magedu.com IPaddr::172.16.100.1/16/eth0 Filesystem::172.16.100.10:/web/htdocs::/var/www/html::nfs httpd
scp haresources node2:/etc/ha.d/
tail -f /var/log/messages
高可用集群之heartbeat基于crm进行资源管理
RA classes:
OCF
pacemaker
linbit
LSB
Legacy Heartbeat V1
STONITH
RA: Resource Agent
代为管理资源
LRM: Local Resource Manager
DC:TE,PE
CRM: Cluster Resource Manager
haresource (heartbeat v1)
crm, haresource (heartbeat v2)
pacemaker (heartbeat v3)
rgmanager (RHCS)
为那些非ha-aware的应用程序提供调用的基础平台;
crmd:管理API GUI,CLI
web(三个资源):vip,httpd,filesystem
Resource Type:
primitive(native)
group
clone
STONISH
Cluster Filesystem dlm:Distributed Lock Manager
master/slave:drbd
资源粘性:资源是否倾向于留在当前节点
正数:乐意
负数:离开
资源约束:
location:位置约束 colocation:排列约束 order:顺序约束
heartbeat:
authkeys
ha.cf
node
bcast、mcast、ucast
haresource
HA:
1、时间同步;2、SSH双机互信;3、主机名称要与uname -n,并通过/etc/hosts解析;
CIB: Cluster Information Base
xml格式
crm –> pacemaker
crmd respawn|on
mcast eth0 255.0.100.19 694 1 0
原理简介
组播报文的目的地址使用D类IP地址, 范围是从224.0.0.0到239.255.255.255。D类地址不能出现在IP报文的源IP地址字段。单播数据传输过程中,一个数据包传输的路径是从源地址路由到目的地址,利用“逐跳”(hop-by-hop)的原理在IP网络中传输。然而在ip组播环中,数据包的目的地址不是一个,而是一组,形成组地址。所有的信息接收者都加入到一个组内,并且一旦加入之后,流向组地址的数据立即开始向接收者传输,组中的所有成员都能接收到数据包。组播组中的成员是动态的,主机可以在任何时刻加入和离开组播组。
组播组分类
组播组可以是永久的也可以是临时的。组播组地址中,有一部分由官方分配的,称为永久组播组。永久组播组保持不变的是它的ip地址,组中的成员构成可以发生变化。永久组播组中成员的数量都可以是任意的,甚至可以为零。那些没有保留下来供永久组播组使用的ip组播地址,可以被临时组播组利用。
224.0.0.0~224.0.0.255为预留的组播地址(永久组地址),地址224.0.0.0保留不做分配,其它地址供路由协议使用;
224.0.1.0~224.0.1.255是公用组播地址,可以用于Internet;
224.0.2.0~238.255.255.255为用户可用的组播地址(临时组地址),全网范围内有效;
239.0.0.0~239.255.255.255为本地管理组播地址,仅在特定的本地范围内有效。
常用预留组播地址
列表如下:
224.0.0.0 基准地址(保留)
224.0.0.1 所有主机的地址 (包括所有路由器地址)
224.0.0.2 所有组播路由器的地址
224.0.0.3 不分配
224.0.0.4 dvmrp 路由器
224.0.0.5 ospf 路由器
224.0.0.6 ospf dr
224.0.0.7 st 路由器
224.0.0.8 st 主机
224.0.0.9 rip-2 路由器
224.0.0.10 Eigrp 路由器
224.0.0.11 活动代理
224.0.0.12 dhcp 服务器/中继代理
224.0.0.13 所有pim 路由器
224.0.0.14 rsvp 封装
224.0.0.15 所有cbt 路由器
224.0.0.16 指定sbm
224.0.0.17 所有sbms
224.0.0.18 vrrp
以太网传输单播ip报文的时候,目的mac地址使用的是接收者的mac地址。但是在传输组播报文时,传输目的不再是一个具体的接收者,而是一个成员不确定的组,所以使用的是组播mac地址。组播mac地址是和组播ip地址对应的。iana(internet assigned number authority)规定,组播mac地址的高24bit为0x01005e,mac 地址的低23bit为组播ip地址的低23bit。
由于ip组播地址的后28位中只有23位被映射到mac地址,这样就会有32个ip组播地址映射到同一mac地址上。
高可用集群之基于heartbeat和nfs的高可用mysql