首先介绍一下什么是多路径(multi-path)先说说多路径功能产生的背景,在多路径功能出现之前,主机上的硬盘是直接挂接到一个总线(PCI)上,路径是一对一的关系,也就是一条路径指向一个硬盘或是存储设备,这样的一对一关系对于操作系统而言,处理相对简单,但是缺少了可靠性。当出现了光纤通道网络(Fibre Channle)也就是通常所说的SAN网络时,或者由iSCSI组成的IPSAN环境时,由于主机和存储之间通过光纤通道交换机或者多块网卡及IP来连接时,构成了多对多关系的IO通道,也就是说一台主机到一台存储设备之间存在多条路径。当这些路径同时生效时,I/O流量如何分配和调度,如何做IO流量的负载均衡,如何做主备。这种背景下多路径软件就产生了。
多路径的主要功能就是和存储设备一起配合实现如下功能:
1.故障的切换和恢复
2.IO流量的负载均衡
3.磁盘的虚拟化
在linux操作系统中,RedHat和Suse的2.6的内核中都自带了免费的多路径软件包,ESX操作系统下也是自带了免费的多路径功能,而windows操作系统下,就需要购买一个叫MPIO的软件lience才能使用multi-path多路径功能。其他windows和ESX操作系统下的多路径 功能都是图形化界面比较简单这里就不多做介绍了,在这里就是介绍一下linux环境下如何配置multi-path多路径功能。
一、Linux下multipath相关工具和参数介绍:
1、device-mapper-multipath:即multipath-tools。主要提供multipathd和multipath等工具和 multipath.conf等配置文件。这些工具通过device mapper的ioctr的接口创建和配置multipath,设备创建的多路径设备映射会在/dev /mapper中。
2、 device-mapper:主要包括两大部分:内核部分和用户部分。内核部分主要由device mapper核心(dm.ko)和一些target driver(md-multipath.ko)。核心完成设备的映射,而target根据映射关系和自身特点具体处理从mappered device 下来的i/o。同时,在核心部分,提供了一个接口,用户通过ioctr可和内核部分通信,以指导内核驱动的行为,比如如何创建mappered device,这些divece的属性等。linux device mapper的用户空间部分主要包括device-mapper这个包。其中包括dmsetup工具和一些帮助创建和配置mappered device的库。这些库主要抽象,封装了与ioctr通信的接口,以便方便创建和配置mappered device。multipath-tool的程序中就需要调用这些库。
3、dm-multipath.ko和dm.ko:dm.ko是device mapper驱动。它是实现multipath的基础。dm-multipath其实是dm的一个target驱动。
4、scsi_id: 包含在udev程序包中,可以在multipath.conf中配置该程序来获取scsi设备的序号。通过序号,便可以判断多个路径对应了同一设备。这个是多路径实现的关键。scsi_id是通过sg驱动,向设备发送EVPD page80或page83 的inquery命令来查询scsi设备的标识。但一些设备并不支持EVPD 的inquery命令,所以他们无法被用来生成multipath设备。但可以改写scsi_id,为不能提供scsi设备标识的设备虚拟一个标识符,并输出到标准输出。multipath程序在创建multipath设备时,会调用scsi_id,从其标准输出中获得该设备的scsi id。在改写时,需要修改scsi_id程序的返回值为0。因为在multipath程序中,会检查该值来确定scsi id是否已经成功得到。
二、multipath在redhat 6.2中的基本配置:
1. 通过命令:lsmod |grep dm_multipath 检查是否正常安装成功。如果没有输出说明没有安装那么通过yum功能安装一下软件包:yum –y install device-mapper device-mapper-multipath
接着通过命令:multipath –ll 查看多路径状态查看模块是否加载成功
[root@liujing ~]# multipath –ll 查看多路径状态
Mar 10 19:18:28 | /etc/multipath.conf does not exist, blacklisting all devices.
Mar 10 19:18:28 | A sample multipath.conf file is located at
Mar 10 19:18:28 | /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Mar 10 19:18:28 | You can run /sbin/mpathconf to create or modify /etc/multipath.conf
Mar 10 19:18:28 | DM multipath kernel driver not loaded —-DM模块没有加载
如果模块没有加载成功请使用下列命初始化DM,或重启系统
—Use the following commands to initialize and start DM for the first time:
# modprobe dm-multipath
# modprobe dm-round-robin
# service multipathd start
# multipath –v2
初始化完了之后再通过multipath -ll命令查看是否加载成功
[root@liujing ~]# multipath -ll
Mar 10 19:21:14 | /etc/multipath.conf does not exist, blacklisting all devices.
Mar 10 19:21:14 | A sample multipath.conf file is located at
Mar 10 19:21:14 | /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Mar 10 19:21:14 | You can run /sbin/mpathconf to create or modify /etc/multipath.conf
DM multipath kernel driver not loaded —-这个提示没了说明DM模块已加载成功。
从上面的提示可以看到,DM模块是成功加载,但是/etc/下没有multipath.conf 配置文件,下一步介绍如何配置multipath.conf 文件。
2. 配置multipath:
通过vi命令创建一个Multipath的配置文件路径是/etc/multipath.conf ,在配置文件中添加multipath正常工作的最简配置如下:
vi /etc/multipath.conf
blacklist {
devnode “^sda”
}
defaults {
user_friendly_names yes
path_grouping_policy multibus
failback immediate
no_path_retry fail
}
编辑完成后保存配置,同时通过命令:
#开启mulitipath服务
# /etc/init.d/multipathd start
如果出现无法开启服务的情况,没有提示OK的话如下:
[root@liujing mapper]# service multipathd start
Starting multipathd daemon: 没有提示OK
重新开关一下服务就可以解决了。
[root@liujing mapper]# /etc/init.d/multipathd stop
Stopping multipathd daemon: [ OK ]
[root@localhost mapper]# /etc/init.d/multipathd start
Starting multipathd daemon: [ OK ] —–提示OK 正常开启服务
通过命令查看:
[root@liujing mapper]# multipath -ll
mpatha (360a9800064665072443469563477396c) dm-0 NETAPP,LUN —-创建了一个lun
size=3.5G features=’0′ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=4 status=active
|- 1:0:0:0 sdb 8:16 active ready running —-多路径下的两个盘符sdb和sde.
`- 2:0:0:0 sde 8:64 active ready running
目录/dev/mapper/ 下多了两个文件夹mpatha 和mpathap1。
[root@liujing mapper]# cd /dev/mapper/
[root@liujing mapper]# ls
control mpatha mpathap1
同时fdisk –l的命令下也多了两个设备标识:
没有配置多路径时:
[root@liujing~]# fdisk -l
Disk /dev/sda: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a6cdd
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 287 2097152 82 Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3 287 17850 141071360 83 Linux
Disk /dev/sdb: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sdb1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/sde: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sde1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
两个CAN网卡获取到同一盘符:
/dev/sde和/dev/sdb.
配置后多了/dev/mapper/mpatha和/dev/mapper/mpathap1:
[root@localhost mapper]# fdisk -l
Disk /dev/sda: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a6cdd
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 287 2097152 82 Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3 287 17850 141071360 83 Linux
Disk /dev/sdb: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sdb1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/sde: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sde1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/mapper/mpatha: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/mapper/mpathap1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/mapper/mpathap1: 3773 MB, 3773441024 bytes
255 heads, 63 sectors/track, 458 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Alignment offset: 1024 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/mpathap1 doesn’t contain a valid partition table
# multipath -F #删除现有路径 两个新的路径就会被删除
# multipath -v2 #格式化路径 格式化后又出现
3. multipath磁盘的基本操作
要对多路径软件生成的磁盘进行操作直接操作/dev/mapper/目录下的磁盘就行.
在对多路径软件生成的磁盘进行分区之前最好运行一下pvcreate命令:
# pvcreate /dev/mapper/mpatha
# fdisk /dev/mapper/mpatha分区时用这个目录/dev/mapper/mpatha
用fdisk对多路径软件生成的磁盘进行分区保存时会有一个报错,此报错不用理会.
# ls -l /dev/mapper/
[root@liujing mnt]# ls -l /dev/mapper/
total 0
crw-rw—-. 1 root root 10, 58 Mar 10 19:10 control
lrwxrwxrwx. 1 root root 7 Mar 10 20:28 mpatha -> ../dm-0
lrwxrwxrwx. 1 root root 7 Mar 10 20:33 mpathap1 -> ../dm-1
的mpathap1就是我们对multipath磁盘进行的分区
# mkfs.ext4 /dev/mapper/mpathap1 #对mpath1p1分区格式化成ext4文件系统
# mount /dev/mapper/mpathap1 /mnt/ #挂载mpathap1分区
格式化和挂载时用/dev/mapper/mpathap1
4. 分区磁盘:
上面有提到分区时用目录/dev/mapper/mpatha
[root@liujing~]# fdisk/dev/mapper/mpatha
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xac956c3a.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won’t be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
WARNING: DOS-compatible mode is deprecated. It’s strongly recommended to
switch off the mode (command ‘c’) and change display units to
sectors (command ‘u’).
Command (m for help): n————————新建分区
Command action
e extended
p primary partition (1-4)
p—————————–主分区
Partition number (1-4): 1
First cylinder (1-1016, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-1016, default 1016):
Using default value 1016
Command (m for help): w ———————写入列表相当于保存
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
注:如果同一台设备的两个node挂同样的盘符,另一个盘符还需要再次写入w就行。不需要n了。
5. 格式化:
[root@liujing ~]# mkfs.ext4/dev/mapper/mpathap1
mke2fs 1.41.12 (17-May-2010)
/dev/sdd1 alignment is offset by 1024 bytes.
This may result in very poor performance, (re)-partitioning suggested.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1 blocks, Stripe width=16 blocks
230608 inodes, 921250 blocks
46062 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=943718400
29 block groups
32768 blocks per group, 32768 fragments per group
7952 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
6. 挂载/dev/mapper/mpathap1到 /mnt
[root@liujing ~]# mount /dev/mapper/mpathap1/mnt
三、multipath的高级配置之前的配置都是用multipath的默认配置来完成multipath,比如映射设备的名称,multipath负载均衡的方法都是默认设置。那有没有按照我们自己定义的方法来配置multipath呢,答案是OK。
1、multipath.conf文件的配置
接下来的工作就是要编辑/etc/multipath.conf的配置文件
multipath.conf主要包括blacklist、multipaths、devices三部份的配置
blacklist配置
blacklist {
devnode “^sda”
}
Multipaths部分配置multipaths和devices两部份的配置。
multipaths {
multipath {
wwid **************** #此值multipath -v3可以看到
alias iscsi-dm0 #映射后的别名,可以随便取
path_grouping_policy multibus #路径组策略
path_checker tur #决定路径状态的方法
path_selector “round-robin 0” #选择那条路径进行下一个IO操作的方法
}
}
Devices部分配置
devices {
device {
vendor “iSCSI-Enterprise” #厂商名称
product “Virtual disk” #产品型号
path_grouping_policy multibus #默认的路径组策略
getuid_callout “/sbin/scsi_id -g -u -s /block/%n” #获得唯一设备号使用的默认程序
prio_callout “/sbin/acs_prio_alua %d” #获取有限级数值使用的默认程序
path_checker readsector0 #决定路径状态的方法
path_selector “round-robin 0” #选择那条路径进行下一个IO操作的方法
failback immediate #故障恢复的模式
no_path_retry queue #在disable queue之前系统尝试使用失效路径的次数的数值
rr_min_io 100 #在当前的用户组中,在切换到另外一条路径之前的IO请求的数目
}
}
下面是相关参数的标准文档的介绍:
Attribute
Description
wwid
Specifies the WWID of the multipath device to which themultipathattributes apply. This parameter is mandatory for this section of themultipath.conffile.
alias
Specifies the symbolic name for the multipath device to which themultipathattributes apply. If you are usinguser_friendly_names, do not set this value tompathn; this may conflict with an automatically assigned user friendly name and give you incorrect device node names.
path_grouping_policy
Specifies the default path grouping policy to apply to unspecified multipaths. Possible values include:
failover= 1 path per priority group
multibus= all valid paths in 1 priority group
group_by_serial= 1 priority group per detected serial number
group_by_prio= 1 priority group per path priority value
group_by_node_name= 1 priority group per target node name
path_selector
Specifies the default algorithm to use in determining what path to use for the next I/O operation. Possible values include:
round-robin 0: Loop through every path in the path group, sending the same amount of I/O to each.
queue-length 0: Send the next bunch of I/O down the path with the least number of outstanding I/O requests.
service-time 0: Send the next bunch of I/O down the path with the shortest estimated service time, which is determined by dividing the total size of the outstanding I/O to each path by its relative throughput.
failback
Manages path group failback.
A value ofimmediatespecifies immediate failback to the highest priority path group that contains active paths.
A value ofmanualspecifies that there should not be immediate failback but that failback can happen only with operator intervention.
A value offollowoverspecifies that automatic failback should be performed when the first path of a path group becomes active. This keeps a node from automatically failing back when another node requested the failover.
A numeric value greater than zero specifies deferred failback, expressed in seconds.
prio
Specifies the default function to call to obtain a path priority value. For example, the ALUA bits in SPC-3 provide an exploitablepriovalue. Possible values include:
const: Set a priority of 1 to all paths.
emc: Generate the path priority for EMC arrays.
alua: Generate the path priority based on the SCSI-3 ALUA settings.
tpg_pref: Generate the path priority based on the SCSI-3 ALUA settings, using the preferred port bit.
ontap: Generate the path priority for NetApp arrays.
rdac: Generate the path priority for LSI/Engenio RDAC controller.
hp_sw: Generate the path priority for Compaq/HP controller in active/standby mode.
hds: Generate the path priority for Hitachi HDS Modular storage arrays.
no_path_retry
A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value offailindicates immediate failure, without queueing.
A value ofqueueindicates that queueing should not stop until the path is fixed.
rr_min_io
Specifies the number of I/O requests to route to a path before switching to the next path in the current path group. This setting is only for systems running kernels older that 2.6.31. Newer systems should userr_min_io_rq. The default value is 1000.
rr_min_io_rq
Specifies the number of I/O requests to route to a path before switching to the next path in the current path group, using request-based device-mapper-multipath. This setting should be used on systems running current kernels. On systems running kernels older than 2.6.31, userr_min_io. The default value is 1.
rr_weight
If set topriorities, then instead of sendingrr_min_iorequests to a path before callingpath_selectorto choose the next path, the number of requests to send is determined byrr_min_iotimes the path’s priority, as determined by thepriofunction. If set touniform, all path weights are equal.
flush_on_last_del
If set toyes, then multipath will disable queueing when the last path to a device has been deleted.
在我本地的一个完整的高级配置如下:
[root@liujing ~]# vi /etc/multipath.conf
blacklist {
devnode “^sda”
}
multipaths {
multipath {
wwid 360a98000646650724434697454546156
aliasmpathb_fcoe
path_grouping_policy multibus
#path_checker “directio”
prio “random”
path_selector “round-robin 0”
}
}
devices {
device {
vendor “NETAPP”
product”LUN”
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
#path_checker “directio”
#path_selector “round-robin 0”
failback immediate
no_path_retry fail
}
}
其中wwid,vendor,product,getuid_callout这些参数可以通过:multipath -v3命令来获取。如果在/etc/multipath.conf中有设定各wwid 别名,别名会覆盖此设定。
四、负载均衡测试:
可以使用dd命令来对设备进行读写操作,并同时通过iostat来查看I/0状态,流量从哪个路径出去:
DD命令:dd if=/dev/zero of=/mnt/1Gfile bs=8k count=131072 在上面我们已经把磁盘挂载在/MNT文件夹下所以我们在读写磁盘时直接对/mnt文件夹直接读写就可以了。
如果想对磁盘重复读写可以用如下语句:
[root@liujing ~]# for ((i=1;i<=5;i++));do dd if=/dev/zero of=/mnt/1Gfile bs=8k count=131072 2>&1|grep MB;done; —重复读写5次这个值可以根据自己测试需求修改。
深度分析LINUX环境下如何配置multi-path
另一个控制台输入iostat 2 10查看IO读写状态:
深度分析LINUX环境下如何配置multi-path
可以看到sdc和sdd是两个多路径的盘符,流量均匀的负载在两条路径中,负载均衡很成功。
五、路径冗余备份测试
将其中一条路径的端口down掉,所有流量会直接切换到另一个路径中。
深度分析LINUX环境下如何配置multi-path