NAS:VMWare SRP Guide: Difference between revisions
No edit summary |
m (→Final Notes:) |
||
Line 678: | Line 678: | ||
</nowiki>}} | </nowiki>}} | ||
'''Troubleshooting IPoIB / vSwitch / vmnic MTU:''' | |||
Changes in the MTU size for an IPoIB pkey can be tricky. The first device connected will set the MTU size for the broadcast domain. i.e. subsequent connection of IPoIB interfaces will not be allowed to set an MTU size larger than the previous one. | |||
1st if you believe your Switch partitions MTU specification is correct, and see other hosts successfully accepting your MTU size, but for whatever reason of of the ESX hosts just refuses to play nicely, and negotiates a 2044 MTU where everyone else is 4092 for example. If this is the case, just put the ESX host into maintenance mode, disconnect all IB interfaces, wait a second, reconnect the 1st IB interface, and validate that you can edit the vSwitch and vmnic to 4092. If successful, plug in the rest of the IB adapters, and validate they too should negotiate MTU correctly. | |||
2nd only if not resolved by the above, double check the partitions.conf file (on your switch or OpenSM configuration) and validate you are specifying the correct MTU=# for the Partition these interfaces are joining. You cant software set anything higher than the physical boundary specified in the partition. | |||
'''ADDITIONAL REFERENCES:''' | '''ADDITIONAL REFERENCES:''' |
Revision as of 09:22, 27 January 2017
Goal:
This wiki is meant to help people get Infiniband SRP Target working under RedHat/CentOS 7.3 to VMWare ESXi 6.0x SRP Initiators via Mellanox Infiniband HCAs. (although the process should work under any RHEL / CentOS 7.x build)
This guide should cover all of the steps from Mellanox OFED Driver compiling/installing/configuring, to SCST compiling/installing, to adding ZFS on Linux (ZoL), and finally configuring the SCST with all of the above. As well as the few ESX6 steps required to remove all of the 'inband' drivers, install the OFED drivers, and be an active SRP initiator.
Not all of these steps are required for everyone, but i'm sure *someone* will appreciate them all together in one place :)
For the purposes of this guide, the syntax assumes you are always logged in as 'root'
Last but not least...
This guide come with no warranty or support of any kind :) I'm not an expert in this science, nor do I claim to be. I'm not even particularly experienced in Linux. I am a storage engineer/consultant by trade, and have a great understanding of the storage stack/layers, so the way I explain stuff below is "as I understand it". Any corrections to my 'math' and 'syntax' are always welcome, as I like to learn as much as I like to teach :)
Credit, where credit is due...
The knowledge and process that this guide was based off, was only possible due to the assistance I was given from mpogr. He was gracious enough to help me troubleshoot and understand all the failures I was having. (many) If it wasn't for the rapid-fire 68 private messages we had on https://forums.servethehome.com , which ultimately led to one of those, hey mate here's my cell# just call me. That and a long night of determination helped me produce the insight required to document this process. I am in his debt, and very much respect his willingness to go above and beyond for a complete stranger, who has one common goal across the planet. Without his help, I would have never accomplished something, worthy of documenting.
What is SRP?
The SCSI RDMA Protocol (SRP) is a protocol that allows one computer to access SCSI devices attached to another computer via remote direct memory access (RDMA).The SRP protocol is also known as the SCSI Remote Protocol. The use of RDMA makes higher throughput and lower latency possible than what is possible through e.g. the TCP/IP communication protocol. RDMA is only possible with network adapters that support RDMA in hardware. Examples of such network adapters are InfiniBand HCAs and 10 GbE network adapters with iWARP support.
What is SCST?
SCST is a GPL licensed SCSI target software stack. The design goals of this software stack are high performance, high reliability, strict conformance to existing SCSI standards, being easy to extend and easy to use. SCST does not only support multiple SCSI protocols (iSCSI, FC, SRP) but also supports multiple local storage interfaces (SCSI pass-through, block I/O and file I/O) and also storage drivers implemented in user-space via the scst_user driver. In order to reach maximum performance SCST has been implemented as a set of kernel drivers.
VMware ESX 6.0.x SRP Initiator Setup
I started with a fresh ESX 6.0 update 2 installation. ESX6 has 'inband' drivers which have shifted toward the Ethernet only world. Let's 'fix' that.
Download the Mellanox OFED v1.8.2.5 Drivers with SRP Initiator from: http://www.mellanox.com/downloads/Software/MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip
This driver was developed for ESX5.x, isn't 'supported' in ESXi v6.0, and doesnt work in ESXi v6.5
Step 1: Remove the 'inband' ESXi v6.0 Infiniband/Mellanox Drivers
You will need SSH enabled on your ESX host via vSphere Client > ESX6 Host > Configuration > Security Profile > Services > Properties
SSH > Start
ESXi Shell > Start *Optional* Configure these to start and stop with the host |
Start by SSH'ing in to your ESXi host and search to see what mlx4 'inband' drivers already exist:
# [root@esx01:~]$ esxcli software vib list | grep mlx4 |
My ouput:
# net-mlx4-core 1.9.7.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-01-24 net-mlx4-en 1.9.7.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-01-24 nmlx4-core 3.0.0.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-01-24 nmlx4-en 3.0.0.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-01-24 nmlx4-rdma 3.0.0.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-01-24 |
Remove all of the 'inband drivers'
# [root@esx01:~]$ esxcli software vib remove -f -n net-mlx4-core -n net-mlx4-en -n nmlx4-core -n nmlx4-en -n nmlx4-rdma |
Output:
# Removal Result Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective. Reboot Required: true VIBs Installed: VIBs Removed: VMware_bootbank_net-mlx4-core_1.9.7.0-1vmw.600.0.0.2494585, VMware_bootbank_net-mlx4-en_1.9.7.0-1vmw.600.0.0.2494585, VMware_bootbank_nmlx4-core_3.0.0.0-1vmw.600.0.0.2494585, VMware_bootbank_nmlx4-en_3.0.0.0-1vmw.600.0.0.2494585, VMware_bootbank_nmlx4-rdma_3.0.0.0-1vmw.600.0.0.2494585 VIBs Skipped: |
Reboot.
# [root@esx01:~]$ reboot |
Step 2: Install v1.8.2.5 of the Mellanox OFED Drivers on your ESXi v6.0x Host
Use SCP to copy the file to the \tmp directory of your ESXi 6.0 host. (or) if your ESX host has internet.
# [root@esx01:~]$
cd /tmp wget http://www.mellanox.com/downloads/Software/MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip |
Allow the 'non-VMware' driver to be installed via esxcli
# [root@esx01:~]$ esxcli software acceptance set --level=CommunitySupported |
Output
# Host acceptance level changed to 'CommunitySupported'. |
Install the Mellanox OFED v1.8.2.5 package
# [root@esx01:~]$ esxcli software vib install -d /tmp/MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip --no-sig-check |
Output
# Installation Result Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective. Reboot Required: true VIBs Installed: MEL_bootbank_net-ib-cm_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-ib-core_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-ib-ipoib_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-ib-mad_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-ib-sa_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-ib-umad_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-memtrack_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-mlx4-core_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_net-mlx4-ib_1.8.2.5-1OEM.600.0.0.2494585, MEL_bootbank_scsi-ib-srp_1.8.2.5-1OEM.600.0.0.2494585 VIBs Removed: VIBs Skipped: |
Reboot..
# [root@esx01:~]$ reboot |
Step 3: Validate and Optimize
You Should see under via vSphere Client > ESX6 Host > Configuration > Storage Adapters >>:
MT26428 [ConnectX VPI - 10GigE / IB QDR, PCIe 2.0 5GT/s] vmhba32 SCSI vmhba33 SCSI |
Your actual make/model# may differ. Your vmhba#s may differ.
As long as they are there, your SRP initiators are ready & waiting.
*Optimize*
To set your ESX6 host to use the 4096 Byte MTU size.
# esxcli system module parameters set -m mlx4_core -p='mtu_4k=1 msi_x=1' |
Reboot...
# [root@esx01:~]$ reboot |
That's about it, for tuning on the ESX6 side.
RedHat/CentOS 7.3 SRP Target Setup
These instructions are meant to be used with: SCST 3.2.x (latest stable branch) as well as Mellanox OFED Drivers 3.4.2 (latest) [as the time of writing]
They should be viable for Mellanox ConnectX-2/3/4 Adapters, with or without an Infiniband Switch.
NOTE: All Infiniband connectivity requires 'a' subnet manager functioning 'somewhere' in the 'fabric'. I will cover the very basics of this shortly, but the gist of it is; You want (1) subnet manager configured and running. On this subnet manager you need to configure at least one 'partition'. This acts like an ethernet VLAN, except that Infiniband wont play nice without one. For the purpose of this guide you wont need more than one. But...If you are on top of managing your subnet manager and partitions already, consider the pro's/con's of potentially creating one specifically for SRP traffic, and segmenting it from IPoIB, and binding all of your SRP only interfaces to that partition.
The basic order you want to do things in is: Install either base OS, and update it to current. Recommendation is Minimal OS installation. Highly recommended on OS installation that, you do NOT add any of the Infiniband or iSCSI packages that come with the OS. I can't guarantee they wont get in the way somewhere down the line. There may be some development type packages that show up as missing/required when making/installing, add them manually, and retry the step.
The Mellanox OFED Drivers
Mellanox OFED Driver Download Page http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
Mellanox OFED Linux User Manual v3.40 http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v3.40.pdf
My Driver Direct Download Link for RHEL/CentOS 7.3 x64 (Latest w/ OFED) http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-3.4-2.0.0.0&mname=MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-x86_64.tgz
Step 1: Install the prerequisite packages required by the Mellanox OFED Driver package
# [root@NAS01 ~]$ yum install tcl tk -y |
Step 2: Download the Mellanox OFED drivers (.tgz version for this guide), and put them in /tmp
# [root@NAS01 ~]$ cd /tmp
[root@NAS01 ~]$ tar xvf MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-x86_64.tgz [root@NAS01 ~]$ cd /MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-x86_64 |
NOTE: If you just run the ./mnlxofedinstall script, with the latest & greatest RedHat/CentOS kernel, you will fail 'later down the road' in the SCST installation, specifically in the ib_srpt module, which is required for this exercise.
Step 3: Initially run the mlnxofedinstall script with the --add-kernel-support flag (REQUIRED)
# [root@NAS01 ~]$ ./mlnxofedinstall --add-kernel-support --without-fw-update |
NOTE: This will actually take the installation package, and use it to rebuild an entirely new installation package, customized for your specific Linux kernel. Note the name and location of the new .tgz package it creates. the --without-fw-update flag isn't required, but is useful if you 'dont even want to go there' on seeing if the driver package wants to auto-flash your HCA firmware. (Use your own best judgement)
Step 4: Extract the new package that was just created, customized for your Linux kernel.
# [root@NAS01 ~]$ cd /tmp/MLNX_OFED_LINUX-3.4-2.0.0.0-3.10.0-514.el7.x86_64/
[root@NAS01 ~]$ tar xvf MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-ext.tgz [root@NAS01 ~]$ cd MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-ext |
NOTE: In my example I'm using the RedHat 7.3 OFED Driver, so my file names may differ from yours.Look for the -ext suffix before the .tgz extension.
Step 5: Now we can run the Mellanox OFED installation script
# [root@NAS01 ~]$ ./mlnxofedinstall |
Step 6: Validate the new Mellanox Drivers can Stop/Start
# [root@NAS01 ~]$ /etc/init.d/openibd restart
Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ] |
Look good? Move on! Nothing to see here....
NOTE: If you get an error here, about iSCSI or SRP being 'used', and the service doesn't automagically stop and start, then you likely have an conflict with a 'inband' driver. You should try and resolve that conflict, before you try and move forward.
I don't the definitive how-to guide for every possible scenario. But a good rule of thumb is run the un-installer that came with the Mellanox OFED package to try and 'clean' up everything.
/usr/sbin/ofed_uninstall.sh
Additionally, look for 'how to remove inband srp/infiniband/scst/iscsi' for whatever looks like it's conflicting for you. Repeat the OFED & SCST(If you had one previously) installations per this guide and validate again.
Step 7: Validate thew new Mellanox Drivers using the supplied Self Test script
# [root@NAS01 ~]$ hca_self_test.ofed |
Validate Output for me, looks like:
# ---- Performing Adapter Device Self Test ---- Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-3.4-2.0.0.0 (OFED-3.4-2.0.0): modules Host Driver RPM Check .................. PASS Firmware on CA #0 HCA .................. v2.10.0720 Host Driver Initialization ............. PASS Number of CA Ports Active .............. 0 Error Counter Check on CA #0 (HCA)...... PASS Kernel Syslog Check .................... PASS Node GUID on CA #0 (HCA) ............... NA ------------------ DONE --------------------- |
I'm currently using the Mellanox ConnectX-2 Dual Port 40Gb QDR Adapter. My Firmware is v2.10.0720 I believe the minimum version you need to be at for this guide to be successful is v2.9.1200, or v2.9.1000. If you allow the OFED driver package to push firmware in my hardware scenario, it only includes v2.9.1000. For ConnectX3/4 based adapters, please google-foo your way to the recommended firmware for SRP support.
Validate your Infiniband Device is being picked up by the driver, and what it's device name is.
# [root@nas01]# ibv_devices |
Output should look like:
# device node GUID ------ ---------------- mlx4_0 0002c903002805a4 |
And my device name is mlx4_0 , and my GUID (like a MAC Address or WWN) of the adapter itself is 0002c903002805a4. Which acts like a 'Node Name' of the HCA itself, not the interfaces/ports it has. Those will always be similar, usually Port 1 = +1 to the Node address, Port 2 = +2 to the Node address.
Validate the Device Info for your HCA
# [root@nas01]# ibv_devices |
Output should look like:
# hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.720 node_guid: 0002:c903:0028:05a4 sys_image_guid: 0002:c903:0028:05a7 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D80110009 phys_port_cnt: 2 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 3 port_lmc: 0x00 link_layer: InfiniBand |
DONT WORRY if your ports aren't ACTIVE yet. It's likely due to a bad or missing Infiniband partition configuration.
Step 8a: Configuring InfiniBand Partitions
*NOTE* You only need (1) active/working Subnet manager on your Infiniband Fabric. That is where you need to define the partitions. If you have a IB Switch, and wish to use the hardware subnet manager, that's where your configuration will be. If you do not have an IB switch, then you need to use a software OpenSM daemon, and configure the partitions there. In crazy multi-computer setups all direct attached in a loop between HCAs, you will need multiple subnet managers running capable of isolating/each wing of the fabric.
Here's a sample 'partitions.conf' that should work for 'everyone'.
# # For reference: # IPv4 IANA reserved multicast addresses: # http://www.iana.org/assignments/multicast-addresses/multicast-addresses.txt # IPv6 IANA reserved multicast addresses: # http://www.iana.org/assignments/ipv6-multicast-addresses/ipv6-multicast-addresses.xml # # mtu = # 1 = 256 # 2 = 512 # 3 = 1024 # 4 = 2048 # 5 = 4096 # # rate = # 2 = 2.5 GBit/s # 3 = 10 GBit/s # 4 = 30 GBit/s # 5 = 5 GBit/s # 6 = 20 GBit/s # 7 = 40 GBit/s # 8 = 60 GBit/s # 9 = 80 GBit/s # 10 = 120 GBit/s Default=0x7fff, ipoib, mtu=5, ALL=full; |
Mellanox IB Switch Instructions
If you have a Mellanox Infiniband Switch running the Subnet Manager, you need to edit/validate the fabric configuration.
*BACKUP* your existing configuration file located in /usr/voltaire/config/partitions.conf, before making any changes.
Use the sample above, edit the partitions.conf file on the switch, and then reboot it. (easy way)
OpenSM Instructions
<To be Continued...work in progress...>
Installing OpenSM (only if required)
# [root@NAS01 ~]$ yum install opensm |
The opensm program keeps its master configuration file in /etc/rdma/opensm.conf.
By default, opensm.conf looks for the file /etc/rdma/partitions.conf to get a list of partitions to create on the fabric.
# [root@NAS01 ~]$ vi /etc/rdma/partitions.conf |
Use the sample above, edit the partitions.conf file on the opensm host, and then restart the service.
Enabling OpenSM
# [root@NAS01 ~]$ .systemctl enable opensm |
Restarting OpenSM
# [root@NAS01 ~]$ .systemctl restart opensm |
OK... Let's assume you have properly configured & matching partition.conf files by now...
Step 8b: (Optional) Validating InfiniBand Partitions & Subnet Manager connectivity
# [root@NAS01 ~]$ osmtest -f c |
Output should be similar to:
# Command Line Arguments Done with args Flow = Create Inventory Jan 23 22:15:20 337670 [5260E740] 0x7f -> Setting log level to: 0x03 Jan 23 22:15:20 338142 [5260E740] 0x02 -> osm_vendor_init: 1000 pending umads specified Jan 23 22:15:20 372978 [5260E740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x2c903002805a5 Jan 23 22:15:20 419733 [5260E740] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- OSMTEST: TEST "Create Inventory" PASS |
Looking for that PASS
# [root@NAS01 ~]$ osmtest -f a |
Ignore any errors like --> 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0300 All we care about is:
# OSMTEST: TEST "All Validations" PASS |
Finally
# [root@NAS01 ~]$ osmtest -v |
Ignore any errors... All we care about is:
# OSMTEST: TEST "All Validations" PASS |
Additionally, you can check the relationship between Infiniband Devices and Network Devices by:
# [root@NAS01 ~]$ ibdev2netdev |
Which in my case looks like:
# mlx4_0 port 1 ==> ib0 (Up) mlx4_0 port 2 ==> ib1 (Up) |
Congrats, you have working InfiniBand :) Now let's work on getting that SRP protocol bound on top of it, with at least one of the interfaces of your HCA as a 'target'
The SCST Package
Step 9: Prepare to install the SCST Package
# [root@NAS01 ~]$ yum install svn
[root@NAS01 ~]$ cd /tmp [root@NAS01 ~]$ svn checkout svn://svn.code.sf.net/p/scst/svn/branches/3.2.x/ scst-svn |
Step 10: Install the SCST Package
The folder is relative to this version of SCST i'm using... Note the make 2perf, rather than the make 2release or make 2anything else.
# [root@NAS01 ~]$ cd /tmp/3.2.x/ [root@NAS01 ~]$ make 2perf [root@NAS01 ~]$ cd scst [root@NAS01 ~]$ make install [root@NAS01 ~]$ cd ../scstadmin [root@NAS01 ~]$ make install [root@NAS01 ~]$ cd ../srpt [root@NAS01 ~]$ make install [root@NAS01 ~]$ cd ../iscsi-scst [root@NAS01 ~]$ make install |
You can combine these into a one liner, but for me it was easier to see issues I was having in SRPT, by performing the make install one by one.
Now if the correct Mellanox OFED drivers are loaded with Kernel support, and there is no conflicting 'inband' drivers that got in the way, then the SRPT install above, should have created a module called ib_srpt. The 'whole' trick for me in getting this setup, was understanding that to get this 'right', the SCST make, depends on the OFED make, to have been made with Kernel support.
Step 11: Validate the correct ib_srpt.ko file is loaded for the module ib_srpt
This step tripped me up for awhile....
# [root@nas01# modinfo ib_srpt |
Output should look like:
# filename: /lib/modules/3.10.0-514.el7.x86_64/extra/ib_srpt.ko license: Dual BSD/GPL description: InfiniBand SCSI RDMA Protocol target v3.2.x#MOFED ((not yet released)) author: Vu Pham and Bart Van Assche rhelversion: 7.3 srcversion: D993FDBF1BE83A3622BF4CC depends: rdma_cm,ib_core,scst,mlx_compat,ib_cm,ib_mad vermagic: 3.10.0-514.el7.x86_64 SMP mod_unload modversions parm: rdma_cm_port:Port number RDMA/CM will bind to. (short) parm: srp_max_rdma_size:Maximum size of SRP RDMA transfers for new connections. (int) parm: srp_max_req_size:Maximum size of SRP request messages in bytes. (int) parm: srp_max_rsp_size:Maximum size of SRP response messages in bytes. (int) parm: use_srq:Whether or not to use SRQ (bool) parm: srpt_srq_size:Shared receive queue (SRQ) size. (int) parm: srpt_sq_size:Per-channel send queue (SQ) size. (int) parm: use_port_guid_in_session_name:Use target port ID in the session name such that redundant paths between multiport systems can be masked. (bool) parm: use_node_guid_in_target_name:Use HCA node GUID as SCST target name. (bool) parm: srpt_service_guid:Using this value for ioc_guid, id_ext, and cm_listen_id instead of using the node_guid of the first HCA. parm: max_sge_delta:Number to subtract from max_sge. (uint) |
i.e.
/lib/modules/`uname -r`/extra/ib_srpt.ko <-- Where 'uname -r' is whatever comes up for you. In my case 3.10.0-514.el7.x86_64 |
If it looks different, like the below example, you have a problem with an 'inband' driver conflict.
# filename: /lib/modules/3.10.0-514.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/ulp/srpt/ib_srpt.ko version: 0.1 license: Dual BSD/GPL description: ib_srpt dummy kernel module author: Alaa Hleihel rhelversion: 7.3 srcversion: 646BEB37C9062B1D74593ED depends: mlx_compat vermagic: 3.10.0-514.el7.x86_64 SMP mod_unload modversions |
If you have this problem, Please perform the following steps... Manually remove the ib_srpt.ko file in whatever location it is. Reboot. Re-Make the SCST package per the above instructions. Then re-check -> modinfo ib_srpt
If you don't have this problem You are getting close :) and have the foundations ready, we just need to find some information, and setup the /etc/scst.conf
Step 12: The SCST Configuration File
*NOTE* In order to create/edit the SCST configuration file with the right information, you will need at least:
1) To discover the SRP addresses of your Infiniband HCA interfaces, and have it handy.
2) Create (or) have the location of the Volume you want to present via SCST to the SRP initiator(s). i.e. Creating the 'Physical' Volume, that we want to share as a 'Virtual' Volume.
Step 12a: Finding your SRP Addresses
Run the following command to to find the SRP addresses we need:
# ibsrpdm -c |
My Output looks like:
# id_ext=0002c903002805a4,ioc_guid=0002c903002805a4,dgid=fe800000000000000002c903002805a5,pkey=ffff,service_id=0002c903002805a4 id_ext=0002c903002805a4,ioc_guid=0002c903002805a4,dgid=fe800000000000000002c903002805a6,pkey=ffff,service_id=0002c903002805a4 |
This shows me (2) Infiniband interfaces logged in to partition ffff (pkey), being serviced by the HCA with a node address of 0002c903002805a4.
What we're looking for is the dgid= address i.e.
# fe800000000000000002c903002805a5 fe800000000000000002c903002805a6 |
We need to convert those numbers into a slightly different format:
# fe80:0000:0000:0000:0002:c903:0028:05a5 fe80:0000:0000:0000:0002:c903:0028:05a6 |
By adding semicolons every 4 hexadecimal places. Copy these to notepad for use later in the SCST.conf
Step 12b: The Volume you want to share
NOTE: If you are using ZFS like I am, your SCST virtual disk you want to present over SRP, needs to be a zfs vol. You can't specify a zfs file system to be the target for a SRP Block device. (This maybe obvious to some).
I have a ZFS pool called 'Hitachi' I wanted my virtual device (vdev) to called 'iSCSIvol1' I wanted compression enabled on the device I wanted block size to be 32KB (vs 8KB default some places) And finally, I wanted it to be 2TB.
Create my ZFS Volume
# zfs create Hitachi/iSCSIvol1 -o compression=lz4 -o volblocksize=32k -V 2048G |
Validate the Volume was created
# zfs list |
My Output:
# NAME USED AVAIL REFER MOUNTPOINT Hitachi 4.13T 23.9T 160K /Hitachi Hitachi/iSCSIvol1 2.06T 26.0T 25.8G - |
Validate the 'physical' path of the Volume, used in the SCST configuration file.
# ls -l /dev/zvol/Hitachi/ |
My Output:
# lrwxrwxrwx. 1 root root 9 Jan 23 18:34 iSCSIvol1 -> ../../zd0 |
So for me the 'physical' path i'm going to use shortly is:
# /dev/zvol/Hitachi/iSCSIvol1 |
Step 12c: Last, but not least... create/edit the /etc/scst.conf file
Sample from data used in this guide:
# HANDLER vdisk_blockio { DEVICE vdisk0 { filename /dev/zvol/Hitachi/iSCSIvol1 nv_cache 1 rotational 0 write_through 0 threads_num 4 } } TARGET_DRIVER ib_srpt { TARGET fe80:0000:0000:0000:0002:c903:0028:05a5 { enabled 1 rel_tgt_id 1 LUN 0 vdisk0 } TARGET fe80:0000:0000:0000:0002:c903:0028:05a6 { enabled 1 rel_tgt_id 2 LUN 0 vdisk0 } } |
In the above configuration, we have the Handler of vdisk_blockio devices. We have one device called 'vdisk0' bound to a 'filename' of our /dev/zvol/Hitachi/iSCSIvol1 volume. This is where it can be confusing, and creating a flat file via dd if=/dev/zero , and trying to mount it as a target would fail via SCST. Again, this needs to be a RAW volume.
I'm using nv_cache 1 , which effectively disable write sync on the volume, unless the ZFS pool has sync=always.
I'm using rotational 0, which effectively tells the client initiator, that this is a SSD device, if you prefer it to be recognized as a traditional HDD, switch this to 1.
I'm using threads_num 4, which is how you can adjust the vdisk queue depth. (of sorts)
In the TARGET_DRIVER section bound to the ib_srpt module, we see (2) SRP target addresses being defined.
They each are enabled, (you can disable one if you prefer to use IPoIB (only) on Port1, and SRP (only) on Port2 for example. [or not define it].
Each target needs a Target address (ID), and they need to be unique if you have more than one, start with address 1.
Finally you define the LUN # <devicename>.
This is the final layer of 'mapping & masking' a virtual device, as a Disk ID, down a Target ID, ready for your Initiator Controller ID to be scanned.
Each device being shared on the same targets needs unique LUN #s. Start with #0.
Final Notes:
Managing the SCST Service/Configuration:
Check the status, stop, start, restart the service.
# [root@NAS01 ~]$ service scst status [root@NAS01 ~]$ service scst stop [root@NAS01 ~]$ service scst start [root@NAS01 ~]$ service scst restart |
You need to restart the SCST service, after making changes to /etc/scst.conf if you want those changes to be active.
But be aware when restarting the service, there will be a temporary disruption in all SCST presented traffic (iSCSI/SRP) to all SCSI virtual disks for all initiators may disappear/reappear, this may take up to 60 seconds or so, while the SRP discovery/login process happens. So if you have critical VMs/Databases running that have strict time-outs, you want to plan accordingly.
I don't want to overstate the speed/flexibility that normally occurs during a quick 'add a quick LUN mapping' and restart. I've tested it via ESX6 with a Windows 2016 Server VM with Resource Monitor open, and you can see it hang for a bit, and then quickly its back and life is good. If that works for you, then don't worry about it.
However if you make a mistake in your /etc/scst.conf for example, and it takes you longer to get back up then you planned, well then, there's that...
Using the SCST Administration tool
Backup your SCST configuration to a flat file
# /usr/local/sbin/scstadmin -write_config ./SCST_backup.cfg |
Adding non-Infiniband 1/10GbE copper redundancy
To be continued... again placeholder for future development...
If you use only SRP traffic for your storage presentation, and because SRP discovery/negotiation/logins on the infiniband fabric take time, and because servers don't particularly care for SCSI disks 'disappearing' for long periods of time.
Then, it might make sense to add an additional connection via TCP/iSCSI down a separate physical network, as a working path to your storage between Linux & ESX. If you don't make changes to your SCST configuration often, or can manage the changes during safe planned outage events, then this may not be for you....
This gets 'little complex' because you have to configure the Software iSCSI initiator on the ESX host, bind a vmkernel IP and a VMNIC to a vSwitch, and bind that to the iSCSI software initiator. You also have to setup the ISCSI target on your ethernet NIC on the Linux server, and add iSCSI as 'another path' along with the SRP paths, down an entirely separate channel/protocol. Finally, to do it right, you need to adjust your ESX multipathing policy to Fixed, for each and every LUN you present, and set the preferred path to one of the vmhba#s that is SRP based.
In this scenario, you can make faster/less disruptive changes to the /etc/scst.conf , where the ethernet iSCSI path will pickup quicker than the SRP path, and although it's infinitely slower bandwidth, it's faster at restoring connectivity. And works as a good fail-safe, incase you are doing changes on your Infiniband Fabric/network etc.
List SCST Targets Addresses
# [root@NAS01 ~]$ /usr/local/sbin/scstadmin -list_target |
The output should look like:
# Collecting current configuration: done. Driver Target ----------------------------------------------------------------- copy_manager copy_manager_tgt ib_srpt fe80:0000:0000:0000:0002:c903:0028:05a5 fe80:0000:0000:0000:0002:c903:0028:05a6 |
Troubleshooting IPoIB / vSwitch / vmnic MTU:
Changes in the MTU size for an IPoIB pkey can be tricky. The first device connected will set the MTU size for the broadcast domain. i.e. subsequent connection of IPoIB interfaces will not be allowed to set an MTU size larger than the previous one.
1st if you believe your Switch partitions MTU specification is correct, and see other hosts successfully accepting your MTU size, but for whatever reason of of the ESX hosts just refuses to play nicely, and negotiates a 2044 MTU where everyone else is 4092 for example. If this is the case, just put the ESX host into maintenance mode, disconnect all IB interfaces, wait a second, reconnect the 1st IB interface, and validate that you can edit the vSwitch and vmnic to 4092. If successful, plug in the rest of the IB adapters, and validate they too should negotiate MTU correctly.
2nd only if not resolved by the above, double check the partitions.conf file (on your switch or OpenSM configuration) and validate you are specifying the correct MTU=# for the Partition these interfaces are joining. You cant software set anything higher than the physical boundary specified in the partition.
ADDITIONAL REFERENCES:
SCSI Target Software Stack (SCST) https://en.wikipedia.org/wiki/SCST http://scst.sourceforge.net/
ZFS CLI Cheat Sheet reference (for the simple stuff) http://thegeekdiary.com/solaris-zfs-command-line-reference-cheat-sheet/
RedHat/Centos Subnet Manager notes (If you run the software based OpenSM package) https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html
RedHat/CentOS Infiniband RDMA notes: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Testing_Early_InfiniBand_RDMA_operation.html
Interesting Mellanox thread about the potential future of RDMA and VMware 6.5, as it pertains to SRP/Infiniband. https://community.mellanox.com/thread/3379