Adding Physical disks to Windows 2012 Server

Since I have a terrible memory for such things, I have to keep notes for myself on How-Tos. A video is at the bottom that illustrates the process (if I can’t read it, at least I’ll see it) 🙂

This blog will walk through the process flow of adding Physical disks to Windows 2012 Server. I have already provisioned three disks via VMware (we won’t be covering this)

• Initialize new disks
• Create storage spaces, disks, and volumes with Server Manager
• Create volumes with the Disk Management snap-in

You use two different tools to bring three new disks online and initialize them in preparation for creating storage volumes.

Use the File and Storage Services submenu in Server Manager

1. In Server Manager, in the File and Storage Services submenu, click Volumes.
2. When Volumes home page appears, choose Disks. The Disks page appears. In our example, its showing one online disk and three offline disks.

3. Right-click the offline disk number 1 and, from the context menu, select Bring Online. A message box appears, warning you not to bring the disk online if it is already online and connected to another server.
4. When click Yes, the disk’s status changes to Online.
5. Right-click the same offline disk number 1 and, from the context menu, select Initialize. A message box appears, warning you that any data on the disk will be erased.
6. Click Yes. The disk is partitioned and ready to create volumes.
7. In Server Manager, click Tools > Computer Management. The Computer Management console appears.
8. In the left pane, click Disk Management. The Disk Management snap-in appears .

Using the Disk Management snap-in

9. Right-click the Disk 2 tile and, from the context menu, select Online.
10. Right-click the Disk 2 tile a second time and, from the context menu, select Initialize Disk. The Initialize Disk dialog box appears.
11. Select the GPT (GUID Partition Table) option and click OK. The Disk 2 status changes to Online.
12. Repeat steps 9 to 11 to initialize Disk 3.

You can use two methods to create simple volumes, using Server Manager and the Disk Management snap-in.
Server Manager and Disk Management both provide wizards for creating simple volumes, with similar capabilities.

1. In Server Manager, in the File and Storage Services submenu, click Volumes. The Volumes home page appears.
2. Click Tasks > New Volume. The New Volume Wizard appears, displaying the Before you begin page.
3. Click Next. The Select the server and disk page appears.
4. Select Disk 1 and click Next. The Specify the size of the volume page appears.
5. In the Volume size text box, type 10 and click Next. The Assign to a drive letter or folder page appears.
6. Click Next. The Select file system settings page appears.
7. Click Next. The Confirm selections page appears.
8. Click Create. The Completion page appears.
9. Click Close. The new volume appears in the Volumes pane.
10. Switch to the Computer Management console. The new volume you just created appears in the Disk 1 pane of the Disk Management snap-in.
11. Right-click the unallocated space on Disk 2 and, from the context menu, select New Simple Volume. The New Simple Volume Wizard appears, displaying the Welcome page.
12. Click Next. The Specify Volume Size page appears.
13. In the Simple volume size in MB spin box, type 10000 and click Next. The Assign Drive Letter or Path page appears.
14. Click Next. The Format Partition page appears.
15. Click Next. The Completing the New Simple Volume Wizard page appears.
16. Click Finish. The wizard creates the volume, and it appears in the Disk 2 pane.
17. Create a 10 GB simple volume on disk 3 with the drive letter G: using Windows PowerShell.

Creating a Storage Pool – Storage pools are a new feature in Windows Server 2012, which enable you to create a flexible storage subsystem with various types of fault tolerance.

Use the Server Manager console to create a storage pool, which consists of space from multiple physical disks.

The Storage Pools home page
1. In Server Manager, on the File and Storage Services submenu, click Storage Pools. The Storage Pools home page appears.
2. In the Storage Pools tile, click Tasks > New Storage Pool. The New Storage Pool Wizard appears, displaying the Before you begin page.
3. Click Next. The Specify a storage pool name and subsystem page appears.
4. In the Name text box, type Pool1 and click Next. The Select physical disks for the storage pool page appears.
5. Select the check boxes for PhysicalDisk1 and PhysicalDisk2 in the list and click Next. The Confirm selections page appears.
6. Click Create. The wizard creates the storage pool.
7. Click Close. The new pool appears in the Storage Pools tile.
8. Select Pool1.
9. In the Virtual Disks tile, click Tasks > New Virtual Disk. The New Virtual Disk Wizard appears, displaying the Before you begin page.
10. Click Next. The Select the storage pool page appears.
11. Click Next. The Specify the virtual disk name page appears.
12. In the name text box, type Data1 and click Next. The Select the storage layout page appears.
13. In the layout list, select Parity and click Next. A warning appears, stating that the storage pool does not contain a sufficient number of physical disks to support the Parity layout.

14. In the layout list, select Mirror and click Next. The Specify the provisioning type page appears.
15. Leave the default Fixed option selected and click Next. The Specify the size of the virtual disk page appears.
16. In the Virtual disk size text box, type 10 and click Next. The Confirm selections page appears.
17. Click Create. The wizard creates the virtual disk and the View results page appears. Deselect the Create a volume when this wizard closes option.
18. Click Close. The virtual disk appears on the Storage Pools page.
19. In the Virtual Disks tile, right-click the Data1 disk you just created and, from the context menu, select New Volume. The New Volume Wizard appears.
20. Using the wizard, create a volume on Disk 4 (Data1) using all of the available space, the NTFS file system, and the drive letter J:

AddPhyDisk_2_Pool_WinSrv2012.mp4

What’s with MGTDB anyways

For those who have either upgraded or fresh-installed 12.1 (12c) Grid Infrastructure stack, will notice a new database instance (-MGMTDB) that was provisioned automagically. So what is this MGMTDB and why do I need this overhead.

Si let’s recap what the DB is and what it does…
Management Database is the central repository to store Cluster Health Monitor, the Grid Infrastructure Management Repository.

MGMT database is a container database (CDB) with one pluggable database (PDB) running. However, this database runs out of the Grid Infrastructure home.
The MGMTDB is a Rac One Node database; i.e., it runs on one node at a time, but because this is Clustered Resource, it can be started or failed over on any node in the cluster. MGMTDB is as a non-critical component of the GI stack (with no “real” hard dependencies). This means that if MGMTDB fails or becomes unavailable, Grid Infrastructure continues running

MGMTDB is configured (subject to change) with 750 MB SGA/325 MB PGA, and 5GB database size. But note that, due to the footprint MGMT’s SGA is not configured for hugepages . Since, this database is dynamically created on install, the OUI installer does not have pre-knowledge of the database that are configured or will be migrated to this cluster, thus in order to avoid any database names conflict the name “-MGMTDB” was chosen (notice the “-“). Note, bypassing MGMTDB installation is only allowed for upgrades to 12.1.0.2. New 12.1.0.2 installations or upgrades to future releases will require MGMTDB to be installed. if MGMTDB is not selected during upgrade, all features (Cluster Health Monitor (CHM/OS) etc) that depend on it will be disabled.

So if you are wondering where the datafiles and other structures are stored for this database. Well they would will be stored in the same diskgroup as OCR and VOTE However, these dtabase files can be migrated into ASM diskgroup post install.

MGMTDB will store a subset of Operating System (OS) performance data for longer term to provide diagnostic information and support intelligent workload management. Performance data (OS metrics similar to, but a subset of Exawatcher) collected by the ‘Cluster Health Monitor’ (CHM) is stored also on local disk, so when not using MGMTDB, CHM data can still be obtained from local disk but intelligent workload management (QoS) will be disabled. onger term MGMTDB will become a key component of the Grid Infrastructure and provide services for important components, because of this MGMTDB will eventually become a mandatory component in future upgrades to releases on Exadata.

See document 1568402.1 for more details.

OVCA Network specs – discussion with Oracle

Notes from meeting w/ Oracle on OVCA:

The most current version of Oracle Virtual Compute Appliance (VCA or OVCA) is X4-2
VCA consists of the following:
Infiniband (IB) QDR links operating at 40Gb/s bandwidth (raw fabric bit rate); effective maximum network throughput possible is in the order of 32Gb/s for data (due in part b/c 8b/10b used for encoding).
PCIe card limits (20-25Gb/s per server depending on server model)

VCA supports VLANs and can accept VLAN tags.

Virtual Machine (VM) network throughput is limited by various factors such as:
Expected Network throughput between Different Virtual Machines is as follows (from Oracle specs):

Between VMs running on same Compute Node: Approx. 15.5Gb/s
Between two pairs of VMs communicating running on the same Compute Node: 25.5Gb/s (with 12Gb/s-13Gb/s per client+server pair)

Between two VMs that runs on different Compute Nodes: Approx. 7.8Gb/s-8Gb/s
Between two pairs of VMs communicating running on different Compute Nodes: No reduction in speed.

Performance is affected by software layers and hypervisor/virtualization such as:
Xsigo drivers, EoIB and IPoIB add path length and latency.
Virtual Machine access adds further latency for virtual front and backend I/O devices
Private Virtual Interconnect (PVI) peer-to-peer communication vs. external traffic via Xsigo backplane

blktrace basics

Life of an I/O

Once a user issues an I/O request, this I/O enters block layer…then the magic begins…

1. IO request is remapped atop underlying logical/aggregated device device (MD, DM). Depending on alignment, size, …, requests are split into 2 separate I/Os
2. Requests added to the request queue
3. Merged with a previous entry on the queue -> All I/Os end up on a request queue at some point
4. The I/O is issued to a device driver, and submitted to a device
5. Later, the I/O is completed by the device, and posted by its driver

btt is a Linux utility that provides an analysis of the amount of time the I/O spent in the different areas of the I/O stack.

btt requires that you run blktrace first. Invoke blktrace specifying whatever devices and other parameters you want. You must save the traces to disk in this step,
In its current state, btt does not work in live mode.

After tracing completes, run blkrawverify, specifying all devices that were traced (or at least on all devices that you will use btt with.

If blkrawverify finds errors in the trace streams saved, it is best to recapture the data

Run blkparse with the -d option specifying a file to store the combined binary stream. (e.g.: blkparse -d bp.bin …).

blktrace produces a series of binary files containing parallel trace streams – one file per CPU per device. blkparse provides the ability to combine all the files into one time-ordered stream of traces for all devices.

Here’s some guidelines on the key indicators from the btt output:

Q — A block I/O is Queued
G — Get Request

A newly queued block I/O was not a candidate for merging with any existing request, so a new block layer request is allocated.

M — A block I/O is Merged with an existing request.
I — A request is Inserted into the device’s queue.
D — A request is issued to the D evice.
C — A request is Completed by the driver.
P — The block device queue is Plugged, to allow the aggregation of requests.
U — The device queue is Unplugged, allowing the aggregated requests to be issued to the device

Metrics of an I/O
Q2I – time it takes to process an I/O prior to it being inserted or merged onto a request queue

Includes split, and remap time
I2D – time the I/O is “idle” on the request queue

D2C – time the I/O is “active” in the driver and on the device

Q2I + I2D + D2C = Q2C
Q2C: Total processing time of the I/O

The latency data files which can be optionally produced by btt provide per-IO latency information, one for total IO time (Q2C) and one for latencies induced by lower layer drivers and devices (D2C).

In both cases, the first column (X values) represent runtime (seconds), while the second column (Y values) shows the actual latency for a command at that time (either Q2C or D2C).

Exadata Monitoring and Agents – EM Plugin

To those who attended our Exadata Monitoring and Agents. Here’s some Answers and followup from the Chat room

The primary goal of the Exadata Pluigin is to digest the schematic file and validate database.xml and catalog.xml files. If the pre-check runs w/o failure then Discovery can be executed.

Agent only runs on compute nodes and monitors all components remotely; i,e ,no additional scripts/code is installed on the peripheral components. Agents pull component metrics and vitals using either ssh commands (using user equivalence based commands) or subscribe to SNMP traps.

Note, that there are always two agents deployed, the master does majority of the work, and a slave agent, which “kicks-in” if the master fails. Agents should be installed on all compute nodes

Initially, the guided discovery wizard runs ASM kfod to get disk names and reads cellip.ora.

The components monitored via the Exadata-EM plugin include the following:
• Storage Cells

• Infiniband Switches (IB switches)
EM agent runs remote ssh calls to collect switch metrics, IB switch sends SNMP traps (PUSH) for all alerts. This collection does require ssh equilavalnace for nm2user. This collection includes varipous sensor data: FAN, voltage, temparture. As well port metrics.
Plugin does the following:
Ssh nm2user@ ibnetdiscover

Reads the components names connected to the IBM switch, matches up the compute node hostnames tp the hostnames used to install agent

• Cisco Switch
EM agent runs remote SNMP get calls to gather metric data, this includes port status, switch vitals; eg, CPU, memory, power, and temp. In addition, performance metrics are also collect; eg, ingress and egress throughput rates

• PDU and KVM
For the PDU, both active and passive PDUs are monitored. Agent runs SNMP get calls from each PDU, metric collection includes Power, temperature, Fan status. The same steps and metrics are gathered for the KVM

• ILOM targets
EM Agent executes remote ipmitool calls to each compute node’s ILOM target. This execution requires oemuser credentials to run ipmitool. Agent collects sensor data as well as configuration data (firmware version and serial number)

In EM 12.1.0.4 , the key enhancements introduced include gathering IB performance, on-demand schematic refresh, Cell performance monitoring as well as a guided resolution for cell alerts. SNMP automation notification setup for Exadata Storage Server and InfiniBand Switches.

The Agent discovers IB switches and compute nodes and sends output to ibnetdiscover. The KVM, PDU, Cisco and ILOM discovery is performed via schematic file on compute node, and finally subscribes to SNMP for cells and IBM switches; note, SNMP has to be manually setup and enabled on peripheral componets for SNMP push of cell alerts. EM agent runs cellcli via ssh to obtain Storage metrics, this does require ssh equialvance with Agent user

The latest version (as of this writing, 12.1.0.6), there were a number of key visualization and metrics enhancements. For example:

• CDB-level I/O Workload Summary with PDB-level details breakdown.
• I/O Resource Management for Oracle Database 12c.
• Exadata Database Machine-level physical visualization of I/O Utilization for CDB and PDB on each Exadata Storage Server. There is also a critical integration link to Database Resource Management UI.
• Additional InfiniBand Switch Sensor fault detection, including power supply unit sensors and fan presence sensors.
• Automatically push Exadata plug-in to agent during discovery.

Use fully qualified names with Agent, using shorten names will causes issues. If there are any issues with metrics gathering or agent, EMDiag Kit should be used to triage this. The EMDiag kit includes scripts that can be used EM issues. Specifically, the kit includes repvfy, agtvfy, and omsvfy. These tools can be used to diagnose issues with the OEM Repository, EM Agents, control management services.
To obtain the EMDiag Kit, download the zip file for the version that you need, per Oracle Support Note: MOS ID# 421053.1

Export EMDIAG_HOME=/u01/app/oracle/product/emdiag
$EMDIAG_HOME/bin/repvfy install
$EMDIAG_HOME/bin/repvfy verify Exadata –level 9 -details

ASM Check script

Here's a little script from @racdba that does ASM check when we go onsite

#!/bin/ksh
HOST=`hostname`
ASM_OS_DEV_NM=/tmp/asmdevicenames.log
ASMVOTEDSK=/tmp/asm_votingdisks.log
GRID_HOME=`cat /etc/oratab |grep “+ASM” |awk -F “:” ‘{print $2}’`
ORACLE_HOME=$GRID_HOME
PATH=$ORACLE_HOME/bin:$PATH:
export GAWK=/bin/gawk

#
#
do_pipe ()
{
SQLP=”$GRID_HOME/bin/sqlplus -s / as sysdba”;
$SQLP |& # Open a pipe to SQL*Plus
print -p — ‘set feed off pause off pages 0 head off veri off line 500’;
print -p — ‘set term off time off’;
print -p — “set sqlprompt ””;

print -p — ‘select sysdate from dual;’;
read -p SYSDATE;

print -p — “select version from v\$instance;”;
read -p ASM_VERSION;

print -p — “select value from v\$parameter where name=’processes’;”;
read -p ASM_PROCESS;

print -p — “select value/1024/1024 from v\$parameter where name=’memory_target’;”;
read -p ASM_MEMORY;

print -p — “quit;”;
sleep 5;
}
#
function get_asminfo {
for LUNS in `ls /dev/oracleasm/disks/*`
do
echo “ASMLIB disk: $LUNS”
asmdisk=`kfed read $LUNS | grep dskname | tr -s ‘ ‘| cut -f2 -d’ ‘`
echo “ASM disk: $asmdisk”
majorminor=`ls -l $LUNS | tr -s ‘ ‘ | cut -f5,6 -d’ ‘`
dev=`ls -l /dev | tr -s ‘ ‘ | grep “$majorminor” | cut -f10 -d’ ‘`
echo “Device path: /dev/$dev”
echo “—-”
done

echo “”
echo “# ————————————————————————————————– #”;
/usr/sbin/oracleasm-discover;
}

function get_mem_info {
MEM=`free | $GAWK ‘/^Mem:/{ print int( ($2 / 1024 / 1024 + 4) / 4 ) * 4 }’`
SWAP=`free | $GAWK ‘/^Swap:/{ print int ( $2 / 1024 / 1024 + 0.5 ) }’`
HUGEPAGES=`grep HugePages_Total /proc/meminfo | $GAWK ‘{print $2}’`

echo “Physical Memory: $MEM |Swap: $SWAP”
echo “HugePages: $HUGEPAGES”
}

export ORACLE_SID=`cat /etc/oratab |grep “+ASM” |awk -F “:” ‘{print $1}’`
CHKPMON=`ps -ef|grep -v grep|grep pmon_$i|awk ‘{print $8}’`
if [ -n “$CHKPMON” ]; then
do_pipe $ORACLE_SID
echo “# ————————————————————————————————– #”;
echo “HOSTNAME: ${HOST}”
echo “GRID HOME: ${GRID_HOME}”
echo “ASM VERSION: ${ASM_VERSION}”
echo “ASM PROCESSES: ${ASM_PROCESS}”
echo “ASM MEMORY: ${ASM_MEMORY} MB”
echo “# ————————————————————————————————– #”;
get_mem_info
echo “# ————————————————————————————————– #”;
else
echo “${ORACLE_SID} is not running.”
fi

echo “# ————————————————————————————————– #”;
echo “LINUX VERSION INFORMATION:”
echo ” ”
[ -f “/etc/redhat-release” ] && cat /etc/redhat-release
[ -f “/etc/oracle-release” ] && cat /etc/oracle-release
uname -a
echo “# ————————————————————————————————– #”;

##SQLP=”sqlplus -s / as sysdba”;
##$SQLP < $ASM_OS_DEV_NM
##set feed off pause off head on veri off line 500;
##set term off time off numwidth 15;
##set sqlprompt ”;
##col label for a25
##col path for a55
##–select label,path,os_mb from v\$asm_disk;
##select label,os_mb from v\$asm_disk;
##exit;
##!

echo “ASM OS DEVICE INFORMATION:”
##cat $ASM_OS_DEV_NM
## Check for ASMLib
ASMLIBCHK=`rpm -qa |grep oracleasmlib`
if [[ -n $ASMLIBCHK ]]
then
echo “# ————————————————————————————————– #”;
echo “ASMLIB RPM: ${ASMLIBCHK}”
echo ” ”
##echo “ASM OS DEVICE INFORMATION:”
##echo ” ”
get_asminfo
else
echo “ASMLIB is NOT installed.”
fi

echo “# ————————————————————————————————– #”;

## Check OCR/Voting disks
OCR=`$GRID_HOME/bin/ocrcheck |grep “Device/File Name” |awk ‘{print $4}’`
##echo ” ”
##echo “GRID HOME is located at ${GRID_HOME}.”
echo “OCR LOCATION: ${OCR}”
echo “# ————————————————————————————————– #”;
echo ” ”

## Voting disk
$GRID_HOME/bin/crsctl query css votedisk > $ASMVOTEDSK

echo “VOTING DISK INFORMATION:”
echo ” ”
cat $ASMVOTEDSK
echo “# ————————————————————————————————– #”;

## Cleanup
if [[ -f $ASM_OS_DEV_NM ]]
then
rm $ASM_OS_DEV_NM
fi

if [[ -f $ASMVOTEDSK ]]
then
rm $ASMVOTEDSK
fi

iSCSI vs FcoE – a basic review and load test for Database Workloads

We were invited to visit w/ a client to help them migrate their 11.2 database to newer hardware. Since they were re-platforming, this was a good time for them to re-think their storage platform too.
From a storage perspective, that had already had iSCSI implanted in their environment. When they asked what I thought about FcoE, the journey immediately turned into a “well let’s go see for ourselves”. The platform and storage is not important nor relevant here, so we’ll keep the innocent parties innocent.

I’m not going to bore you with FcoE and iSCSI essentials, but I’ll start with this basic info that we discussed, Here’s a bit of that dialogue before we started testing
(I’ll touch on the the key discussion points):

FCoE
Fibre Channel over Ethernet (FcoE) maps Fibre Channel onto Layer 2 Ethernet and thus encapsulates SCSI frames at DLL layer

FcoE requires a converged network adapter (CNA). A CNA (typically a 10GbE interface) supports both LAN plus SCSI and Fibre Channel stacks;effectively allowing the combination of LAN and SAN traffic onto a converged (unified) network link.

If running FcoE end-2-end (all the way to the storage ports), then FcoE requires (in Layer 2) Data Center Bridging (DCB), with target and initiator on the same layer 2 segment. DCB enablement is via DCB cards and switches.

In our test scenarios, we leveraged the existing native FC storage array w/ native FC ports, i.e., no end-2-end FcoE, thus no DCB was required.
As a side note, [for those curious] DCB, a IEEE 802.1 standard enhancements, allows predictable latency and the
behavior of dropping packets upon congestion to co-exist with the SAN requirement of no loss of frames.
FCoE networks also supports Ethernet pass-through switches, whereby an Ethernet pass-through capable-switch forwards Fibre Channel frames as an upper layer
protocol. These switches are lossless have no knowledge of FC stack

iSCSI
iSCSI requires a network interface card and encapsulates SCSI packets in IP (layer 3)

iSCSI is implemented in hardware (iSCSI HBA with or without TCP offload engines -TOE) or software via drivers

Unlike FCoE which requires target and initiator on the same layer 2 segment, iSCSI can be on different subnets (this may not be such a good thing, but it provides flexibility). Additionally, iSCSI does not require DCB end-2-end
iSCSI can be support longer distances

Load tests
In our tests, FcoE and iSCSI, traffic was kept within the layer 2 network, thus no additional hops for iSCSI. MTU was set to 9000.
We used standard NICs w/ no offload, thus not completely apples to apples comparison between FCoE and iSCSI from CPU utilization/overhead perspective, so we kept an eye on this.

Client used Swingbench and client database workload for load tests.

Summary

  • FcoE and iSCSI had similar throughput and latency numbers at low thread/user session counts, however, as thread/user session counts increased, latency continued to be relatively low and throughput was higher in FCoE.
  • iSCSI always had higher CPU utilization as thread count increased. This can be because of the absence of TOE.

My conclusion
FcoE has traditionally been associated with enterprise grade systems, whereas iSCSI has been delegated to SMB and lesser critical apps. However, this not necessarily accurate thinking, there are many optimizations that can be made to make iSCSI highly performant. I have seen many high-end, heavy ERP workloads run atop iSCSI configurations. Typically because of the extra layers encapsulation, iSCSI can be perceived as having additional latency, add into layer 3 hops, and you have more latency.

Based on these tests and tests we have done in the past, in very high workloads or applications w/ low latency requirements FCoE seems the better choice. The storage engineer quoted [from a SNIA report] that anything in the 700MB/sec to 900 MB/sec per port -> either iSCSI or FCoE will work, but anything more demanding may require FcoE

Interestingly enough, even w/the performance and CPU utilization difference, the client decided to stay w/ iSCSI because of two key reasons – cost and cost. This Cost is the infrastructure to support the load and cost of skillet acquisition for supporting FC (esp FCoE) stacks