Exadata Cloud – Post Provisioning Exadata Configuration – Part 1

Post Provisioning Exadata Configuration – Part1

After an Exadata is provisioned, ther are several post provisioning steps that need to be executed in order to allow system automation such as patching, backups, and infrastructure updates. This document will describe these steps.

All the traffic in an Exadata DB System is, by default, routed through the client network. To route backup traffic to the backup interface (BONDETH1), a static route needs to be created on each of the compute nodes in the cluster.

First identify the gateway configured for the BONDETH1 interface.

grep GATEWAY /etc/sysconfig/network-scripts/ifcfg-bondeth1 |awk -F”=” ‘{print $2}’

10.232.35.1

Review current /etc/sysconfig/network-scripts/route-bondeth1

cat /etc/sysconfig/network-scripts/route-bondeth1

10.232.35.0/24 dev bondeth1 table 211

default via 10.232.35.1 dev bondeth1 table 211

Create a new static rule for BONDETH1 and update route-bondeth1 with the following entries (per Cloud region)

Phoenix (PHX) region:

ADDRESS0=129.146.0.0

NETMASK0=255.255.0.0

GATEWAY0=10.232.35.1

 Ashburn (IAD) region):

ADDRESS0=129.213.0.0

NETMASK0=255.255.0.0

GATEWAY0=10.232.35.1

Restart the interface.

[root@dbsys ~]# ifdown bondeth1; ifup bondeth1; 


Once this change is done, you should see a new entry in the route table:

[root@~ network-scripts]# netstat -rn

Kernel IP routing table

Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface

0.0.0.0         10.232.34.1     0.0.0.0         UG        0 0          0 bondeth0

10.232.34.0     0.0.0.0         255.255.255.0   U         0 0          0 bondeth0

10.232.35.0     0.0.0.0         255.255.255.0   U         0 0          0 bondeth1

129.146.0.0     10.232.35.1     255.255.0.0     UG        0 0          0 bondeth1

169.254.200.0   0.0.0.0         255.255.255.252 U         0 0          0 eth0

192.168.132.0   0.0.0.0         255.255.252.0   U         0 0          0 clib1

192.168.132.0   0.0.0.0         255.255.252.0   U         0 0          0 clib0

192.168.136.0   0.0.0.0         255.255.248.0   U         0 0          0 stib0

192.168.136.0   0.0.0.0         255.255.248.0   U         0 0          0 stib1

 

Exadata Cloud – Post Provisioning View of the system

Review of Exadata Deployment

Once the Exadata provisioning process completes (which takes around 4-5hrs for a ½ rack).  We explore to see what gets deployed:

$ cat/etc/oratab

OCITEST:/u02/app/oracle/product/12.2.0/dbhome_2:Y

+ASM1:/u01/app/12.2.0.1/grid:N       # line added by Agent

 

[grid@phxdbm-o3eja1 ~]$ olsnodes -n

phxdbm-o3eja1 1

phxdbm-o3eja2 2

phxdbm-o3eja3 3

phxdbm-o3eja4 4

 

[grid@phxdbm-o3eja1 ~]$ cat /var/opt/oracle/creg/OCITEST.ini | grep nodelist

nodelist=phxdbm-o3eja1 phxdbm-o3eja2 phxdbm-o3eja3 phxdbm-o3eja4

 

[grid@phxdbm-o3eja1 ~]$ crsctl stat res -t

—————————————————————————–

Name           Target  State        Server                   State details

—————————————————————————–

Local Resources

—————————————————————————–

ora.ACFSC1_DG1.C1_DG11V.advm

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.ACFSC1_DG1.C1_DG12V.advm

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.ACFSC1_DG1.dg

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE    ora.ACFSC1_DG2.C1_DG2V.advm

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE    ora.ACFSC1_DG2.dg

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE    ora.ASMNET1LSNR_ASM.lsnr

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.DATAC1.dg

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE . ora.DBFS_DG.dg

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.LISTENER.lsnr

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.RECOC1.dg

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE . ora.acfsc1_dg1.c1_dg11v.acfs

ONLINE  ONLINE       phxdbm-o3eja1            mounted on /scratch/acfsc1_dg1,STABLE

ONLINE  ONLINE       phxdbm-o3eja2            mounted on /scratch/acfsc1_dg1,STABLE

ONLINE  ONLINE       phxdbm-o3eja3            mounted on /scratch/acfsc1_dg1,STABLE

ONLINE  ONLINE       phxdbm-o3eja4            mounted on /scratch/acfsc1_dg1,STABLE

ora.acfsc1_dg1.c1_dg12v.acfs

ONLINE  ONLINE       phxdbm-o3eja1            mounted on /u02/app_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja2            mounted on /u02/app_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja3            mounted on /u02/app_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja4            mounted on /u02/app_acfs,STABLE

ora.acfsc1_dg2.c1_dg2v.acfs

ONLINE  ONLINE       phxdbm-o3eja1            mounted on /var/opt/oracle/dbaas_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja2            mounted on /var/opt/oracle/dbaas_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja3            mounted on /var/opt/oracle/dbaas_acfs,STABLE

ONLINE  ONLINE       phxdbm-o3eja4            mounted on /var/opt/oracle/dbaas_acfs,STABLE

ora.net1.network

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.ons

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

ora.proxy_advm

ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ONLINE  ONLINE       phxdbm-o3eja4            STABLE

—————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       phxdbm-o3eja2            STABLE

ora.LISTENER_SCAN2.lsnr

1        ONLINE  ONLINE       phxdbm-o3eja3            STABLE

ora.LISTENER_SCAN3.lsnr

1        ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ora.asm

1        ONLINE  ONLINE       phxdbm-o3eja1            Started,STABLE

2        ONLINE  ONLINE       phxdbm-o3eja2            Started,STABLE

3        ONLINE  ONLINE       phxdbm-o3eja3            Started,STABLE

4        ONLINE  ONLINE       phxdbm-o3eja4            Started,STABLE

ora.cvu

1        ONLINE  ONLINE       phxdbm-o3eja1            STABLE

ora.ocitest.db

1        ONLINE  ONLINE       phxdbm-o3eja1            Open,HOME=/u02/app/oracle/product/12.2.0/dbhome_2,STABLE

2        ONLINE  ONLINE       phxdbm-o3eja2            Open,HOME=/u02/app/o

racle/product/12.2.0

/dbhome_2,STABLE

3        ONLINE  ONLINE       phxdbm-o3eja3            Open,HOME=/u02/app/oracle/product/12.2.0

/dbhome_2,STABLE

4        ONLINE  ONLINE       phxdbm-o3eja4            Open,HOME=/u02/app/oracle/product/12.2.0

/dbhome_2,STABLE

ora.phxdbm-o3eja1.vip

1        ONLINE  ONLINE       phxdbm-o3eja1            STABLE ora.phxdbm-o3eja2.vip

1        ONLINE  ONLINE       phxdbm-o3eja2            STABLE ora.phxdbm-o3eja3.vip

1        ONLINE  ONLINE       phxdbm-o3eja3            STABLE ora.phxdbm-o3eja4.vip

1        ONLINE  ONLINE       phxdbm-o3eja4            STABLE ora.qosmserver

1        OFFLINE OFFLINE                               STABLE ora.scan1.vip

1        ONLINE  ONLINE       phxdbm-o3eja2            STABLE ora.scan2.vip

1        ONLINE  ONLINE       phxdbm-o3eja3            STABLE ora.scan3.vip

1        ONLINE  ONLINE       phxdbm-o3eja1            STABLE

—————————————————————————–

[grid@phxdbm-o3eja1 ~]$ asmcmd lsct

DB_Name  Status     Software_Version  Compatible_version  Instance_Name   Disk_Group

+APX     CONNECTED        12.2.0.1.0          12.2.0.1.0  +APX1   ACFSC1_DG1

+APX     CONNECTED        12.2.0.1.0          12.2.0.1.0  +APX1   ACFSC1_DG2

+ASM     CONNECTED        12.2.0.1.0          12.2.0.1.0  +ASM1   DATAC1

+ASM     CONNECTED        12.2.0.1.0          12.2.0.1.0  +ASM1    DBFS_DG

OCITEST  CONNECTED        12.2.0.1.0          12.2.0.0.0  OCITEST1 DATAC1

OCITEST  CONNECTED        12.2.0.1.0          12.2.0.0.0  OCITEST1  RECOC1

_OCR     CONNECTED         –                  phxdbm-o3eja1.client.phxexadata.oraclevcn.com  DBFS_DG

yoda     CONNECTED        12.2.0.1.0          12.2.0.0.0  yoda1    DATAC1

yoda     CONNECTED        12.2.0.1.0          12.2.0.0.0  yoda1    RECOC1

 

[root@phxdbm-o3eja1 ~]# df -k

Filesystem           1K-blocks     Used Available Use% Mounted on

/dev/mapper/VGExaDb-LVDbSys1

24639868  3878788  19486408  17% /

tmpfs                742619136  2465792 740153344   1% /dev/shm

/dev/xvda1              499656    26360    447084   6% /boot

/dev/mapper/VGExaDb-LVDbOra1

20511356   719324  18727072   4% /u01

/dev/xvdb             51475068  9757380  39079864  20% /u01/app/12.2.0.1/grid

/dev/xvdc             51475068  9302820  39534424  20% /u01/app/oracle/product/12.1.0.2/dbhome_1

/dev/xvdd             51475068  8173956  40663288  17% /u01/app/oracle/product/12.2.0.1/dbhome_1

/dev/xvde             51475068  6002756  42834488  13% /u01/app/oracle/product/11.2.0.4/dbhome_1

/dev/xvdg            206293688 19751360 176040184  11% /u02

/dev/asm/c1_dg12v-186

459276288  1067008 458209280   1% /u02/app_acfs

/dev/asm/c1_dg11v-186

229638144   611488 229026656   1% /scratch/acfsc1_dg1

/dev/asm/c1_dg2v-341 228589568 26597644 201991924  12% /var/opt/oracle/dbaas_acfs

 

Oracle Homes are created and mounted, though for IQN we will only be using 12.2, 12.1.0.2, and 11.2.0.4 [interim].

The   following are Exadata specific filesystems and use cases
/scratch/acfs1_dg1             –staging Exadata

/u02/app_acfs.                    – User filesystem for applications (currently empty)

/var/opt/oracle/dbaas_acfs.  –  Binary and image repository for all Exadata patching and enablement

Exadata Cloud Deployment and Considerations

I recently did a presentation and wipe-board session on Exadata Cloud deployment.  As part of that engagment, I did a small write-up on this topic.  This is a series of blogs that reflects the presentation:

Cloud Exadata Network and Platform Configuration

 Exadata DB Systems are offered in quarter rack, half rack or full rack configurations, and each configuration consists of compute nodes and storage servers. The compute nodes are each configured as a Virtual Machine (VM).

Key Operational characteristics of Exadata Cloud

  • Admins have root privileges for the compute node VMs. Thus 3rd party software can be installed, however, only supported Oracle DB versions and rpms should be implemented.

 

  • Admins do not have administrative access to the Exadata infrastructure components, including the physical compute node hardware, network switches, power distribution units (PDUs), integrated lights- out management (ILOM) interfaces, or the Exadata Storage Servers, which are all administered by Oracle.

 

  • Admins have full administrative privileges for your databases. However, application users should connect to databases via Oracle Net Services.

 

  • Admins are responsible for database administration tasks such as creating tablespaces and managing database users.

 

  • Admins should define how ssh keys will managed for users that will need compute node access.

 

 

 

 

 

 

 

 

 

 

 

Provisioning Exadata Pre-reqs

The following are network pre-reqs for provisioning Cloud Exadata DB Systems

Subnets

  • Require two separate VCN subnets: client subnet for user data traffic and backup subnet for backup traffic.
  • Define both the client subnet and the backup subnet as public subnets. Exadata requires a public subnet to support backup of the database to the Object Store.
  • Do not use a subnet that overlaps with 192.168.128.0/20. This restriction applies to both the client subnet and backup subnet.
  • Oracle requires that you use a VCN Resolver for DNS name resolution for the client subnet. It automatically resolves the Swift endpoints required for backing up databases, patching, and updating the cloud tooling on an Exadata DB System.

At the completion of the provisioning, you should have the following configured:

 

 

 

 

 

 

Security Lists and Routing

  • Each VCN subnet has a default security list that contains a rule to allow TCP traffic on destination port 22 (SSH) from source 0.0.0.0/0 and any source port. Properly configure the security list ingress and egress rules.
  • The OneCommand configuration enables TCP and ICMP traffic between all nodes and all ports in the respective subnet for client and backup subnets
  • Exadata DB System’s cloud network (VCN) must be configured with an internet gateway. Add a route table rule to open the access to the Object Storage Service Swift endpoint on CIDR 0.0.0.0/0.
  • Update the backup subnet’s security list to disallow any access from outside the subnet and allow egress traffic for TCP port 443 (https) on CIDR Ranges 129.146.0.0/16 (Phoenix region), 129.213.0.0/16 (Ashburn region)

Enable a route table with an entry that includes a Internet Gateway.  This will enable remote ssh access to the Exadata nodes

 

 

 

 

 

 

 

Provisioning Exadata

Service Console – Provision Exadata

Below are screenshot views that illustrate the provisioning of Exadata

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Cloud Exadata Storage Configuration

Exadata Storage Servers use the following ASM disk groups:

DATA diskgroup – for the storage of Oracle Data base datafiles.

RECO diskgroup – primarily used for storing files related to backup and recovery, such as RMAN backups and archived redo log files.  Depending how admins choose to provision for backups on Exadata storage

approximately 40% of the available storage space is allocated to the DATA disk group and approximately 60% is allocated to the RECO disk group.

Provision for backups on Exadata storage, approximately 80% of the available storage space is allocated to the DATA disk group and approximately 20% is allocated to the RECO disk group.

DBFS and ACFS diskgroups are system diskgroups that support various operational purposes. The DBFS disk group is primarily used to store the shared Clusterware files (Oracle Cluster Registry and voting disks), while the ACFS disk groups are primarily used to store Oracle Database binaries, staging directories and metadata.

 

Grid Infrastructure and RAC 12.2 New Features – a Recap

The following list illustrates the new 12.2 Oracle RAC and Grid Infrastructure. This is a personal list which “I believe to be the most interesting.” I apologize to the RAC Dev team if I left out any features.

Streamlined Grid Infrastructure Installation

12.2 Grid Infrastructure software is available as an image file for download and installation. The key objective of this feature was to enable a simpler and quicker installation of Grid Infrastructure. Administrators simply prep the system by creating a new Grid home directory, appropriate users, permissions and kernel settings. Once completed, Admins extract the image file into the newly-created Grid home, and execute the gridsetup.sh script to invoke setup wizard to register the Oracle Grid Infrastructure stack with Oracle inventory. This installation approach can be used for Oracle Grid Infrastructure for Cluster and Standalone Servers configurations. This new software installation will improve large scale deployment automation as well as deployment of customized images, Patch Set Updates (PSUs) and patches.

Real Application Clusters Reader Nodes

In 12.2, Oracle extended the capability of Flex Clusters by introducing Reader nodes. Reader nodes are Leaf nodes (in a Flex Cluster) that run read-only RAC database instances. The Reader nodes are not affected by RAC reconfigurations, caused by node evictions or other cluster node membership changes, as long as the Hub Node, to which it is connected, is part of the cluster. Reader Nodes allows users to create huge reader farms (up to 64 reader nodes per Hub Node), thus enabling massive parallel processing. In this architecture, updates to the read/write instances (running on Hub nodes) are immediately propagated to the read-only instances on the Leaf Nodes, where they can be used for online reporting or instantaneous queries. Users can create services to direct queries to read-only instances running on reader nodes.

Service-Oriented Buffer Cache Access

RAC Services, which are used to allocate and distribute workloads across RAC instances, are the cornerstone of RAC workload management. There is a strong relationship between a RAC Service, a specific workload, and the database object it accesses. With 12.2 RAC, a Service- oriented buffer cache feature was introduced to improve scale and performance, by optimizing instance and node-buffer cache affinity. This is done by caching or pre-warming instances with data blocks for objects accessed where a service is expected to run.

Twelve Days of 12.2

Server Weight-Based Node Eviction

When there is a spilt-brain, or when a node eviction decision must be made, traditionally the decision was based on age, or duration of the nodes, in the cluster; i.e., nodes with a large uptime in the cluster will survive. In 12.2 RAC, Server weight-based node eviction uses a more intelligent, tie-breaker mechanism to evict a particular node or a group of nodes from a cluster. The Server Weight-based node eviction feature introspects the current load on those servers as part of the decision. Two principle mechanisms, a system inherent automatic mechanism and a user input-based mechanism is used to offer and provide guidance.

Load-Aware Resource Placement

Load-aware resource placement, prevents overloading a server with more database instances than the server is capable of running. The metrics used to determine whether an application can be started on a given server, is based on the expected resource consumption of the application, as well as the capacity of the server in terms of CPU and memory. Administrators can define database resources such as CPU (cpu_count) and memory (memory_target) to Clusterware. Clusterware uses this information to place the database instances only on servers that meet a sufficient number of CPUs, amount of memory or both.

srvctl modify database -db testdb -cpucount 8 -memorytarget 64g

Hang Manager

The Hang Manager features first became available in 11gR1. In this initial version, Hang Manager evaluated and identified system hangs, then dumped the relevant information, “wait for graph,” into a trace file. In 12.2, Hang Manager takes action and attempts to resolve the system hang. An ORA-32701 error message is logged in the alert log to reflect the hang resolution. Hang Manager also runs in both single-instance and Oracle RAC database instances. With Hang Manager, it is constantly aware of processes running in reader nodes instances, and checks whether any of these processes are blocking progress on Hub Nodes to take action, if possible.

Separation of Duty for Administering RAC Clusters

12.2 RAC introduces a new administrative privilege called SYSRAC. This privilege is used by the Clusterware agent, and removes the need to use SYSDBA privilege for RAC administrative tasks, thus reducing the reliance on SYSDBA on production systems. Note, SYSRAC privilege is the default mode for connecting to the database by Clusterware agent; e.g, when executing RAC utilities such as SRVCTL.

Rapid Home Provisioning of Oracle Software

Rapid Home Provisioning enables you to create clusters, provision, patch, and upgrade Oracle Grid Infrastructure and Oracle Database homes. It also provisions 11.2 Clusters, applications, and middleware using Rapid Home Provisioning.

Extended Clusters

In 12.2 GI Administrators can create an extended RAC cluster across two, or more, geographically separate sites. Note, each site will include a set of servers with its own storage. If a site fails, the other site acts as an active standby. 12.2 Extended Clusters can be built on initial installation or be converted from an existing (non-Flex ASM) cluster, using the ConvertToExtended script.

De-support of OCR and Voting Files on Shared Filesystem

In Grid Infrastructure 12.2, the placement of Oracle Clusterware files: the Oracle Cluster Registry (OCR), and the Voting Files, directly on a shared file system is desupported. Only ASM or NFS is supported. If you need to use a supported shared file system, either a Network File System, or a shared cluster file system instead of native disk devices, then you must create Oracle ASM disks on supported network file systems that you plan to use for hosting Oracle Clusterware files before installing Oracle Grid Infrastructure. You can then use the Oracle ASM disks in an Oracle ASM disk group to manage Oracle Clusterware files. If your Oracle Database files are stored on a shared file system, then you can continue to use shared file system storage for database files, instead of moving them to Oracle ASM storage.

ACFS 12.2 New Features – a Recap

Oracle Automatic Storage Management Cluster File System (ACFS) made it’s debut with Oracle 11.2. Many DBAs are not aware of the vast features that are available with ACFS. With each release and update to Oracle, significant enhancements have been made. With Oracle Database 12c Release 2, new feature/functionality was made to ACFS.

Snapshot Enhancements

In Oracle 12.2, Oracle extends ACFS snapshot functionality and further simplifies file system snapshot operations. The following are a few of the key new features with snapshots:

Admins can now, if needed, impose quotas to snapshots to limit amount of write operations that can be done on a snapshot. Quotas can be set on the snapshot level. Oracle also provides the capability to rename an existing ACFS snapshot, to allow more user-friendly names.

When we delete a snapshot with the “acfsutil snap delete snapshot mount_point” command, we can force a delete, even if there are open files.

There are several new capabilities with snapshot re-mastering and duplication. The new ACFS snapshot remaster capability allows for a snapshot in the snapshot registry to become the primary file system. ACFS snapshot duplication features are introduced. With the “acfsutil snap duplicate create” command, can be used to duplicate a snapshot from an existing snapshot, to a standby target file system.

The “apply” option to the “acfsutil snap duplicate” command, allows us to apply deltas to the target ACFS file system or snapshot. If this is the initial apply, the target file system must be empty. If the target had been applied before, then the apply process becomes an incremental update. Before the incremental update occurs, the contents of the target file system must match the content of the older snapshot, since the last incremental update. Also, the contents of the target snapshot cannot be modified while the apply is happening.

Additionally, ACFS snapshot-based replication now uses SSH protocols to transmit data streams.

4k Sectors and Metadata

When Admins create an ACFS file system, they have the option to create the file system with the 4096-byte metadata structure. When issuing the mkfs command, you can specify the metadata block size with the –i option; two valid options are 512 bytes or 4096 bytes. The 4096-byte metadata structure is made up of multiple 512-byte logical sectors.

If the COMPATIBLE.ADVM ASM Diskgroup attribute is set to 12.2 or greater, then the metadata block is 4096 bytes by default. If COMPATIBLE.ADVM attribute is set to less than 12.2, then the block size is set to 512 bytes. When the ADVM volume of the ACFS file system is set with 4K logical disk sector size, Direct I/O requests should be aligned on the 4K offset and be a multiple of 4k size for optimal performance.

Defragger

Very rarely would you need the defragmentation tool, due to the fact that ACFS algorithm is for allocation and coalesce-ment of free space. However, for those rare situations, when we can get into fragmented situations under heavy workloads or for compressed files, Oracle provides the defrag option to the acfsutil command. Now, we can issue “acfsutil defrag dir” or “acfsutil defrag file” commands for on-demand defragmentation.

ACFS will perform all defrag operations in the background. With the –r option of the “acfsutil defrag dir”command, you can recursively defrag subdirectories.

Compression Enhancements

ACFS compression can significantly reduce disk storage requirements for customers running databases on ACFS. Databases running on ACFS, must be of versions 11.2.0.4 or higher. ACFS compression can be enabled for specific ACFS file systems for database files, RMAN backup files, archivelogs, data pump extract files, and general purpose files. Oracle does not support redo log/flashback logs/control file compression.

When enabling ACFS compression for a file system, only new incoming files will be compressed. All existing files on the file system will remain un-compressed. Likewise, if you decide to uncompress a file system, Oracle will not de-compress files. Oracle will simply disable compression for newly created files.

To compress and uncompress ACFS file systems, execute the acfsutil compress on or acfsutil compress off commands. To view compression state and space consumption information, you can execute the “acfsutil compress info” command. The commands “acfsutil info fs” and “acfsutil info file” now support ACFS compression status.

At this time, databases with 2K or 4K block sizes are not supported for ACFS compression. ACFS compression is supported on Linux and AIX. ACFS is also supported to work with ACFS snapshot-based replication.

Loopback Devices

ACFS now supports loopback devices on the Linux operating system. With ACFS loopback device support, we can now take OVM images, templates, and virtual disks and present them as a block device. Files can be sparse or non-sparse. ACFS also supports Direct I/O on sparse images.

Metadata Collector

The metadata collector, copies metadata structures from an Oracle ACFS file system to a separate output file that can be ingested for analysis and diagnostics. The metadata collector reads the contents of the file system and all metadata is written out to a specified output file. The metadata collector can read the ACFS file system online without requiring an outage. Note, this tool is not a replacement for the file system checker command (fsck), but a supplement for additional diagnosis and support. Even though the metadata collector can read the file system while it is online, for best results, unmount the file system prior to metadata collection. The size of the output file, is directly correlated to the size of the file system that the collection is specified for. To collect metadata for a file system, invoke the “acfsutil meta” command.

The auto-resize feature, allows us to “autoextend” a file system if the size of the file system is about to run out of space. Just like an Oracle datafile that has the autoextend option enabled, we can now “autoextend” the ACFS file system to the size of the increment by option. With the –a option to the “acfsutil size” command, we can specify the increment by size.

We can also specify the maximum size or quota for the ACFS file system to “autoextend” to guard against a runaway space consumption. To set the maximum size for an ACFS file system, execute the “acfsutil size” command with the –x option.

Uh Oh, I didnt set my Exadata core count correctly , now what?

Changing Capacity On-Demand Core Count in Exadata

We recently implemented an Exadata X6  at a one of client sites (yes, we don’t Oracle ACS, we do it ourselves).   However, the client failed to tell us that they had licensed only a subset of the cores per compute, after we actually had implmeneted and *migrated* production databases onto the X6.  So how do we set the core count correctly after implementation (post-OEDA run).  We heard horror stories about other folks saying they needed to re-image to set core count.  To be specific, its easy to increase cores, but decrease is nasty business.

The steps below are ones we used to decrease the core count:

1. Gracefully stop all databases running on all compute nodes.

2. Login to the compute nodes as root and run the “dbmcli” utility

3. Display the current core count using the following command:

LIST DBSERVER attributes coreCount

4. Change the core count to the desired count using this command (this needs done on all compute nodes):

ALTER DBSERVER pendingCoreCount = 14

NOTE:  Since we are  decreasing the number of cores after installation of the system, the FORCE option needs to be done.

ALTER DBSERVER pendingCoreCount = 14 FORCE

5. Reboot

6. Verify the change was correct by using the “LIST” command in step 3.

 

Just FYI… Troubleshooting

If there is an issue with the MS service starting up, it could be because of the Java being used on the system.

For Exadata release 12.1.2.3.1.160411, the version of Java was 1.8.0.66 and was flagged by a security audit as a vulnerability and was removed from the system.  When the system rebooted, the MS service couldn’t start back up because Java was removed. Follow these steps to reinstall Java and get the MS service restarted on the compute nodes.

1. Download the latest JDK from the Oracle site. NOTE: The RPM download was used.

2. Install the JDK package on the system:

rpm -ivh jdk-8u102- linux-x64.rpm

3. Redeploy the MS service application:

/opt/oracle/dbserver/dbms/deploy/scripts/unix/setup_dynamicDeploy DB -D

4. Restart the MS service:

ALTER DBSERVER RESTART SERVICES MS

 

What’s with MGTDB anyways

For those who have either upgraded or fresh-installed 12.1 (12c) Grid Infrastructure stack, will notice a new database instance (-MGMTDB) that was provisioned automagically. So what is this MGMTDB and why do I need this overhead.

Si let’s recap what the DB is and what it does…
Management Database is the central repository to store Cluster Health Monitor, the Grid Infrastructure Management Repository.

MGMT database is a container database (CDB) with one pluggable database (PDB) running. However, this database runs out of the Grid Infrastructure home.
The MGMTDB is a Rac One Node database; i.e., it runs on one node at a time, but because this is Clustered Resource, it can be started or failed over on any node in the cluster. MGMTDB is as a non-critical component of the GI stack (with no “real” hard dependencies). This means that if MGMTDB fails or becomes unavailable, Grid Infrastructure continues running

MGMTDB is configured (subject to change) with 750 MB SGA/325 MB PGA, and 5GB database size. But note that, due to the footprint MGMT’s SGA is not configured for hugepages . Since, this database is dynamically created on install, the OUI installer does not have pre-knowledge of the database that are configured or will be migrated to this cluster, thus in order to avoid any database names conflict the name “-MGMTDB” was chosen (notice the “-“). Note, bypassing MGMTDB installation is only allowed for upgrades to 12.1.0.2. New 12.1.0.2 installations or upgrades to future releases will require MGMTDB to be installed. if MGMTDB is not selected during upgrade, all features (Cluster Health Monitor (CHM/OS) etc) that depend on it will be disabled.

So if you are wondering where the datafiles and other structures are stored for this database. Well they would will be stored in the same diskgroup as OCR and VOTE However, these dtabase files can be migrated into ASM diskgroup post install.

MGMTDB will store a subset of Operating System (OS) performance data for longer term to provide diagnostic information and support intelligent workload management. Performance data (OS metrics similar to, but a subset of Exawatcher) collected by the ‘Cluster Health Monitor’ (CHM) is stored also on local disk, so when not using MGMTDB, CHM data can still be obtained from local disk but intelligent workload management (QoS) will be disabled. onger term MGMTDB will become a key component of the Grid Infrastructure and provide services for important components, because of this MGMTDB will eventually become a mandatory component in future upgrades to releases on Exadata.

See document 1568402.1 for more details.

Setting Jumbo Frames – Portrait of a Large MTU size

There cases where we need to ensure that large packet “address-ability” exists. This is needed to verify configuration for non standard packet sizes, i.e, MTU of 9000. For example if we are deploying a NAS or backup server across the network.

Setting the MTU can be done by editing the configuration script for the relevant interface in /etc/sysconfig/network-scripts/. In our example, we will use the eth1 interface, thus the file to edit would be ifcfg-eth1.

Add a line to specify the MTU, for example:
DEVICE=eth1
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.20.2
NETMASK=255.255.255.0
MTU=9000

Assuming that MTU is set on the system, just do a ifdown eth1 followed by ifup eth1.
An ifconfig eth1 will tell if its set correctly

eth1 Link encap:Ethernet HWaddr 00:0F:EA:94:xx:xx
inet addr:192.168.20.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::20f:eaff:fe91:407/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:141567 errors:0 dropped:0 overruns:0 frame:0
TX packets:141306 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:101087512 (96.4 MiB) TX bytes:32695783 (31.1 MiB)
Interrupt:18 Base address:0xc000

To validate end-2-end MTU 9000 packet management

Execute the following on Linux systems:

ping -M do -s 8972 [destinationIP]
For example: ping datadomain.viscosityna.com -s 8972

The reason for the 8972 on Linux/Unix system, the ICMP/ping implementation doesn’t encapsulate the 28 byte ICMP (8) + TCP (20) (ping + standard transmission control protocol packet) header. Therefore, take in account : 9000 and subtract 28 = 8972.

[root@racnode01]# ping -s 8972 -M do datadomain.viscosityna.com
PING datadomain.viscosityna.com. (192.168.20.32) 8972(9000) bytes of data.
8980 bytes from racnode1.viscosityna.com. (192.168.20.2): icmp_seq=0 ttl=64 time=0.914 ms

To illustrate if proper MTU packet address-ability is not in place, I can set a larger packet size in the ping (8993). The packet gets fragmented you will see
“Packet needs to be fragmented by DF set”. In this example, the ping command uses ” -s” to set the packet size, and “-M do” sets the Do Not Fragment

[root@racnode01]# ping -s 8993 -M do datadomain.viscosityna.com
5 packets transmitted, 5 received, 0% packet loss, time 4003ms
rtt min/avg/max/mdev = 0.859/0.955/1.167/0.109 ms, pipe 2
PING datadomain.viscosityna.com. (192.168.20.32) 8993(9001) bytes of data.
From racnode1.viscosityna.com. (192.168.20.2) icmp_seq=0 Frag needed and DF set (mtu = 9000)

By adjusting the packet size, you can figure out what the mtu for the link is. This will represent the lowest mtu allowed by any device in the path, e.g., the switch, source or target node, target or anything else inbetween.

Finally, another way to verify the correct usage of the MTU size is the command ‘netstat -a -i -n’ (the column MTU size should be 9000 when you are performing tests on Jumbo Frames)

High Level Overview of 11204 ASM Rebalance in Async ARB0

High Level look at 11204 Rebalance with Plan Optimiation and Async ARB0

 

Drop disk

 

SQL> alter diskgroup reco drop disk ‘ASM_NORM_DATA4’ rebalance power 12

here we issue the rebalance

NOTE: requesting all-instance membership refresh for group=2

GMON querying group 2 at 120 for pid 19, osid 19030

GMON updating for reconfiguration, group 2 at 121 for pid 19, osid 19030

NOTE: group 2 PST updated.

NOTE: membership refresh pending for group 2/0x89b87754 (RECO)

GMON querying group 2 at 122 for pid 13, osid 4000

SUCCESS: refreshed membership for 2/0x89b87754 (RECO)

NOTE: starting rebalance of group 2/0x89b87754 (RECO) at power 12   rebalance internally started

Starting background process ARB0    ARB0 gets started for this rebalance

SUCCESS: alter diskgroup reco drop disk ‘ASM_NORM_DATA4’ rebalance power 12

Wed Sep 19 23:54:10 2012

ARB0 started with pid=21, OS id=19526

NOTE: assigning ARB0 to group 2/0x89b87754 (RECO) with 12 parallel I/Os   ARB0 assigned to this

diskgroup rebalance. Note that it states 12 parallel I/Os

NOTE: Attempting voting file refresh on diskgroup RECO

Wed Sep 19 23:54:38 2012

NOTE: requesting all-instance membership refresh for group=2   first indications that rebalance is completing

GMON updating for reconfiguration, group 2 at 123 for pid 22, osid 19609

NOTE: group 2 PST updated.

SUCCESS: grp 2 disk ASM_NORM_DATA4 emptied    Once rebalanced relocation phase is complete, the disk is emptied

NOTE: erasing header on grp 2 disk ASM_NORM_DATA4   The emptied disk’s header is erased and set to FORMER

NOTE: process _x000_+asm (19609) initiating offline of disk 3.3915941808 (ASM_NORM_DATA4) with mask 0x7e in group 2

The dropped disk is offlined

NOTE: initiating PST update: grp = 2, dsk = 3/0xe96887b0, mask = 0x6a, op = clear

GMON updating disk modes for group 2 at 124 for pid 22, osid 19609

NOTE: PST update grp = 2 completed successfully

NOTE: initiating PST update: grp = 2, dsk = 3/0xe96887b0, mask = 0x7e, op = clear

GMON updating disk modes for group 2 at 125 for pid 22, osid 19609

NOTE: cache closing disk 3 of grp 2: ASM_NORM_DATA4

NOTE: PST update grp = 2 completed successfully

GMON updating for reconfiguration, group 2 at 126 for pid 22, osid 19609

NOTE: cache closing disk 3 of grp 2: (not open) ASM_NORM_DATA4

NOTE: group 2 PST updated.

Wed Sep 19 23:54:42 2012

NOTE: membership refresh pending for group 2/0x89b87754 (RECO)

GMON querying group 2 at 127 for pid 13, osid 4000

GMON querying group 2 at 128 for pid 13, osid 4000

NOTE: Disk in mode 0x8 marked for de-assignment

SUCCESS: refreshed membership for 2/0x89b87754 (RECO)

NOTE: Attempting voting file refresh on diskgroup RECO

Wed Sep 19 23:56:45 2012

NOTE: stopping process ARB0    All phases of rebalance are completed and ARB0 is shutdown

SUCCESS: rebalance completed for group 2/0x89b87754 (RECO)   Rebalance marked as complete

 

 

Add disk

Starting background process ARB0

SUCCESS: alter diskgroup reco add disk ‘ORCL:ASM_NORM_DATA4’ rebalance power 16

Thu Sep 20 23:08:22 2012

ARB0 started with pid=22, OS id=19415

NOTE: assigning ARB0 to group 2/0x89b87754 (RECO) with 16 parallel I/Os

Thu Sep 20 23:08:31 2012

NOTE: Attempting voting file refresh on diskgroup RECO

Thu Sep 20 23:08:46 2012

NOTE: requesting all-instance membership refresh for group=2

Thu Sep 20 23:08:49 2012

NOTE: F1X0 copy 1 relocating from 0:2 to 0:459 for diskgroup 2 (RECO)

Thu Sep 20 23:08:50 2012

GMON updating for reconfiguration, group 2 at 134 for pid 27, osid 19492

NOTE: group 2 PST updated.

Thu Sep 20 23:08:50 2012

NOTE: membership refresh pending for group 2/0x89b87754 (RECO)

NOTE: F1X0 copy 2 relocating from 1:2 to 1:500 for diskgroup 2 (RECO)

NOTE: F1X0 copy 3 relocating from 2:2 to 2:548 for diskgroup 2 (RECO)

GMON querying group 2 at 135 for pid 13, osid 4000

SUCCESS: refreshed membership for 2/0x89b87754 (RECO)

Thu Sep 20 23:09:06 2012

NOTE: Attempting voting file refresh on diskgroup RECO

Thu Sep 20 23:09:57 2012

NOTE: stopping process ARB0

SUCCESS: rebalance completed for group 2/0x89b87754 (RECO)

SQL> select NUMBER_KFGMG, OP_KFGMG, ACTUAL_KFGMG, REBALST_KFGMG from X$KFGMG;
NUMBER_KFGMG   OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
           2         1           0             2
           2         32           0             2

NUMBER_KFGMG   OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
           2         1           16             1
NUMBER_KFGMG   OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
           2        1           16             2
           2         32           16             2


NUMBER_KFGMG   OP_KFGMG ACTUAL_KFGMG REBALST_KFGMG
------------ ---------- ------------ -------------
           2         1           16             2

			

Exadata Monitoring and Agents – EM Plugin

To those who attended our Exadata Monitoring and Agents. Here’s some Answers and followup from the Chat room

The primary goal of the Exadata Pluigin is to digest the schematic file and validate database.xml and catalog.xml files. If the pre-check runs w/o failure then Discovery can be executed.

Agent only runs on compute nodes and monitors all components remotely; i,e ,no additional scripts/code is installed on the peripheral components. Agents pull component metrics and vitals using either ssh commands (using user equivalence based commands) or subscribe to SNMP traps.

Note, that there are always two agents deployed, the master does majority of the work, and a slave agent, which “kicks-in” if the master fails. Agents should be installed on all compute nodes

Initially, the guided discovery wizard runs ASM kfod to get disk names and reads cellip.ora.

The components monitored via the Exadata-EM plugin include the following:
• Storage Cells

• Infiniband Switches (IB switches)
EM agent runs remote ssh calls to collect switch metrics, IB switch sends SNMP traps (PUSH) for all alerts. This collection does require ssh equilavalnace for nm2user. This collection includes varipous sensor data: FAN, voltage, temparture. As well port metrics.
Plugin does the following:
Ssh nm2user@ ibnetdiscover

Reads the components names connected to the IBM switch, matches up the compute node hostnames tp the hostnames used to install agent

• Cisco Switch
EM agent runs remote SNMP get calls to gather metric data, this includes port status, switch vitals; eg, CPU, memory, power, and temp. In addition, performance metrics are also collect; eg, ingress and egress throughput rates

• PDU and KVM
For the PDU, both active and passive PDUs are monitored. Agent runs SNMP get calls from each PDU, metric collection includes Power, temperature, Fan status. The same steps and metrics are gathered for the KVM

• ILOM targets
EM Agent executes remote ipmitool calls to each compute node’s ILOM target. This execution requires oemuser credentials to run ipmitool. Agent collects sensor data as well as configuration data (firmware version and serial number)

In EM 12.1.0.4 , the key enhancements introduced include gathering IB performance, on-demand schematic refresh, Cell performance monitoring as well as a guided resolution for cell alerts. SNMP automation notification setup for Exadata Storage Server and InfiniBand Switches.

The Agent discovers IB switches and compute nodes and sends output to ibnetdiscover. The KVM, PDU, Cisco and ILOM discovery is performed via schematic file on compute node, and finally subscribes to SNMP for cells and IBM switches; note, SNMP has to be manually setup and enabled on peripheral componets for SNMP push of cell alerts. EM agent runs cellcli via ssh to obtain Storage metrics, this does require ssh equialvance with Agent user

The latest version (as of this writing, 12.1.0.6), there were a number of key visualization and metrics enhancements. For example:

• CDB-level I/O Workload Summary with PDB-level details breakdown.
• I/O Resource Management for Oracle Database 12c.
• Exadata Database Machine-level physical visualization of I/O Utilization for CDB and PDB on each Exadata Storage Server. There is also a critical integration link to Database Resource Management UI.
• Additional InfiniBand Switch Sensor fault detection, including power supply unit sensors and fan presence sensors.
• Automatically push Exadata plug-in to agent during discovery.

Use fully qualified names with Agent, using shorten names will causes issues. If there are any issues with metrics gathering or agent, EMDiag Kit should be used to triage this. The EMDiag kit includes scripts that can be used EM issues. Specifically, the kit includes repvfy, agtvfy, and omsvfy. These tools can be used to diagnose issues with the OEM Repository, EM Agents, control management services.
To obtain the EMDiag Kit, download the zip file for the version that you need, per Oracle Support Note: MOS ID# 421053.1

Export EMDIAG_HOME=/u01/app/oracle/product/emdiag
$EMDIAG_HOME/bin/repvfy install
$EMDIAG_HOME/bin/repvfy verify Exadata –level 9 -details

These questions keep coming up, so for quick reference, here’s a terminology re-cap!!

ZFS Snapshots and Clones
Snapshots

A snapshot is a point-in-time copy of a filesystem or LUN. Snapshots can be created manually or by setting up an automatic schedule. Snapshots initially consume no additional space, but as the active share changes, previously unreferenced blocks will be kept as part of the last snapshot. Over time, the last snapshot will take up additional space, with a maximum equivalent to the size of the filesystem at the time the snapshot was taken.

Filesystem snapshots can be accessed over the standard protocols in the .zfs/snapshot snapshot at the root of the filesystem. This directory is hidden by default, and can only be accessed by explicitly changing to the .zfs directory. This behavior can be changed in the Snapshot view, but may cause backup software to backup snapshots in addition to live data. LUN Snapshots cannot be accessed directly, though they can be used as a rollback target or as the source of a clone. Project snapshots are the equivalent of snapshotting all shares within the project, and snapshots are identified by name. If a share snapshot that is part of a larger project snapshot is renamed, it will no longer be considered part of the same snapshot, and if any snapshot is renamed to have the same name as a snapshot in the parent project, it will be treated as part of the project snapshot.

Shares support the ability to rollback to previous snapshots. When a rollback occurs, any newer snapshots (and clones of newer snapshots) will be destroyed, and the active data will be reverted to the state when the snapshot was taken. Snapshots only include data, not properties, so any property settings changed since the snapshot was taken will remain.

Clones

A clone is a writable copy of a share snapshot, and is treated as an independent share for administrative purposes. Like snapshots, a clone will initially take up no extra space, but as new data is written to the clone, the space required for the new changes will be associated with the clone. Clones of projects are not supported. Because space is shared between snapshots and clones, and a snapshot can have multiple clones, a snapshot cannot be destroyed without also destroying any active clones.

Shadow Data Migration

A common task for administrators is to move data from one location to another. In the most abstract sense, this problem encompasses a large number of use cases, from replicating data between servers to keeping user data on laptops in sync with servers. There are many external tools available to do this, but the Sun Storage 7000 series of appliances has two integrated solutions for migrating data that addresses the most common use cases. The first, remote replication, is intended for replicating data between one or more appliances, and is covered separately. The second, shadow migration, is described here.

Shadow migration is a process for migrating data from external NAS sources with the intent of replacing or decommissioning the original once the migration is complete. This is most often used when introducing a Sun Storage 7000 appliance into an existing environment in order to take over file sharing duties of another server, but a number of other novel uses are possible, outlined below.
ZFS Storage Pools and Projects

Storage Pools

The appliance is based on the ZFS filesystem. ZFS groups underlying storage devices into pools, and filesystems and LUNs allocate from this storage as needed. Before creating filesystems or LUNs, you must first configure storage on the appliance. Once a storage pool is configured, there is no need to statically size filesystems, though this behavior can be achieved by using quotas and reservations.

While multiple storage pools are supported, this type of configuration is generally discouraged because it provides significant drawbacks as described in the storage configuration section. Multiple pools should only be used where the performance or reliability characteristics of two different profiles are drastically different, such as a mirrored pool for databases and a RAID-Z pool for streaming workloads.

When multiple pools are active on a single host, the BUI will display a drop-down list in the menu bar that can be used to switch between pools. In the CLI, the name of the current pool will be displayed in parenthesis, and can be changed by setting the ‘pool’ property. If there is only a single pool configured, then these controls will be hidden. When multiple pools are selected, the default pool chosen by the UI is arbitrary, so any scripted operation should be sure to set the pool name explicitly before manipulating any shares.

Projects

All filesystems and LUNs are grouped into projects. A project defines a common administrative control point for managing shares. All shares within a project can share common settings, and quotas can be enforced at the project level in addition to the share level. Projects can also be used solely for grouping logically related shares together, so their common attributes (such as accumulated space) can be accessed from a single point.

By default, the appliance creates a single default project when a storage pool is first configured. It is possible to create all shares within this default project, although for reasonably sized environments creating additional projects is strongly recommended, if only for organizational purposes.