PREREQ FOR FAST START FAIL OVER
Maximum Performance Mode (Oracle Database 11g Rel 1 and later)
Oracle Database 11g FSFO adds support for Maximum Performance mode (async redo transfer), providing the flexibility to trade durability for performance. Commit latency is not affected by redo transfer, but committed transactions whose redo has not been received by the standby will be lost during failover. FSFO configurations in Maximum Performance mode may limit potential data loss by specifying the maximum allowable age of transactions that are lost during a failover. For example, if the limit specified is 30 seconds (the default), FSFO guarantees that all transactions that committed prior to 30 seconds ago are preserved during failover. The minimum allowable limit is 10 seconds.
Data Guard Broker
Broker is a Data Guard management utility that maintains state information about a primary and its standby databases. It automatically sets Data Guard related database initialization parameters on instance start and role transitions, starts apply services for standbys, and automates many of the administrative tasks associated with maintaining a Data Guard configuration. FSFO is a feature of Broker which records information about the failover target, how long to wait after a failure before triggering a failover, and other FSFO specific properties.Flashback Database
Flashback Database is a continuous data protection (CDP) solution integrated with the Oracle Database. It provides a way to quickly restore a database to a previous point in time or SCN using on-disk data structures called flashback logs. Flashing back a database is much faster and more seamless (one simple DDL statement) than traditional point-in-time or SCN-based recovery. FSFO uses Flashback Database as part of the process of reinstating a failed primary as a standby.Problems with automatic reinstatement are frequently due to misconfiguration, so let's look at this in a bit more detail.
Flashback Database records the before-image of changed blocks. To avoid the overhead of recording every change to every block, Flashback Database takes a "fuzzy" snapshot every 30 minutes and only records the before-image block upon its first change since the last snapshot. Subsequent changes to the same block during the same snapshot are not recorded.
Flashing back a database occurs in two stages:
- Restore - Flashback Database restores the datafiles to the closest snapshot prior to the specified SCN. This can be compared to performing an RMAN restore of the datafiles from a backup taken prior to the specified SCN, but is much faster.
- Media Recovery - Once the restore is complete, recovery proceeds as a typical media recovery, applying redo from archived and online redologs and rolling back uncommitted changes with undo. This means that in order for a flashback database operation to succeed,Flashback Database requires all archive redo logs generated between the snapshot time and restore SCN (typically the past 30 minutes of redo). Use the V$RECOVERY_PROGRESS view to monitor recovery status.
For FSFO environments, set db_flashback_retention_target = 60 or higher to provide sufficient Flashback Database history for automatic standby reinstatement. Metadata for the fuzzy snapshot is stored in the flashback log itself. If that metadata is pushed out, Oracle can no longer find a fuzzy snapshot so it will not be able to flash back. To avoid problems due to timing variations, values less than 60 minutes are not recommended and values of 30 or less virtually guarantee Flashback Database failure.
Flashback Database stores its logs in the Flash Recovery Area (FRA), so the FRA must be large enough to store at least 60 minutes of Flashback Database history. The total storage requirement is proportional to the number of distinct blocks changed during snapshots - e.g. 1,000,000 block changes on a small set of blocks generates less Flashback Database history than 1,000,000 changes on a larger set of blocks. A good method to determine Flashback Database storage requirements is to enable Flashback Database and observe the amount of storage it uses during several peak loads. There is little risk in enabling Flashback Database to determine its storage requirements - it can be disabled while the primary is open if necessary. However, re-enabling Flashback Database will require a bounce since the database must be mounted and not open.
FSFO observer
Note: the FSFO observer version must match the database version. Oracle Database 11g observers are incompatible with 10g databases and vice-versa.
Defination. While Oracle 11g’s Data Guard definitely protects a database when the entire production site is lost via its failover capabilities, it’s still necessary for an Oracle DBA to intervene to complete the failover process. This article – the seventh in this ongoing series – shows how to set up and control automatic failover using the Fast-Start Failover Observer so that DBA intervention is no longer required during a disaster recovery scenario.
Its failover capability of oracle database in the event of failure of primary database, where
manual intervention is not needed,
Table 7-1. Fast-Start Failover: Triggering Events
| |
Event Class | Triggering Event |
Connectivity Loss | Network connectivity is lost simultaneously between the primary database and:
|
Database Health Check Failure | A database health check detects any of the following optionally-configured failures:
|
Instance Crash (Single Instance) | The primary database’s instance has crashed |
Instances Crash (RAC) | All instances for a RAC primary database have crashed |
Shutdown Abort on Primary | The primary database is shut down with the SHUTDOWN ABORT command |
Fast-Start Failover detects one of these failover situations through the Fast-Start Failover Observer (FSFO
all of which are completely configurable using the Data Guard Broker Manager utility (DGMGRL).
Fast-Start Failover: Basic Configuration. Since it’s certainly possible that more than one physical standby database could exist in a Data Guard configuration, the first thing that I’ll need to establish is which physical standby database should be paired with the primary database in case a fast-start failover is initiated. I’ll do that by setting a value for the FastStartFailoverTarget parameter via the DGMGRL utility. Note that I’ve chosen the primary database as the fast-start failover target for the selected physical standby database as well:
ENABLING FAST START FAILOVER
ENABLING FAST START FAILOVER
DGMGRL> EDIT DATABASE primary SET PROPERTY FastStartFailoverTarget = 'standby';
DGMGRL> EDIT DATABASE standby SET PROPERTY FastStartFailoverTarget = 'primary';
SETTING FAST START FAIL OVER THRESHOLD VALUE
SETTING FAST START FAIL OVER THRESHOLD VALUE
DGMGRL>EDIT CONFIGURATION SET PROPERTY FastStartFailoverThreshold = '180';
DGMGRL>EDIT CONFIGURATION SET PROPERTY FastStartFailoverLagLimit =60;
Next, I’ll establish how long the Fast-Start Failover Observer should wait until it decides that the primary database is unreachable by setting a value of 180 seconds for the FastStartFailoverThreshold parameter:
Now that the basic fast-start failover configuration is completed, I can confirm its status with the SHOW FAST_START FAILOVER command:
DGMGRL> show fast_start failover
Fast-Start Failover: DISABLED
Threshold: 90 seconds
Target: (none)
Observer: orcl_stdby1
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES
Oracle Error Conditions:
(none)
DGMGRL> show database verbose primary;
Database
Name: primary
Role: PRIMARY
Enabled: YES
Intended State: TRANSPORT-ON
Instance(s):
orcl_primary
Properties:
DGConnectIdentifier = 'primary'
ObserverConnectIdentifier = ''
LogXptMode = 'ASYNC'
DelayMins = '0'
Binding = 'OPTIONAL'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = ''
LogFileNameConvert = ''
FastStartFailoverTarget = 'standby'
StatusReport = '(monitor)'
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
HostName = 'primary'
SidName = 'primary'
StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/stanby/'
AlternateLocation = ''
LogArchiveTrace = '0'
LogArchiveFormat = 'log_%s_%t_%r.arc'
LatestLog = '(monitor)'
TopWaitEvents = '(monitor)'
Current status for "primary":
SUCCESS
Current status for "orcl_primary":
SUCCESS
DGMGRL> show database verbose standby
Database
Name: orcl_stdby1
Role: PHYSICAL STANDBY
Enabled: YES
Intended State: APPLY-ON
Instance(s):
orcl_stdby1
Properties:
DGConnectIdentifier = 'orcl_stdby1'
ObserverConnectIdentifier = ''
LogXptMode = 'ASYNC'
DelayMins = '0'
Binding = 'OPTIONAL'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = ''
LogFileNameConvert = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/'
FastStartFailoverTarget = 'orcl_primary'
StatusReport = '(monitor)'
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
HostName = '11gStdby'
SidName = 'orcl_stdby1'
StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/STDBY/'
AlternateLocation = ''
LogArchiveTrace = '0'
LogArchiveFormat = 'log_%s_%t_%r.arc'
LatestLog = '(monitor)'
TopWaitEvents = '(monitor)'
Current status for "orcl_stdby1":
SUCCESS
Configuring Detection of Advanced Failover Conditions
Which physical standby database is the target for Fast-Start Failover and how long to wait until initiating Fast-Start Failover are obviously the most important reactions I want to control in a failover situation. However, Oracle 11g Data Guard Broker also offers the ability to adjust just how much lost redo data is permitted before declaring a failover and what it should do with the primary database after the failover operation. Here’s a complete list of the configurable thresholds, triggering events, and post-failover behaviors in a Fast-Start Failover situation:
Table 7-2. Fast-Start Failover: Configurable Reactions
| ||
FSFO Parameter | Default Value | Triggering Event |
FastStartFailoverTarget | NULL | Indicates which database the FSFO will choose as its failover target should a Fast-Start Failover occur |
FastStartFailoverThreshold | 90 | Determines how long the FSFO will wait until it triggers a Fast-Start Failover. Note that both the target physical standby and the FSFO must have lost network connectivity to the primary database for this time limit (in seconds) before FSF is triggered |
FastStartFailoverLagLimit | 30 | When the primary database is operating inMaximum Performance mode, this sets a limit for the amount of lost data in seconds before the FSFO should trigger a Fast-Start Failover |
FastStartFailoverPmyShutdown | TRUE | Determines whether the FSFO willautomatically shut down the primary database after a Fast-Start Failover occurs |
FastStartFailoverAutoReinstate | TRUE | Determines whether the FSFO will automatically attempt to reinstantiate the designated primary database as the physical standby as soon as the FSFO detects the primary site is once again available |
Advanced Failover Options. In addition to these conditional triggering events and controls, FSFO also offers exquisitely detailed control over which failures of the primary database’s infrastructure would trigger a Fast-Start Failover:
Table 7-3. Fast-Start Failover: Advanced Failover Triggers
| ||
Trigger | Active By Default? | Explanation |
Datafile Offline | Yes | Tells FSFO to perform a failover whenever a datafile is offline on the primary database |
Corrupted Controlfile | Yes | Tells FSFO to perform a failover whenever a corrupted control file is detected on the primary database |
Corrupted Dictionary | Yes | Requests FSFO to perform a failover whenevercorruption is detected within the primary database’s data dictionary |
Inaccessible Logfile | No | Forces FSFO to perform a failover whenever any one online redo log member cannot be accessed on the primary database |
Stuck Archiver | No | Tells FSFO to perform a failover whenever archived redo logging cannot proceed on the primary database because there is no more space to write out archived redo logs on that server |
These two sets of Fast-Start Failover conditions offer me extremely fine-grained control over when FSFO should trigger a failover automatically. And even if these controls are insufficient for my situation, there’s one other option: I can configure FSFO to trigger a Fast-Start Failover upon receipt of a specific Oracle 11g error message code. For example, if I wanted to force a failover whenever a critical lack of space in a Flashback Data Archive occurred, I could configure FSFO to initiate automatic failover whenever the ORA-55623 error is detected on the primary database with the following command:
DGMGRL> ENABLE FAST_START FAILOVER CONDITION 55623;
Activating the Fast-Start Failover Observer
Now that the configuration of FSFO is complete, all I need to do is enable the configuration via DGMGRL as shown below. Note that I’m also enabling logging of Data Guard Broker activity for the command-line utility so that I can track any unexpected issues related to the FSFO’s performance or configuration:
[oracle@11gStdby ~]$ dgmgrl -logfile 11gStdby1_observer.log DGMGRL for Linux: Version 11.1.0.6.0 - Production Copyright (c) 2000, 2005, Oracle. All rights reserved. Welcome to DGMGRL, type "help" for information. DGMGRL> connect sys/oracle Connected. DGMGRL> ENABLE FAST_START FAILOVER; Enabled.
Finally, it’s time to start up FSFO. Once again, I’ll use DGMGRL to start the Fast-Start Failover Observer process:
DGMGRL> START OBSERVER;
Once the FSFO is started, I can confirm that it’s been activated properly with the SHOW CONFIGURATION and SHOW DATABASE commands:
DGMGRL> show configuration verbose Configuration Name: MAA_orcl Enabled: YES Protection Mode: MaxPerformance Databases: orcl_primary - Primary database orcl_stdby1 - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Threshold: 180 seconds Target: orcl_stdby1 Observer: 11gStdby Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Current status for "MAA_orcl": Warning: ORA-16608: one or more databases have warnings DGMGRL> show database orcl_primary Database Name: orcl_primary Role: PRIMARY Enabled: YES Intended State: TRANSPORT-ON Instance(s): orcl_primary Current status for "orcl_primary": SUCCESS DGMGRL> show database orcl_stdby1 Database Name: orcl_stdby1 Role: PHYSICAL STANDBY Enabled: YES Intended State: APPLY-ON Instance(s): orcl_stdby1 Current status for "orcl_stdby1": SUCCESS DGMGRL> show fast_start failover Fast-Start Failover: ENABLED Threshold: 180 seconds Target: orcl_stdby1 Observer: 11gStdby Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Configurable Failover Conditions Health Conditions: Corrupted Controlfile YES Corrupted Dictionary YES Inaccessible Logfile NO Stuck Archiver NO Datafile Offline YES Oracle Error Conditions: (none)
Automatic Detection of Failover Conditions: An Example
Now that FSFO is fully configured and is ready to detect a failover situation, I’ll use the same technique I used in the prior article about Data Guard failover to simulate a failure of the primary database: I’ll simply issue the kill -9 <pid> command against its Server Monitor (SMON) background process. Once again, the death of the primary database is almost immediately recorded in its alert log:
. . . Tue Aug 25 18:54:10 2009 Errors in file /u01/app/oracle/diag/rdbms/orcl_primary/orcl_primary/trace/orcl_primary_pmon_6166.trc: ORA-00474: SMON process terminated with error PMON (ospid: 6166): terminating the instance due to error 474 Instance terminated by PMON, pid = 6166 . . .
Just as before, the loss of connectivity to the primary database is reflected within the alert log of the corresponding physical standby databases by its Remote File Server (RFS) background process:
. . . Tue Aug 25 18:54:49 2009 RFS[2]: Possible network disconnect with primary database Tue Aug 25 18:54:49 2009 RFS[1]: Possible network disconnect with primary database Tue Aug 25 18:55:49 2009 . . .
This time, however, there’s a dramatic difference! After approximately three minutes have elapsed, there’s a sudden flurry of activity at the physical standby site as the FSFO automatically detects the failure of the primary database. In Listing 7.1, I’ve captured the alert logs of both databases as well as the Data Guard Broker log entries to show all of the actions that Oracle 11g initiates during a Fast-Start Failover. After the automatic failover is complete, the Data Guard configuration fully reflects the successful actions of the FSFO:
DGMGRL> show configuration verbose Configuration Name: MAA_orcl Enabled: YES Protection Mode: MaxPerformance Databases: orcl_stdby1 - Primary database orcl_primary - Physical standby database (disabled) - Fast-Start Failover target Fast-Start Failover: ENABLED Threshold: 180 seconds Target: orcl_primary Observer: 11gStdby Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Current status for "MAA_orcl": Warning: ORA-16608: one or more databases have warnings DGMGRL> show database verbose orcl_stdby1 Database Name: orcl_stdby1 OEM Name: orcl_11gStdby1 Role: PRIMARY Enabled: YES Intended State: TRANSPORT-ON Instance(s): orcl_stdby1 Properties: DGConnectIdentifier = 'orcl_stdby1' ObserverConnectIdentifier = '' LogXptMode = 'ASYNC' DelayMins = '0' Binding = 'OPTIONAL' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '' LogFileNameConvert = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/' FastStartFailoverTarget = 'orcl_primary' StatusReport = '(monitor)' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' HostName = '11gStdby' SidName = 'orcl_stdby1' StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/STDBY/' AlternateLocation = '' LogArchiveTrace = '0' LogArchiveFormat = 'log_%s_%t_%r.arc' LatestLog = '(monitor)' TopWaitEvents = '(monitor)' Current status for "orcl_stdby1": Warning: ORA-16829: fast-start failover configuration is lagging DGMGRL> show database verbose orcl_primary Database Name: orcl_primary OEM Name: orcl_11gPrimary Role: PHYSICAL STANDBY Enabled: NO Intended State: APPLY-ON Instance(s): orcl_primary Properties: DGConnectIdentifier = 'orcl_primary' ObserverConnectIdentifier = '' LogXptMode = 'ASYNC' DelayMins = '0' Binding = 'OPTIONAL' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '' LogFileNameConvert = '' FastStartFailoverTarget = 'orcl_stdby1' StatusReport = '(monitor)' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' HostName = '11gPrimary' SidName = 'orcl_primary' StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/ORCL/' AlternateLocation = '' LogArchiveTrace = '0' LogArchiveFormat = 'log_%s_%t_%r.arc' LatestLog = '(monitor)' TopWaitEvents = '(monitor)' Current status for "orcl_primary": Error: ORA-16661: the standby database needs to be reinstated
Reinstating the Original Primary Database
My previous example of initiating Fast-Start Failover brings to light an interesting situation: What if the primary database was actually completely healthy at the time that FSFO acknowledged the conditions for Fast-Start Failover? Here’s where the brilliance of enabling Flashback Logging on both the primary and physical standby databases really shines through: With a single command, it’s a simple matter to reinstate the original primary database as a physical standby database.
To illustrate, I’ll issue the REINSTATE DATABASE command from a DGMGRL session connected to the new primary database, ORCL_STDBY1, and I’ll designate the original primary database,ORCL_PRIMARY, as the target of the reinstatement:
DGMGRL> reinstate database orcl_primary
Once again, there’s a flurry of activity on the original primary database as Data Guard Broker successfully attempts the reinstatement. I’ve captured the pertinent alert log entries from theORCL_PRIMARY database in Listing 7.2, and DGMGRL reflects the appropriate Data Guard configuration once the reinstatement has completed:
DGMGRL> show configuration verbose Configuration Name: MAA_orcl Enabled: YES Protection Mode: MaxPerformance Databases: orcl_stdby1 - Primary database orcl_primary - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Threshold: 180 seconds Target: orcl_primary Observer: 11gStdby Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Current status for "MAA_orcl": SUCCESS
Switching Back. Since the original primary database is now successfully restored as part of the Data Guard environment, I’ll request the original primary and physical standby databases to switch roles with the SWITCHOVER command:
DGMGRL> switchover to orcl_primary; Performing switchover NOW, please wait... New primary database "orcl_primary" is opening... Operation requires shutdown of instance "orcl_stdby1" on database "orcl_stdby1" Shutting down instance "orcl_stdby1"... ORA-01109: database not open Database dismounted. ORACLE instance shut down. Operation requires startup of instance "orcl_stdby1" on database "orcl_stdby1" Starting instance "orcl_stdby1"... ORACLE instance started. Database mounted. Switchover succeeded, new primary is "orcl_primary" DGMGRL> show configuration verbose; Configuration Name: MAA_orcl Enabled: YES Protection Mode: MaxPerformance Databases: orcl_primary - Primary database orcl_stdby1 - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Threshold: 180 seconds Target: orcl_stdby1 Observer: 11gStdby Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Current status for "MAA_orcl": SUCCESS
Deactivating Fast-Start Failover
To deactivate Fast-Start Failover, all I need to do is issue the DISABLE FAST_START FAILOVER command from within a DGMGRL session:
DGMGRL> DISABLE FAST_START FAILOVER; Disabled.
Note that this only disables the possibility of future Fast-Start Failovers until I re-enable the Fast-Start Failover configuration with the ENABLE FAST_START FAILOVER command; all of the Fast-Start Failover configuration details I’ve so carefully constructed are still intact.
Next Steps
The next article in this series will explore how to construct and maintain a Logical Standby database in Oracle 11g, focusing on their usefulness in data warehouse and data mart environments.
References and Additional Reading
While I’m hopeful that I’ve given you a thorough grounding in the technical aspects of the features I’ve discussed in this article, I’m also sure that there may be better documentation available since it’s been published. I therefore strongly suggest that you take a close look at the corresponding Oracle documentation on these features to obtain crystal-clear understanding before attempting to implement them in a production environment. Please note that I’ve drawn upon the following Oracle Database 11g documentation for the deeper technical details of this article:
No comments:
Post a Comment