Thursday, 3 August 2017

Access your database sooner after a system failure -- introducing DB2_ONLINERECOVERY!

- David Sciaraffa, Software Development Manager – IBM Db2 Availability


System failures happen. Modern enterprise IT stacks are complex feats of software and hardware engineering, encompassing billions of physical transistors and billions of lines of logical code comprising an almost infinite number computational permutations. These stacks run in data centers with physical attributes such as electrical utility and temperature conditions. Lest we forget the most vulnerable component of the IT stack - human error. Thus, system failures happen.



An enterprise database must be extremely resilient to system failures. Data loss or corruption are impermissible. Db2 has a distinguished history of resilience to system failures. It is truly best of breed. I'm proud to manage a team of talented developers focused on database recovery and availability.

Db2 recovery from a system failure begins when the database is restarted, and utilizes a process called 'crash recovery'. During crash recovery, any database changes which were previously committed by the application but not yet persisted to disk, are re-applied (we refer to this as the Forward-phase of crash recovery). Next, any database changes which were persisted to disk but not yet committed by the application, are undone (we refer to this as the Backward-phase of crash recovery). The Db2 transaction logs are processed to facilitate all this work.

Crash recovery processing is not isolated to system failures. In a database configured for disaster recovery via HADR, a TAKEOVER HADR operation on the Standby database must also perform crash recovery in order to undo any changes which were persisted to disk but not yet committed to the application.

While a database is performing crash recovery, it is not connectable by any application.

While the process of crash recovery provides database atomicity and durability in the event of a system failure, the process can take time to complete. For well tuned  applications crash recovery usually completes in a reasonable amount of time (ranging from near instantaneous to minutes) because the quantity of transaction log data to be processed is small. For poorly tuned applications, the quantity of log data to be processed can be large, and thus crash recovery can take a significant amount of time.


Well tuned applications and databases have some common themes:

- After changing data (like inserting, updating, or deleting) these applications perform a transaction commit operation expediently. By not leaving transactions open for a long time, this has the effect of reducing the total quantity of log data that needs to be processed if a crash recovery is required.

- Batch data changes are constrained to small manageable sizes. For example, if 1 million records need to be inserted into the database during a nightly batch, the application will perform this in smaller units of work of say 1 thousand records and then perform a transaction commit operation after each unit of work. By not generating single huge units of work, we reduce the quantity of transaction log data that needs to be processed if a crash recovery is required.

- The database is tuned to flush data changes to disk frequently. This is accomplished by configuring the PAGE_AGE_TRGT_MCR configuration parameter to a reasonable size in minutes. Since flushing data changes too frequent may overwhelm the system of i/o for heaving workloads, tuning this parameter will often require monitoring and adjustment. The DB2_USE_ALTERNATE_PAGE_CLEANING registry variable is a popular method of increasing the frequency of flushing data changes.

- Strict limitations are enforced on the size of transactions using the NUM_LOG_SPAN and MAX_LOG configuration parameters.

- The database is monitored proactively to identify any poorly tuned applications. The log_to_redo_for_recovery value of the MON_GET_TRANSACTION_LOG() table function can help to estimate the quantity of transaction log data required to process during crash recovery. The log span of uncommitted transaction can be monitoring using the db2pd-transactions, as described in the 'Determine the Size and Span of Uncommitted Transactions' section of this technote: http://thinkingdb2.blogspot.ca/2015/07/how-to-estimate-duration-of-takeover.html


While system failures seldom happen, most customers expect crash recovery to complete quickly. Since a database administrator can never fully prevent poorly tuned applications, a common customer request is for quicker accessibility to the database after a system failure, during crash recovery.

Our development team decided upon a solution which promised increase availability, without risking the distinguished resilience that Db2 has achieved over the decades.

The DB2_ONLINERECOVERY feature was rolled out in Db2 Version 11.1.2.2.

The feature allows for the database to become connectable once the forward phase of crash recovery has completed, while the backward phase of crash recovery is progressing. Tables which contain data changes that need to be undone during the backward phase of crash recovery will not be accessible to queries until the last of their data changes is undone.  All other tables are fully accessible to queries throughout the backward phase of crash recovery.

For more detailed information about this cool new feature, please see:

Database accessibility during backward phase of crash recovery or HADR takeover
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.ha.doc/doc/c0070281.html



I hope this information is useful. If you have any thoughts or questions, feel free to submit a comment below.

No comments:

Post a Comment

Translate