Thursday, 2 July 2015

How to Estimate the duration of a TAKEOVER HADR operation in DB2

- David Sciaraffa, IBM IM/DB2 Accelerated Value Leader (Premium Support Manager).

A TAKEOVER HADR operation, which instructs a Standby database to take over as the new Primary database, is normally a quick operation. However,
a TAKEOVER HADR operation (when not executed with the emergency WITH FORCE option) must first rollback any uncommitted transactions on the Primary database before the Standby database can take over. This can elongate the HADR TAKEOVER operation and it may appear to hang. This blog entry describes how to identify any large or old uncommitted transactions that would require a lengthy rollback.


When large or old uncommitted transactions exist, the rollback of these transactions can take a long time, causing the TAKEOVER HADR operation to appear to hang.
For large uncommitted transactions which consume a lot of LogSpace, the rollback must replay (undo) many log-records associated with the transaction. 
For old uncommitted transactions which do not consume a lot of LogSpace, the rollback must still read and parse through many log-records to eventually find those associated with the transaction to replay (undo).

Determine the Size and Span of Uncommitted Transactions

To determine if there are any uncommitted transactions that will require a lengthy rollback, you can query db2pd-transactions to list transactions with the largest log span (in bytes).  This represents the approximate number of bytes of log file data that will have to be parsed through in order to process the log-records required to rollback the transaction.

$ db2pd -db [dbname] -transactions > trans.out; 
$ cat trans.out | awk '{diff = $10-$9; printf "%.0f=log_span, firstLSN=%s, lastLSN=%s, AppHndl=%s \n",diff,$9,$10,$2 }' | sort -n -r | less

8648404033=log_span, firstLSN=0x000055D26CD7EAB6, lastLSN=0x000055D4705416F7, AppHndl=25943
1512356109=log_span, firstLSN=0x000055D4162F5C21, lastLSN=0x000055D47054152E, AppHndl=29146
488762756=log_span, firstLSN=0x000055D4533228EB, lastLSN=0x000055D47054166F, AppHndl=25629
55009676=log_span, firstLSN=0x000055D46D0CB109, lastLSN=0x000055D470541295, AppHndl=29424
53676373=log_span, firstLSN=0x000055D46D20AC09, lastLSN=0x000055D47053B55E, AppHndl=29436
51957579=log_span, firstLSN=0x000055D46D3B3180, lastLSN=0x000055D4705400CB, AppHndl=29465
2422301=log_span, firstLSN=0x000055D4702F0366, lastLSN=0x000055D47053F983, AppHndl=29442
152148=log_span, firstLSN=0x000055D47050A4D9, lastLSN=0x000055D47052F72D, AppHndl=30172
136276=log_span, firstLSN=0x000055D47050F536, lastLSN=0x000055D47053098A, AppHndl=30485
6765=log_span, firstLSN=0x000055D47053CC52, lastLSN=0x000055D47053E6BF, AppHndl=29666
6423=log_span, firstLSN=0x000055D4705305A2, lastLSN=0x000055D470531EB9, AppHndl=30774
2561=log_span, firstLSN=0x000055D47052C938, lastLSN=0x000055D47052D339, AppHndl=25326
1779=log_span, firstLSN=0x000055D47053CFC0, lastLSN=0x000055D47053D6B3, AppHndl=25378
1275=log_span, firstLSN=0x000055D470537F7E, lastLSN=0x000055D470538479, AppHndl=25367
526=log_span, firstLSN=0x000055D4705331A7, lastLSN=0x000055D4705333B5, AppHndl=29608
289=log_span, firstLSN=0x000055D470526016, lastLSN=0x000055D470526137, AppHndl=22962
…etc…

If we see one or more transactions with a log span of 500MB or more,  we would generally expect it to take some *noticeable* amount of time for the transaction to rollback.  But depending on the performance of your system, this threshold could be lower.

To find out more about these particular transactions, you can find their AppHndl's in "db2pd –db [dbname] -apinfo" output.

 

Estimate the Progress of a Rollback while TAKEOVER HADR is running

Once the TAKEOVER operation is issued, you can see which applications are still rolling back their uncommitted transections by using db2pd-applications:

$ db2pd –db [dbname] –applications | grep "RollbackActive"

Applications:
Address            AppHandl [nod-index] NumAgents  CoorEDUID  Status            …  Appid   …
0x07800000012F0080 9        [000-00009] 1          14394      RollbackActive    …  *LOCAL.dsciaraf.150703020130

 

As I mentioned above, you can determine the original LogSpace and log span values for these application’s transactions using the db2pd-transactions and awk command that I provided at top. 

$ db2pd –db [dbname] –transactions > trans.out
$ cat trans.out | awk '{diff = $10-$9; printf "%.0f=log_span, firstLSN=%s, lastLSN=%s, AppHndl=%s \n",diff,$9,$10,$2 }' | sort -n -r | less

30000269=log_span, firstLSN=0x00000192863F6B14, lastLSN=0x00000222863F883C, AppHndl=9
0=log_span, firstLSN=0x0000000000000000, lastLSN=0x0000000000000000, AppHndl=11
0=log_span, firstLSN=0x0000000000000000, lastLSN=0x0000000000000000, AppHndl=12
0=log_span, firstLSN=0x0000000000000000, lastLSN=0x0000000000000000, AppHndl=13
0=log_span, firstLSN=0x0000000000000000, lastLSN=0x0000000000000000, AppHndl=14

But how about monitoring the actual progress of the Rollback operation?
Unfortunately, since the Takeover operation has forced off all existing database connections and blocked new ones, we cannot use the normal methods for determining the progress of the rollback operations (such as monitoring the ‘Total Work’ and ‘Committed Work’ values of an application snapshot).

However, we can monitor the SpaceReserved value in db2pd-transactions. 
(The SpaceReserved value represents the amount of log file space that was reserved for this transaction in case a rollback operation was needed. For every log-record that is replayed (undone) while a transaction is being rolled back, a special log-record is added directly into this reserved log file space. If the rollback were to fail and be restarted again, these special log-records allow DB2 to avoid processing log-records which have already been processed.)

As the rollback operation proceeds, the SpaceReserved value diminishes.  This is easily observable for large transactions where many log-records are being replayed (undone). It is not easily observable for old transactions which are not large, since many log-records are parsed but seldom are replayed (undone) as I described earlier.
Here in my example, I simply watch the SpaceReserved value go down.  You can extrapolate an estimated time to completion by watching the rate at which the value diminishes (however, since even large transactions can insert log-records into the log file stream in periodic clumps, you may see pauses in the rate at which this value diminishes while you are observing it).

$ db2pd –db [dbname] –transactions | grep ABORT

Address            AppHandl  TranHdl  … State  … Firstlsn          Lastlsn            SpaceReserved   LogSpace …
0x070000004006CF00 7         2        … ABORT   0x00000192863F6B14 0x00000222863F883C 1151342         2312191  …

$ db2pd –db [dbname] –transactions | grep ABORT

Address            AppHandl  TranHdl  … State  … Firstlsn          Lastlsn            SpaceReserved   LogSpace …
0x070000004006CF00 7         2          ABORT   0x00000192863F6B14 0x00000222863F883C 473890          2312191  …

$ db2pd –db [dbname] –transactions | grep ABORT

Address            AppHandl  TranHdl  … State  … Firstlsn          Lastlsn            SpaceReserved   LogSpace …
0x070000004006CF00 7         2          ABORT   0x00000192863F6B14 0x00000222863F883C 304246          2312191  …


I hope this information is useful. If you have any thoughts or questions, feel free to submit a comment below. 

Translate