Saturday 28 April 2018

Monitoring the replay-only-window on a Db2 HADR standby database

- David Sciaraffa, Software Development Manager – IBM Db2


In Db2 systems configured for high availability disaster recovery (HADR), all data changes which are made on the primary database are also transmitted and replayed on one or more secondary ‘standby’ databases.  If a system failure occurs on the primary database server, applications can be rerouted to the standby database almost instantaneously (and usually done so through automation).
Db2 HADR is simple to configure, and mostly invisible to the applications accessing the database. There are many excellent resources online for getting started with HADR:
Step by Step Procedure to set up HADR replication between DB2 databases
IBM Developerworks Db2 HADR Wiki (Tutorial, Getting-Started, Performance Monitoring and Tuning, more…)


Performing SQL queries on a standby database (“Reads-on-Standby”) --
It is possible to allow applications to connect to the Db2 standby database and perform SQL read-only queries.  We refer to standby databases which are enabled for this as a “Reads-on-Standby” database.
Reads-on-Standby databases are simple to enable. There are numerous resources available to help get started:
Get the most from the DB2 HADR standby database: Using the reads on standby capability
HADR reads on standby feature
Reads on standby restrictions


Limitation of HADR Reads-on-Standby databases – the “replay-only-window”

A current limitation of Reads-on-Standby databases is the “replay-only-window”. When an application on the primary database performs a ‘DDL’ operation (such as creating or altering a regular or temporary table, view, or other database object) all application connections on the standby database will be momentary forced off (any currently executing queries will be interrupted). After the unit-of-work containing the DDL operation is fully replayed on the Standby database, application connections and queries on the standby database are permitted again.  The list of DDL and utility operations which trigger the replay-only-window are documented in Replay-only window on the active standby database.


What does a replay-only-window look like?

First, let’s establish a connection on the standby database, and start running some basic read-only queries:

[standby db server]

$ db2pd -db hadr105 -hadr | grep HADR_ROLE
                             HADR_ROLE = STANDBY

$ db2 connect to hadr105

   Database Connection Information

   Database server        = DB2/LINUXX8664 10.5.9
   SQL authorization ID   = DB2INST1
   Local database alias   = HADR105

$ while [[ 1 ]]
   do
   db2 -v "select FIRSTNME from EMPLOYEE with UR";
   done

Let’s invoke a simple DDL statement on the primary database, to trigger a replay-only-window on the standby database:

[primary db server]

$ db2pd -db hadr105 -hadr | grep HADR_ROLE
                            HADR_ROLE = PRIMARY

$ db2 create table table1 '(c1 int)'
DB20000I  The SQL command completed successfully.

When the replay-only-window is triggered on the standby database, all running queries on the standby database will be interrupted and their associated database connections will be forced off resulting in an SQL1224N error:

[standby db server]

$ db2 "select FIRSTNME from EMPLOYEE"
SQL1224N  The database manager is not able to accept new requests, has
terminated all requests in progress, or has terminated the specified request
because of an error or a forced interrupt.  SQLSTATE=55032

Also, if an attempt is made to establish a connection to the standby database while a replay-only-window is active, an SQL1776N reason 4 is returned:

[standby db server]

$ db2 connect to hadr1113
SQL1776N  The command cannot be issued on an HADR database. Reason code = "4".


How can I monitor the replay-only-window?

In all Db2 versions:
The STANDBY_REPLAY_ONLY_WINDOW_ACTIVE value of ‘db2pd –hadr’ or the MON_GET_HADR() table function will have a value of ‘Y’ on the standby database for the duration of the replay-only-window:

[standby db server]

$ db2pd –db hadr105 –hadr

                            HADR_ROLE = STANDBY
                           REPLAY_TYPE = PHYSICAL
                         HADR_SYNCMODE = NEARSYNC
                            STANDBY_ID = 0
                         LOG_STREAM_ID = 0
                            HADR_STATE = PEER
                            HADR_FLAGS = TCP_PROTOCOL
                   PRIMARY_MEMBER_HOST = hotellnx108
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                   STANDBY_MEMBER_HOST = hotellnx111
                      STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                   HADR_CONNECT_STATUS = CONNECTED
              HADR_CONNECT_STATUS_TIME = 04/28/2018 18:55:38.089002 (1524956138)
           HEARTBEAT_INTERVAL(seconds) = 30
                      HEARTBEAT_MISSED = 0
                    HEARTBEAT_EXPECTED = 283
                 HADR_TIMEOUT(seconds) = 120
         TIME_SINCE_LAST_RECV(seconds) = 0
              PEER_WAIT_LIMIT(seconds) = 0
            LOG_HADR_WAIT_CUR(seconds) = 0.000
     LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000308
    LOG_HADR_WAIT_ACCUMULATED(seconds) = 1.909
                   LOG_HADR_WAIT_COUNT = 986
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87380
             PRIMARY_LOG_FILE,PAGE,POS = S0000002.LOG, 725, 55945845
             STANDBY_LOG_FILE,PAGE,POS = S0000002.LOG, 725, 55945845
                   HADR_LOG_GAP(bytes) = 0
      STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000002.LOG, 725, 55945789
        STANDBY_RECV_REPLAY_GAP(bytes) = 2
                      PRIMARY_LOG_TIME = 04/28/2018 21:17:06.000000 (1524964626)
                      STANDBY_LOG_TIME = 04/28/2018 21:17:06.000000 (1524964626)
               STANDBY_REPLAY_LOG_TIME = 04/28/2018 21:17:06.000000 (1524964626)
          STANDBY_RECV_BUF_SIZE(pages) = 512
              STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 13000
                 STANDBY_SPOOL_PERCENT = 0
                    STANDBY_ERROR_TIME = NULL
                  PEER_WINDOW(seconds) = 0
              READS_ON_STANDBY_ENABLED = Y
     STANDBY_REPLAY_ONLY_WINDOW_ACTIVE = Y
      STANDBY_REPLAY_ONLY_WINDOW_START = 04/28/2018 21:17:32.000000 (1524964652)
STANDBY_REPLAY_ONLY_WINDOW_TRAN_COUNT = 1


In Db2 version 11.1.2.2 and previous releases:
The db2 diagnostic log file (db2diag.log) on the standby database will contain the following messages:

[standby db server]

$ less /home/db2inst1/sqllib/db2dump/db2diag.log

2018-04-28-18.36.18.723755-240 E41445875E502         LEVEL: Warning
PID     : 483                  TID : 140327683876608 PROC : db2sysc
INSTANCE: dsciaraf             NODE : 000            DB   : HADR105
APPHDL  : 0-12                 APPID: *LOCAL.DB2.180428222707
HOSTNAME: hotellnx111
EDUID   : 685                  EDUNAME: db2redom (HADR105)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrForceAppsInReplayOnlyWindow
, probe:100
DATA #1 : String, 28 bytes
Replay only window is active

2018-04-28-18.36.18.724354-240 I41446378E490         LEVEL: Info
PID     : 483                  TID : 140327683876608 PROC : db2sysc
INSTANCE: db2inst1             NODE : 000            DB   : HADR105
APPHDL  : 0-12                 APPID: *LOCAL.DB2.180428222707
HOSTNAME: hotellnx111
EDUID   : 685                  EDUNAME: db2redom (HADR105)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrKickPooledAgents, probe:10
DATA #1 : <preformatted>
Interrupting agent with eduid:688

2018-04-28-18.36.18.724607-240 I41446869E500         LEVEL: Info
PID     : 483                  TID : 140327683876608 PROC : db2sysc
INSTANCE: db2inst1             NODE : 000            DB   : HADR105
APPHDL  : 0-12                 APPID: *LOCAL.DB2.180428222707
HOSTNAME: hotellnx111
EDUID   : 685                  EDUNAME: db2redom (HADR105)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrKickPooledAgents, probe:20
DATA #1 : <preformatted>
Waiting for last disassociation from the db

When the replay-only-windows is completed, and applications allowed to reconnect to the standby database, the following db2diag.log message is displayed:

2018-04-28-18.36.18.973225-240 E41447370E539         LEVEL: Warning
PID     : 483                  TID : 140327683876608 PROC : db2sysc
INSTANCE: db2inst1             NODE : 000            DB   : HADR105
APPHDL  : 0-12                 APPID: *LOCAL.DB2.180428222707
HOSTNAME: hotellnx111
EDUID   : 685                  EDUNAME: db2redom (HADR105)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrEndReplayOnlyWindow, probe:
210
DATA #1 : String, 73 bytes
Replay only window is inactive, connections to Active Standby are allowed

In addition to these db2diag.log entries, it is possible to parse db2 recovery log files to determine which log-records (if any) will cause a replay-only-window, by using the db2fmtlog_replayonlywindow tool, which can be downloaded from the Db2 HADR wiki here: (link). Since db2 recovery log files are mostly identical on both primary and standby databases, this tool can be run on log files resided on either database (or the log archive path):

[primary or standby db server]

$ db2 get db config for hadr105 | grep "Path to log files"
  Path to log files = /home/db2inst1/db2inst1/NODE0000/SQL00001/LOGSTREAM0000/

$ cd /home/db2inst1/db2inst1/NODE0000/SQL00001/LOGSTREAM0000/

$ db2fmtlog_replayonlywindow_linux 0-2

  |------|------------------------------------------------------------------------
   | LREC |  5171  000408D1  00000000029D
   |------|------------------------------------------------------------------------
   | LREC |                     Record LSO = 45190368
   |      |                     Record TID = 00000000029D
   |      |                         Action = DDL
   |------|------------------------------------------------------------------------
   | LREC |  5180  00040921  00000000037E
   |------|------------------------------------------------------------------------
   | LREC |                     Record LSO = 45197733
   |      |                     Record TID = 00000000037E
   |      |                         Action = DDL

Also, when the LOG_DDL_STMTS database configuration parameter is set to yes on the Primary database (at the time the DDL operation was performed and it’s log records were generated within the recovery log file), then the DDL statement itself will also be displayed:

  |------|------------------------------------------------------------------------
  | LREC |  5171  000408D1  00000000029D
  |------|------------------------------------------------------------------------
  | LREC |                     Record LSO = 45190368
  |      |                     Record TID = 00000000029D
  |      |                         Action = DDL
  |------|------------------------------------------------------------------------
  | LREC |  5171  000408D2  00000000029D
  |------|------------------------------------------------------------------------
  | LREC |                     Record LSO = 45190796
  |      |                     Record TID = 00000000029D
  |      |                  DDL Statement = create table t3 (c1 int)
  |------|------------------------------------------------------------------------
  | LREC |  5180  00040921  00000000037E
  |------|------------------------------------------------------------------------
  | LREC |                     Record LSO = 45197733
  |      |                     Record TID = 00000000037E
  |      |                         Action = DDL
  |------|------------------------------------------------------------------------
  | LREC |  5180  00040922  00000000037E
  |------|------------------------------------------------------------------------
  | LREC |                     Record LSO = 45198161
  |      |                     Record TID = 00000000037E
  |      |                 DDL Statement = create table t4 (c1 int)


In Db2 version 11.1.3.3 and newer releases:
The db2 diagnostic log file (db2diag.log) on the standby database has been enhanced to provide additional details when a replay-only-window occurs.  The frequency of these diagnostic messages can be configured using the DB2_HADR_REPLAY_ONLY_WINDOW_DIAGLEVEL registry variable.  A value of 0 meaning no additional diagnostic messages, a value of 1 meaning only the first log-record/DDL statement within the unit-of-work will be displayed, and a value of 2 means all log-records/DDL statements within the unit-of-work will be displayed.

[standby db server]

$ db2pd -db hadr1113 | grep HADR_ROLE
                             HADR_ROLE = STANDBY

$ less /home/db2inst1/sqllib/db2dump/db2diag.log

2018-04-28-19.31.28.182828-240 I41576483E762         LEVEL: Info
PID     : 8721                 TID : 140697919284992 PROC : db2sysc
INSTANCE: db2inst1             NODE : 000            DB   : HADR1113
APPHDL  : 0-11                 APPID: *LOCAL.DB2.180428225527
HOSTNAME: hotellnx111
EDUID   : 397                  EDUNAME: db2redom (HADR1113)
FUNCTION: DB2 UDB, recovery manager, SQLP_REPLAY_ONLY_WINDOW_STAT::sqlpStartHadrReplayOnlyWindow,
probe:9140
MESSAGE : Replay only window is triggered by this log record: LogStreamId / TID
           / LSO / action
DATA #1 : db2LogStreamIDType, PD_TYPE_DB2_LOG_STREAM_ID, 2 bytes
0
DATA #2 : SQLP_TID, PD_TYPE_SQLP_TID, 6 bytes
00000000029D
DATA #3 : unsigned integer, 8 bytes
45190368
DATA #4 : String, 3 bytes
DDL

2018-04-28-19.31.28.183183-240 E41577246E504         LEVEL: Warning
PID     : 8721                 TID : 140697919284992 PROC : db2sysc
INSTANCE: dsciaraf             NODE : 000            DB   : HADR1113
APPHDL  : 0-11                 APPID: *LOCAL.DB2.180428225527
HOSTNAME: hotellnx111
EDUID   : 397                  EDUNAME: db2redom (HADR1113)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrForceAppsInReplayOnlyWindow, probe:100
DATA #1 : String, 28 bytes
Replay only window is active

2018-04-28-19.31.28.183344-240 I41577751E502         LEVEL: Info
PID     : 8721                 TID : 140697919284992 PROC : db2sysc
INSTANCE: dsciaraf             NODE : 000            DB   : HADR1113
APPHDL  : 0-11                 APPID: *LOCAL.DB2.180428225527
HOSTNAME: hotellnx111
EDUID   : 397                  EDUNAME: db2redom (HADR1113)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrKickPooledAgents, probe:20
DATA #1 : <preformatted>
Waiting for last disassociation from the db

And same as previous releases, when the replay-only-windows is completed, and applications allowed to reconnect to the standby database, the following db2diag.log message is displayed:

2018-04-28-19.31.28.375434-240 E41579289E541         LEVEL: Warning
PID     : 8721                 TID : 140697919284992 PROC : db2sysc
INSTANCE: dsciaraf             NODE : 000            DB   : HADR1113
APPHDL  : 0-11                 APPID: *LOCAL.DB2.180428225527
HOSTNAME: hotellnx111
EDUID   : 397                  EDUNAME: db2redom (HADR1113)
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::HdrEndReplayOnlyWindow, probe:210
DATA #1 : String, 73 bytes
Replay only window is inactive, connections to Active Standby are allowed

Also, when the LOG_DDL_STMTS database configuration parameter is set to yes on the Primary database (at the time the DDL operation was performed and it’s log records were generated within the recovery log file and transmitted to the standby database), then the DDL statement will also be printed into the db2diag.log starting in v11.1.3.3 and new releases:

2018-04-28-19.31.28.183524-240 I41578254E544         LEVEL: Info
PID     : 8721                 TID : 140697919284992 PROC : db2sysc
INSTANCE: dsciaraf             NODE : 000            DB   : HADR1113
APPHDL  : 0-11                 APPID: *LOCAL.DB2.180428225527
HOSTNAME: hotellnx111
EDUID   : 397                  EDUNAME: db2redom (HADR1113)
FUNCTION: DB2 UDB, recovery manager, SQLP_REPLAY_ONLY_WINDOW_STAT::sqlpSetDDLStmtForHadrReplayOnly
Window, probe:9150
MESSAGE : DDL statement text
DATA #1 : String, 24 bytes
create table table3 (c1 int)

In addition, available in v11.1.3.3 and newer releases, the db2fmtlog tool is available in the sqllib/bin/ folder, and it’s REPLAYONLYWINDOW option can be used to display all log records (if any) that will cause a replay-only-window.  If the LOG_DDL_STMTS database configuration parameter is set to yes on the Primary database (at the time the DDL operation was performed and it’s log records were generated), then the statement itself is displayed as well:

[primary or standby db server]

$ db2 get db config for hadr105 | grep "Path to log files"
  Path to log files = /home/db2inst1/db2inst1/NODE0000/SQL00001/LOGSTREAM0000/

$ cd /home/db2inst1/db2inst1/NODE0000/SQL00001/LOGSTREAM0000/

$ ~/sqllib/bin/db2fmtlog -replayonlywindow 0-10

|------|------------------------------------------------------------------------
| LREC |                     Record LSO = 45190796
|      |                     Record TID = 00000000029D
|      |                  DDL Statement = create table t3 (c1 int)
|------|------------------------------------------------------------------------
| LREC |  5180  00040921  00000000037E
|------|------------------------------------------------------------------------
| LREC |                     Record LSO = 45197733
|      |                     Record TID = 00000000037E
|      |                         Action = DDL
|------|------------------------------------------------------------------------
| LREC |  5180  00040922  00000000037E
|------|------------------------------------------------------------------------
| LREC |                     Record LSO = 45198161
|      |                     Record TID = 00000000037E
|      |                  DDL Statement = create table t4 (c1 int)


How can the replay-only-window be avoided?

In Db2 version 11.1.3.3 and prior releases, the database administrator should implement monitoring for replay-only-window’s. Should the replay-only-windows occur frequently enough to hinder application productivity (or increase in frequency over time), then the DDL statements causing replay-only-windows should be determined (as described above), and applications teams should be engaged to determine if these DDL statements can be reduced in frequency.

Application’s which perform queries against the standby database should plan to expect SQL1224N errors on occasion (due to the replay-only-window), and implement handling to automatically re-connect to the database and re-submit and interrupted query.


Stayed tuned for a drastic improvement which eliminates the replay-only-window in a future Db2 version.



I hope this information is useful. If you have any thoughts or questions, feel free to submit a comment below.












No comments:

Post a Comment

Translate