Friday 4 October 2019

Db2 Persistent Diagnostic Data - Use Case: Isolating high CPU operations

- David Sciaraffa, Software Engineering Manager – IBM Db2


The Db2 Persistent Diagnostic Data scripts (available here) collect various Db2 and Operating System diagnostic information, and retain this info for a period of time, allowing for basic triage of many types of issues. Information is collected about every minute (by a script I often call the 'minutely script'), and additional info about every hour (by a script I often call the 'hourly script').

The diagnostic information is sometimes raw in nature, and thus problem triage often requires various scraping and conjugation of the data.


In this particular customer scenario that I recently engaged in, we have a situation where CPU utilization on the database server host spiked for a moderate period of time. We try to narrow the cause of the CPU spike using the information collected by the Db2 Persistent Diagnostic Data scripts.


Examining the vmstat data (collected by the minutely scripts), we can see the user-cpu spike start at approximately 18:15, with user-cpu usage jumping from about 25% to 80+%

$ grep "" OS_vmstat*
OS_vmstat.[hostname].20190917.181501.txt:procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
OS_vmstat.[hostname].20190917.181501.txt: r b swpd free inact active si so bi bo in cs us sy id wa st
OS_vmstat.[hostname].20190917.181501.txt:27 0 0 484832 41316092 22476632 0 0 144 48 1 1 4 3 92 0 0
OS_vmstat.[hostname].20190917.181501.txt:22 1 0 477076 41319924 22478360 0 0 1048 948 38641 59758
80 9 10 0 0
OS_vmstat.[hostname].20190917.181501.txt:22 0 0 510836 41284108 22482420 0 0 968 1072 36293 51284
82 7 11 0 0
OS_vmstat.[hostname].20190917.181501.txt:25 0 0 491156 41298248 22482316 0 0 1668 1196 36069 45501
83 8 8 0 0
OS_vmstat.[hostname].20190917.181501.txt:23 0 0 616992 41242172 22410940 0 0 1928 1468 35699 37067
84 9 7 0 0
OS_vmstat.[hostname].20190917.181501.txt:23 0 0 598428 41254200 22412968 0 0 2196 14028 32377 39835
84 8 8 0 0
OS_vmstat.[hostname].20190917.181501.txt:20 0 0 686944 41206804 22371564 0 0 1918 1572 35247 42502
80 9 11 1 0
OS_vmstat.[hostname].20190917.181501.txt:28 0 0 684656 41212132 22367500 0 0 1952 1664 32867 42681
78 9 13 1 0
OS_vmstat.[hostname].20190917.181501.txt:31 0 0 680500 41216784 22366940 0 0 2648 2688 38645 45578
81 9 9 1 0
OS_vmstat.[hostname].20190917.181501.txt:27 0 0 672308 41223868 22366852 0 0 9688 3940 39261 48969
83 9 7 1 0

Next I confirmed that the cpu spike was associate with the db2sysc database server process, by comparing top data from the two time periods:
$ less OS_top.[hostname].20190917.180001.txt
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8621 ipportd1  19  -1 38.4g  13g 7.6g S 118.8 21.0  19294:30 db2sysc
13443 root      RT   0  637m  65m  48m R 22.3  0.1   2786:46 corosync
12122 root      15  -5  422m  46m 1136 S 14.8  0.1   2067:30 tesvc
12564 root      20   0  328m 290m 2252 S 14.8  0.5   1912:53 rtvscand
 5962 root      20   0  6184 1020  336 S 13.0  0.0   1551:54 symcfgd
38663 ipportd1  20   0 15684 1912  884 R  7.4  0.0   0:00.06 top
38766 caapm     20   0  122m 3508 1932 S  5.6  0.0   0:00.03 perl
11857 root      20   0 20.1g 760m  10m S  3.7  1.2 738:29.34 java
$ less OS_top.[hostname].20190917.181502.txt
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8621 ipportd1  19  -1 38.3g  15g 9.8g S 1853.9 24.3  19363:32 db2sysc
38942 ipportd1  20   0  339m  51m  13m R 32.4  0.1   0:00.26 db2bp
39244 ipportd1  20   0  250m  15m  10m R 25.2  0.0   0:00.14 db2
13443 root      RT   0  638m  66m  49m S 18.0  0.1   2788:40 corosync
39285 root      19  -1 36580 1748  904 R 16.2  0.0   0:00.09 clulog
12122 root      15  -5  422m  46m 1136 S 12.6  0.1   2068:48 tesvc
12564 root      20   0  328m 290m 2252 S 10.8  0.5   1914:09 rtvscand
39190 ipportd1  20   0 15684 1924  884 R 10.8  0.0   0:00.10 top
 5962 root      20   0  6184 1020  336 S  9.0  0.0   1552:56 symcfgd
  358 root      20   0     0    0    0 S  3.6  0.0   6:12.15 kswapd1
11857 root      20   0 20.1g 760m  10m S  3.6  1.2 738:53.95 java
12135 ipporta1  19  -1 52.0g  24g  21g S  3.6 39.8   1227:27 db2sysc
  228 root      20   0     0    0    0 S  1.8  0.0   2:36.82 kblockd/2
  357 root      20   0     0    0    0 S  1.8  0.0   6:29.98 kswapd0
12223 caapm     20   0  187m  54m 1596 S  1.8  0.1 169:17.21 sysedge



Next, I examined the cumulative user-cpu values of all the Db2 threads in the db2pd_edus output, between a time frame just before the cpu spike (18:00) and during the cpu spike (18:15).
However, I did not find any EDU with a very sharp increase in cumulative user-cpu during this time. Mostly just small increases between the two time-frames:
$ diff db2pd_edus.inst.[hostname].20190917.180002.txt db2pd_edus.inst.[hostname].20190917.181504.txt  | less
  EDU ID    TID                  Kernel TID           EDU Name                               USR (s)         SYS (s)
  
  2770      140698158360320      30841                db2agent (WPJCR) 0                   525.340000    24.820000
  2769      140698162554624      29837                db2agent (SGROUPDB) 0                 10.320000     2.510000
  2768      140698166748928      21556                db2agent (WPJCR) 0                   360.730000    20.310000
  2767      140698170943232      21553                db2agntdp (WPREL   ) 0                 9.300000     2.270000
---
  2770      140698158360320      30841                db2agent (WPJCR) 0                   531.230000    25.330000
  2769      140698162554624      29837                db2agent (SGROUPDB) 0                 10.840000     2.630000
  2768      140698166748928      21556                db2agent (WPJCR) 0                   398.970000    21.970000
  2767      140698170943232      21553                db2agent (WPJCR) 0                    10.080000     2.320000
...etc...
Next, I examine the db2pd-utilities output, and confirmed that there were no utilities (such as REORGS or BACKUPS running during this time frame).
$ ls -l db2pd_utilities*

-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:00 db2pd_utilities.[hostname].20190917.180004.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:03 db2pd_utilities.[hostname].20190917.180305.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:06 db2pd_utilities.[hostname].20190917.180605.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:09 db2pd_utilities.[hostname].20190917.180904.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:12 db2pd_utilities.[hostname].20190917.181204.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:15 db2pd_utilities.[hostname].20190917.181507.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:18 db2pd_utilities.[hostname].20190917.181812.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:21 db2pd_utilities.[hostname].20190917.182113.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:24 db2pd_utilities.[hostname].20190917.182414.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:27 db2pd_utilities.[hostname].20190917.182707.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:30 db2pd_utilities.[hostname].20190917.183006.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:33 db2pd_utilities.[hostname].20190917.183306.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:36 db2pd_utilities.[hostname].20190917.183605.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:39 db2pd_utilities.[hostname].20190917.183904.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:42 db2pd_utilities.[hostname].20190917.184205.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:45 db2pd_utilities.[hostname].20190917.184505.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:48 db2pd_utilities.[hostname].20190917.184804.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:51 db2pd_utilities.[hostname].20190917.185105.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:54 db2pd_utilities.[hostname].20190917.185405.txt
-rwxrwxr-x    1 ecuunpck swsupt          410 Sep 18 00:57 db2pd_utilities.[hostname].20190917.185705.txt

Next, I compared the db2pd_agents between the same two time frames. I can see a small increase in the number of active agents (ie. an increase in the number of database connections).
$ diff  db2pd_agents.inst.[hostname1].20190917.180003.txt db2pd_agents.inst.[hostname1].20190917.181807.txt | less
< Active coord agents: 797
< Active agents total: 797
< Pooled coord agents: 78
< Pooled agents total: 78
---
> Active coord agents: 828
> Active agents total: 828
> Pooled coord agents: 47
> Pooled agents total: 47
and I also see some cases where the cumulative rows-read values increase sharply, such as this example, where AppHandl 3698 read about 31M rows between the two time frames:

$ diff  db2pd_agents.inst.[hostname1].20190917.180003.txt db2pd_agents.inst.[hostname1].20190917.181807.txt | less
Address              AppHandl [nod-index] AgentEDUID Priority ... Rowsread  Rowswrtn ...
0x00007FFCEC3F6900   3695     [000-03695] 1123       0 Coord  ... 11560     0        ...
0x00007FFCEBBEB4C0   3696     [000-03696] 1086       0 Coord  ... 54716     0        ...
0x00007FFCEB8985C0   3698     [000-03698] 1081       0 Coord  ... 200138526 312789   ...
---
Address              AppHandl [nod-index] AgentEDUID Priority ... Rowsread  Rowswrtn ...
0x00007FFCEC3F6900   3695     [000-03695] 1123       0 Coord  ... 0         0        ...
0x00007FFCEBBEB4C0   3696     [000-03696] 1086       0 Coord  ... 0         0        ...
0x00007FFCEB8985C0   3698     [000-03698] 1081       0 Coord  ... 231755161 334148   ...

Next, using the AgentEDUID value of 1081 for this agent, I examined the db2pd-apinfo data to see what this agent was executing:
$ less db2pd_apinfo.[hostname1].20190917.181507.txt 

Application :  
Address :                0x00007FFCEBD00080  
AppHandl [nod-index] :   3698     [000-03698]  
TranHdl :                25  
Application PID :        0  
Application Node Name :  [ipaddr]
IP Address:              [ipaddr]
Connection Start Time :  (1567261562)Sat Aug 31 10:26:02 2019  
Client User ID :         n/a  
System Auth ID :         APPWPS  
Coordinator EDU ID :     1081 ...  
Last executed statements :    
Package cache ID :        0x0000021900000002    
Anchor ID :               537    
Statement UID :           2    
QL Type :                Dynamic    
Statement Type :          DML, Insert/Update/Delete    
Statement :               DELETE FROM JCR.ICMSTJCRREMOVEHLP WHERE WSID = ? AND LID = ?
So we might consider this a suspect query... but let's keep looking....

Next, I input the MON_GET_ACTIVITES output from the 8:15 cpu spike time-frame into a spreadsheet, however it did not reflect the activities of the 800+ agents in the database, it only contains 4 records, none of which show high total_cpu_time values, or large rows read or written values, or query cost estimates.


So I suspect whatever is causing the CPU increase is not a single long execution of a large query which was captured in the minutely data collections, but rather many successive executions of a single query or small set of queries, which may not have been executing at the moment when the minutely data is collected every minute.

Next, I examined the mon_get_pkg_cache_stmnt output from the hourly data collection script.
I crafted the following awk query to calculate the average cpu time per query execution from the mon_get_pkg_cache_stmt() output.
I do see some relatively expensive individual queries which are compounded by many executions:

$ awk -F' ' '{avg=0; if($9+0 != 0 && $13+0 != 0){ $avg=$13/$9 }; print "pkg_sch:" $3 ",pkg_nam:" $4 ",pkg_ver:" $5 ",sec_num:" $6 ",num_execs:" $9 ",total_cpu_time:" $13 ",average_cpu_time:" $avg;}' qry3.out | less 
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:155337498,total_cpu_time:78672618581,average_cpu_time:506.463
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:2447636,total_cpu_time:25020306619,average_cpu_time:10222.2
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:2853649,total_cpu_time:14116987481,average_cpu_time:4947
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:25452,total_cpu_time:7238807540,average_cpu_time:284410
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:109301657,total_cpu_time:4889890787,average_cpu_time:44.7376
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:41021,total_cpu_time:4850725299,average_cpu_time:118250
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:38909250,total_cpu_time:3761535928,average_cpu_time:96.6746
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:58972,total_cpu_time:3707165379,average_cpu_time:62863.1
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:11536,total_cpu_time:3488732997,average_cpu_time:302421
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:17642026,total_cpu_time:3051013622,average_cpu_time:172.94
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:17226,total_cpu_time:2038784887,average_cpu_time:118355
...
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:73,total_cpu_time:120223152,average_cpu_time:1.64689e+06
...

Find the entry with this total_cpu_time value within the mon_get_pkg_cache_stmt data files, we see this particular query is:

DELETE FROM jcr.WCM_USER_SHORTCUT t0 WHERE (t0.VPID = ? AND t0.LOCATION_DATA LIKE ? ESCAPE '\') AND t0.KIND = ?

At this point, I'd consider these two DELETE statements to be suspects, and recommend using the db2batch bench-marking tool against these statements to determine their execution metrics (rows read, updated, written, etc) and whether any possible tuning is required.


Translate