ThinkingDb2!: Db2 Persistent Diagnostic Data - Use Case: Isolating high CPU operations

- David Sciaraffa, Software Engineering Manager – IBM Db2

The Db2 Persistent Diagnostic Data scripts (available here) collect various Db2 and Operating System diagnostic information, and retain this info for a period of time, allowing for basic triage of many types of issues. Information is collected about every minute (by a script I often call the 'minutely script'), and additional info about every hour (by a script I often call the 'hourly script').

The diagnostic information is sometimes raw in nature, and thus problem triage often requires various scraping and conjugation of the data.

In this particular customer scenario that I recently engaged in, we have a situation where CPU utilization on the database server host spiked for a moderate period of time. We try to narrow the cause of the CPU spike using the information collected by the Db2 Persistent Diagnostic Data scripts.

Examining the vmstat data (collected by the minutely scripts), we can see the user-cpu spike start at approximately 18:15, with user-cpu usage jumping from about 25% to 80+%

$ grep "" OS_vmstat*

OS_vmstat.[hostname].20190917.181501.txt:procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
OS_vmstat.[hostname].20190917.181501.txt: r b swpd free inact active si so bi bo in cs us sy id wa st
OS_vmstat.[hostname].20190917.181501.txt:27 0 0 484832 41316092 22476632 0 0 144 48 1 1 4 3 92 0 0
OS_vmstat.[hostname].20190917.181501.txt:22 1 0 477076 41319924 22478360 0 0 1048 948 38641 59758 80 9 10 0 0
OS_vmstat.[hostname].20190917.181501.txt:22 0 0 510836 41284108 22482420 0 0 968 1072 36293 51284 82 7 11 0 0
OS_vmstat.[hostname].20190917.181501.txt:25 0 0 491156 41298248 22482316 0 0 1668 1196 36069 45501 83 8 8 0 0
OS_vmstat.[hostname].20190917.181501.txt:23 0 0 616992 41242172 22410940 0 0 1928 1468 35699 37067 84 9 7 0 0
OS_vmstat.[hostname].20190917.181501.txt:23 0 0 598428 41254200 22412968 0 0 2196 14028 32377 39835 84 8 8 0 0
OS_vmstat.[hostname].20190917.181501.txt:20 0 0 686944 41206804 22371564 0 0 1918 1572 35247 42502 80 9 11 1 0
OS_vmstat.[hostname].20190917.181501.txt:28 0 0 684656 41212132 22367500 0 0 1952 1664 32867 42681 78 9 13 1 0
OS_vmstat.[hostname].20190917.181501.txt:31 0 0 680500 41216784 22366940 0 0 2648 2688 38645 45578 81 9 9 1 0
OS_vmstat.[hostname].20190917.181501.txt:27 0 0 672308 41223868 22366852 0 0 9688 3940 39261 48969 83 9 7 1 0

Next I confirmed that the cpu spike was associate with the db2sysc database server process, by comparing top data from the two time periods:

$ less OS_top.[hostname].20190917.180001.txt

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8621 ipportd1 19 -1 38.4g 13g 7.6g S 118.8 21.0 19294:30 db2sysc
13443 root RT 0 637m 65m 48m R 22.3 0.1 2786:46 corosync
12122 root 15 -5 422m 46m 1136 S 14.8 0.1 2067:30 tesvc
12564 root 20 0 328m 290m 2252 S 14.8 0.5 1912:53 rtvscand
5962 root 20 0 6184 1020 336 S 13.0 0.0 1551:54 symcfgd
38663 ipportd1 20 0 15684 1912 884 R 7.4 0.0 0:00.06 top
38766 caapm 20 0 122m 3508 1932 S 5.6 0.0 0:00.03 perl
11857 root 20 0 20.1g 760m 10m S 3.7 1.2 738:29.34 java

$ less OS_top.[hostname].20190917.181502.txt

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8621 ipportd1 19 -1 38.3g 15g 9.8g S 1853.9 24.3 19363:32 db2sysc
38942 ipportd1 20 0 339m 51m 13m R 32.4 0.1 0:00.26 db2bp
39244 ipportd1 20 0 250m 15m 10m R 25.2 0.0 0:00.14 db2
13443 root RT 0 638m 66m 49m S 18.0 0.1 2788:40 corosync
39285 root 19 -1 36580 1748 904 R 16.2 0.0 0:00.09 clulog
12122 root 15 -5 422m 46m 1136 S 12.6 0.1 2068:48 tesvc
12564 root 20 0 328m 290m 2252 S 10.8 0.5 1914:09 rtvscand
39190 ipportd1 20 0 15684 1924 884 R 10.8 0.0 0:00.10 top
5962 root 20 0 6184 1020 336 S 9.0 0.0 1552:56 symcfgd
358 root 20 0 0 0 0 S 3.6 0.0 6:12.15 kswapd1
11857 root 20 0 20.1g 760m 10m S 3.6 1.2 738:53.95 java
12135 ipporta1 19 -1 52.0g 24g 21g S 3.6 39.8 1227:27 db2sysc
228 root 20 0 0 0 0 S 1.8 0.0 2:36.82 kblockd/2
357 root 20 0 0 0 0 S 1.8 0.0 6:29.98 kswapd0
12223 caapm 20 0 187m 54m 1596 S 1.8 0.1 169:17.21 sysedge

Next, I examined the cumulative user-cpu values of all the Db2 threads in the db2pd_edus output, between a time frame just before the cpu spike (18:00) and during the cpu spike (18:15).
However, I did not find any EDU with a very sharp increase in cumulative user-cpu during this time. Mostly just small increases between the two time-frames:

$ diff db2pd_edus.inst.[hostname].20190917.180002.txt db2pd_edus.inst.[hostname].20190917.181504.txt | less
EDU ID TID Kernel TID EDU Name USR (s) SYS (s)

2770 140698158360320 30841 db2agent (WPJCR) 0 525.340000 24.820000
2769 140698162554624 29837 db2agent (SGROUPDB) 0 10.320000 2.510000
2768 140698166748928 21556 db2agent (WPJCR) 0 360.730000 20.310000
2767 140698170943232 21553 db2agntdp (WPREL ) 0 9.300000 2.270000
---
2770 140698158360320 30841 db2agent (WPJCR) 0 531.230000 25.330000
2769 140698162554624 29837 db2agent (SGROUPDB) 0 10.840000 2.630000
2768 140698166748928 21556 db2agent (WPJCR) 0 398.970000 21.970000
2767 140698170943232 21553 db2agent (WPJCR) 0 10.080000 2.320000
...etc...

Next, I examine the db2pd-utilities output, and confirmed that there were no utilities (such as REORGS or BACKUPS running during this time frame).

$ ls -l db2pd_utilities*

-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:00 db2pd_utilities.[hostname].20190917.180004.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:03 db2pd_utilities.[hostname].20190917.180305.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:06 db2pd_utilities.[hostname].20190917.180605.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:09 db2pd_utilities.[hostname].20190917.180904.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:12 db2pd_utilities.[hostname].20190917.181204.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:15 db2pd_utilities.[hostname].20190917.181507.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:18 db2pd_utilities.[hostname].20190917.181812.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:21 db2pd_utilities.[hostname].20190917.182113.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:24 db2pd_utilities.[hostname].20190917.182414.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:27 db2pd_utilities.[hostname].20190917.182707.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:30 db2pd_utilities.[hostname].20190917.183006.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:33 db2pd_utilities.[hostname].20190917.183306.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:36 db2pd_utilities.[hostname].20190917.183605.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:39 db2pd_utilities.[hostname].20190917.183904.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:42 db2pd_utilities.[hostname].20190917.184205.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:45 db2pd_utilities.[hostname].20190917.184505.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:48 db2pd_utilities.[hostname].20190917.184804.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:51 db2pd_utilities.[hostname].20190917.185105.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:54 db2pd_utilities.[hostname].20190917.185405.txt
-rwxrwxr-x 1 ecuunpck swsupt 410 Sep 18 00:57 db2pd_utilities.[hostname].20190917.185705.txt

Next, I compared the db2pd_agents between the same two time frames. I can see a small increase in the number of active agents (ie. an increase in the number of database connections).

$ diff db2pd_agents.inst.[hostname1].20190917.180003.txt db2pd_agents.inst.[hostname1].20190917.181807.txt | less
< Active coord agents: 797
< Active agents total: 797
< Pooled coord agents: 78
< Pooled agents total: 78
---
> Active coord agents: 828
> Active agents total: 828
> Pooled coord agents: 47
> Pooled agents total: 47

and I also see some cases where the cumulative rows-read values increase sharply, such as this example, where AppHandl 3698 read about 31M rows between the two time frames:

$ diff db2pd_agents.inst.[hostname1].20190917.180003.txt db2pd_agents.inst.[hostname1].20190917.181807.txt | less

Address AppHandl [nod-index] AgentEDUID Priority ... Rowsread Rowswrtn ...
0x00007FFCEC3F6900 3695 [000-03695] 1123 0 Coord ... 11560 0 ...
0x00007FFCEBBEB4C0 3696 [000-03696] 1086 0 Coord ... 54716 0 ...
0x00007FFCEB8985C0 3698 [000-03698] 1081 0 Coord ... 200138526 312789 ...
---
Address AppHandl [nod-index] AgentEDUID Priority ... Rowsread Rowswrtn ...
0x00007FFCEC3F6900 3695 [000-03695] 1123 0 Coord ... 0 0 ...
0x00007FFCEBBEB4C0 3696 [000-03696] 1086 0 Coord ... 0 0 ...
0x00007FFCEB8985C0 3698 [000-03698] 1081 0 Coord ... 231755161 334148 ...

Next, using the AgentEDUID value of 1081 for this agent, I examined the db2pd-apinfo data to see what this agent was executing:

$ less db2pd_apinfo.[hostname1].20190917.181507.txt

Application :
Address : 0x00007FFCEBD00080
AppHandl [nod-index] : 3698 [000-03698]
TranHdl : 25
Application PID : 0
Application Node Name : [ipaddr]
IP Address: [ipaddr]
Connection Start Time : (1567261562)Sat Aug 31 10:26:02 2019
Client User ID : n/a
System Auth ID : APPWPS
Coordinator EDU ID : 1081 ...
Last executed statements :
Package cache ID : 0x0000021900000002
Anchor ID : 537
Statement UID : 2
QL Type : Dynamic
Statement Type : DML, Insert/Update/Delete
Statement : DELETE FROM JCR.ICMSTJCRREMOVEHLP WHERE WSID = ? AND LID = ?

So we might consider this a suspect query... but let's keep looking....

Next, I input the MON_GET_ACTIVITES output from the 8:15 cpu spike time-frame into a spreadsheet, however it did not reflect the activities of the 800+ agents in the database, it only contains 4 records, none of which show high total_cpu_time values, or large rows read or written values, or query cost estimates.

So I suspect whatever is causing the CPU increase is not a single long execution of a large query which was captured in the minutely data collections, but rather many successive executions of a single query or small set of queries, which may not have been executing at the moment when the minutely data is collected every minute.

Next, I examined the mon_get_pkg_cache_stmnt output from the hourly data collection script.

I crafted the following awk query to calculate the average cpu time per query execution from the mon_get_pkg_cache_stmt() output.
I do see some relatively expensive individual queries which are compounded by many executions:

$ awk -F' ' '{avg=0; if($9+0 != 0 && $13+0 != 0){ $avg=$13/$9 }; print "pkg_sch:" $3 ",pkg_nam:" $4 ",pkg_ver:" $5 ",sec_num:" $6 ",num_execs:" $9 ",total_cpu_time:" $13 ",average_cpu_time:" $avg;}' qry3.out | less

pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:155337498,total_cpu_time:78672618581,average_cpu_time:506.463
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:2447636,total_cpu_time:25020306619,average_cpu_time:10222.2
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:2853649,total_cpu_time:14116987481,average_cpu_time:4947
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:25452,total_cpu_time:7238807540,average_cpu_time:284410
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:109301657,total_cpu_time:4889890787,average_cpu_time:44.7376
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:41021,total_cpu_time:4850725299,average_cpu_time:118250
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:38909250,total_cpu_time:3761535928,average_cpu_time:96.6746
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:58972,total_cpu_time:3707165379,average_cpu_time:62863.1
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:11536,total_cpu_time:3488732997,average_cpu_time:302421
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:17642026,total_cpu_time:3051013622,average_cpu_time:172.94
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:17226,total_cpu_time:2038784887,average_cpu_time:118355
...
pkg_sch:-,pkg_nam:-,pkg_ver:-,sec_num:-,num_execs:73,total_cpu_time:120223152,average_cpu_time:1.64689e+06

...

Find the entry with this total_cpu_time value within the mon_get_pkg_cache_stmt data files, we see this particular query is:

DELETE FROM jcr.WCM_USER_SHORTCUT t0 WHERE (t0.VPID = ? AND t0.LOCATION_DATA LIKE ? ESCAPE '\') AND t0.KIND = ?

At this point, I'd consider these two DELETE statements to be suspects, and recommend using the db2batch bench-marking tool against these statements to determine their execution metrics (rows read, updated, written, etc) and whether any possible tuning is required.

ThinkingDb2!

Friday, 4 October 2019

Db2 Persistent Diagnostic Data - Use Case: Isolating high CPU operations

No comments:

Post a Comment

Translate