The company where I work run a large web infrastructure with many different Java based applications and servers. Most of these application servers connect to the database using a connection pool to manage database connections and reduce the cost of starting/destroying database sessions. Over the years we have spent a lot of time trying to get the right balance to keep session usage as smooth as possible.
- maximum connections in the pool set too low? - Can lead to the requests queuing to get a connection or running out all together during peak spikes (leading to application failure).
- maximum connections set too high? - Can lead to “logon storms”. A sudden surge in concurrent user activity can cause a huge demand on new sessions to be created on the database server. Because the database is slowed by the surge, you often get a “feed back loop” effect. Slow database response means even more connection requests. “Connection Storms” can also be caused by small problems which start in the database or network (unexpected slow network pause or database wait event…. then a sudden logon storm hits the database host as increasing connections slow the database even more (seen this lots in the past)
- minimum connections in the pool set too low? – Can lead to the added cost of always having to restart new database sessions just at the critical time you need them.
- idle out time set too low? - Once you have taken the hit of creating a new session in the database, you let this session reach the idle time too soon and the session is destroyed… only to be needed again a few minutes later.
- idle out time set too high? Lets say you have suffered a minor “logon storm”, if the idle time set too high you could have all these extra sessions hanging around for a very long time (There have been a number of good presentations by the Oracle Real World Performance group talking about benefits of reducing connections on a database server)
- minimum and maximum the same value? There is a lot to be said for running with a fixed number of sessions supporting your average usage and your peaks (The trick is finding the correct number to support peaks)
A very simple session monitoring script
Before I actually talk about the script itself, I thought I would give an example of why I wrote it in the first place and how it was used, then I will show the script and why we still use it today.
The problem
When I first joined my company, I could never get my head around the number of sessions on the database compared to the number of “ACTIVE” sessions seen in v$session. We seemed to have many more connections from the application servers then sessions doing work. Every once in a while the company would review the peak connection pool settings (per box) and adjust them. The kind of conversations we would have around the office were:
”We are using the maximum 30 connections per box, let’s make it 35 per box for growth”
The next year we would say…
“We are using all the 35 per box, let’s make it 40 per box….. for growth”
All the time I kept thinking maybe v$session status=”ACTIVE” was mis-leading, or maybe a problem with the connection pools (We had 30 minute idle out time but we always ran at the maximum connections but very few were actually seen to be used.
Back to the script
client_info.sql
set trimspool on
set pagesize 1000
set linesize 190
column since format a20
column logon format a20
column event format a25 truncate
column machine format a10 truncate
column status format a10 truncate
column username format a10 truncate
column n format 9999.99
break on machine on report skip 1
compute sum of sess_count on machine
compute sum of sess_count on report
set time on
spool client.txt append
select 'CFROM' tag,
to_char(sysdate,'hh24:mi:ss') when,
machine,
event,
seconds_in_wait,
sql_id,
prev_sql_id,
count(*) sess_count,
to_char(sysdate - (SECONDS_IN_WAIT / (60 * 60 * 24)),'dd-mon-yy hh24:mi:ss') since,
-- next two lines useful if trying to predict a concurrent spike. 1200 being 20 minutes
-- Left over from site specific issue but could be useful.
-- next is last active time + 20 minutes
-- n is a count down to next predicted spike
-- to_char(sysdate - ( (seconds_in_wait ) / (60 * 60 * 24)) + (1200 / (60 * 60 * 24)),'dd-mon-yy hh24:mi:ss') NEXT ,
-- ((sysdate - (sysdate - ( (seconds_in_wait ) / (60 * 60 * 24)) + (1200 / (60 * 60 * 24)) ) ) * (60 * 60 * 24) ) / 60 n,
username,
status,
state
from v$session
group by machine,
event,
seconds_in_wait,
sql_id,
prev_sql_id,
username,
status,
state
order by machine,
username,
event,
seconds_in_wait,
sql_id,
prev_sql_id,
status,
state
/
One day I wrote the above script and all became very clear. Something sent a very fast (150 micro seconds) sql statement concurrently from each application server to every session in the connection pool. The request rate could easily have been serviced by a handful of sessions, but because they came concurrently at the exact same time every session was used.
note: This output has been faked, Not able to find real example of the original issue we had (few years back)
note: All host names and user names are also faked...... so I can keep my job!
Columns Description
----------------- --------------------------
TAG tag used so can grep info from log file
WHEN current system time when data collected
MACHINE host where the session originates
EVENT Current database event
SECONDS_IN_WAIT Seconds in Wait
SQL_ID Current SQL statement
PREV_SQL_ID Previous SQL statement last run on the session
SESS_COUNT Total number of sessions within group (sql_id,event,seconds in wait etc)
SINCE What time the session has been idle since (sysdate - seconds wait)
USERNAME User name of sessions
STATUS Session status (Active or Inactive)
STATE Session Wait State
TAG WHEN MACHINE EVENT SECONDS_IN_WAIT SQL_ID PREV_SQL_ID SESS_COUNT SINCE USERNAME STATUS STATE
----- -------- ---------- ------------------------- --------------- ------------- ------------- ---------- -------------------- ---------- ---------- -------------------
CFROM 13:55:17 abcabc01.i SQL*Net message from clie 0 2j7pff3tfuzzz 2j7pff3tfuzzz 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 qw4bv6jwup5ab 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 5tarshstnypzv 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 1 8zv7177vuc8dt 3 14-may-10 13:55:16 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 5 4616pfpak8akh 1 14-may-10 13:55:12 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 22 amsmuu1pp1w74 1 14-may-10 13:54:55 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 185 3q4bv6jx8wup5 30 14-may-10 13:52:12 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 433 f3tg1gz4zdadm 1 14-may-10 13:48:04 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 873 cmh3vh4pjs7q7 1 14-may-10 13:40:44 WEBUSERABC INACTIVE WAITING
********** ----------
sum 40
CFROM 13:55:17 abcabc02.i SQL*Net message from clie 0 zzz0ahx447fpr zzz0ahx447fpr 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 3q4bv6jx8wup5 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 5tarshstnypzv 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 f3tg1gz4zdadm 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 6 cmh3vh4pjs7q7 1 14-may-10 13:55:11 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 8 8zv7177vuc8dt 1 14-may-10 13:55:09 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 11 cmh3vh4pjs7q7 1 14-may-10 13:55:06 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 12 f3tg1gz4zdadm 1 14-may-10 13:55:05 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 80 8zv7177vuc8dt 1 14-may-10 13:53:57 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 80 f3tg1gz4zdadm 1 14-may-10 13:53:57 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 138 3q4bv6jx8wup5 28 14-may-10 13:52:59 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 695 cmh3vh4pjs7q7 1 14-may-10 13:43:42 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 1394 cmh3vh4pjs7q7 1 14-may-10 13:32:03 WEBUSERABC INACTIVE WAITING
********** ----------
sum 40
CFROM 13:55:17 abcabc03.i SQL*Net message from clie 0 zzz0ahx447fpr zzz0ahx447fpr 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 5tarshstnypzv 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 cmh3vh4pjs7q7 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 2 5tarshstnypzv 1 14-may-10 13:55:15 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 7 8zv7177vuc8dt 2 14-may-10 13:55:10 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 14 3q4bv6jx8wup5 1 14-may-10 13:55:03 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 14 4616pfpak8akh 1 14-may-10 13:55:03 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 61 8zv7177vuc8dt 1 14-may-10 13:54:16 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 159 3q4bv6jx8wup5 31 14-may-10 13:52:38 WEBUSERABC INACTIVE WAITING
********** ----------
sum 40
CFROM 13:55:17 abcabc04.i SQL*Net message from clie 0 3q4bv6jx8wup5 39 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 5tarshstnypzv 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
********** ----------
sum 40
CFROM 13:55:17 abcabc05.i SQL*Net message from clie 0 zzz0ahx447fpr zzz0ahx447fpr 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 33hsynd62ka4k 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 0 cmh3vh4pjs7q7 1 14-may-10 13:55:17 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 1 zzz0ahx447fpr zzz0ahx447fpr 2 14-may-10 13:55:16 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 10 cmh3vh4pjs7q7 1 14-may-10 13:55:07 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 31 8zv7177vuc8dt 1 14-may-10 13:54:46 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 157 amsmuu1pp1w74 1 14-may-10 13:52:40 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 318 3q4bv6jx8wup5 29 14-may-10 13:49:59 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 874 cmh3vh4pjs7q7 2 14-may-10 13:40:43 WEBUSERABC INACTIVE WAITING
CFROM 13:55:17 SQL*Net message from clie 1040 8zv7177vuc8dt 1 14-may-10 13:37:57 WEBUSERABC INACTIVE WAITING
********** ----------
sum 40
Whenever I ran the script, I could see that we had spikes of sessions running the same sql statement in the same second on any given host. Because the statement groups by “seconds_in_wait” we might see that say 25 sessions all were active 8 minutes ago with the same statement and have since done no work (waiting on “SQL*Net message from client”). When we looked at all the application hosts, each would spike at a 20 minute interval (although each host had its own time at which it occurred).
My company use a number of different application caching techniques in our middle-tier application, one of these was a concurrent “read- ahead” cache, this would allow a read from cache to detect its data was soon to expire and asynchronously request a load of the caches in parallel. Our “read-ahead” code had been set to allow 50 threads to get data, but we only had 40 connections in the database connection pool (each box). We had a 30 minute keep alive on the connections, but every 20 minutes exactly, we would touch every connection and keep them alive. The solution was to reduce the number of threads on the “read-ahead” and not even change the connection pool configuration. This fix automatically reduced the number of connections on the database by around 1000.
There have been so many examples where the script has helped, we leave it collecting every 5 minutes to a log file on the more sensitive databases. It can help identify cases where:
-
Sudden concurrent requests by one or more severs or applications. We can get clues in what was the last sql statement run, who by and when
-
Statements taking a long time on specific hosts (Look for “ACTIVE” sessions for long time on same sql)
-
Idle connections which never get returned to the connection pool correctly within the application hosts
-
user behaviour driving sudden spikes in concurrent requests from specific hosts
-
sessions which are getting stuck on database waits (eg db links due to networking issues, row locking etc) and not returning to the pool
-
What time an application host last sent any requests to the database
The script attempts to provide a fast summary of the connections on the database, how long they have been idle, what they last did, which current wait events are used. Although the script is very simple, I have found it a really good way to summarise what our hosts (using connection pools) are doing.
Good script! It really useful to me.
Comment by Vega Lin — 20/05/2010 @ 2:08 am
Nice catch. One of the things I want to try in Maria DB is a backport they did from MySQL 6, to have an event based connection pool, so that connections only cost a few kb of ram and concurrency is rather controlled by a worker pool.
Comment by George — 01/11/2010 @ 9:23 pm