Bug – DCD (Dead Client Detection) doesn’t work as expected

For quite some time we observed dead client connections on some of our production servers (in all cases: 32-bit Oracle10g 10.2.0.2, Windows 2003 SP1). We turned dead client detection ON by setting up SQLNET.EXPIRE_TIME in sqlnet.ora on the server side, trying different values (5, 10, 15 minutes). Despite of that, we noticed that DCD doesn’t work for no apparent reason. We confirmed that by tracing the session according to Metalink Note 1019019.102 “How to Check if Dead Connection Detection (DCD) is Enabled”. It was during this search on Metalink that we found out about Bug #5573896 “DCD IS NOT WORKING IF SET MORE THAN 1 MINUTE”. After setting SQLNET.EXPIRE_TIME=1 the problem with dead client connections disappeared – well, better said the problem was not as severe, as before we changed EXPIRE_TIME to 1 minute. Occasionally, we still find dead connections that Oracle doesn’t recognize as dead connections. According to Metalink note it’s a platform specific bug – it’s not known yet when’ll be fixed.

So far we found out that:
– this is the problem with Oracle 10.2.0.2/10.2.0.3 on Windows 2003 / XP
– we could not reproduce the problem running Oracle 10.2.0.2 on Windows 2000 (SP4)
it doesn’t seems to affect Windows 2003 EMT64*

[Comment added, 2 November, 2007]
* Actually, it does affect Windows 2003 EMT64 as well. Last week we put in production Oracle 10.2.0.3 (+Patch 12) on 64-bit Windows 2003 EE with SP2. After a week in a production we noticed dead client connections as described above.

Regards,
Ales

Posted on 13.08.2007, in Oracle and tagged . Bookmark the permalink. 5 Comments.

  1. A real bummer…
    We found out that the same problem appear on Windows 2000 (SP4) as well. Not only is this bug wasting system resources but it causes instance crashes as well.

    Additional testing showed that:

    – if we kill dead client connections with:


    alter system kill session 'sid,serial#' immediate;

    the instance crash is followed immediately after we issue the statement. We can reproduce this almost at will. If we leave out the immediate part of the clause, then instance will not crash, the status in V$SESSION will be ‘KILLED’ and stayed listed in this view until we schedule shutdown – and at that (shutdown) time, instance crash happens 9 out of 10 times!

    – if we try to shutdown the instance, instead of killing sessions then during shutdown immediate, the shutdown procedure will hang for a couple of minutes…then dreaded ORA-3113. Yap, instance crash.

    – using orakill to kill the dead client does…absolutely nothing. Sessions are still there with status INACTIVE. Orakill is useless in this case.

    Our next step will probably be upgrade to 10.2.0.3, but we do not believe it’ll solve the problem (however opening SR should be easier ;-).
    Oracle apparently changed the way processes are terminated in 9.2.0.8 and 10.2.0.1, according to short Metlink note 5057695.8 (fixed in 10.2.0.4).

    I’ll keep updating this thread…

  2. Workaround being tested…
    I think, we found the root cause for occasional instance crash when limit of the number of allowed processes (spfile: processes=n) is reached, even thought, the number is set very generous compared to the actual user base (e.g. allowing 200 processes on instance with 10-15 regular users).

    Reference Bug Notes:

    - BugNo 5607984 "ORACLE DOES NOT CLOSE TCP CONNECTIONS. REMAIN IN CLOSE_WAIT STATE".
    - BugNo 5939756 "SAME EFFECTS AS BUG 5607984 WITH PATCH APPLIED"
    - BugNo 5672200 "BLR BACKPORT OF BUG 5607984 ON TOP OF VERSION 10.2.0.3.0"

    Affected versions are 10.2.0.1, 10.2.0.2, 10.2.0.3 (even with pacth 10.2.0.3.1) on 32-bit Windows XP/2000/2003.
    We expecting this bug to be fixed in forthcoming patchset 10.2.0.4.

    Some quick tests that you can run to determine if you’re dealing with the bug:

    Run queries and compare the result:
    
    sql> select count(*) from v$session;
    
    sql> select count(*) from v$process;
    
    If you see big gap between the numbers (for example 20 sessions, 150 processes) you can be pretty sure you're dealing with bug 5607984.
    You can make additional tests on OS level:
    
    cmd> netstat
    
    You'll see extremely large number of TCP connections with status CLOSE_WAIT.
    
    Additional test: open Task Manager and observe the number of Threads per Oracle process. You'll see at least the number of threads reported by v$process, often you'll see twice that number (don't ask me why, because I don't know).
    
    Alternatively, you can use pslist tool to query remote server:
    
    cmd> pslist -e oracle \\oraserver
    

    Until Oracle doesn’t fix this bug, we’re testing workaround mentioned in BugNo 5607984:
    – we disabled Oracle DCD (Dead Client Detection) by commenting out the line in sqlnet.ora on our server:
    sqlnet.expire_time=1
    – we turned on TCP/IP KeepAlive packets by adding parameter KeepAliveTime=180000 to registry:

    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\TCPIP\Parameters\KeepAliveTime (of type REG_DWORD, with decimal value 180000).

    We rebooted the server.

    References:
    – Microsoft KB Article 314053 “TCP/IP and NBT configuration parameters for Windows XP”
    – Microsoft KB Article 324270 “How to harden the TCP/IP stack against denial of service attacks in Windows Server 2003”
    – Oracle Metalink Note 226202.1 “How to set Keepalive on Windows NT/2000”

    I’ll shortly update this thread with the test results from workaround implementation.

  3. twelve hours later…
    it seems that “workaround” is really working. No more dead client connections on production servers. In short, DCD doesn’t work on 32-bit Oracle 10g for Windows 2000/2003/XP, even proposed workaround, to set SQLNET.EXPIRE_TIME=1 in SQLNET.ORA doesn’t work at all. The only workable solution at the time of this writing (latest release 10.2.0.3) is to turn completely off Oracle DCD and use Windows KeepAliveTime parameter in registry.

  4. …and the workaround doesn’t work on 64-bit 10.2.0.3 Patch 12
    We removed SQLNET.EXPIRE_TIME entirely from all our sqlnet.ora configuration files due to Oracle bug 5573896 (keep in mind that in our case, even with expire_time=1, DCD didn’t work).
    We removed SQLNET.EXPIRE_TIME on our brand new 64-bit Oracle 10.2.0.3 Patch 12 database running on Windows 2003 x64 because we noticed the same problem as on 10.2.0.2 – Oracle didn’t prune dead client connections. So, the first thought was, ok, let’s try a known workaround – and we set TcpKeepAlive to 180000 (3 minutes).

    The workaround that worked on 10.2.0.2, doesn’t work on 10.2.0.3 Patch 12 (x64). Well, the bug 5573896 is listed as fixed in Patch 6 for 10.2.0.3. Perhaps that’s the reason. We’ll try to use Oracle technique for dead client detection once again, with SQLNET.EXPIRE_TIME=5. I’ll be back shortly with the report.

  5. …so far, so good
    Test scenario:

    1. entered SQLNET.EXPIRE_TIME=5 in SQLNET.ORA on our server (10.2.0.3 + Patch 12) and restarted the listener.
    2. connected to the instance from my test workstation
    3. pulled out the network cable from test workstation, simulating network outage
    4. after a few minutes I checked if Oracle server detected the dead client connection – and it did! Great.

    The bottom line is that DCD bug was apparently fixed in Patch 6 for 10.2.0.3 (since Patches are cumulative you should be fine with anything beyond Patch 6).

    Regards,
    Ales