SNMP4j over TLS stuck at "Key is writable"

The latest SNMP4J 3.6.4-SNAPSHOT contains a probably fix for busy “key is writable” loop described above. That issue could occur when the TLS session needs to process a task but returns OK on wrap:
https://snmp.app/dist/snapshot/org/snmp4j/snmp4j/3.6.4-SNAPSHOT/snmp4j-3.6.4-20211213.015414-3-distribution.tar.gz

The new change is basically in TLSTM.ServerThread.writeMessage where runDelegatedTasks is executed even if wrap returns OK but handshake status is NEED_TASK:

                if (result.getStatus() == SSLEngineResult.Status.OK) {
                    if (result.bytesProduced() > 0) {
                        writeNetBuffer(entry, sc);
                    }
                    else {
                        if (logger.isDebugEnabled()) {
                            logger.debug("SSL Engine status of wrap is ok, but 0 NET bytes produced, handshake status is "+result.getHandshakeStatus());
                        }
                    }
                }
                if ((result.getHandshakeStatus() == SSLEngineResult.HandshakeStatus.NEED_TASK) &&
                        runDelegatedTasks(result, entry)) {
                    logger.debug("SSL session OK");
                }

Hi @AGENTPP
We started using SNMP 2.8.8 and still notice that key is readable is stuck in loop. Please find snmp4j logs below.


After these logs, last 4 lines are continously repeated.

Any suggestion on what could be causing this?

Thanks,
Anjali

What jdk version are you using now?

Tested with latest available java jdk1.8.0_321, but issue is still seen

Same as stated earlier. Do you recommend to try with version >= jdk 1.8.0_291 ?

OK, thanks for confirming that the issue is still present with the latest jdk1.8.0_321 build.
The source of the issue seems to be some weird state of the SSLEngine (status CLOSED but handshake status NEED_WRAP). I have im proved the SNMP4J 2.8.x code to cover this case too in TLSTM.ServerThread.runDelegatedTasks. The following new if-statement ensures that the socket is registered with the outQueue to write out-data:

    case CLOSED:
      if (status == SSLEngineResult.HandshakeStatus.NOT_HANDSHAKING) {
        return false;
      }
      break;

Please try this latest 2.8.9-SNPSHOT from today https://snmp.app/dist/snapshot/org/snmp4j/snmp4j/2.8.9-SNAPSHOT/snmp4j-2.8.9-20220215.112730-2.jar if it fixes the issue. I am not sure, because the SSLEngine does not seem to produce any out-data in your case, which is unexpected (at least from my understanding).

Thank you @AGENTPP
I used the latest 2.8.9 jar mentioned and Status = CLOSED HandshakeStatus = NEED_WRAP is actually closing the session and firing the same event.

But we are seeing data recieved at socket but not processed at snmp4j many a times and hence causing SNMP transaction timeout. We are looking into the details to share but any suggestions around this area is greatly appreciated :slight_smile:

Thanks,
Anjali

Stuck in Key is writable loop now :frowning:
Our agent side idle connection timeout is 5min. Once reconnected I see below issue some times

Any other info needed shall be shared.

Is there any chance you may be encountering 8274524?

Hi norrisjeremy,
Thanks for pointing us to this bug. This could be the root cause, indeed. As expected my fix in the 2.8.9 SNAPSHOT did not really solve it, because then instead looping in the read key (without data coming in) it looped on the write key without having any data to send. As a result, the SSLEngine is a dead lock situation. In SNMP4J 3.6.x I closed the session and the TCP connection in this case. This could solve the issue for SNMP4J 2.x too.
Best regards,
Frank

Thanks @norrisjeremy and @AGENTPP
Can you please share work around code or a snapshot jar with changes in 2.8.x for handling this until fix is released in Java 1.8 load.
This would help use test further on SNMP over TLS.

It would also really help if this could be backported to 2.8.x and delivered.

I have changed the handling of the CLOSED SSL engine status once again in the latest 2.8.9 SNAPSHOT. Now the underlying TCP connection will be closed in any case to fully reset the connection.
Maybe that helps.
Of course, you can try newer JDKs too. The SSL engine was improved a lot from 8 to 17 - I am not sure if everything has been back ported to 8.

Hi @AGENTPP
We are communicating with Oracle to get the fix backported but we are unsure of the rootcause.
This issue was in JKS keystore and occationally seen.
In BCFKS keystore we are frequently seeing this issue when we are establishing connection itself. Took the thread dump and noticed a dead lock.

Can you please take a look and let us know the cause.

cc: @tvikasv

This seems to be a bug. Currently unclear why it does not occur on SNMP4J 3.x too, but that might be related to using not Hashtable but a Collections.synchronizedMap there for Snmp.pendingRequests.
I am going to provide a fix ASAP.

This deadlock can happen only during (D)TLS contextEngineID discovery based on RFC 5343.

To fix it, in the Snmp.PendingRequest.resendRequest method the synchronized block needs to be removed like this:

protected <A extends Address> boolean resendRequest(PendingRequest<A> request, PDU response) {
    if (request.useNextPDU()) {
        request.responseReceived = false;
        pendingRequests.remove(request.key);
        PduHandle holdKeyUntilResendDone = request.key;
        request.key = null;
        handleInternalResponse(response, request.pdu, request.target.getAddress());
        try {
            sendMessage(request.pdu, request.target, request.transport, request);
        } catch (IOException e) {
            logger.error("IOException while resending request after RFC 5343 context engine ID discovery: " +
                    e.getMessage(), e);
        }
        // now the previous retry can be released
        if (logger.isDebugEnabled()) {
            logger.debug("Releasing PDU handle " + holdKeyUntilResendDone);
        }
        holdKeyUntilResendDone = null;
        return true;
    }
    return false;
}

Hi Frank ,

We don’t see the below exception when key is in writable loop .

at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:48)
at java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:261)
at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:312)
at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
at java.base/java.net.Socket$SocketInputStream.read(Socket.java:981)
at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
at java.base/sun.security.ssl.SSLSocketInputRecord.readFully(SSLSocketInputRecord.java:461)
at java.base/sun.security.ssl.SSLSocketInputRecord.decodeInputRecord(SSLSocketInputRecord.java:243)
at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:181)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1491)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1397)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:444)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:415)

SSLSocket.close() also reads from socket after JDK-8268965:
at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:48)
at java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:261)
at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:312)
at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:350)
at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:803)
at java.base/java.net.Socket$SocketInputStream.read(Socket.java:981)
at java.base/java.io.InputStream.skip(InputStream.java:544)
at java.base/sun.security.ssl.SSLSocketInputRecord.deplete(SSLSocketInputRecord.java:504)
at java.base/sun.security.ssl.SSLSocketImpl.closeSocket(SSLSocketImpl.java:1762)
at java.base/sun.security.ssl.SSLSocketImpl.shutdown(SSLSocketImpl.java:1739)
at java.base/sun.security.ssl.SSLSocketImpl.bruteForceCloseInput(SSLSocketImpl.java:773)
at java.base/sun.security.ssl.SSLSocketImpl.duplexCloseOutput(SSLSocketImpl.java:638)
at java.base/sun.security.ssl.SSLSocketImpl.close(SSLSocketImpl.java:570)

Can you please let us know how can we justify to oracle that the looping issue is due to https://bugs.openjdk.java.net/browse/JDK-8274524 as we are not seeing ssl exception

Regards,
Vikas

I was just guessing that the JDK bug is related to the issue you observe. I have no proof for that, except that this behavior was not observed with recent JDKs and closing connections is part of your issue.
Does not SNMP4J 2.8.9 fix your problem? It has been released today.

Hi Frank,
We used https://snmp.app/dist/snapshot/org/snmp4j/snmp4j/2.9.0-SNAPSHOT/snmp4j-2.9.0-20220302.102246-2.jar
It has the looping issue .
Regards,
Vikas

Hi Vikas,
OK, then I have no further ideas.
Best regards,
Frank

Hi Frank,
We are thankful for the so far support.
We took java 11 openjdk-11.0.10.0.9-8 with latest snmp4j jar snmp4j-3.6.5-20220302.101842-5.jar

But we are seeing Key Is Writable issue post connections established as well.

Can you please share some pointers as how we can debug the issue on snmp4j side. As I had requested, how can we decrypt SNMP over TLS data? This might help us understand the cause. We have tried TLS decryption and was successful for HTTP over TLS but not SNMP. In your tests, how are you decrypting data for tests? Is there any other memory analyser tools or code used for decryption?

The agent side is confirming the response is sent. Java side issue pointed earlier also couldn’t be seen as we contacted Oracle. Looks like issue is irrespective of java version.