SNMP4j stuck in a loop when agent doesn't follow lexicographic ordering

We are querying a table for which the agent responds in a loop (Not in lexicographic order).
As per the documentation, I understand that this behavior is fixed in 2.5.10 release.
From the Changelog:

2018-01-05] Version 2.5.10:

  • Fixed [SFJ-161]: TableUtils does not check for lexicographic ordering in SNMP4J 2.5.9 which could
    cause endless looping with incorrectly implemented agents. The ordering checking can be now disabled
    but is enabled by default.

But this is not working. SNMP4j is stuck forever in this scenario.
Can you help us in understanding if this scenario should be supported by SNMP4J?

Example Response:
Oid order

1.2.0
1.2.1

1.2.6
1.2.0
1.2.1

1.2.6

Name/OID: lldpLocPortIdSubtype.0; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.1; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.2; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.3; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.4; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.5; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.6; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.0; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.1; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.2; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.3; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.4; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.5; Value (Integer): interfaceName (5)
Name/OID: lldpLocPortIdSubtype.6; Value (Integer): interfaceName (5)

Hi,
Which SNMP4J version are you using?
In 2.5.11 there was a change to limiting the retrieval stop to 3 consecutive lexical ordering errors by default. In your example, there is only a single error returned by the agent.

[2018-01-05] Version 2.5.11:

  • Improved [SFJ-162]: TableUtils now waits until three (3) primary lexicographic ordering errors occurred and returns all rows until then. Rows that contain cell values based on incorrectly order data will be returned now with status TableEvent.STATUS_WRONG_ORDER. That state will be also set in the finishing TableEvent then.

Maybe you need to set TableUtils.setIgnoreMaxLexicographicRowOrderingErrors to 0 in order to exit the loop earliest.

We are using SNMP4j 2.8.5. TableUtils.getTable(Target target, OID[] columnOIDs, OID lowerBoundIndex, OID upperBoundIndex) method is stuck in wait().

On further debugging, I found the following condition never returns true for the above scenario.

rowCache.getFirst()).getRowIndex().compareTo(
                  lastMinIndex) < 0)

The first index is always 1.2.0 (The responses are sorted and added to the rows) and the lastMinIndex is also 1.2.0. Due to this it always returns false.

The error handling code is present inside the while loop which never gets invoked.
TableUtils Line 658:
while ((rowCache.size() > 0) &&
((rowCache.getFirst()).getNumComplete() ==
columnOIDs.length) &&
// make sure, row is not prematurely deemed complete
(receivedInOrder) &&
((lastMinIndex == null) ||
((rowCache.getFirst()).getRowIndex().compareTo(
lastMinIndex) < 0))) {

Yes, that is indeed a bug that occurs when the lexicographic loop is hitting the first row still in the row cache and if that row itself is starting the loop.

To fix that, use the following while loop condition instead:

while (((firstCacheRow = (rowCache.isEmpty()) ? null : rowCache.getFirst()) != null) &&
       (firstCacheRow.getNumComplete() == columnOIDs.length) &&
       // make sure, row is not prematurely deemed complete
       (receivedInOrder) &&
       ((lastMinIndex == null) || firstCacheRow.orderError ||
        (firstCacheRow.getRowIndex().compareTo(lastMinIndex) < 0))) {

The next SNMP4J versions 2.8.7 and 3.4.5 will have this fix included.

Thank you for the solution.

Also I see that TableUtils.getTable() waits forever. Due to this in case of these unforeseen circumstances, the thread is stuck forever. I think it would be good to perform a time-bound wait and throw a timeout error if the response is not received within that timeout period.

Let me know your thoughts on this.

Sure, the wait is potentially risky, but life too :wink:

The problem is, that the wait has to wait for an arbitrary number of subsequent requests. The maximum timeout to wait, cannot be derived from the Target.getTimeout() value. A separate timeout is necessary.
I will provide an optional timeout parameter for the next update to be able to control that limit if needed.

To avoid the wait, you could use the asynchronous getTable call too.