Advice for periodic "lost" traps

augur · April 12, 2023, 10:03pm

Hi. First, I do not suspect a SNMP4J bug. But I was hoping someone might have some insight into a problem. A busy server is receiving about 1,500 traps/second. Sometimes, random traps are not processed by software that uses SNMP4J. On one occasion, a “missing” trap was successfully seen during a tcpdump capture on the same machine running the SNMP4J application. So we know that the traps are being sent, and they are arriving at the network interface. And we know the traps are “valid” because sometimes the same traps are processed. And we track any packets rejected by SNMP4J, so we would know if traps were lost because of some occasional malformed packet. The application also has a good long track record, so we don’t think it’s a coding bug in SNMP4J or the application.

Instead, I suspect/guess that there is a packet buffer in the O/S (Linux), or maybe the interface card, that is able to give the trap to tcpdump, but can’t wait for the software running SNMP4J when there’s any lag, e.g. during Java garbage collection. Is that a plausible explanation? Any advice for avoiding the data loss, or maybe detecting it easier, would be appreciated.

One more clue: When extra RAM was provided for the Java process, the losses seemed to be less frequent. I’m assuming that could be due to less frequent need for garbage collection. The JVM (GraalVM) is using its default garbage collector, which I believe is optimized for short garbage collection (versus deep/slow garbage collection). So other than further offloading this server, I don’t know what to try.

Thanks,
Chris

AGENTPP · April 12, 2023, 10:25pm

Are you talking about UDP, TCP, DTLS, or TLS?
Which SNMP4J version are you using?
Which JDK version and platform (OS and version) are you using?

augur · April 13, 2023, 1:11pm

UDP. (Sorry, I should have noted that!)
SNMP4J v3.4.2
Linux version 3.10.0-1160.88.1.el7.x86_64
OpenJDK 64-Bit Server VM GraalVM CE 21.3.0 (build 17.0.1+12-jvmci-21.3-b05, mixed mode, sharing)
JDK args: -server -DpreferIPv6Addresses=0 -Dnetworkaddress.cache.negative.ttl=0

AGENTPP · April 13, 2023, 10:37pm

With UDP there is a buffer in the operating system. Thus, to reduce the packet loss, you need to increase the operating system buffer or make sure, that Java processes pending packets in the buffer fast enough.
The latter can be done using the MultiThreadedMessageDispatcher but also with the standard MessageDispatcherImpl. Important when using the MessageDispatcherImpl is decoupling the processMessage method from IO waits or long running computations.

augur · April 16, 2023, 4:31am

That is very useful, thank you! I will ensure processMessage (Specifically, CommandResponder.processPDU() I think) is further decoupled. That is probably the main bottleneck in this situation… I did not consider that. Thanks.

Regarding the operating system, it looks like Linux setting SO_RCVBUF is important. SNMP4J sets it in DefaultUdpTransportMapping.ListenThread.run() via Java’s DatagramSocket.setReceiveBufferSize(). By default SNMP4J sets it to 64kB, the same as the max packet size. But the Javadoc for setReceiveBufferSize() hints that maybe a larger SO_RCVBUF could help buffer more packets.

Increasing SO_RCVBUF may allow the network implementation to buffer multiple packets when packets arrive faster than are being received using receive(DatagramPacket).

If that’s true, then maybe a larger SO_RCVBUF can also help to ensure that fewer packets are lost, e.g. during JVM garbage collection. It looks like that can be safely achieved via DefaultUdpTransportMapping.setReceiveBufferSize(). That method’s Javadoc says the maximum is getMaxInboundMessageSize() which is just the 64kB, but it doesn’t look like that limit is enforced. So I think it’s OK to set a larger value there. Any thoughts about that?

Thanks again,
Chris

AGENTPP · April 16, 2023, 10:23am

Hi Chris,

Yes, increasing the UPD socket’s retrieve buffer size might help, provided that Java is using a lower value than the possible maximum defined by the operating system. See UDP Receive Buffer Size · quic-go/quic-go Wiki · GitHub for hints/links on that topic.
SNMP4J DefaultUdpTransportMapping will print the currently set receive buffer size in DEBUG log with “UDP receive buffer size for socket is set to: ”

On a MacOS Ventura 13.3.1 system default Java receive buffer size value is 786896 bytes (which is little more that 768KBytes).

Only if you call DefaultUdpTransportMapping.setReceiveBufferSize before its listen method, then your own settings will apply. SNMP4J only ensures that the set buffer size is at least the maximum inbound packet size to ensure that at least one packet fits in the buffer. Thus, the below quote from your last posting is not quite correct. The opposite is true. The receive buffer size must be at least the max inbound message message but should be larger than that, of course.

In most cases, the best effect can be achieved if you increase the operating system’s limits.
Here is another link with some useful info on that topic:

Best regards,
Frank

augur · April 28, 2023, 5:30am

Hi Frank. Thanks for those details. They have helped a lot! The following info is mostly to document the final results for myself or others.

It is weird how undocumented these kernel settings are… I have found wildly different setting recommendations online, but nothing seems official. Well, I did find this official-looking page, but those UDP settings seem to affect the memory available to the whole UDP protocol on the machine, not individual sockets, so just FYI. Anyway, testing verified that if the Linux setting net.core.rmem_max is increased to allow larger UDP buffers, then calling DefaultUdpTransportMapping’s setReceiveBufferSize() (before calling listen()) will allocate that increased buffering.

In addition, the application added a queue to decouple the packet processing from SNMP4J’s work, as you suggested. A nice side-benefit of that queue is that now an overflow can be detected in the application first. (Because the app’s queue is going to overflow before the Linux buffer even begins to fill, unless there is such a big storm of packets the can’t be read fast enough by even just the SNMP4J thread.) Previously the app was unaware that packets were lost, and only some Linux reports would reveal the situation. Also just FYI, the app was writing packet data to disk as part of its processing… The disk was too slow, and that was the main bottleneck. Now, in addition to the packet queue already mentioned, the disk I/O is queued too, so it can safely overflow (lose logs) without affecting other more-important work that completes quickly.

The Linux report command that was most useful was: ss -lump "sport = :162"
But the output is cryptic… The dropped packet count is in the number after a ‘d’ prefix. Here’s an example output. Note the d3419 at the end of the skmem (socket memory) data, indicating 3,419 lost packets.

$ ss -lump "sport = :162"
State                         Recv-Q                        Send-Q                                               Local Address:Port                                                     Peer Address:Port                        Process                        
UNCONN                        0                             0                                                                *:snmptrap                                                            *:*                         
         skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d3419)

The other useful command was netstat -su. Its overflow field is named “receive buffer errors.” Strangely it had a different starting number than the ss command, but it would increment perfectly during the testing, so it was reliable too. Note that the ifconfig command output has a field for “overruns” which is supposed to track lost packets. That always stayed zero, even when I could manually increment the lost packet count for other commands. Possibly a Linux bug? (Unfortunately I was using this report first, and wasted time thinking my test was broken, instead of the report!)

FYI, for testing I just opened a UDP port for reading, but didn’t read from it. Then I sent fixed-size packets (a demo trap) to that port. After enough packets to fill the buffer, the overflow counts on the ss and netstat reports would increment. Next I would read a few packets (emptying the buffer a little), and then send that same number of packets successfully back into the buffer before the buffer would overflow again. Here are the individual JShell commands used:

// Open the UDP socket...
jshell> DatagramSocket s = new DatagramSocket(1162);
// Get a packet ready, as required for reading...
jshell> DatagramPacket packet = new DatagramPacket(new byte[4000],4000)
// Read a packet from the buffer...
jshell> s.receive(packet)

Thanks again for great support. Hopefully this thread is useful to someone in the future.
Chris

AGENTPP · April 28, 2023, 10:27pm

Hi Chris,

Many thanks for your detailed report! I think that is very helpful for many users dealing with high trap rates. The Linux kernel UDP buffer stuff indeed looks really ancient, but as you wrote, simplicity is fast and its capabilities are very rarely really needed and used. In most cases the consuming parts of an application are slower than expected or slower than the trap rate.

Best regards,
Frank