Hi Frank. Thanks for those details. They have helped a lot! The following info is mostly to document the final results for myself or others.
It is weird how undocumented these kernel settings are… I have found wildly different setting recommendations online, but nothing seems official. Well, I did find this official-looking page, but those UDP settings seem to affect the memory available to the whole UDP protocol on the machine, not individual sockets, so just FYI. Anyway, testing verified that if the Linux setting net.core.rmem_max
is increased to allow larger UDP buffers, then calling DefaultUdpTransportMapping’s setReceiveBufferSize()
(before calling listen()
) will allocate that increased buffering.
In addition, the application added a queue to decouple the packet processing from SNMP4J’s work, as you suggested. A nice side-benefit of that queue is that now an overflow can be detected in the application first. (Because the app’s queue is going to overflow before the Linux buffer even begins to fill, unless there is such a big storm of packets the can’t be read fast enough by even just the SNMP4J thread.) Previously the app was unaware that packets were lost, and only some Linux reports would reveal the situation. Also just FYI, the app was writing packet data to disk as part of its processing… The disk was too slow, and that was the main bottleneck. Now, in addition to the packet queue already mentioned, the disk I/O is queued too, so it can safely overflow (lose logs) without affecting other more-important work that completes quickly.
The Linux report command that was most useful was: ss -lump "sport = :162"
But the output is cryptic… The dropped packet count is in the number after a ‘d’ prefix. Here’s an example output. Note the d3419
at the end of the skmem
(socket memory) data, indicating 3,419 lost packets.
$ ss -lump "sport = :162"
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 *:snmptrap *:*
skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d3419)
The other useful command was netstat -su
. Its overflow field is named “receive buffer errors.” Strangely it had a different starting number than the ss
command, but it would increment perfectly during the testing, so it was reliable too. Note that the ifconfig
command output has a field for “overruns” which is supposed to track lost packets. That always stayed zero, even when I could manually increment the lost packet count for other commands. Possibly a Linux bug? (Unfortunately I was using this report first, and wasted time thinking my test was broken, instead of the report!)
FYI, for testing I just opened a UDP port for reading, but didn’t read from it. Then I sent fixed-size packets (a demo trap) to that port. After enough packets to fill the buffer, the overflow counts on the ss and netstat reports would increment. Next I would read a few packets (emptying the buffer a little), and then send that same number of packets successfully back into the buffer before the buffer would overflow again. Here are the individual JShell commands used:
// Open the UDP socket...
jshell> DatagramSocket s = new DatagramSocket(1162);
// Get a packet ready, as required for reading...
jshell> DatagramPacket packet = new DatagramPacket(new byte[4000],4000)
// Read a packet from the buffer...
jshell> s.receive(packet)
Thanks again for great support. Hopefully this thread is useful to someone in the future.
Chris