Segfault in snmp++

steven.matty · September 10, 2021, 6:26pm

Hi, running on Linux 4.14.139 kernel and sometimes get a segfault. Backtrace shows:

#0  0x00007f6611571724 in Snmp_pp::Vb::free_vb() () from /opt/Platform/ThirdParty/lib/libsnmp++.so.33
#1  0x00007f66117dee9c in Snmp_pp::Vb::~Vb() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#2  0x00007f6611566ff5 in Snmp_pp::Pdu::set_vb(Snmp_pp::Vb const&, int) () from /opt/Platform/ThirdParty/lib/libsnmp++.so.33
#3  0x00007f661180f267 in Agentpp::Request::finish(int, Agentpp::Vbx const&) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#4  0x00007f66117f72d6 in Agentpp::MibLeaf::get_request(Agentpp::Request*, int) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#5  0x00007f66118048e4 in Agentpp::Mib::process_request(Agentpp::Request*, int) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#6  0x00007f661130e1dc in Agentpp::SubAgentXMib::do_process_request(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#7  0x00007f661182976c in Agentpp::MibTask::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#8  0x00007f66118288e4 in Agentpp::TaskManager::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#9  0x00007f6611827c1e in Agentpp::thread_starter(void*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#10 0x00007f660eee0bc7 in start_thread () from /usr/lib/libpthread.so.0
#11 0x00007f660ec16fbf in clone () from /usr/lib/libc.so.6

Any pointers would be appreciated!

jkatz · September 11, 2021, 5:50pm

Hi,

are you using the latest versions of snmp++, agent++ and agentX++? Does it happen with a high load on the agent with many concurrent requests? Can you reproduce the crash with logging and debugging code enabled?

Regards,
Jochen

AGENTPP · September 12, 2021, 11:15pm

Most likely, some instrumentation code does not correctly implement the required lock order to protect the request object and its VBs. See What is the lock order for Mib objects to protected them multi-threaded agent? - AGENT++ - AGENTPP for more details.

steven.matty · September 13, 2021, 10:09am

We’re using snmp+±3.3.10, agent+±4.10 and agentx+±2.1.0
Thanks, will enable logging - which log level would be best to use?
And may I also ask, how to enable “debugging code”?

steven.matty · September 13, 2021, 10:10am

Is locking required if the MIB is being polled only (i.e. no writes/deletes to any MIB entries)

jkatz · September 13, 2021, 6:52pm

I wanted to suggest to build with compiler option “-g Produce debugging information ...”, as your stack trace did not contain file name and line number information.

Logging level: Just set every log type to level 15.

Locking: I would strongly recommend to implement the locking as required by agent++. Even if it works at the moment, Murphy will hit you at the worst possible point in time.

steven.matty · September 15, 2021, 12:56pm

Thanks - will rebuild with -ggdb and increase logging level. I forgot to mention that we use AgenPro to generate C++ code from our MIB ASN files.

Whilst I’m here…on another project, we’re also seeing a deadlock when we poll our MIB(s) from 2 sources… The mutex that is being held belongs to this thread:

#0 0xb71fb6a1 in write () from /lib/libpthread.so.0
#1 0xb73638c0 in Agentpp::AgentX::send_agentx(int, Agentpp::AgentXPdu const&) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#2 0xb738ef4b in Agentpp::MasterAgentXMib::send_pending_ax(unsigned long) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#3 0xb738f1e8 in Agentpp::MasterAgentXMib::finalize(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#4 0xb75bd24d in Agentpp::Mib::do_process_request(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#5 0xb75dee7a in Agentpp::MibTask::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#6 0xb75de02d in Agentpp::TaskManager::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#7 0xb75dd487 in Agentpp::thread_starter(void*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41

AGENTPP · September 15, 2021, 11:52pm

Please note that AgenPro does not generate any locking because that needs to be done in the instrumentation code.

The deadlock is probably an indicator that there is a memory problem elsewhere, i.e. locks were not unlocked because the corresponding memory became inaccessible.

Deadlocks with the AgentX protocol based SET requests are eliminated by the lock queues and a thread pool that can handle the AgentX packet exchange in addition to the SNMP based PDU processing.

Hope this helps.

steven.matty · September 16, 2021, 8:30am

Hi, thanks for the link - is it OK to use the following to get the MibTable (AgenPro generated)

MibTable * table = myTableEntry::instance;

or

MibTable * table = (MibTable*) mib->get( "" , myTableEntry::instance->getOid() );

AGENTPP · September 17, 2021, 8:20am

Yes, as long as there is only one instance of the MibTable in the process.
Otherwise you need to lookup it from Mib as you suggested.

steven.matty · September 17, 2021, 11:43am

Thanks (gain!). A few more questions, sorry!..

Currently we do implement the mib->lock() / table->start_synch() / table->end_synch() / mib->unlock() whenever a table is cleared/rows added or removed or updated. We currently do not have any protection inside the AgenPro generated get_request() methods.

Is it required to perform the mib->lock() / start_synch() / mib->unlock() if updating scalars via set_state() (AgenPro generated) ?

Is a guaranteed thread safe approach to perform mib->lock() / mib->unlock() within the RequestList processing loop?

In this loop, if reqList->receive(-1) is used, will this block forever, or can it be cancelled (e.g. when closing down the application) ?

The current loop looks like this - is there anything wrong here?..

      while (agent_running && (!dynamic_cast<SubAgentXMib *>(mib)->get_agentx()->quit()))
      {
         req = reqList->receive(1);
         if (req)
         {
            mib->process_request(req);
         }
         else
         {
            mib->cleanup();
         }
      }

steven.matty · September 23, 2021, 8:52am

In a subagent I’m seeing an error logged by the agent - every 30s… I note the (id) is incrementing each time

0210923.08:27:51: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (79), (1), (0)
20210923.08:28:11: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (90), (1), (0)
20210923.08:28:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (137), (1), (0)
20210923.08:29:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (188), (1), (0)
20210923.08:29:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (239), (1), (0)
20210923.08:30:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (290), (1), (0)
20210923.08:30:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (341), (1), (0)
20210923.08:31:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (392), (1), (0)
20210923.08:31:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (443), (1), (0)
20210923.08:32:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (494), (1), (0)
20210923.08:32:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (545), (1), (0)
20210923.08:33:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (596), (1), (0)
20210923.08:33:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (647), (1), (0)
20210923.08:34:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (698), (1), (0)
20210923.08:34:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (749), (1), (0)
20210923.08:35:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (800), (1), (0)
20210923.08:35:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (851), (1), (0)
20210923.08:36:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (902), (1), (0)
20210923.08:36:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (953), (1), (0)
20210923.08:37:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1004), (1), (0)
20210923.08:37:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1055), (1), (0)
20210923.08:38:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1106), (1), (0)
20210923.08:38:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1157), (1), (0)
20210923.08:39:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1208), (1), (0)
20210923.08:39:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1259), (1), (0)
20210923.08:40:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1310), (1), (0)
20210923.08:40:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1361), (1), (0)
20210923.08:41:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1412), (1), (0)
20210923.08:41:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1463), (1), (0)
20210923.08:42:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1514), (1), (0)
20210923.08:42:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1565), (1), (0)
20210923.08:43:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1616), (1), (0)
20210923.08:43:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1667), (1), (0)
20210923.08:44:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1718), (1), (0)
20210923.08:44:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1769), (1), (0)

AGENTPP · September 24, 2021, 9:49pm

Looks like an error in the synchronisation. Some of the muteness are unlocked more than once. This error should be fixed, but it is not critical in most cases.

steven.matty · September 25, 2021, 10:52am

Thanks again. I couldn’t find a case where mib->unlock() or table->end_synch() was being called without a matric lock() / start_synch() call

AGENTPP · September 26, 2021, 4:02pm

Excellent, fixing it avoids a possible race condition.

steven.matty · September 27, 2021, 8:53am

Sorry, what I meant to write was that I couldn’t find a case where the lock() / unlock() or start_synch() / end_synch() didn’t match up. So I’m unable to find the root cause of the error being logged.

AGENTPP · September 28, 2021, 7:49pm

Ok, then it could be some compiler switch about MUTEX strategy.
If remember it correctly, without

#define NO_FAST_MUTEXES

this kind of unlock failing occurs with some threading implementations. Have you pthread available on your system?

steven.matty · September 28, 2021, 9:02pm

Hi yes our system has pthread. I’ll put some more instrumentation in to be 100% sure our use of lock/unlock etc. is correct