Segfault in snmp++

Hi, running on Linux 4.14.139 kernel and sometimes get a segfault. Backtrace shows:

#0  0x00007f6611571724 in Snmp_pp::Vb::free_vb() () from /opt/Platform/ThirdParty/lib/libsnmp++.so.33
#1  0x00007f66117dee9c in Snmp_pp::Vb::~Vb() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#2  0x00007f6611566ff5 in Snmp_pp::Pdu::set_vb(Snmp_pp::Vb const&, int) () from /opt/Platform/ThirdParty/lib/libsnmp++.so.33
#3  0x00007f661180f267 in Agentpp::Request::finish(int, Agentpp::Vbx const&) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#4  0x00007f66117f72d6 in Agentpp::MibLeaf::get_request(Agentpp::Request*, int) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#5  0x00007f66118048e4 in Agentpp::Mib::process_request(Agentpp::Request*, int) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#6  0x00007f661130e1dc in Agentpp::SubAgentXMib::do_process_request(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#7  0x00007f661182976c in Agentpp::MibTask::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#8  0x00007f66118288e4 in Agentpp::TaskManager::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#9  0x00007f6611827c1e in Agentpp::thread_starter(void*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#10 0x00007f660eee0bc7 in start_thread () from /usr/lib/libpthread.so.0
#11 0x00007f660ec16fbf in clone () from /usr/lib/libc.so.6

Any pointers would be appreciated!

Hi,

are you using the latest versions of snmp++, agent++ and agentX++? Does it happen with a high load on the agent with many concurrent requests? Can you reproduce the crash with logging and debugging code enabled?

Regards,
Jochen

Most likely, some instrumentation code does not correctly implement the required lock order to protect the request object and its VBs. See What is the lock order for Mib objects to protected them multi-threaded agent? - AGENT++ - AGENTPP for more details.

We’re using snmp+±3.3.10, agent+±4.10 and agentx+±2.1.0
Thanks, will enable logging - which log level would be best to use?
And may I also ask, how to enable “debugging code”?

Is locking required if the MIB is being polled only (i.e. no writes/deletes to any MIB entries)

I wanted to suggest to build with compiler option “-g Produce debugging information ...”, as your stack trace did not contain file name and line number information.

Logging level: Just set every log type to level 15.

Locking: I would strongly recommend to implement the locking as required by agent++. Even if it works at the moment, Murphy will hit you at the worst possible point in time.

Thanks - will rebuild with -ggdb and increase logging level. I forgot to mention that we use AgenPro to generate C++ code from our MIB ASN files.

Whilst I’m here…on another project, we’re also seeing a deadlock when we poll our MIB(s) from 2 sources… The mutex that is being held belongs to this thread:

#0 0xb71fb6a1 in write () from /lib/libpthread.so.0
#1 0xb73638c0 in Agentpp::AgentX::send_agentx(int, Agentpp::AgentXPdu const&) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#2 0xb738ef4b in Agentpp::MasterAgentXMib::send_pending_ax(unsigned long) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#3 0xb738f1e8 in Agentpp::MasterAgentXMib::finalize(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagentx++.so.21
#4 0xb75bd24d in Agentpp::Mib::do_process_request(Agentpp::Request*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#5 0xb75dee7a in Agentpp::MibTask::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#6 0xb75de02d in Agentpp::TaskManager::run() () from /opt/Platform/ThirdParty/lib/libagent++.so.41
#7 0xb75dd487 in Agentpp::thread_starter(void*) () from /opt/Platform/ThirdParty/lib/libagent++.so.41

Please note that AgenPro does not generate any locking because that needs to be done in the instrumentation code.

The deadlock is probably an indicator that there is a memory problem elsewhere, i.e. locks were not unlocked because the corresponding memory became inaccessible.

Deadlocks with the AgentX protocol based SET requests are eliminated by the lock queues and a thread pool that can handle the AgentX packet exchange in addition to the SNMP based PDU processing.

Hope this helps.

Hi, thanks for the link - is it OK to use the following to get the MibTable (AgenPro generated)

MibTable * table = myTableEntry::instance;

or

MibTable * table = (MibTable*) mib->get( "" , myTableEntry::instance->getOid() );

Yes, as long as there is only one instance of the MibTable in the process.
Otherwise you need to lookup it from Mib as you suggested.

Thanks (gain!). A few more questions, sorry!..

Currently we do implement the mib->lock() / table->start_synch() / table->end_synch() / mib->unlock() whenever a table is cleared/rows added or removed or updated. We currently do not have any protection inside the AgenPro generated get_request() methods.

Is it required to perform the mib->lock() / start_synch() / mib->unlock() if updating scalars via set_state() (AgenPro generated) ?

Is a guaranteed thread safe approach to perform mib->lock() / mib->unlock() within the RequestList processing loop?

In this loop, if reqList->receive(-1) is used, will this block forever, or can it be cancelled (e.g. when closing down the application) ?

The current loop looks like this - is there anything wrong here?..

      while (agent_running && (!dynamic_cast<SubAgentXMib *>(mib)->get_agentx()->quit()))
      {
         req = reqList->receive(1);
         if (req)
         {
            mib->process_request(req);
         }
         else
         {
            mib->cleanup();
         }
      }

In a subagent I’m seeing an error logged by the agent - every 30s… I note the (id) is incrementing each time

0210923.08:27:51: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (79), (1), (0)
20210923.08:28:11: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (90), (1), (0)
20210923.08:28:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (137), (1), (0)
20210923.08:29:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (188), (1), (0)
20210923.08:29:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (239), (1), (0)
20210923.08:30:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (290), (1), (0)
20210923.08:30:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (341), (1), (0)
20210923.08:31:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (392), (1), (0)
20210923.08:31:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (443), (1), (0)
20210923.08:32:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (494), (1), (0)
20210923.08:32:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (545), (1), (0)
20210923.08:33:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (596), (1), (0)
20210923.08:33:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (647), (1), (0)
20210923.08:34:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (698), (1), (0)
20210923.08:34:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (749), (1), (0)
20210923.08:35:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (800), (1), (0)
20210923.08:35:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (851), (1), (0)
20210923.08:36:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (902), (1), (0)
20210923.08:36:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (953), (1), (0)
20210923.08:37:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1004), (1), (0)
20210923.08:37:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1055), (1), (0)
20210923.08:38:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1106), (1), (0)
20210923.08:38:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1157), (1), (0)
20210923.08:39:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1208), (1), (0)
20210923.08:39:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1259), (1), (0)
20210923.08:40:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1310), (1), (0)
20210923.08:40:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1361), (1), (0)
20210923.08:41:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1412), (1), (0)
20210923.08:41:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1463), (1), (0)
20210923.08:42:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1514), (1), (0)
20210923.08:42:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1565), (1), (0)
20210923.08:43:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1616), (1), (0)
20210923.08:43:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1667), (1), (0)
20210923.08:44:02: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1718), (1), (0)
20210923.08:44:32: -1291642048: (1)WARNING: Synchronized: unlock failed (id)(error)(wasLocked): (1769), (1), (0)

Looks like an error in the synchronisation. Some of the muteness are unlocked more than once. This error should be fixed, but it is not critical in most cases.

Thanks again. I couldn’t find a case where mib->unlock() or table->end_synch() was being called without a matric lock() / start_synch() call

:+1:
Excellent, fixing it avoids a possible race condition.

Sorry, what I meant to write was that I couldn’t find a case where the lock() / unlock() or start_synch() / end_synch() didn’t match up. So I’m unable to find the root cause of the error being logged.

Ok, then it could be some compiler switch about MUTEX strategy.
If remember it correctly, without

#define NO_FAST_MUTEXES

this kind of unlock failing occurs with some threading implementations. Have you pthread available on your system?

Hi yes our system has pthread. I’ll put some more instrumentation in to be 100% sure our use of lock/unlock etc. is correct