#3755 new defect

leak in realloc

Reported by: Jeffrey Hill Owned by:
Priority: normal Milestone:
Component: score Version: 4.11
Severity: normal Keywords: realloc leak
Cc: Blocked By:
Blocking:

Description

We are running Lua in an RTEMS system which, by default, uses realloc for memory allocation. We observe a memory leak on RTEMS unless we replace the default Lua allocator on RTEMS as follows. The net result being avoidance of any calls to realloc, and instead fall back to Lua memory allocation based solely on malloc and free.

nios2-rtems4.11-gcc (GCC) 4.9.3 20150626 (RTEMS 4.11, RSB no-repo, Newlib 2.2.0.20150423)

configure --target=nios2-rtems4.11 --prefix=/ade/rtems/install/rtems-4-11 --disable-itron --disable-tests --enable-posix --enable-cxx --enable-rtemsbsp=altera-sys-config-S43X-TDAQ-dev --enable-networking

This is occurring in 4.11.1.99. We definitely need to sync up, however our cursory search in the bug database indicated that this might need to be reported.

#if defined ( CAS_LUA_DONT_USE_REALLOC )
#pragma message("installing RTEMS workaround for realloc bugs")
void * RspThr :: LuaHandle :: m_alloc ( void * const pPriv, void * const pOld, 
                                    const size_t osize, const size_t nsize ) 
{
    void * pNew;
    if ( nsize > 0u ) {
        if ( pOld ) {
            if ( osize == nsize ) {
                pNew = pOld;
            }
            else {
                pNew = malloc ( nsize );
                if ( pNew ) {
                    memcpy ( pNew, pOld, std :: min ( osize, nsize ) );
                    free ( pOld );
                }
            }
        }
        else {
            pNew = malloc ( nsize );
        }
    }
    else {
        free ( pOld );
        pNew = NULL;
    }
    return pNew;
}
#else
void * RspThr :: LuaHandle :: m_alloc ( void * const pPriv, void * const pOld, 
                                const size_t osize, const size_t nsize ) 
{
    if ( nsize == 0 ) {
        free ( pOld );
        return NULL;
    }
    else {
        void * const pNew = realloc ( pOld, nsize );
        return pNew;
    }
}
#endif /* ifdef CAS_LUA_DONT_USE_REALLOC */

Change History (18)

comment:1 Changed on 06/05/19 at 16:27:21 by Jeffrey Hill

Component: admintool/newlib

comment:2 Changed on 06/06/19 at 07:04:24 by Sebastian Huber

Component: tool/newlibscore

The realloc() is implemented in RTEMS (cpukit/libcsupport/src/realloc.c). Do you have a self-contained test case for this resource leak? Could it be also a memory fragmentation issue?

comment:3 Changed on 06/06/19 at 12:47:22 by Joel Sherrill

I am teaching an RTEMS class this week and strangely someone mentioned this problem. They have a low memory target so we suspect that compounds the issue. We had a few ideas. First was to set a breakpoint at all exit paths from realloc() that return NULL and ensure Lua handles that correctly. Returning NULL means the memory was not resized but still exists at the same size. This could be incorrectly handled by Lua and a leak.

Another idea was to use the resource_snapshot or just the workspace command/method to check that all memory from Lua is actually being freed after each command invocation.

https://git.rtems.org/rtems/tree/cpukit/libcsupport/src/realloc.c

If turning on Lua's own memory pool management addresses the issue, then it is either increased fragmentation in RTEMS or they may just be holding on to previous allocations more efficiently and not losing memory.

comment:4 Changed on 06/06/19 at 13:25:36 by Sebastian Huber

We had some trouble with the first fit allocator of RTEMS in applications which use a lot of malloc/free. We switched to TLSF and the fragmentation issues disappeared.

Since the CAS_LUA_DONT_USE_REALLOC fixes the problems, there could be an error in _Heap_Resize_block().

comment:5 in reply to:  4 Changed on 06/07/19 at 02:16:01 by Chris Johns

Replying to Sebastian Huber:

We had some trouble with the first fit allocator of RTEMS in applications which use a lot of malloc/free. We switched to TLSF and the fragmentation issues disappeared.

What is involved in switching to TLSF? Can this code be merged into RTEMS so users can make a suitable selection?

comment:6 Changed on 06/07/19 at 04:59:44 by Sebastian Huber

I started to work on that in #3582, but I am now busy with other things. A proper integration is quite a bit of work. For the applications we use just an adopted

https://github.com/mattconte/tlsf/blob/master/tlsf.c

and link it to the application. No RTEMS patch is necessary.

comment:7 in reply to:  6 ; Changed on 06/07/19 at 05:22:50 by Chris Johns

Replying to Sebastian Huber:

I started to work on that in #3582, but I am now busy with other things. A proper integration is quite a bit of work. For the applications we use just an adopted

Is the adopted version publicly available?

https://github.com/mattconte/tlsf/blob/master/tlsf.c

and link it to the application. No RTEMS patch is necessary.

We should write this up somewhere.

comment:8 in reply to:  7 ; Changed on 06/07/19 at 11:50:23 by Sebastian Huber

Replying to Chris Johns:

Replying to Sebastian Huber:

I started to work on that in #3582, but I am now busy with other things. A proper integration is quite a bit of work. For the applications we use just an adopted

Is the adopted version publicly available?

No, I had no time to clean it up. One problem is that the file has no license header. This makes it a bit difficult to distribute separately.

https://github.com/mattconte/tlsf/blob/master/tlsf.c

and link it to the application. No RTEMS patch is necessary.

We should write this up somewhere.

It is not a high priority item for me.

comment:9 Changed on 06/08/19 at 00:52:11 by Jeffrey Hill

I should add that we see linear 5kB leaked each Lua context create/destroy cycle. When a Lua context runs for extended period there isn't a leak. Maybe it doesn't look like a fragmentation issue due to the linear 5kB per cycle. Admittedly our heapSpace command line diagnostic maybe isn't showing info about fragmentation. We create a fresh Lua context when a client connects to the EPICS server, and destroy it at client disconnect.

I did log all of the Lua allocations and frees by pointer value and verified that they were balanced using a map in a python script a few weeks back. I don't recall now if there were nil pointers in there in that mix. Its a good idea, and something to look for, I will have a look when spare time arrives :-).

Also I suppose that a standalone test playing back the various allocations requests, shrinking and enlarging existing blocks, simulating the lua calls to realloc, but just following input from my log files might reproduce the issue standalone?

comment:10 in reply to:  8 Changed on 06/08/19 at 01:36:51 by Chris Johns

Replying to Sebastian Huber:

One problem is that the file has no license header. This makes it a bit difficult to distribute separately.

I have raised an issue in github asking if the BSD license can be added to the source.

It is not a high priority item for me.

Sure. Posting the code as is somewhere would help and would not take long to do. :)

Others can clean it up and bring it into the kernel. There is clearly a need and the current state of the NFS client in libbsd where it uses the heap is an example. I am sure there are others.

comment:11 in reply to:  9 Changed on 06/08/19 at 01:38:18 by Chris Johns

Replying to Jeffrey Hill:

Also I suppose that a standalone test playing back the various allocations requests, shrinking and enlarging existing blocks, simulating the lua calls to realloc, but just following input from my log files might reproduce the issue standalone?

This is a great idea and it would make a good testsuite test if it exposes an issue.

comment:12 Changed on 06/08/19 at 08:40:02 by Sebastian Huber

I added a ticket for the alternative allocator: #3757.

comment:13 Changed on 07/31/19 at 00:10:57 by Jeffrey Hill

I noticed last week that the EPICS shell function for displaying the "heapSpace" is implemented as follows. Furthermore, peeking in the score area I can see perhaps that the protected heap methods do not update the lifetime_allocated and lifetime_freed when realloc is called. So perhaps this issue was a false positive, due to weaknesses in our instrumentation. Whether the instrumentation issue is in EPICS or RTEMS is a matter for further discussion :-). We have isolated another leak which is currently known now to be within the EPICS save/restore facility, or more likely in the RTEMS NFS client. When I get that one knocked out then I will try restoring the original Lua allocator, calling realloc, and determine if this issue was a false positive due to instrumentation error.

    Heap_Information_block info;
    malloc_info( &info );
    x = info.Stats.size - (unsigned long)(info.Stats.lifetime_allocated - info.Stats.lifetime_freed);
Last edited on 07/31/19 at 00:19:51 by Jeffrey Hill (previous) (diff)

comment:14 Changed on 08/01/19 at 21:47:16 by Jeffrey Hill

After some additional testing today, it does appear that after heavy use of realloc by the default Lua allocator a somewhat higher magnitude of heap fragmentation on RTEMS is observed. I have not compared this with other OS, but the result is positive enough on RTEMS that I have left the alternative allocator, sans realloc, conditionally built into our production systems for RTEMS builds only.

Furthermore, it does not appear that there is a leak when the default, realloc based, Lua allocator is installed, and we are basing our conclusions on results from a new "heapSpace" diagnostic based on the code below. We see only somewhat more fragmentation in that situation, based on the largest free block.

     malloc_info( &info );
-    x = info.Stats.size - (unsigned long)(info.Stats.lifetime_allocated - info.Stats.lifetime_freed);
+    x = info.Free.total;

comment:15 Changed on 08/01/19 at 22:28:43 by Joel Sherrill

Am I reading this correctly that there isn't a leak problem but there is a change in fragmentation between the two Lua allocator choices? This doesn't surprise me.

If so, what's difference between the two?

Also is there a bug to be fixed in the protected heap instrumentation code?

comment:16 Changed on 08/02/19 at 00:47:20 by Jeffrey Hill

If so, what's difference between the two?

A very short experiment today, returning to the default Lua allocator based on realloc, resulted in a much smaller largest-free-block. I didn't try this experiment returning to the default Lua allocator more than once. During this test the system's free memory appeared to drop to a plateau of about 95 MB free in a 125MB RAM with the default Lua allocator. We see about the same result with our replacement Lua allocator.

We continue to produce a robust system using a replacement Lua allocator that uses a 64 byte free list for small blocks, and falling back on RTEMS to ordinary malloc and free for the larger blocks. I have heard that other code bases use the LOKI small block allocator for this purpose, but we haven't gone there. Our largest free block is stable at 9797512 bytes on a 125MB RAM during testing today on the replacement Lua allocator, which isn't optimal, but the systems in question are solid. Our largest free block was stable at about 3MB bytes on a 125MB RAM using the default Lua allocator, which is definitely scary.

We isolated and fixed several leaks in the EPICS save/restore component, and this appears to have been the root problem. This was somewhat of a surprise because the EPICS save/restore component is used ubiquitously in our community. Memory leaks can be hard to find, especially in modal systems. It turns out that on this particular system a PID loop parameter was inadvertently in the variable list to save, and so the leak rate was dramatically increased, and we didn't see the issue initially on the host development platforms with Valgrind / Intel Tools because of the modalities. I am still running long term tests to see if any leaks remain.

Also is there a bug to be fixed in the protected heap instrumentation code?

Initially there was a strong signature of a leak when we were using the following code for answering the "how much memory is left" question, and I recall confirming that there was a behavior change before and after switching the allocator when I submitted this entry. I have not tried returning to this computation, and instead we are now using simply "Free.total". I was peeking at score, and a very quick check indicated that the score version of realloc maybe isn't recalculating the lifetime_allocated and lifetime_freed statistics. Probably, this happens only indirectly when the score version of realloc calls the score version of malloc and free. But I didn't look at this too long and could be wrong. So ... I don't know for certain that there is an issue with lifetime_allocated and lifetime_freed when making heavy use of realloc, but I have some suspicions.

info.Stats.size - (unsigned long)(info.Stats.lifetime_allocated - info.Stats.lifetime_freed)

comment:17 Changed on 08/05/19 at 16:58:40 by Gedare Bloom

It looks to me like _Heap_Resize_block_checked() is not increasing lifetime_allocated and probably should be increasing it by new_size - old_size before returning success.

comment:18 Changed on 08/06/20 at 19:47:32 by Jeffrey Hill

I feel obligated to update this issue with information about an interesting coincidence.

The coincidence is that recently I have discovered, based on experimental evidence, that removing calls to GCC's de-mangler appears to eliminate some rare pool corruption incidents. Now admittedly, this function has a very complicated API and the possibility for pilot error is high. However, I repeatedly re-code the call to this function in multiple ways, following examples on the web, and the end result is always the same, if I wait long enough; corrupted pool. I expect some sort of re-entrant function state bug in a multi-threaded context.

Below is my call to the gcc de-mangler. The pointer to the Thread Private class below is obtained from a thread private variable.

Ok, so here is the coincidence; I haven't looked inside the de-mangler, but one could make a pretty good guess that it might internally make some calls to realloc. I also looked at our production builds and we currently have Lua configured to not base its memory allocator on realloc, when its an RTEMS build.

string ThreadPrivate :: demangle ( const char * const pMangledName )
{
    /*
     * on nios2 with gcc 4.8 the implementation of
     * __cxa_demangle appears to corrupt pool no
     * matter if it is called to reuse a thread
     * private storage or not
     */
#if __GNUC__ >= 6
    int status = -1000;
    char * const pBufResult =
        abi :: __cxa_demangle ( pMangledName, m_pBuf,
                                    & m_bufLen, & status );
    if ( pBufResult ) {
        m_pBuf = pBufResult;
        if ( status == 0 ) {
            return string ( pBufResult );
        }
    }
#endif
    return string ( pMangledName );
}
Note: See TracTickets for help on using tickets.