Tracing Leaks in Python: Find the Nearest Root

By Chad Austin


Garbage Collection Doesn’t Mean You Can Ignore Memory Altogether…

Garbage collection removes a great deal of burden from programming. In fact, garbage collection is a critical language feature for all languages where abstractions such as functional closures or coroutines are common, as they frequently create reference cycles.

IMVU is a mix of C++ and Python. The C++ code generally consists of small, cohesive objects with a clear ownership chain. An Avatar SceneObject owns a ModelInstance which owns a set of Meshes which own Materials which own Textures and so on… Since there are no cycles in this object graph, reference-counting with shared_ptr suffices.

The Python code, however, is full of messy object cycles. An asynchronous operation may hold a reference to a Room, while the Room may be holding a reference to the asynchronous operation. Often two related objects will be listening for events from the other. While Python’s garbage collector will happily take care of cycles, it’s still possible to leak objects.

Imagine these scenarios:

  • a leaked or living C++ object has a strong reference to a Python object.
  • a global cache has a reference to an instance’s bound method, which implicitly contains a reference to the instance.
  • two objects with __del__ methods participate in a cycle with each other, and Python refuses to decide which should destruct first

To detect these types of memory leaks, we use a LifeTimeMonitor utility:

a = SomeObject()
lm = LifeTimeMonitor(a)
del a
lm.assertDead() # succeeds

b = SomeObject()
lm = LifeTimeMonitor(b)
lm.assertDead() # raises ObjectNotDead

We use LifeTimeMonitor’s assertDead facility at key events, such as when a user closes a dialog box or 3D window. Take 3D windows as an example. Since they’re the root of an entire object subgraph, we would hate to inadvertently leak them. LifeTimeMonitor’s assertDead prevents us from introducing an object leak.

It’s good to know that an object leaked, but how can you determine why it can’t be collected?

Python’s Garbage Collection Algorithm

Let’s go over the basics of automatic garbage collection. In a garbage-collected system there are objects and objects can reference each other. Some objects are roots; that is, if an object is referenced by a root, it cannot be collected. Example roots are the stacks of live threads and the global module list. The graph formed by objects and their references is the object graph.

In SpiderMonkey, Mozilla’s JavaScript engine, the root set is explicitly-managed. SpiderMonkey’s GC traverses the object graph from the root set. If the GC does not reach an object, that object is destroyed. If C code creates a root object but fails to add it to the root set, it risks the GC deallocating the object while it’s still in use.

In Python however, the root set is implicit. All Python objects are ref-counted, and any that can refer to other objects — and potentially participate in an object cycle — are added to a global list upon construction. Each GC-tracked object can be queried for its referents. Python’s root set is implicit because anyone can create a root simply by incrementing an object’s refcount.

Since Python’s root set is implicit, its garbage collection algorithm differs slightly from SpiderMonkey’s. Python begins by setting GCRefs(o) to CurrentRefCount(o) for each GC-tracked PyObject o. Then it traverses all referents r of all GC-tracked PyObjects and subtracts 1 from GCRefs(r). Then, if GCRefs(o) is nonzero, o is an unknown reference, and thus a root. Python traverses the now-known root set and increments GCRefs(o) for any traversed objects. If any object o remains where GCRefs(o) == 0, that object is unreachable and thus collectible.

Finding a Path From the Nearest Root to the Leaked Object

Now that we know how Python’s garbage collector works, we can ask it for its set of roots by calculating GCRefs(o) for all objects o in gc.get_objects(). Then we perform a breadth-first-search from the root set to the leaked object. If the root set directly or indirectly refers to the leaked object, we return the path our search took.

Sounds simple, but there’s a catch! Imagine that the search function has signature:

PyObject* findPathToNearestRoot(PyObject* leakedObject);

leakedObject is a reference (incremented within Python’s function-call machinery itself) to the leaked object, making leakedObject a root!

To work around this, change findPathToNearestRoot so it accepts a singleton list containing a reference to the leaked object. findPathToNearestRoot can borrow that reference and clear the list, ensuring that leakedObject has no untracked references.

findPathToNearestRoot will find paths to expected Python roots like thread entry points and module objects. But, since it directly mirrors the behavior of Python’s GC, it will also find paths to leaked C references! Obviously, it can’t directly point you to the C code that leaked the reference, but the reference path should be enough of a clue to figure it out.

The Code

template<typename ArgType>
void traverse(PyObject* o, int (*visit)(PyObject* visitee, ArgType* arg), ArgType* arg) {
    if (Py_TYPE(o)->tp_traverse) {
        Py_TYPE(o)->tp_traverse(o, (visitproc)visit, arg);
    }
}

typedef std::map<PyObject*, int> GCRefs;

static int subtractKnownReferences(PyObject* visitee, GCRefs* gcrefs) {
    if (gcrefs->count(visitee)) {
        Assert(PyObject_IS_GC(visitee));
        --(*gcrefs)[visitee];
    }
    return 0;
}

typedef int Backlink; // -1 = none

typedef std::vector< std::pair<Backlink, PyObject*> > ReferenceList;
struct Referents {
    std::set<PyObject*>& seen;
    Backlink backlink;
    ReferenceList& referenceList;
};

static int addReferents(PyObject* visitee, Referents* referents) {
    if (!referents->seen.count(visitee) && PyObject_IS_GC(visitee)) {
        referents->referenceList.push_back(std::make_pair(referents->backlink, visitee));
    }
    return 0;
}

static Backlink findNextLevel(
    std::vector<PyObject*>& chain,
    const ReferenceList& roots,
    PyObject* goal,
    std::set<PyObject*>& seen
) {
    if (roots.empty()) {
        return -1;
    }

    for (size_t i = 0; i < roots.size(); ++i) {
        if (roots[i].first != -1) {
            if (goal == roots[i].second) {
                chain.push_back(goal);
                return roots[i].first;
            }
            seen.insert(roots[i].second);
        }
    }

    ReferenceList nextLevel;
    for (size_t i = 0; i < roots.size(); ++i) {
        Referents referents = {seen, i, nextLevel};
        traverse(roots[i].second, &addReferents, &referents);
    }

    Backlink backlink = findNextLevel(chain, nextLevel, goal, seen);
    if (backlink == -1) {
        return -1;
    }

    chain.push_back(roots[backlink].second);
    return roots[backlink].first;
}

static std::vector<PyObject*> findReferenceChain(
    const std::vector<PyObject*>& roots,
    PyObject* goal
) {
    std::set<PyObject*> seen;
    ReferenceList unknownReferrer;
    for (size_t i = 0; i < roots.size(); ++i) {
        unknownReferrer.push_back(std::make_pair<Backlink>(-1, roots[i]));
    }
    std::vector<PyObject*> rv;
    // going to return -1 no matter what: no backlink from roots
    findNextLevel(rv, unknownReferrer, goal, seen);
    return rv;
}

static object findPathToNearestRoot(const object& o) {
    if (!PyList_Check(o.ptr()) || PyList_GET_SIZE(o.ptr()) != 1) {
        PyErr_SetString(PyExc_TypeError, "findNearestRoot must take a list of length 1");
        throw_error_already_set();
    }

    // target = o.pop()
    object target(handle<>(borrowed(PyList_GET_ITEM(o.ptr(), 0))));
    if (-1 == PyList_SetSlice(o.ptr(), 0, 1, 0)) {
        throw_error_already_set();
    }

    object gc_module(handle<>(PyImport_ImportModule("gc")));
    object tracked_objects_list = gc_module.attr("get_objects")();
    // allocating the returned list may have run a GC, but tracked_objects won't be in the list

    std::vector<PyObject*> tracked_objects(len(tracked_objects_list));
    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        object to = tracked_objects_list[i];
        tracked_objects[i] = to.ptr();
    }
    tracked_objects_list = object();

    GCRefs gcrefs;

    // TODO: store allocation/gc count per generation

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        gcrefs[tracked_objects[i]] = tracked_objects[i]->ob_refcnt;
    }

    for (size_t i = 0; i < tracked_objects.size(); ++i) {
        traverse(tracked_objects[i], subtractKnownReferences, &gcrefs);
    }

    // BFS time

    std::vector<PyObject*> roots;
    for (GCRefs::const_iterator i = gcrefs.begin(); i != gcrefs.end(); ++i) {
        if (i->second && i->first != target.ptr()) { // Don't count the target as a root.
            roots.push_back(i->first);
        }
    }
    std::vector<PyObject*> chain = findReferenceChain(roots, target.ptr());

    // TODO: assert that allocation/gc count per generation didn't change

    list rv;
    for (size_t i = 0; i < chain.size(); ++i) {
        rv.append(object(handle<>(borrowed(chain[i]))));
    }

    return rv;
}

How to Write an Interactive, 60 Hz Desktop Application

By Chad Austin

IMVU’s client application doesn’t fit neatly into a single development paradigm:

  • IMVU is a Windows desktop application. Mouse clicks, window resizes, and dialog boxes must all respond with imperceptible latency. Running IMVU should not significantly affect laptop battery life.
  • IMVU is an interactive 3D game. The 3D scene must be simulated and drawn at smooth, interactive frame rates, 60 Hz if possible.
  • IMVU is a networked application. Sending and receiving network packets must happen quickly and the UI should never have to wait for I/O.

Thus, let us clarify some specific requirements:

  • Minimal CPU usage (and thus battery consumption) when the application is minimized or obscured.
  • Minimal CPU usage in low-complexity scenes. Unlike most games, IMVU must never unnecessarily consume battery life while waiting in spin loops.
  • Animation must continue while modal dialog boxes and menus are visible. You don’t have control over these modal event loops, but it looks terrible if animation pauses while menus and dialogs are visible.
  • Animation must be accurate and precise. It looks much better if every frame takes 22 milliseconds (45 Hz) than if some frames take 30 milliseconds and some take 15 milliseconds (averaging 45 Hz).
  • Animation must degrade gracefully. In a really complex room with a dozen avatars, IMVU can easily spend all of a core’s CPU trying to animate the scene. In this case, the frame rate should gradually drop while the application remains responsive to mouse clicks and other input events.
  • Support for Windows XP, Vista, and 7.

Naive Approach #1

Windows applications typically have a main loop that looks something like:

MSG msg;
while (GetMessage(&msg, 0, 0, 0) > 0) {
    TranslateMessage(&msg);
    DispatchMessage(&msg);
}

What went wrong

Using SetTimer/WM_TIMER sounds like a good idea for simulation and painting, but it’s way too imprecise for interactive applications.

Naive Approach #2

Games typically have a main loop that looks something like the following:

while (running) {
    // process input events
    MSG msg;
    while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }

    if (frame_interval_has_elapsed) {
        simulate_world();
        paint();
    }
}

What went wrong

The above loop never sleeps, draining the user’s battery and burning her legs.

Clever Approach #1: Standard Event Loop + timeSetEvent

void runMainLoop() {
    MSG msg;
    while (GetMessage(&msg, 0, 0, 0) > 0) {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }
}

void customWindowProc(...) {
    if (message == timerMessage) {
        simulate();
        // schedules paint with InvalidateRect
    }
}

void CALLBACK TimerProc(UINT, UINT, DWORD, DWORD, DWORD) {
    if (0 == InterlockedExchange(&inFlight, 1)) {
        PostMessage(frameTimerWindow, timerMessage, 0, 0);
    }
}

void startFrameTimer() {
    RegisterClass(customWindowProc, ...);
    frameTimerWindow = CreateWindow(...);
    timeSetEvent(FRAME_INTERVAL, 0, &TimerProc, 0, TIME_PERIODIC);
}

What went wrong

The main loop’s GetMessage call always returns messages in a priority order. Slightly oversimplified, posted messages come first, then WM_PAINT messages, then WM_TIMER. Since timerMessage is a normal message, it will preempt any scheduled paints. This would be fine for us, since simulations are cheap, but the dealbreaker is that if we fail to maintain frame rate, WM_TIMER messages are entirely starved. This violates our graceful degradation requirement. When frame rate begins to degrade, code dependent on WM_TIMER shouldn’t stop entirely.

Even worse, the modal dialog loop has a freaky historic detail. It waits for the message queue to be empty before displaying modal dialogs. When painting can’t keep up, modal dialogs simply don’t appear.

We tried a bunch of variations, setting flags when stepping or painting, but they all had critical flaws. Some continued to starve timers and dialog boxes and some degraded by ping-ponging between 30 Hz and 15 Hz, which looked terrible.

Clever Approach #2: PostThreadMessage + WM_ENTERIDLE

A standard message loop didn’t seem to be getting us anywhere, so we changed our timeSetEvent callback to PostThreadMessage a custom message to the main loop, who knew how to handle it. Messages sent via PostThreadMessage don’t go to a window, so the event loop needs to process them directly. Since DialogBox and TrackPopupMenu modal loops won’t understand this custom message, we will fall back on a different mechanism.

DialogBox and TrackPopupMenu send WM_ENTERIDLE to their owning windows. Any window in IMVU that can host a dialog box or popup menu handles WM_ENTERIDLE by notifying a global idle handler, which can decide to schedule a new frame immediately or in N milliseconds, depending on how much time has elapsed.

What Went Wrong

So close! In our testing under realistic workloads, timeSetEvent had horrible pauses and jitter. Sometimes the multimedia thread would go 250 ms between notifications. Otherwise, the custom event loop + WM_ENTERIDLE approach seemed sound. I tried timeSetEvent with several flags, but they all had accuracy and precision problems.

What Finally Worked

Finally, we settled on MsgWaitForMultipleObjects with a calculated timeout.

Assuming the existence of a FrameTimeoutCalculator object which returns the number of milliseconds until the next frame:

int runApp() {
    FrameTimeoutCalculator ftc;

    for (;;) {
        const DWORD timeout = ftc.getTimeout();
        DWORD result = (timeout
            ? MsgWaitForMultipleObjects(0, 0, TRUE, timeout, QS_ALLEVENTS)
            : WAIT_TIMEOUT);
        if (result == WAIT_TIMEOUT) {
            simulate();
            ftc.step();
        }

        MSG msg;
        while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) {
            if (msg.message == WM_QUIT) {
                return msg.wParam;
            }

            TranslateMessage(&msg);
            DispatchMessage(msg);
        }
    }
}

Well, what about modal dialogs?

Since we rely on a custom message loop to animate 3D scenes, how do we handle standard message loops such as the modal DialogBox and TrackPopupMenu calls? Fortunately, DialogBox and TrackPopupMenu provide us with the hook required to implement frame updates: WM_ENTERIDLE.

When the standard DialogBox and TrackPopupMenu modal message loops go idle, they send their parent window a WM_ENTERIDLE message. Upon receiving WM_ENTERIDLE, the parent window determines whether it’s time to render a new frame. If so, we animate all visible 3D windows, which will trigger a WM_PAINT, which triggers a subsequent WM_ENTERIDLE.

On the other hand, if it’s not time to render a new frame, we call timeSetEvent with TIME_ONESHOT to schedule a frame update in the future.

As we saw previously, timeSetEvent isn’t as reliable as a custom loop using MsgWaitForMultipleObjectsEx, but if a modal dialog or popup menu is visible, the user probably isn’t paying very close attention anyway. All that matters is that the UI remains responsive and animation continues while modal loops are open. Code follows:

LRESULT CALLBACK ModalFrameSchedulerWndProc(HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (message == idleMessage) {
        stepFrame();
    }
    return DefWindowProc(hwnd, message, wparam, lparam);
}

struct AlmostMSG {
    HWND hwnd;
    UINT message;
    WPARAM wparam;
    LPARAM lparam;
};

void CALLBACK timeForPost(UINT, UINT, DWORD_PTR user_data, DWORD_PTR, DWORD_PTR) {
    AlmostMSG* msg = reinterpret_cast<AlmostMSG*>(user_data);
    PostMessage(msg->hwnd, msg->message, msg->wparam, msg->lparam);
    delete msg;
}

void PostMessageIn(DWORD timeout, HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (timeout) {
        AlmostMSG* msg = new AlmostMSG;
        msg->hwnd = hwnd;
        msg->message = message;
        msg->wparam = wparam;
        msg->lparam = lparam;
        timeSetEvent(timeout, 1, timeForPost, reinterpret_cast<DWORD_PTR>(msg), TIME_ONESHOT | TIME_CALLBACK_FUNCTION);
    } else {
        PostMessage(hwnd, message, wparam, lparam);
    }
}

class ModalFrameScheduler : public IFrameListener {
public:
    ModalFrameScheduler() { stepping = false; }

    // Call when WM_ENTERIDLE is received.
    void onIdle() {
        if (!frameListenerWindow) {
            idleMessage = RegisterWindowMessageW(L"IMVU_ScheduleFrame");
            Assert(idleMessage);

            WNDCLASS wc;
            ZeroMemory(&wc, sizeof(wc));
            wc.hInstance = GetModuleHandle(0);
            wc.lpfnWndProc = ModalFrameSchedulerWndProc;
            wc.lpszClassName = L"IMVUModalFrameScheduler";
            RegisterClass(&wc);

            frameListenerWindow = CreateWindowW(
                L"IMVUModalFrameScheduler",
                L"IMVUModalFrameScheduler",
                0, 0, 0, 0, 0, 0, 0,
                GetModuleHandle(0), 0);
            Assert(frameListenerWindow);
        }

        if (!stepping) {
            const unsigned timeout = ftc.getTimeout();
            stepping = true;
            PostMessageIn(timeout, frameListenerWindow, idleMessage, 0, 0);
            ftc.step();
        }
    }
    void step() { stepping = false; }

private:
    bool stepping;
    FrameTimeoutCalculator ftc;
};

How has it worked out?

A custom message loop and WM_ENTERIDLE neatly solves all of the goals we laid out:

  • No unnecessary polling, and thus increased battery life and performace.
  • When possible, the 3D windows animate at 60 Hz.
  • Even degradation. If painting a frame takes 40 ms, the frame rate will drop from 60 Hz to 25 Hz, not from 60 Hz to 15 Hz, as some of the implementations did.
  • Animation continue to play, even while modal dialogs and popup menus are visible.
  • This code runs well on XP, Vista, and Windows 7.