Quantlane

At Quantlane, most of our applications are long-running and process loads of data every second. These types of applications are especially prone to memory problems, which can lead to very serious performance issues or even system failures. Knowing your way around debugging and optimizing memory usage of your applications is really useful when dealing with these kinds of issues.

Python memory management

First, a quick look at how memory management and garbage collection in Python actually works. Python handles memory automatically which greatly helps to speed up the development process as we don’t have to deal with low-level memory management. Python has two ways of handling the memory:

1. Reference counting - one of the common algorithms used to deallocate unused objects. In short, it keeps track of all the references to an object at runtime. Whenever the object is referenced by another object, the number of references is increased and when dereferenced, it’s decreased. When the reference count hits zero (meaning it’s no longer used by your program) it is automatically deallocated. This also comes with some cons, one example being the overhead work the interpreter has to do in a threadsafe manner.

2. Generational garbage collection - is addressing issues like cyclical references where reference counting comes short. In a nutshell, it consists of three generations in total. Every object's lifecycle begins in the first generation. The object is then automatically moved to the next generation whenever it survives the garbage collection of its current generation. It also provides some additional functionality, like changing collection frequency (thresholds) on the individual generations (e.g. checking short-lived objects more often than long-lived objects), or even completely disabling the collection.

Even with all of this happening behind the scenes, when it comes to more complex programs you might want to debug and profile the memory yourself. It can be useful - e.g. for optimization purposes and writing more efficient code, or worse - when dealing with memory leaks (unused objects are not freed) introduced by code design bugs. There are many great available Python memory profiling libraries that are right for the job. Some examples:

guppy3 - great tool for debugging and profiling memory, especially when hunting memory leaks

objgraph - similar to guppy3 but also provides visual interpretation of Python object graphs

memory_profiler - provides monitoring of memory consumption of a process and also line-by-line memory analysis of your code (similar to line_profiler for CPU profiling)

Let's profile with guppy3 library

We found guppy3 suits us the most when it comes to profiling - despite its lack of documentation. I’ll try to showcase some of the basic functionality that guppy3 provides with the example profile of one of our long-running applications responsible for real-time data processing. Because it is an asynchronous application (written with asyncio), we decided to use aiomanhole library, which is already set up in most of our applications and enables us access to their internal state. This way we are able to profile without modifying application code. Both aiomanhole and guppy3 can be simply installed by:

$ pip install guppy3 aiomanhole

Adding manhole to your application can be done by adding this snippet to your code (with namespace adjusted to your app):

namespace = {
        'gizmo': application_state_gizmo,
        'whatsit': application_state_whatsit,
}

banner = '\n\t'.join(name for name in namespace)

aiomanhole.start_manhole(
        host = '127.0.0.1',
        port = 9999,
        path = '/var/tmp/aiomanhole.sock',
        namespace = namespace,
        banner = f'Namespace: \n\t{banner}\n\n',
)

Now we can connect to the aiomanhole port (which is forwarded to localhost). We use nc to connect and rlwrap for shell-like feeling:

$ rlwrap nc localhost 9999

This is all the setup we need and now we’re ready for profiling:

>>> from guppy import hpy
>>> heapy_session = hpy()
>>> heapy_session.setref()

First, we created a heapy session context by hpy call. Next, with setref method, we set a reference point for heap usage, meaning the heap method will only show us the objects allocated in the heap after setref call. Now to actually inspect the objects that are reachable in the heap:

>>> heap_snapshot = heapy_session.heap()
Partition of a set of 775867 objects. Total size = 68490112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  99270  13 10324080  15  10324080  15 decimal.Decimal
     1  98703  13 10265112  15  20589192  30 dict of qutils.time.nanotime.NanoTime
     2  22704   3  9718960  14  30308152  44 types.FrameType
     3 142838  18  8142643  12  38450795  56 str
     4  43356   6  5202720   8  43653515  64 the_one.messages.market.TradeMessage
     5  98703  13  4737744   7  48391259  71 qutils.time.nanotime.NanoTime
     6  34221   4  4106520   6  52497779  77 types.CoroutineType
     7 106589  14  3774288   6  56272067  83 int
     8  22866   3  3658560   5  59930627  88 _asyncio.Task
     9  22896   3  1648512   2  61579139  90 builtins.weakref
<90 more rows. Type e.g. '_.more' to view.>

In the heap_snapshot we can observe objects allocated and grouped by equivalence relation. The default relation used when calling heap method is clodo, which basically means grouped by class or __dict__ attributes of class instances (class or dict owner). This quick inspection has already uncovered some issues:

Issue #1

A big 15% chunk of memory of the objects allocated is taken up by __dict__ attributes of our internal NanoTime class. We can take a closer look at their individual size when grouping by size equivalence relation:

>>> heap_snapshot[1].bysize  # ``__dict__ of NanoTime`` grouped by size
Partition of a set of 98703 objects. Total size = 10265112 bytes.
 Index  Count   %     Size   % Cumulative  % Individual Size
     0  98703 100 10265112 100  10265112 100       104

>>> heap_snapshot[5].bysize  # ``NanoTime`` grouped by size
Partition of a set of 98703 objects. Total size = 4737744 bytes.
 Index  Count   %     Size   % Cumulative  % Individual Size
     0  98703 100  4737744 100   4737744 100        48

Fortunately, this can be easily reduced by adding __slots__ attribute to our NanoTime class like:

class NanoTime:

        __slots__ = ('nanoseconds',)

That way a static amount of memory is allocated for storing object attributes. But remember, this comes with the price of not being able to dynamically add new attributes! In a new and unrelated session we can check how much our improved version of NanoTime class instance allocates:

>>> from guppy import hpy
>>> heapy_session = hpy()
>>> heapy_session.setref()
>>> nanotime_instance = NanoTime.today()
>>> heapy_session.heap()
Partition of a set of 8 objects. Total size = 1201 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2  25      800  67       800  67 types.FrameType
     1      1  12      176  15       976  81 types.CodeType
     2      2  25      104   9      1080  90 tuple
     3      1  12       45   4      1125  94 bytes
     4      1  12       40   3      1165  97 qutils.time.nanotime.NanoTime
     5      1  12       36   3      1201 100 int

No more __dict__ of NanoTime allocated, which is a 104 bytes reduction per NanoTime instance, and the size of NanoTime instance itself has dropped from 48 bytes to 40 bytes.

Issue #2

The second problem is specific to our application and domain, but we noticed the amount of TradeMessage was unusually high (and only rising over time) as they are only supposed to be sent to our messaging system and thus should have a quite short lifespan. Memory leak was definitely the first thing that came to mind. To find out where these instances are coming from we can use byrcs equivalence relation which tells us all referrers of these objects.

>>> trades = heap_snapshot[4]
>>> trades.byrcs
Partition of a set of 43356 objects. Total size = 5202720 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0  43329 100  5199480 100  87328920 100 tuple
     1     26   0     3120   0  87380400 100 <Nothing>
     2      1   0      120   0  87380520 100 tuple, types.FrameType

Almost 100% are reffered by tuple. This didn't tell us much but let's dig deeper by displaying the shortest path to an individual TradeMessage object:

>>> trades.byid[1].sp
0: hpy().Root.i0_t140358710572864_async_gen_finalizer.__self__.__dict__['_signal_handlers'][<Signals.SIGHUP: 1>]._callback.__self__.__dict__['_hub'].__dict__['_unreads'][('DataSource'...B')][0].__self__.__dict__['_queue'].??[1].messages[0]

Now, this looks a little bit crazy, but it actually showed us that the instances are being stuck in our _queue waiting to be sent to our messaging system. Even though this was not a case of a memory leak as we first suspected, it helped us to focus on the actual issue which turned out to be an incorrect setup in our messaging system.

Hopefully this example showcased how useful guppy3 can be, even if it's just scratching the surface of its abilities :)

Quantlane tech blog

Memory profiling in python

Python memory management

Let's profile with guppy3 library

Issue #1

Issue #2