At Quantlane, most of our applications are long-running and process loads of data every second. These types of applications are especially prone to memory problems, which can lead to very serious performance issues or even system failures. Knowing your way around debugging and optimizing memory usage of your applications is really useful when dealing with these kinds of issues.
Python memory management
First, a quick look at how memory management and garbage collection in Python actually works. Python handles memory automatically which greatly helps to speed up the development process as we don’t have to deal with low-level memory management. Python has two ways of handling the memory:
1. Reference counting - one of the common algorithms used to deallocate unused objects. In short, it keeps track of all the references to an object at runtime. Whenever the object is referenced by another object, the number of references is increased and when dereferenced, it’s decreased. When the reference count hits zero (meaning it’s no longer used by your program) it is automatically deallocated. This also comes with some cons, one example being the overhead work the interpreter has to do in a threadsafe manner.
2. Generational garbage collection - is addressing issues like cyclical references where reference counting comes short. In a nutshell, it consists of three generations in total. Every object's lifecycle begins in the first generation. The object is then automatically moved to the next generation whenever it survives the garbage collection of its current generation. It also provides some additional functionality, like changing collection frequency (thresholds) on the individual generations (e.g. checking short-lived objects more often than long-lived objects), or even completely disabling the collection.
Even with all of this happening behind the scenes, when it comes to more complex programs you might want to debug and profile the memory yourself. It can be useful - e.g. for optimization purposes and writing more efficient code, or worse - when dealing with memory leaks (unused objects are not freed) introduced by code design bugs. There are many great available Python memory profiling libraries that are right for the job. Some examples:
guppy3 - great tool for debugging and profiling memory, especially when hunting memory leaks
objgraph - similar to guppy3 but also provides visual interpretation of Python object graphs
memory_profiler - provides monitoring of memory consumption of a process and also line-by-line memory analysis of your code (similar to line_profiler for CPU profiling)
Let's profile with guppy3 library
We found guppy3 suits us the most when it comes to profiling - despite its lack of documentation. I’ll try to showcase some of the basic functionality that guppy3 provides with the example profile of one of our long-running applications responsible for real-time data processing. Because it is an asynchronous application (written with asyncio), we decided to use aiomanhole library, which is already set up in most of our applications and enables us access to their internal state. This way we are able to profile without modifying application code. Both aiomanhole and guppy3 can be simply installed by:
$ pip install guppy3 aiomanhole
Adding manhole to your application can be done by adding this snippet to your code (with namespace adjusted to your app):
namespace = { 'gizmo': application_state_gizmo, 'whatsit': application_state_whatsit, } banner = '\n\t'.join(name for name in namespace) aiomanhole.start_manhole( host = '127.0.0.1', port = 9999, path = '/var/tmp/aiomanhole.sock', namespace = namespace, banner = f'Namespace: \n\t{banner}\n\n', )
Now we can connect to the aiomanhole port (which is forwarded to localhost). We use nc to connect and rlwrap for shell-like feeling:
$ rlwrap nc localhost 9999
This is all the setup we need and now we’re ready for profiling:
>>> from guppy import hpy >>> heapy_session = hpy() >>> heapy_session.setref()
First, we created a heapy session context by hpy call. Next, with setref method, we set a reference point for heap usage, meaning the heap method will only show us the objects allocated in the heap after setref call. Now to actually inspect the objects that are reachable in the heap:
>>> heap_snapshot = heapy_session.heap() Partition of a set of 775867 objects. Total size = 68490112 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 99270 13 10324080 15 10324080 15 decimal.Decimal 1 98703 13 10265112 15 20589192 30 dict of qutils.time.nanotime.NanoTime 2 22704 3 9718960 14 30308152 44 types.FrameType 3 142838 18 8142643 12 38450795 56 str 4 43356 6 5202720 8 43653515 64 the_one.messages.market.TradeMessage 5 98703 13 4737744 7 48391259 71 qutils.time.nanotime.NanoTime 6 34221 4 4106520 6 52497779 77 types.CoroutineType 7 106589 14 3774288 6 56272067 83 int 8 22866 3 3658560 5 59930627 88 _asyncio.Task 9 22896 3 1648512 2 61579139 90 builtins.weakref <90 more rows. Type e.g. '_.more' to view.>
In the heap_snapshot we can observe objects allocated and grouped by equivalence relation. The default relation used when calling heap method is clodo, which basically means grouped by class or __dict__ attributes of class instances (class or dict owner). This quick inspection has already uncovered some issues:
Issue #1
A big 15% chunk of memory of the objects allocated is taken up by __dict__ attributes of our internal NanoTime class. We can take a closer look at their individual size when grouping by size equivalence relation:
>>> heap_snapshot[1].bysize # ``__dict__ of NanoTime`` grouped by size Partition of a set of 98703 objects. Total size = 10265112 bytes. Index Count % Size % Cumulative % Individual Size 0 98703 100 10265112 100 10265112 100 104 >>> heap_snapshot[5].bysize # ``NanoTime`` grouped by size Partition of a set of 98703 objects. Total size = 4737744 bytes. Index Count % Size % Cumulative % Individual Size 0 98703 100 4737744 100 4737744 100 48
Fortunately, this can be easily reduced by adding __slots__ attribute to our NanoTime class like:
class NanoTime: __slots__ = ('nanoseconds',)
That way a static amount of memory is allocated for storing object attributes. But remember, this comes with the price of not being able to dynamically add new attributes! In a new and unrelated session we can check how much our improved version of NanoTime class instance allocates:
>>> from guppy import hpy >>> heapy_session = hpy() >>> heapy_session.setref() >>> nanotime_instance = NanoTime.today() >>> heapy_session.heap() Partition of a set of 8 objects. Total size = 1201 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 2 25 800 67 800 67 types.FrameType 1 1 12 176 15 976 81 types.CodeType 2 2 25 104 9 1080 90 tuple 3 1 12 45 4 1125 94 bytes 4 1 12 40 3 1165 97 qutils.time.nanotime.NanoTime 5 1 12 36 3 1201 100 int
No more __dict__ of NanoTime allocated, which is a 104 bytes reduction per NanoTime instance, and the size of NanoTime instance itself has dropped from 48 bytes to 40 bytes.
Issue #2
The second problem is specific to our application and domain, but we noticed the amount of TradeMessage was unusually high (and only rising over time) as they are only supposed to be sent to our messaging system and thus should have a quite short lifespan. Memory leak was definitely the first thing that came to mind. To find out where these instances are coming from we can use byrcs equivalence relation which tells us all referrers of these objects.
>>> trades = heap_snapshot[4] >>> trades.byrcs Partition of a set of 43356 objects. Total size = 5202720 bytes. Index Count % Size % Cumulative % Referrers by Kind (class / dict of class) 0 43329 100 5199480 100 87328920 100 tuple 1 26 0 3120 0 87380400 100 <Nothing> 2 1 0 120 0 87380520 100 tuple, types.FrameType
Almost 100% are reffered by tuple. This didn't tell us much but let's dig deeper by displaying the shortest path to an individual TradeMessage object:
>>> trades.byid[1].sp 0: hpy().Root.i0_t140358710572864_async_gen_finalizer.__self__.__dict__['_signal_handlers'][<Signals.SIGHUP: 1>]._callback.__self__.__dict__['_hub'].__dict__['_unreads'][('DataSource'...B')][0].__self__.__dict__['_queue'].??[1].messages[0]
Now, this looks a little bit crazy, but it actually showed us that the instances are being stuck in our _queue waiting to be sent to our messaging system. Even though this was not a case of a memory leak as we first suspected, it helped us to focus on the actual issue which turned out to be an incorrect setup in our messaging system.
Hopefully this example showcased how useful guppy3 can be, even if it's just scratching the surface of its abilities :)