Why Django ORM Sucks : It takes a hell lot of memory in processing.

For a project i wanted to keep one of my table in memcache. I tried to do it using Django ORM but it was taking a hell lot of memory.

Let me tell u some details.

Table Structure : This table has almost 1,300,000 rows

id movie_id user_id rating
1 1 1 5.0
2 1 2 4.5
3 1 3 4.5
4 2 1 3.5
5 3 3 2.0

All I wanted was to keep them in cache(memcached), with following format:
memcache Key : movie-id
memcache Data : [ (user1:rating1), (user2:rating2) ]

First Approach : Directly from DB

for r in FlixsterUserRating.objects.all(): 
	ratingDict = cache.get(r.movie.id) 
	if ratingDict is None:  
		cache.set(r.movie.id, [(r.user.id, r.rating)], 86400) 
	else:  
		ratingDict.append((r.user.id, r.rating))  
		cache.set(r.movie.id, ratingDict, 86400) 
	del ratingDict

In this approach for processing every 1000 it was taking some 5.1 secs. But here Memory Requirement was growing like hell.
In windows this program memory requirement was growing at rate of 34 MB / 10K of records. Don’t ask about using linux. In linux it was taking some 900MB for just processing 50K records.
So to process full data it would need :
(34 MB/10,000)*(1,300,000) = 4420 MB = almost 4Gigs…. No way I can’t afford this approach.

Second Approach : From database in Chunks

for m in FlixsterMovie.objects.all(): 
	d = [] 
	for r in FlixsterUserRating.objects.filter(movie=m):  
		d.append((r.user.id, r.rating)) 
		cache.set(m.id, d)

In this approach for processing every 1000 it was taking some 2.0 secs.
In windows this program memory requirement was growing at rate of 8 MB / 10K of records.
So to process full data it would need :
(8 MB/10,000)*(1,300,000) = 1040 MB = almost 1Gigs…. Still it is a lot.. My Development system has only 1.5 GB RAM.

Third Approach : Fuck Database Use CSV Dump

Finally I left using Database. Dumped my data in a CSV. And used following program.

for s in csv.reader(open(csv_filename)): 
	movie_id = int(s[1]) 
	user_id = int(s[2]) 
	rating = float(s[3]) 
	ratingDict = cache.get(movie_id) 
	if ratingDict is None:  
		cache.set(movie_id, [(user_id, rating)], 86400) 
	else:  
		ratingDict.append((user_id, rating))  
		cache.set(movie_id, ratingDict, 86400)

In this approach memory requirement was just : 17-18MB and it wasn’t growing. Some relief. But, As expected this approach was very slow. For processing 1K records it was taking some 55 Secs.
So to perform full task it will take some : 55 * 1,300 ~ 20Hrs.

Fuck, What should I do now

I ran out of options.
I am going ahead with 2nd approach. But suggestions from you guys are most welcome.

Update

Final Version:

p = Paginator(FlixsterUserRating.objects.all(), 50000)
for i in p.page_range: 
	page = p.page(i) 
	for r in page.object_list:   
		ratingDict = cache.get(r.movie_id)  
		if ratingDict is None:   
			cache.set(r.movie_id, {r.user_id: r.rating}, 86400)  
		else:   
			ratingDict[r.user_id] =  r.rating   
			cache.set(r.movie_id, ratingDict, 86400)

This approach kept the memory level below 100 MB, and whole process was completed in ~ 1hr.

Thanks you all…
Moreover, I didn’t put stat for Iterator version because there wasn’t any significant improvement. Memory consumption was growing even in this case.

21 Responses

  1. Anonymous #

    Use Paginator (http://www.djangoproject.com/documentation/models/pagination/) to process your rows in chunks of 1000 or so:

    p = Paginator(Rating.objects.all(), 1000)
    for i in xrange(p.num_pages):
    page = p.page(i)
    for rating in page.object_list:
    ….

    That should reduce the memory overhead when processing large rowsets.

    Reply
  2. Anonymous #

    Sorry, should be “for i in p.page_range”. But you get the idea.

    Reply
  3. Anonymous #

    Django’s objects are not identity mapped. So in effect you’re loading the movie table in memory, multiple times. What you should do is reuse the first example and replace this:
    cache.set(r.movie.id, [(r.user.id, r.rating)], 86400)

    by that:
    cache.set(r.movie_id, [(r.user_id, r.rating)], 86400)

    memory usage should be the size of the rating table.

    Reply
  4. Anonymous #

    Are you running with DEBUG=True, by any chance? If so, Django caches all the SQL statements for debugging. Try turning debug off in your settings and see if that helps.

    Reply
  5. honeyman #

    You are doing the wrong stuff using the wrong tools.
    If you want a memory-stored database – setup MySQL cluster and stop worrying. If you want to store the non-ORM database data in memcache – you don’t need to ignite the ORM (that keeps an object for every record you process), or at least you could’ve manually destruct the already processed objects.

    Reply
  6. TheOne #

    As per the suggestion i changed
    cache.set(r.movie.id, [(r.user.id, r.rating)], 86400)

    to
    cache.set(r.movie_id, [(r.user_id, r.rating)], 86400)

    The Memory consumption is drastically reduced but still, don’t know why the memory consumption is gradually increasing.

    Reply
  7. Anonymous #

    Are you using DEBUG=True ? That might be it.

    Reply
  8. Anonymous #

    I second the DEBUG=True bit, nothing kills memory faster than that. If you really need to, try forcing GC to happen after every so many rows with DEBUG disabled, and you should be able to cap your memory usage out.

    Reply
  9. TheOne #

    The DEBUG MODE is False.. So that is not the problem

    Reply
  10. TheOne #

    toxik : Iterator didn’t helped I tried that already.

    Reply
  11. Anonymous #

    TheOne, I don’t suppose you have the memory/time stats for the .iterator() version – just out of interest.

    Reply
  12. Anonymous #

    I’m also surprised that .iterator didn’t help. Using .iterator instead of .__iter__ bypasses the queryset’s internal result cache and prevents it from being filled up with the entire contents of the table — which is probably not desirable in this situation and could explain why your memory consumption is still increasing.

    Reply
    • Luckie #

      Your story was really informative, tnkhas!

      Reply
  13. macin #

    Maybe try fetching data by something like:

    >>> Entry.objects.values_list(‘id’, ‘headline’)
    [(1, u’First entry’), …]

    this should be faster than creating model instances for every record

    Reply
  14. Thomas Güttler #

    For you loop over huge querysets, you should disable the internal
    cache by using Entry.objects….iterator().

    If you only need some values, use values_list(‘myattr’, …). Maybe with flat=True
    if you only want one value from each row.

    I had the same problem. But now it works nicely.

    Check that your queries are not saved by DEBUG:

    from django.db import connection
    assert not connection.queries

    Reply
  15. Anonymous #

    You can start off by watching your language.

    Reply
  16. SDC #

    Yeah anonymous, shut the fuck up! I found this post pretty goddamn useful, which is more than I can say about your comment.

    Reply
  17. Alexander E. Patrakov #

    The reason why .iterator() doesn’t help is that the default cursor class in the MySQLdb python module uses mysql_store_result() which downloads the entire result set and stores it in memory.

    Reply
  18. Diederik van der Boor #

    How about using less object retrievals, and only query what you need?
    Obviously you have a separate case here.

    What happens when you try:
    – r.user.id vs r.user_id
    – FlixsterUserRating.objects.only(‘movie_id’, ‘user_id’, ‘rating’)
    – FlixsterUserRating.objects.values(‘movie_id’, ‘user_id’, ‘rating’)

    ?

    Reply

Leave a Reply