Introduction

Nowadays (it’s 2023 if someone has just woken up from hibernation), many applications work with large amounts of data, often exceeding 1 million records in the database. Such operations are more problematic than those on smaller datasets, mainly due to limited RAM and CPU usage.

In this tutorial I will present how to efficiently import large datasets into a database using django migrations, but a similar approach can be adjusted to alembic or flask-migrate.

The problem to solve

An Ill-considered process of importing such a large dataset might cause loading all objects into memory, which in turn can cause the migration process to crash.

Suppose we have a simple Person model as below:

class Person(models.Model):
   name = models.CharField(max_length=255)
   email = models.EmailField()

and the goal is to load data from a CSV file.

The simple and common solution (which works for small and medium datasets) is to read data from a file and save them one by one to the database.

import csv

from django.db import migrations


def read_csv_file(file_path: str):
  with open(file_path, "r") as people_file:
      return list(csv.reader(people_file))


def import_people_records(apps, schema_editor):
   Person = apps.get_model("people", "Person")

   for name, email in read_csv_file("people.csv"):
       Person.objects.get_or_create(
           name=name,
           email=email,
       )


def delete_people_records(apps, schema_editor):
   apps.get_model("people", "Person").objects.all().delete()


class Migration(migrations.Migration):
   dependencies = [
       ("people", "0001_initial"),
   ]

   operations = [
       migrations.RunPython(
           code=import_people_records,
           reverse_code=delete_people_records,
       ),
   ]

This approach is simple, but it has two serious disadvantages: 

  • all data is loaded into memory, which can lead to a lack of memory, which in turn  cause the process to crash or slow down significantly,
  • as many database queries are executed as there are records, which is suboptimal.

Solution 1 for large datasets

As described above, there are two main problems to solve.

Running out of memory in case of Python can be solved by using generators.

Generators in Python are a special type of function that yield a sequence of results instead of a single value, allowing memory-efficient iteration over large datasets. They produce values one at a time on demand, instead of storing all results in memory at once.

The number of queries to the database can be reduced using bulk operations (i.e. on a group of data), leading to faster execution. In other words, a few records are inserted into the database table within one query. In the case of Django, bulk_create can be used for that purpose.

import csv
from itertools import islice

from django.db import migrations


BATCH_SIZE = 1000


def read_csv_file(file_path: str, batch_size: int):
   with open(file_path, "r") as f:
       reader = csv.reader(f)
       while True:
           batch = list(islice(reader, batch_size))

           if not batch:
               return

           yield batch


def import_people_records(apps, schema_editor):
   Person = apps.get_model("people", "Person")

   for batch in read_csv_file("people.csv", BATCH_SIZE):
       Person.objects.bulk_create(
           [
               Person(
                   name=name,
                   email=email,
               )
               for name, email in batch
           ],
           batch_size=BATCH_SIZE,
       )

(Entire example can be found here.)

The above solution will do its job properly, but it’s not without flaws. More precisely, such a migration is run in a single database transaction, which is usually fine, but in severe cases can be dangerous, e.g. it can lead to import failure, performance issues or locking and concurrency issues or even exhaustion of database system resources.

Solution 2 for even larger datasets

To avoid issues related to too many records within a single database transaction described above, you can follow another approach.

The solution would be to split all those queries into a few database transactions.

To do that, atomic transactions within migration must be disabled (atomic transactions are associated with database operations where a set of actions must ALL complete).

Moreover, autocommit must be disabled (Django has it enabled by default) to fully manually manage when to update records in the database.

def import_people_records(apps, schema_editor):
   Person = apps.get_model("people", "Person")

   transaction.set_autocommit(False)

   for batch in read_csv_file("people.csv", BATCH_SIZE):
       Person.objects.bulk_create(
           [
               Person(
                   name=name,
                   email=email,
               )
               for name, email in batch
           ],
           batch_size=BATCH_SIZE,
       )
       transaction.commit()


class Migration(migrations.Migration):
   atomic = False

(Entire example can be found here.)

The above solution works well for very large datasets, however, it has one drawback. Due to the fact that groups of operations are performed in separate transactions, in case of problems during such a migration (e.g. loss of connection to the database), only part of the records may be saved to the database.

Depending on the imported data, this might be safeguarded in Django through the ignore_conflicts parameter that can be set for the bulk_create method. This causes that when a conflict occurs on the primary key, the record is skipped, otherwise it is saved, and thus the migration can be rerun.

With other migration tools, please consult the documentation. Looking for a way to execute INSERT IGNORE and INSERT ... ON DUPLICATE KEY UPDATE would resemble the Django’s ignore_conflicts behavior.

Conclusion

Large datasets are now becoming more and more common, so it’s important for developers to learn how to work with them effectively.

It’s important to review the migration to be run and adjust it to the size of the dataset.

The above tips will allow you to avoid issues related to memory shortage and reduce the risk of failure.

Looking for people improving the performance metrics of your application?

Let’s talk!
Python developer | Website

Mateusz Głowiński, an accomplished Backend Developer at Makimo, embodies the fusion of strategic vision and technical acuity. Known for his expertise in Python, he skillfully employs Django and Flask in his pursuit of pristine software architecture. His thought-provoking articles, centered on clean architecture and efficient coding practices, are avidly read on LinkedIn and his personal blog. Away from his code-filled world, Mateusz trades software for mountains or football pitches, savouring the exhilaration they bring.