Introduction
Magnetico is a the self-hosted BitTorrent DHT crawler, with built-in searching capabilities.
Usually, to avoid bootstrapping database and crawl Internet from zero on your own, you can download a community dump (like @anindyamaiti’s here), and let Magnetico start from here.
But these dumps are usually proposed as SQLite databases (see this thread), because original Magnetico implementation didn’t support any other source.
Some years ago, there was indeed magnetico-go-migrator that had been developed for this very purpose, but it’s written in Go the project doesn’t seem to exist anymore… There is also magnetico_merge, but I wanted a stable and efficient solution (i.e. without hard-coded SQL schema and not written in Python either).
In this blog post, I’ll explain how I migrated such an SQLite database dump to PostgreSQL, which now Magnetico uses as its regular database, using PGLoader (because such a migration isn’t a straightforward process at all).
Initial setup
First we need to install some packages :
I consider Magnetico service directory tree to look somehow like this :
PGLoader is broken (when it comes to SQLite)
Despite being pretty actively maintained, PGLoader still counts an incredible amount of opened issues. One that I’ve been struggling with is #1256, which causes PGLoader to fail to correctly “transpile” SQLite REFERENCES
constraints :
… which, as you can see, comes from Magnetico schema :
So if we wrap this up, a naive approach (or rather “an approach that I’ve tried for you” ) would be to execute a PGLoader command derived from upstream documentation :
You should have noticed prefetch rows
option (which defaults to 100000
) that I had to lower as, if you don’t have a whole datacenter at your disposal either, it leads to heap memory exhaustion.
Later, I’ve also hit another PGLoader “transpilation” issue, this time related to SQLite
PRIMARY KEY
… Long story short : this is an SQL nightmare.
But anyway, this PGLoader command fails pretty hard and quick, because of…
Encoding issues… as always
BitTorrent DHT actually contains lots of garbage, including invalid UTF-8 character sequences.
PostgreSQL enforces strict character encoding checks, which prevents us from directly importing TEXT
columns (torrents.name
and files.path
) from SQLite. Moreover, it stops current table processing when it encounters an error (including an encoding one), and PGLoader doesn’t continue with the remaining rows afterward.
So I decided to adapt @Crystalix007’s solution for cleaning the SQLite dump from invalid character sequences, without the major drawback of duplicating it on disk multiple times (which by the way additionally lead to database corruption in my case…).
The final recipe
So the idea here is to walk away from PGLoader SQLite connector and let Magnetico create a clean database schema. Then we dump SQLite database tables as CSV streamed through uconv
(in order to skip invalid UTF-8 character sequences, which spares us some useless gigabytes by the way ) down to PostgreSQL database thanks to PGLoader :
As CSV is a text format, special care must be taken when importing torrents.info_hash
, which is an SQLite BLOB
column (containing raw bytes). For this we leverage PostgreSQL’s bytea
hex format and encode those bytes on-the-fly as hexadecimal.
The final steps
Bootstrap Magnetico
Extend Magnetico Compose stack with a PostgreSQL service, as below :
… and then run the following commands :
Run migration
You’ll have to prepare a PGLoader command file (load_table.pgl
), which imports CSV content read from stdin into PostgreSQL TARGET_TABLE
:
It turns out PGLoader doesn’t specify CASCADE
when truncating target table, so we must do it ourselves as import prelude.
For performance purpose, we disable all TARGET_TABLE
indexes during import (following @fle’s trick) and trigger whole database re-indexation afterwards (see below).
This is a(nother) workaround as PGLoader isn’t able to drop indexes
(before reconstructing them at the end) when some SQL constraints depend on them.
You can then execute this Bash script :
Restart Magnetico
Once database import is done, you may simplify your postgres
Compose service definition and restart services :
If you’re satisfied with the imported dataset you can clean all of our mess (as well as old SQLite database dump) :
Conclusion
Migration took around ~12 hours on my setup (~30M torrents with ~950M files), but it took me more than a week to completely figure out the process and tidy it up
‘hope it helped and that one day most of Magnetico users will eventually drop SQLite, preferring PostgreSQL (custom archive) dumps.
Don’t forget to share with the world… tracker-free