Backutil development: Building and JOINing SQLite databases in Python
This post is something of a development diary for Backutil - my Python-based utility for backing up files from Windows systems. I published the first version of Backutil (v0.51) at the beginning of 2021, and pushed a small update (v0.52) to fix some minor issues in February.
As of v0.52, the utility still relied on text files full of file hashes to remember what it had backed up, and in-memory Pandas dataframes to compare the files currently on the system to the previous file hashes and work out what to back up. The full process looked something like this:
For v0.61, I wanted to streamline this somewhat, and hopefully speed up the process (this will be boosted by a multithreading implementation in a future update). The solution? SQLite - as the name would suggest, a lighter version of SQL that can be used to store databases either in memory or as files via the Python
sqlite3 library. This enabled me to use SQL functions like
JOIN to more efficiently manipulate the backup data. The new process looks like this:
To demonstrate this, I’ll run through a few examples of how SQLite is used in the latest version.
Managing the database connection
Before we can use a SQLite database, we must connect to it. As this would happen several times through the code, I defined a function that takes the
config class and
action variable. If the
open, we connect to the database and create the
backutil_previous table if it does not already exist.
action supplied to the function is
close, we close the connection to the database.
With this function defined and actions taken according to the contents of the
action variable, we can open and close the database using a single line elsewhere in the code.
Getting previous backup dates
In order to work out whether the backups need to be rotated, Backutil previously counted the number of
.back text files in its directory. Now, however, we can use
SELECT DISTINCT to get a list of unique timestamps of previous backups from the SQLite database stored on disk.
This code iterates through the returned dates and adds them to a Python list named
backup_dates, the length of which is subsequently counted to determine whether there are more backups than required.
Deleting data from old backups
If there are too many backups, the oldest backup must be deleted, along with its entries in the previous backups database. We can take the oldest date from
backup_dates and use the
DELETE operation. Using question marks in
sqlite3 inserts items from the supplied variable - in this case
query_data - and is best practice to protect against SQL injection by interpreting strings only as literal values.
Another important note here is that after an operation is executed that would change a SQLite database, it must be committed. If we did not commit our changes to the database, it would not be written to the file on disk. Once the changes are committed, we have deleted the oldest backup hash entries for good.
Checking old backup hashes against current files
Meanwhile, a separate database named
backutil_tracker has been built in memory, containing the hashes of all the files that currently reside in the backup directories. To work out which files are new and need to be backed up, we must compare them to the hashes in
We can use
ATTACH to enable cross-database operations, then perform a
LEFT JOIN (see my previous post on SQL
JOINs for all the details) to combine the data from the two tables based on the file hashes.
If a hash in
backutil_tracker matches a hash in
backutil_previous, the date that file was backed up is copied to the table. Once all hashes have been checked, we know that rows with
None in the date column do not exist in previous backups. These are copied to a Python list that will be supplied to the backup loop. This way, only new files are added to the new backup.
Writing backed up files to the database
As Backutil successfully copies files, it appends their rows to the
files_backed_up list. Once all files have been safely copied, compressed, and encrypted, we can
INSERT the contents of this list into the
backutil_previous table, which is stored on disk for future reference.
We open the database connection again, execute the
INSERT operation, then commit the changes. The new hashes will be written to the database with the timestamp of the new backup.
With our files backed up and the new hashes written to
backutil_previous, we can close our database connection. The database will remain on disk until it is called upon during the next backup, at which time Backutil will not copy the same files again because their hashes are now in the database.