Tidying my music folder with Python

Today is about simple matters: tidying one’s (music) folder to enforce some conventions, e.g. all paths like Album/TrackNum. Title.ogg. Some music libraries can handle this very well, especially gmusicbrowser, yet because of a bug with eCryptfs on my Ubuntu 11.04 I lost a non-negligible amount of music using them. So, how about a little Python script?

Change naming pattern

First, let us see how to use the os.path.walk function to recursively explore our folder and perform some operations in every sub-folder. The operation here will be “look for a file named TrackNum - blah and rename it to TrackNum. blah“. Renaming from one pattern to the other goes like this :

def rename(fname):
    return re.sub(r'^(\d+) - ', r'\1. ', fname)

os.path.walk takes three arguments:

  • the root directory
  • a callback function
  • an optional argument (to be passed to the callback function)

In turns, the callback function receives the optional argument, a directory path (the current sub-directory during the recursive exploration) and a list of file names (all files in the sub-directory, including directories, links, …) Here is the example of a callback function which renames all files according to our renaming scheme (and moves the file to /tmp if the new file already exists).

def callback(arg, dirname, fnames):
    for fname in fnames:
        path = dirname + '/' + fname
        if not os.path.isfile(path):
            continue
        new_path = dirname + '/' + renames(fname)
        if path == new_path:
            continue
        if os.path.exists(new_path):
            os.system('mv "%s" /tmp/' % path)
        else:
            print "Moving %s..." % path
            os.system('mv "%s" "%s"' % (path, new_path))

One can call it with e.g. the first command-line argument:

if __name__ == "__main__":
    assert len(sys.argv) == 2
    os.path.walk(sys.argv[1], callback, [])

Finding duplicates

Another issue for me was to find multiple occurrences of a file with different file names. A way to do this is to compute an md5 checksum for each files in a sub-folder and look for duplicates. Computing the sum goes like this:

import hashlib

def md5sum(filename):
    m = hashlib.md5()
    try:
        fd = open(filename,"rb")
    except IOError:
        print "Unable to open the file in readmode:", filename
        return
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        m.update(eachLine)
    return m.hexdigest()

While checking for duplicates is straightforward using hash tables:

def callback(arg, dirname, fnames):
    hashes = {}
    for fname in fnames:
        path = dirname + '/' + fname
        if not os.path.isfile(path):
            continue
        md5 = md5sum(path)
        if hashes.has_key(md5):
            print "'%s' == '%s'" % (hashes[md5], path)
            os.system('rm "%s"' % path)
        else:
            hashes[md5] = path

Conclusion

As foretold, today was about trivia. All these operations could as well have been performed using the shell alone, but I chose to use Python so I can post the snippets here instead of feeling shameful about dirty Bash code...

Pages of this website are under the CC-BY 4.0 license.