Training AI Models at the Potsdam University Cluster

Prequisites

Containerization (docker) in a Nutshell

If you google what Containerization is, I am certain you will find many tutorials that go over the documentation without any kind of abstract explanation, which is just not that helpful. This is a better video. Let’s see it in a simple drawing:

Containerization

We can think of recipes as “how can I have a virtual machine that runs things on its own”. We can think of images as a virtual machine “created”. But we don’t want to have to re-do that virtual machine every time to run every separate program right? So we create containers for specfic applications.

In the context of what I will be showing, the rationale for containers is more that we will do small modifications to our python code, and building an image every time would be unnecessarily and costly. We only need to rebuild the image when we are using different libraries (or different base images of course).

A minimal working example

The example can be found at https://github.com/xarxaxdev/example_apptainer.

I go in this section over the relevant parts; you may want to first look at the section “How to work with the example” and then modify the parts of the code you want.

Project-relevant code:

generate_model.py -> Python main code: whatever you are trying to run with this tutorial. I kind of followed an existing tutorial in huggingface; feel fry to modify for whatever NLP task you want to do.

requirements.txt -> Python libraries I needed for the main code of generate_model.py

Container and cluster-relevant code

slurm.job -> Defines the general parameters for running a task in slurm. Relevant lines I modified:

recipe.def -> This is where we have the set of instructions that will be used to build the image that we will use for running the containers.Let’s go over the sections it has:

We want to prepare the environment (Bootstrap) for docker-like images. pytorch/pytorch is a specific image for docker; we will use it as base, install some libraries and define the main script. You can find other images in https://hub.docker.com/

This makes the 2 files available in "." (the main folder of the image).

We want to use the current folder, so we define the PATH of the image as evaluating(using $) the current PATH

Once we are done cloning the base image, the %post will be run to customize the image before it is saved.

The main script that will run when we run the image in a container.

alias that just made my life easier

You will find them in : readme.md. It is just 4 variables and 3 aliases that make working locally, uploading the code and rerunning it simpler. You may want to add them in your .bashrc . The alias are:

How to work with the example

  1. Clone from github. Would recommend doing so in /home/username.
  2. Customize code, aliases and variables:
    1. Change “yourusername”
    2. Change “/example_apptainer” for whichever folder you stored it in. You may have to tinker the alias if it’s not in /home/username
    3. You should also change generate_model.py for your own code and requirements.txt for the libraries you need.
  3. update_example_apptainer to upload the code from the folder project to the uni cluster’s working directory.
  4. ssh_uni to access the cluster; then navigate to your working folder /work/username/example_apptainer
  5. Now build the image: apptainer build img.sif recipe.def. This is slow.
  6. Now img.sif is in your folder; you could run the image with apptainer run --nv img.sif. Instead use the cluster slurm system and run sbatch slurm.job. Wait for the job to be complete or fail.
  7. Close ssh connection and reverse_update_example_apptainer to get the results from your cluster.
    1. The job run successfully, good job!
    2. The job run failed; you should check the slurm-*jobid*.out to see what is wrong.
      1. Need to modify requirements.txt and install new libraries? You should do so, then go to step 3 without skipping any steps.
      2. You only needed to modify generate_model.py? You should do so, then go to step 3 but you can skip step 5.

If this was helpful to you consider dropping a star on https://github.com/xarxaxdev/example_apptainer or donating. I am planning to keep this website as a personal blog and to document my projects.

Donation:

Monero
49va5kaQ8qzQjfNpTjURwiFR9Zh1uQQsT5cbnnur2NUsfzCbU1QQ2tG3PhdeapEGFTLuGMcx46ss6grJTKKFfP8EC1ePk9M
Monero QR Code
Paypal
PayPal QR Code