Hyak's parallel-sql is very similar to standard GNU parallel, but instead of getting tasks from STDIN or a file, instances of parallel-sql retrieve unique tasks out of an SQL database as needed until all the tasks are complete. This allows users to build a single database of all their tasks and have many jobs (nodes) work from that single database.
More Detailed Description
parallel-sql takes tasks out of the database as it needs them. If you load ten thousand tasks into the database, then submit three 16 core jobs, those three jobs start and they each pick up 16 tasks. Once a task finishes, it's marked done and that instance of parallel-sql will retrieve another task from the database. The parallel-sql jobs will continue running like that until all the tasks are complete. If a job is preempted or canceled, all the picked up (running) tasks are marked as available so that other jobs or subsequent runs of the same job can pick them up and complete them.
task: a single work unit. For instance, a path to a program (e.g. Matlab) and a set of command line arguments (input parameters). A task must use only one core.
job: an instance of parallel-sql executing or waiting to execute on a single node. Each job is defined by slurm job script and submitted to the scheduler separately. Each job runs on exactly one node at a time.
Comparison to GNU parallel
With standard GNU parallel, you would have a project consisting of many thousands of tasks, but you'd have to split that up into multiple task lists (one for each job). You'd then submit N different jobs with N different task lists and job scripts. With parallel-sql you just load all the tasks you need to run into the database and submit a number of identical jobs equal to the number of nodes you'd like to use.
Below links have information on GNU parallel:
It is important to choose a number of jobs that's proportionate to the number of tasks you've added to your database. If you only have 600 tasks (each taking an hour), you should not submit more than thirty-seven 16 core jobs (592 total cores).
Standard requirements for the ckpt queue still apply. Also see the section below for specific instructions on the ckpt queue. Individual jobs should still run for at least 1 hour. This means thinking about the number of tasks you have to run and their runtime to determine the number of jobs you should submit. For example, if you have 10000 tasks that each run for 10 minutes, that's 100,000 minutes of computation (~28 hours). A 16 core node can perform 16 hours of computation per hour, so you should only use one or two 16 core nodes.
The ckpt queue is still unsuitable for tasks taking longer than 4 hours (unless your application performs checkpointing). parallel-sql does not checkpoint the progress of your application, only whether a task has completed or not.
If you have very long running tasks (that you run in your group's own allocation) and don't run a large number of tasks, standard GNU parallel is likely a better choice.
- Do not perform more than one load operation at a time.
- Do not perform a load operation while parallel-sql jobs are running.
Initial Database Setup
This creates a user and table for you on the database server where information about your jobs will be stored. You perform these steps only one time.
Basic Setup Session
In above output, 'available' means the task has not started, and 'complete' means the task has completed. A 'picked up' task may actually be in three states: (1) it is running but not complete, (2) it was running but got interrupted by the scheduler (e.g. preempted or ran out of walltime), (3) an instance of parallel-sql has checked out the task from the server, but has not started running the task yet.
Below is an example of a task list. Each task is one line of the task list. Each task must be a complete command which can be run at the command line without any user input. Each task must be independent of every other task since parallel-sql may run the tasks in any order.
Example Task List (mytasklist)
Sample Job Script
You will need to change parameters appropriately. Below abc is your userid and xyz is your group.
## Job Name
## Allocation Definition
## Nodes. This should always be 1 for parallel-sql.
## Walltime (3 hours)
## Memory per node
## Specify the working directory for this job
##turn on e-mail notification
module load parallel_sql
#Put here commands to load other modules (e.g. matlab etc.)
#Below command means that parallel_sql will get tasks from the database
#and run them on the node (in parallel). So a 16 core node will have
#16 tasks running at one time.
parallel-sql --sql -a parallel --exit-on-term
parallel-sql does single node parallelism. If you want to use multiple nodes, you will have to submit the same job script a number of times. If you use the bash shell (default), you can use the below example to submit a number of jobs specified by N.
Checkpoint Queue ckpt (old name backfill queue bf)
If your job gets interrupted and put back on the ckpt queue then when it starts again, parallel-sql will only run those tasks which were not completed earlier. The specific task which was interrupted will be in the 'picked up' state. So some of your tasks may end up in the 'picked up' state due to running out of walltime or when your jobs are preempted (interrupted). Hence, you can use below command to change the state of the tasks from 'picked up' to 'available'. This uses the scheduler data to the change the state for only those tasks which are not running.
If your want to forcibly change the state of the tasks from 'picked up' to 'available' then you can use below command. This will change the state even if the task is running. Hence, you should only use it if you are sure that the task is not running.
Once you're done with a set of tasks, you should clear them out of the database. --del removes all tasks. There's a maximum taskid of about 4 billion which you may hit if you never clean up your database with --del. With the default index type, hash, load performance decreases as the taskid increases.
parallel-sql can support multiple distinct sets of tasks. You might want to set up one set of tasks that can run in the ckpt queue since those tasks run for four hours or less and then set up another set of tasks that run in the batch queue because those tasks run for more than 4 hours.
You would then just have to add the --sql-set ckpt argument onto the command line for parallel-sql and that instance would read tasks from the ckpt task set.