Child pages
  • Hyak_RStudio
Skip to end of metadata
Go to start of metadata

Hyak fully supports R, anything that you can do with R on your desktop computer can be done on hyak but on a much larger scale. However, since hyak is a large scale shared resource there are a few steps that you have to take before running R.

Hyak is a Linux supercomputer. If you have not used the linux or UNIX or MacOS X command line before, it is a good idea to get familiar with the command line before running on hyak. Google "linux command line" for useful links. Of course, you should also have experience with R programming and R scripts on your desktop computer before you try running R on a supercomputer like hyak.

Usually, you will be using qsub to submit a PBS script to the hyak scheduler. (See bottom of this page). The PBS script will contain instructions for running your R script. The hyak scheduler takes care of details like finding a suitable node to run your R script etc.

Interactive Session

However, first let us use an interactive qsub session to get familiar with running R on hyak. Issue the command below at the hyak login node, to get a interactive hyak session (the "-I" is an upper case i)

 qsub -q build -I -l walltime=3:00:00

This opens an interactive session on a hyak node.

If we type R at the command line and press enter then we get an error
"R: command not found". What's wrong? Unlike your desktop computer, hyak supports hundreds of users with different requirements and different versions of the same software. Hence, unlike your desktop computer, not all software executables are put in everyone's PATH environment variable.

Those fluent with the Linux will want to put the location of R in their own PATH. However, this should be done using module command instead. It essentially performs similar tasks by modifying the PATH, LD_LIBRARY_PATH and other necessary parts of your environment.

The below command shows you the available modules:

 module avail

The list is long and right now we are only interested in the latest version R (3.2.0 as of September 2015). Hence we issue the command:

 module load  r_3.2.0

Now the command R works.

After working for sometime, you may want to know what modules are still loaded in your environment. You can use below command to list the modules:

 module list

After you have completed your work, you can remove the module from your environment by using below command.

 module unload  r_3.2.0

Installing CRAN packages on hyak

As hyak is a shared resource, CRAN packages cannot be installed in their default locations. Below are the steps to install CRAN packages to a location specified by you. Here UVW is the package name, XYZ is your hyak group name and abc is your hyak userid.

  1. make a directory where you will install the R packages. For example:

    mkdir /gscratch/XYZ/abc/rpackages

    (If your group uses the same packages you could use a path like {{ /gscratch/XYZ/rpackages }}. You can also use a path like {{ /sw/contrib/XYZ/rpackages }}.)

  2. If you have not done this earlier then issue below command to get a build node.

    qsub -q build -I -l walltime=3:00:00
  3. On the build node, issue below command to load the R module

    module load r_3.1.1

    For instance, when installing the CRAN package Rglpk then issue below commands:

    export PKG_CFLAGS="-I/usr/include/glpk"
    export PKG_CPPFLAGS="-I/usr/include/glpk"
  4. Choose one of the following steps:<ol style="list-style-type: lower-roman;"><li>If you do not want to specify the location of the library then start the R command line and issue below command:

    install.packages("UVW")

    The command will ask you to choose a mirror. Choose a nearby mirror and press enter.
    R will ask you if you want to install the packages at below default location in your home directory:

    ~/R/x86_64-unknown-linux-gnu-library/{R_VERSION}

    Enter "y" and your package will be installed at above location.
    Now in your R scripts and at the R command line you only need to use below command

    library(UVW) 

    </li>
    <li>If you want to specify the location of the library then start the R command line and issue below command.

    install.packages("UVW", lib="/gscratch/XYZ/abc/rpackages") 

    The command will ask you to choose a mirror. Choose a nearby mirror and press enter. The package UVW will be built and installed at the location /gscratch/XYZ/abc/rpackages</li>
    </ol>

Once the above is done, whenever you want to use the package UVW, just load it in R by

 library(UVW, lib.loc="/gscratch/XYZ/abc/rpackages")

You may not want to give the lib.loc parameter every time. In that case, you can put the below in your .bashrc file

 export R_LIBS="/gscratch/XYZ/abc/rpackages"

Now you can load the library by just

 library(UVW)

Updating CRAN packages on hyak

For updating R packages use the below command at the R prompt:

 update.packages()

This will update the packages that you installed. It will also update the recommended packages. It will not update the base packages. For the difference between the base and recommended packages see Stackoverflow

Compiling MPI packages

In order to compile MPI packages (like Rmpi and pbdMPI) you have to load MPI support module and use MPI-aware compiler.

Compilers with MPI support have _*mpi<version>_ ending, for instance

 module load icc_14.0.3-ompi_1.8.3

will do. Note that later you have to load exactly the same module when running R

Next, you have to tell R to use mpi-aware compiler, otherwise it will complain about missing MPI headers. This can be achieved by specifying

 CC=mpicc

in the environment, for instance in the R compilation environment file ~/.R/Makevars.

Now you can install

 install.packages("pbdMPI", dep=TRUE)

Some packages, in particular Rmpi may not find the necessary headers and libraries. One can help the configure script by specifying MPI_ROOT environment variable:

 MPI_ROOT="/sw/openmpi-1.8.3_icc-14.0.3" R CMD INSTALL Rmpi_0.6-5.tar.gz

where the folder pointed by MPI_ROOT must correspond to the compiler you are using.

Running R using PBS scripts and qsub

Elsewhere on this wiki you will find details about submitting PBS scripts via qsub. Here we will focus on the R specific part. In your PBS script you should have lines like below. Note that if your scripts produce graphs then you should save the graphs using the usual R commands.

 module load  r_3.2.0
 Rscript >output.txt 2>&amp;1 /gscratch/XYZ/abc/myscript.R

Alternatively, in case you use MPI

 module load r_3.2.0 icc_14.0.3-ompi_1.8.3
 mpirun --mca mtl mx --mca pml cm --bind-to core --map-by core Rscript >output.txt 2>&amp;1 /gscratch/XYZ/abc/myscript.R

Note that we redirect both R output and error messages to output.txt here.

Executing Parallel Tasks

R allows easy parallel processing via package parallel and other similar packages.

Multicore Parallelism

On a single node (where all processor cores share the same memory) it is the most efficient to use mclapply or fork clusters:

 cl <- makeForkCluster(detectCores())
 result <- parLapply(cl, i, myfunc)
 stopCluster(cl)

This code creates a fork-cluster on a single node by "forking" the running R process. All the environment is sahred, no data copying or initiating new instances of R is necessary. detectCores() detects the number of CPU cores (it usually works well enough). Don't forget to stop the cluster afterwards!

Warning: I have not been able to run mclapply on nodes that are invoked through socket cluster. It seems the mclapply forks are never closed and the node runs out of available forks. The code above worked. Otoomet (talk) 18:41, 2 September 2015 (PDT)

MPI parallelism

MPI is a popular and efficient way to run tasks in parallel. This is supported in R through CRAN packages, such as Rmpi and pbdMPI. You can use MPI approach in two ways: either you start a swarm of R processes, connected via MPI communicator, simultaneously via mpirun (see PBS scripts and qsub above), or you invoke a single instance of R and later spawn a swarm of wokers.

If you invoke R through mpirun in your PBS-script, the scheduler tells mpirun which nodes and cores the code should be run on. By default it starts this many instances of your script. For instance the jobscript

 #PBS -l nodes=2:ppn=8
 mpirun Rscript test.R

will start 16 instances of your "test.R" script. This may be what you want, for instance if using same–code–different–data paradigm with pbdMPI package. However, be careful with operations that use shared resources, such as printing or saving. In particular, ensure that multiple threads are not writing into the same file. Otherwise you get both errors and garbled data. This also applies to_R_command normal quit behavior!' You should either use Rscript instead or invoke R with --no-save option.

If you prefer to start only one instance of your script and spawn the workers later using Rmpi library, you invoke R without mpirun:

 Rscript test.R

Only 1 instance of your script is started but MPI is still set up across the allocated processors. Note though that now mpi.universe.size() is wrong and equal to 1. Instead, you should check the content of PBS_NODEFILE to find out the total number of processors:

 library(Rmpi) 
 library(snow) 
 nodefile <- Sys.getenv("PBS_NODEFILE")       
 nodes <- readLines(nodefile)       
 cl <- makeMPIcluster(length(nodes), includemaster=TRUE)

Now you can use parallel code via the cluster cl.

You can also invoke a single instance of R via mpirun:

 mpirun -np 1 Rscript test.R

Now the mpi.universe.size() is correct. However, I have not been able to create a full-sized cluster of workers as seemingly one MPI slot is already taken by the first R instance.

Sharing data

MPI workers do not share memory so you have to copy the necessary data before you can run any code on the cluster:

 clusterExport(cl, "mydata", envir=environment())

exports the variable mydata in the current environment and

 clusterExport(cl, "myfunc")

exports the function myfunc in the global environment.

Now you can execute your task on the cluster:

 result <- parLapplyLB(cl, 1:100, myfunc)

SOCKET Clusters

SOCKET clusters is an easy way to apply parallelism over independent computers like the hyak nodes. Invoking a socket cluster connects the corresponding nodes over ssh and runs the script Rscript on those nodes. My experience tells that SOCKET clusters are unstable, I have received best results using MPI. Otoomet (talk) 10:40, 4 September 2015 (PDT)

Which Nodes Do We Run On?

Unlike in your laptop or lab network, the node names that the scheduler allocates are not known to us in advance. However, these are saved in a file, pointed to by the environmental variable PBS_NODEFILE. This file includes each node as many times as many CPU-s on that node is allocated to us. One can extract the node names with the following function:

  getnodes <- function() \{
     f <- Sys.getenv('PBS_NODEFILE')
     x <- readLines(f)
     unique(x) \}

Load modules on the node

However, the connection is not smart enough to load the necessary modules (see Interactive Example). We can overcome this by creating our own tiny Rscript. You may create the following file:

 #\!/bin/bash
 module load r_3.2.0
 Rscript "$@"

A good place to save it is in bin folder in your home directory and a good name for it is module_rscript. Note that the script must be executable, this can be done by issuing

 chmod +x module_rscript

on command line. As you see, the script loads the necessary module and thereafter runs the "real" Rscript. "$@" means we also pass all the necessary parameters.

Now you can create the cluster by

 cl <- makePSOCKcluster(nodes,
       rscript="/usr/lusers/otoomet/bin/module_rscript",
       outfile="worklog.txt")

Here nodes is the list of nodes (use getnodes() function above), rscript is a pointer to our script we just created (note: you have to specify the full path), and outfile is the file where workers' output is sent to.

Note that by default, R only allows 128 connections. Hence you cannot use larger socket clusters.

Links

The "Programming with Big Data in R" project (pbdR)

The pbdR project provides R APIs for MPI, distributed linear algebra (scaLAPACK) and parallel netcdf4.

http://r-pbd.org/