Child pages
  • Hyak R programming
Skip to end of metadata
Go to start of metadata

Hyak fully supports R, anything that you can do with R on your desktop computer can be done on hyak but on a much larger scale. However, since hyak is a large scale shared resource there are a few steps that you have to take before running R.

Hyak is a Linux supercomputer. If you have not used the linux or UNIX or MacOS X command line before, it is a good idea to get familiar with the command line before running on hyak. Google "linux command line" for useful links. Of course, you should also have experience with R programming and R scripts on your desktop computer before you try running R on a supercomputer like hyak.

Usually, you will be using sbatch to submit a slurm script to the hyak scheduler. The slurm script will contain instructions for running your R script. The hyak scheduler takes care of details like finding a suitable node to run your R script etc.

Interactive Session

However, first let us use an interactive qsub session to get familiar with running R on hyak. Issue the command below at the hyak login node, to get a interactive hyak session.

The build node can connect to outside mox.hyak or ikt.hyak. It is useful for using git, transferring files to outside mox or getting files from outside hyak, installing packages in R or Python etc.

To get an interactive build node for 2 hours:

srun -p build --time=2:00:00 --mem=10G --pty /bin/bash

This opens an interactive session on a hyak node.

If we type R at the command line and press enter then we get an error
"R: command not found". What's wrong? Unlike your desktop computer, hyak supports hundreds of users with different requirements and different versions of the same software. Hence, unlike your desktop computer, not all software executables are put in everyone's PATH environment variable.

Those fluent with the Linux will want to put the location of R in their own PATH. However, this should be done using module command instead. It essentially performs similar tasks by modifying the PATH, LD_LIBRARY_PATH and other necessary parts of your environment.

The below command shows you the available modules:

 module avail

The list is long and right now we are only interested in the latest version R (3.5.1 as of November 2018). Hence we issue the command:

 module load  r_3.5.1

Now the command R works.

 

After working for sometime, you may want to know what modules are still loaded in your environment. You can use below command to list the modules:

 module list

After you have completed your work, you can remove the module from your environment by using below command.

 module unload  r_3.5.1

Installing CRAN packages on hyak

As hyak is a shared resource, CRAN packages cannot be installed in their default locations. Below are the steps to install CRAN packages to a location specified by you. Here UVW is the package name, XYZ is your hyak group name and abc is your hyak userid.

  1. make a directory where you will install the R packages. For example:

    mkdir /gscratch/XYZ/abc/rpackages

    (If your group uses the same packages you could use a path like /gscratch/XYZ/rpackages . You can also use a path like /sw/contrib/XYZ/rpackages.)

  2. If you have not done this earlier then see above for the commands to get a build node.

  3. On the build node, issue below command to load the R module

    module load r_3.5.1

    For instance, when installing the CRAN package Rglpk then issue below commands:

    export PKG_CFLAGS="-I/usr/include/glpk"
    export PKG_CPPFLAGS="-I/usr/include/glpk"
  4. Choose one of the following steps (a) or (b):

    (a) If you do not want to specify the location of the library then start the R command line and issue below command:

    install.packages("UVW")

    The command will ask you to choose a mirror. Choose a nearby mirror and press enter.
    R will ask you if you want to install the packages at below default location in your home directory:

    ~/R/x86_64-unknown-linux-gnu-library/{R_VERSION}

    Enter "y" and your package will be installed at above location.
    Now in your R scripts and at the R command line you only need to use below command

    library(UVW) 

    (b) If you want to specify the location of the library then start the R command line and issue below command.

    install.packages("UVW", lib="/gscratch/XYZ/abc/rpackages") 

    The command will ask you to choose a mirror. Choose a nearby mirror and press enter. The package UVW will be built and installed at the location /gscratch/XYZ/abc/rpackages

Once the above is done, whenever you want to use the package UVW, just load it in R by

 library(UVW, lib.loc="/gscratch/XYZ/abc/rpackages")

You may not want to give the lib.loc parameter every time. In that case, you can put the below in your .bashrc file

 export R_LIBS="/gscratch/XYZ/abc/rpackages"

Now you can load the library by just

 library(UVW)

Updating CRAN packages on hyak

For updating R packages use the below command at the R prompt:

 update.packages()

This will update the packages that you installed. It will also update the recommended packages. It will not update the base packages. For the difference between the base and recommended packages see Stackoverflow

Running R using slurm scripts and sbatch

Elsewhere on this wiki ( Mox_scheduler) you will find details about submitting slurm scripts via sbatch. Here we will focus on the R specific part. In your slurm script you should have lines like below. Note that if your scripts produce graphs then you should save the graphs using the usual R commands.

 module load  r_3.5.1
 Rscript >output.txt 2>&1 /gscratch/XYZ/abc/myscript.R

Installing R from source code


You may want to install the latest version of R.

Run below command to find if that you have anaconda Python in your path:

which python

You should get /usr/bin/python.

If you get anaconda python because you put it in your PATH in your .bashrc then remove it from your PATH, logout of mox.hyak and then login and get a build node.

If you get anaconda python because you loaded an anaconda module then unload the anaconda module.

Now the command "which python" should give /usr/bin/python.

Note that you must remove anaconda from your PATH because it has old versions of libcurl and R install needs newer versions of libcurl (which already exist on mox).

Below xyz is your group name and abc is your userid.

Below are steps to install R from source:

 srun -p build --time=2:00:00 --mem=100G --pty /bin/bash


mkdir /gscratch/xyz/abc/Rstuff

mkdir /gscratch/xyz/abc/Rinstall

cd /gscratch/xyz/abc/Rstuff

(Below wget command will get source code for R 3.5.0. If you want the latest version of R, goto  https://cran.cnr.berkeley.edu/ and right click on the link "latest release" and copy link location to choose a the latest version of R.)


wget  https://cran.cnr.berkeley.edu/src/base/R-3/R-3.5.0.tar.gz

tar -xvf R-3.5.0.tar.gz

cd R-3.5.0

./configure --prefix=/gscratch/xyz/abc/Rinstall

make

make install

After this you can put /gscratch/xyz/abc/Rinstall/bin in your PATH environment variable:

export PATH=/gscratch/xyz/abc/Rinstall/bin:$PATH

Run below command to verify that you have got  R 3.5.0:

R --version

 


Advanced (parallel R using MPI)

 

Rest of this page is advanced material for using R over MPI.

Compiling MPI packages

In order to compile MPI packages (like Rmpi and pbdMPI) you have to load MPI support module and use MPI-aware compiler.

Compilers with MPI support have _*mpi<version>_ ending, for instance

 module load icc_14.0.3-ompi_1.8.3

will do. Note that later you have to load exactly the same module when running R

Next, you have to tell R to use mpi-aware compiler, otherwise it will complain about missing MPI headers. This can be achieved by specifying

 CC=mpicc

in the environment, for instance in the R compilation environment file ~/.R/Makevars.

Now you can install

 install.packages("pbdMPI", dep=TRUE)

Some packages, in particular Rmpi may not find the necessary headers and libraries. One can help the configure script by specifying MPI_ROOT environment variable:

 MPI_ROOT="/sw/openmpi-1.8.3_icc-14.0.3" R CMD INSTALL Rmpi_0.6-5.tar.gz

where the folder pointed by MPI_ROOT must correspond to the compiler you are using.

Running R using slurm scripts and sbatch

Elsewhere on this wiki ( Mox_scheduler) you will find details about submitting slurm scripts via sbatch. Here we will focus on the R specific part. In your slurm script you should have lines like below. Note that if your scripts produce graphs then you should save the graphs using the usual R commands.

 module load  r_3.2.0
 Rscript >output.txt 2>&1 /gscratch/XYZ/abc/myscript.R

Alternatively, in case you use MPI for ikt.hyak

 module load r_3.2.0 icc_14.0.3-ompi_1.8.3
 mpirun --mca mtl mx --mca pml cm --bind-to core --map-by core Rscript >output.txt 2>&1 /gscratch/XYZ/abc/myscript.R

Note that we redirect both R output and error messages to output.txt here.

Executing Parallel Tasks

R allows easy parallel processing via package parallel and other similar packages.

Multicore Parallelism

On a single node (where all processor cores share the same memory) it is the most efficient to use mclapply or fork clusters:

 cl <- makeForkCluster(detectCores())
 result <- parLapply(cl, i, myfunc)
 stopCluster(cl)

This code creates a fork-cluster on a single node by "forking" the running R process. All the environment is sahred, no data copying or initiating new instances of R is necessary. detectCores() detects the number of CPU cores (it usually works well enough). Don't forget to stop the cluster afterwards!


MPI parallelism

MPI is a popular and efficient way to run tasks in parallel. This is supported in R through CRAN packages, such as Rmpi and pbdMPI. You can use MPI approach in two ways: either you start a swarm of R processes, connected via MPI communicator, simultaneously via mpirun, or you invoke a single instance of R and later spawn a swarm of wokers.

If you invoke R through mpirun in your slurm script, the scheduler tells mpirun which nodes and cores the code should be run on. By default it starts this many instances of your script. For instance the jobscript

 mpirun Rscript test.R

will start multiple instances of your "test.R" script. This may be what you want, for instance if using same–code–different–data paradigm with pbdMPI package. However, be careful with operations that use shared resources, such as printing or saving. In particular, ensure that multiple threads are not writing into the same file. Otherwise you get both errors and garbled data. This also applies to_R_command normal quit behavior!' You should either use Rscript instead or invoke R with --no-save option.

If you prefer to start only one instance of your script and spawn the workers later using Rmpi library, you invoke R without mpirun:

 Rscript test.R

Only 1 instance of your script is started but MPI is still set up across the allocated processors. Note though that now mpi.universe.size() is wrong and equal to 1.  Below N is the number of cores in your job.

 library(Rmpi) 
 library(snow) 
              
 cl <- makeMPIcluster(N, includemaster=TRUE)

Now you can use parallel code via the cluster cl.

You can also invoke a single instance of R via mpirun:

 mpirun -np 1 Rscript test.R

Now the mpi.universe.size() is correct. However, I have not been able to create a full-sized cluster of workers as seemingly one MPI slot is already taken by the first R instance.

Sharing data

MPI workers do not share memory so you have to copy the necessary data before you can run any code on the cluster:

 clusterExport(cl, "mydata", envir=environment())

exports the variable mydata in the current environment and

 clusterExport(cl, "myfunc")

exports the function myfunc in the global environment.

Now you can execute your task on the cluster:

 result <- parLapplyLB(cl, 1:100, myfunc)

Links

The "Programming with Big Data in R" project (pbdR)

The pbdR project provides R APIs for MPI, distributed linear algebra (scaLAPACK) and parallel netcdf4.

http://r-pbd.org/

 

  • No labels