Child pages
  • Hyak python dask
Skip to end of metadata
Go to start of metadata

You can process data in parallel using the python package dask.

Here is a tutorial:

https://github.com/blaze/dask-tutorial/blob/master/README.md

Install your own copy of anaconda.

See section on "Install Anaconda Python" at below link

Hyak python programming

Issue below command to verify that you are using your own copy of Anaconda Python:

which python

 

Choose one of below two ways of installing dask.

Install dask using anaconda command conda

1.  srun -p build --time=2:00:00 --mem=100G --pty /bin/bash
Below step creates a copy of the anaconda install in the .conda directory in your home directory.
2. conda create -n my_root --clone=/sw/anaconda-2.4.0
Below step allows you to activate your clone of the anaconda.
3. source activate my_root
Below step installs dask in your clone of anaconda.
4. conda install dask

Later, whenever you want to use dask using python at the command line or in your slurm scripts, issue below commands:
source activate my_root

Install dask using pip

1.  srun -p build --time=2:00:00 --mem=100G --pty /bin/bash
Below step installs dask in your home directory.
2. pip install dask --user

Sample dask program:

import dask.array as da
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y.compute()

Sample numpy program which does same calculation as above dask program:

import numpy as np
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]
y

  • No labels