Rice Supercomputing (NOTS)

You have developed a process that works in principle, but is taking an inordinately long time on your dataset.

Sometimes, a large virtual machine can be enough to do the trick. It could very well be the case that by rebuilding your environment on a machine in the cloud that has more CPU's and RAM than your personal computer, you will see performance increases that will solve your problems.

However, researchers quite frequently find that these performance increases taper off quickly as they simply add resources to a VM. If a job is taking a very long time to run (days, weeks, months), it can make more sense to move to a supercomputing paradigm.

In supercomputing workflows, your dataset is broken into smaller chunks, and these chunks are distributed amongst many worker nodes, which are either individual CPU's or small clusters. Depending on the way the job is designed, these worker nodes either check in with each other regularly, or they simply churn through their assigned work independently until it is done.

The CRC has access to on-campus and cloud resources to help you make the shift to supercomputing, as well as years of experience helping researchers in this area. There is a learning curve, but the payoff can be enormous: performance increases of two to three orders of magnitude.

We offer one-on-one consulting, application support, and even group and class workshops to teach you, your graduate, and even your undergraduate students how to use supercomputing resources on campus, in the cloud, and on national resources. Contact us reach out to us to schedule a consultation or workshop.

workshop

Examples:

Social Sciences: You are testing a model that has numerous inputs. Each run takes approximately 10 minutes, and you decide to test on a range of values that will ultimately result in 1,000 outputs -- this will take about a week to run. But by breaking your parameter values up into chunks, you could distribute this work over many, many processors and finish the simulation in a fraction of the time. In fact, with those time savings, you might decide to test on a broader range of values, or with a finer granularity!

Bioengineering: You have been working with a piece of protein analysis software, and getting very promising results. Now you want to greatly expand your test dataset, but it looks as though this will take weeks or months to run on your local machine, at the scale you are considering. By moving your inputs and software to a supercomputing cluster, you can run 10, 50, or 100 of your analyses concurrently, allowing you to very quickly get your data back and repeat with another set of proteins.

We have extensive documentation on the use of supercomputing resources at Rice. Please visit the CRC's NOTS documentation pages on Knowledge Base for up-to-date documentation and examples.

Some core concepts:

NOTS: This is Rice's unified, on-campus supercomputing cluster. The system consists of 294 dual socket compute blades housed within HPE s6500, HPE Apollo 2000, and Dell PowerEdge C6400 chassis. All the nodes are interconnected with 10 or 25 Gigabit Ethernet network. In addition, the Apollo and C6400 chassis are connected with high speed Omni-Path for message passing applications. There is a 160TB Lustre filesystem attached to the compute nodes via Ethernet. The system can support various workloads including single node, parallel, large memory multithreaded, and GPU jobs.

Job scheduling: Typically, supercomputing is not performed interactively. Instead of logging into a machine, loading your data, and executing your code, you learn to use a job scheduler. You request resources, and the system executes the job for you when those resources become available. You get an email when the job starts, and when it finishes. Rice uses the job scheduler SLURM for its on-campus supercomputing cluster, NOTS.

Interactive jobs and on-demand clusters: In some rare cases (usually involving graphical processing), researchers need both supercomputing-scale resources and the ability to control their jobs live. The CRC has resources that enable such work in its NOTS cluster, and cloud providers have introduced technology that enables their servers to quickly spin up resources that you can control in real time. These specialized jobs can be quite difficult to set up efficiently -- or, put another way, quite easy to spend a lot of money on! We strongly encourage researchers to consult with us in order to help them determine how to get the most out of their research budget.