The CRC offers subsidized, low-cost, networked data storage resources on- and off-campus to research faculty in every discipline. We facilitate access to these resources for you and your research group, and help you to leverage them with computing resources to more efficiently work with and securely back up your large datasets.
Cloud Storage
Rice contracts with Rice Box to allow you secure, scalable storage in the cloud. This offering is no longer unlimited and data transfer rates are controlled by the provider, but the cap is quite high and the throughput meets most researchers' needs.
We also contract with Amazon Glacier to set up automatic disaster recovery backups of researchers' data in the cloud, on a case by case basis.
For more information on how to make the most of these services, reach out to us.
The RDF Isilon
Basics:
Our on-campus Research Data Facility (RDF) is a networked Isilon storage appliance which can be accessed from anywhere on the Rice network.
Research faculty are eligible for 500GB of university-subsidized storage, which they can share with members of their team. Usage above this limit is low-cost for cost-recovery purposes. Options include regular backups via Amazon Glacier.
If you are logged on to the campus network, you can mount the appliance as a network share from Windows, MacOS, or Linux using the SMB file sharing protocol.
Leveraging your RDF data:
The real power of the RDF comes from being situated in the Rice network. Because the Isilon is housed in our primary data center, it is easy to make efficient use of your data on our other research computing infrastructure.
Use Case A: Staging data for supercomputing jobs (RDF Isilon + Globus FTP + NOTS)
Rice maintains specialized data transfer infrastructure to bring large datasets onto network resources at speeds that exceed those available on the commercial internet. Read more about out Globus FTP service
And while this does mean that you can bring big data (like genomic datasets) in directly from national resources to the supercomputing clusters, users frequently want to keep a backup copy of their input data alongside their output data. The cluster is not the best place to store these datasets for more than a couple weeks.
In order to preserve backups of supercomputing inputs and outputs while minimizing transfer times, researchers will therefore frequently use a workflow like this:
- Use Globus FTP to ship data to the Isilon RDF
- Use Globus FTP to copy this data over from the Isilon RDF to the NOTS supercomputing cluster
- Run their supercomputing job on the NOTS cluster
- Ship their output data from NOTS back to the Isilon RDF
At this point, researchers have a clean backup of their inputs and outputs for later analysis, archiving, and/or publication.
Use Case B: Pre- and post-processing (NOTS + RDF Isilon + ORION)
Sometimes, supercomputing jobs on our NOTS cluster will involve intermediary steps that are inefficient to perform on the cluster, such as cleaning data beforehand or analyzing results after a run. Researchers sometimes use the RDF to store work before and after their run:
- Import the data to the RDF Isilon, using Globus
- Perform some pre-processing work between the RDF and ORION
- Ship the data to NOTS via your ORION VM, using scp
- Run your job on NOTS
- Ship the job back to the RDF via your ORION VM, using scp
- Perform some post-processing work between the RDF and ORION
Workflows like this have several advantages in a remote work environment. All of this work can be done without your large dataset ever taking up space on your local machine. By keeping your work entirely on the Rice network, you minimize your dependency on commercial internet infrastructure, and can see a marked improvement in processing time and data transfer, as well as being able to have our infrastructure do more of your work in the background without clogging up your personal computer.
Use case C: Medium-sized computing jobs (RDF Isilon + ORION)
The virtual machines on our private cloud, ORION, can access your RDF data share as a mounted just like your laptop, as a mounted file system. Users with workloads that are too large for their laptop but not big enough for the supercomputers will often:
- Mount their RDF fileshare on an ORION VM
- Load a chunk of their data onto the VM for analysis
- Export the results to the RDF
- Repeat
This solution has its limits -- file I/O here is not fast enough to support things like database connections. But for medium-sized workloads, ORION and the RDF make a very good team.
The above use-cases are covered extensively in CRC webinars, and we frequently consult with researchers to build custom solutions that allow them to make the most of their subsidized storage on the Rice network. Reach out to us to schedule a consultation.