HPC and HTC resilience and efficiency


The objective of this activity is to make advances in the Exascale challenge that HPC is facing, that is, improve the fault tolerance capabilities and task scheduling in the future Exascale supercomputers. Both open issues are strictly related, and a fundamental part of their solution is proposed in this line of research: the creation of a checkpointing/restart mechanism capable to migrate individual tasks composing parallel jobs inside a distributed infrastructure, its integration with latest resource managers, and its further deployment for an increased fault tolerance and efficient task scheduling.

You can find here a video explaining how the DMTCP checkpointing library has been integrated in Slurm

This work has already been presented at:
CARLA 2018 (Bucaramanga, Sep 2018)
EASC 2018 (Edimburgh, April 2018)
ISUM 2018 (Mérida, March 2018)
Slurm User Group Meeting (San Francisco, September 2017)
RES User Group Meeting (Santiago de Compostela, September 2017)
CARLA 2017 (Buenos Aires y Colonia de Sacramento, September 2017)
HPC4E project workshop (Rio de Janeiro, July 2017)
HPC4E project website
Slurm User Group Meeting 2 (Athens, September 2016)
Slurm User Group Meeting 1 (Athens, September 2016)
EASC 2016 (Stockholm, April 2016)
ISC-HPC 2015 (Frankfurt, July 2015)
EASC 2015 (Edinburgh, April 2015).
SAC 2015 (Salamanca, April 2015). 

GWpilot and GWcloud

As it is well known, a way for overcoming Grid overheads is the use of pilot-job systems. The huge collaborations, such as the High Energy Physics ones, usually rely on ad-hoc pilot systems that fit their specific necessities, but a researcher working on his own or out of these collaborations, did not count on a general purpose application. In this sense, there are few frameworks that really offer the pilot-job advantages to conventional users. Nevertheless, they usually lack some features such as user-sharing and easy-to-installing capabilities or standardized Grid interfaces that could prevent their deployment.

GWpilot is a new developed general purpose framework based on GridWay [1] that offers common functionalities already implemented in other pilot systems to overcome remote queues, correctly fit tasks to pilots or discard bad resources, but also coordinates the pilot-task matchmaking with advanced scheduling techniques. In addition to command line tools, the system offers interfaces considered as Grid standards that make it suitable for inexpert users and for running legacy applications or being coupled to high level systems such as workflow managers.

GWcould is the extension to cloud environments.

[1] E. Huedo, R.S. Montero, I.M. Llorente. "The GridWay Framework for Adaptive Scheduling and Execution on Grids". Scalable Computing-Practice and Experience 6, 1 (2005)

This work has already been presented at:
CS-DC'15 (Phoenix, 2015)
PODC-ARMSCC 2015 (San Sebastián, 2015)
EGI TF 2013 (Madrid, September 2013)
CLCAR 2013 (San José, August 2013)
EGI CF 2013 (Manchester, April 2013)
IEEE NSS/MIC 2012 (Los Angeles, October 2012)
EGI TF 2012 (Prague, September 2012)
HPCS 2012 (Madrid, July 2012)


DistributedToolbox is a set of tools devoted to overcome the problem of executing distributed applications on dynamic environments, so fault tolerance is greatly enhanced. DistributedToolbox incorporates a small API for remotely managing the tasks together with tools to execute them on different platforms. These currently include local clusters and Grid infrastructures. Because of the toolbox design, adding new computational platforms and software/middleware characteristics is extremely easy. The core of this work is GridController, a newly created tool for an unattended execution of tasks on Grid infrastructures. It is designed focusing on reliability, ensuring that after the user has specified a task to execute, the desired output files will be returned. Based on GridWay metascheduler [1], GridController is able to automatically detect any problem during the task execution, overcome it and execute the desired tasks. A small replication factor is employed to minimize the influence of slow sites on the execution time. DistributedToolbox executes with no error a high number of jobs during months.

[1] E. Huedo, R.S. Montero, I.M. Llorente. "The GridWay Framework for Adaptive Scheduling and Execution on Grids". Scalable Computing-Practice and Experience 6, 1 (2005)

This work has already been presented at:
CICT 2014 (Louisville, December 2014)
EGI CF 2013 (Manchester, April 2013)
IberGrid 2013 (Madrid, September 2013)


Monte Carlo codes constitute a powerful tool for scientific computing. Because of their architecture their parallelization is straightforward, and they have been successfully ported to the Grid in multiple occasions. However, there is still a lack of a deep analysis on their optimization for being executed on a distributed environment. To solve this issue, Montera (MONTE carlo RApido. Fast Monte Carlo from its Spanish acronym) is a framework that efficiently executes this kind of codes on Grid infrastructures making the most of their particularities.

A characterization of Monte Carlo codes and Grid sites is performed, so their behaviour can be modelled by means of an implementation with a 2 step dynamic scheduling. It is coded in Java, and employs DRMAA API to manage the execution of tasks. GridWay [1] is the chosen metascheduler to control the Grid execution of tasks and provide Montera additional information about the status of the Grid infrastructure at any given moment.

[1] E. Huedo, R.S. Montero, I.M. Llorente. "The GridWay Framework for Adaptive Scheduling and Execution on Grids". Scalable Computing-Practice and Experience 6, 1-8 (2005)

This work has already been presented at:
CLCAR 2013 (San José, August 2013)
I Jornadas CDISC (Trujillo, June 2012)
EGI Community Forum (Garching, March 2012)
PhD Dissertation (Madrid, March 2012)
IEEE NSS MIC (Valencia, October 2011)
EGI Technical Forum (Lyon, September 2011)
Red Gallega Computación Altas Prestaciones (Santiago de Compostela, June 2011)
PDP2011 (Ayia Napa, February 2011)
Tutorial CMM (Santiago de Chile, November 2010)
GISELA KoM (San Luis Potosí, September 2010)
EGEE UF (Uppsala, April 2010)