Checkpointing and Communication Library (CCL)

by Francesco Quaglia and Andrea Santoro

Overview

CCL is a C-library in support of optimistic parallel discrete event simulation (PDES) carried out on computer clusters connected by Myrinet networks. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, semi-asynchronous checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engine on board of Myrinet network cards. This allows the CPU to carry out other simulation specific tasks (e.g. event excution, LP scheduling, event list management) while checkpointing is in progress. This is the reason why the checkpointing mode supported by CCL is also referred to as non-blocking.

CCL versions up to v2.4 have been developed for LINUX 2.0.32 and the M2M-PCI32C Myrinet network cards. They should work on any kernel 2.2.x and on any Myrinet network card employing the LANai 4 chip. Anyway, we have only tested them on kernel 2.0.32 and M2M-PCI32C Myrinet cards. Theoretically, the library should work also on different hardware architectures than the one it has been developed for (i.e. IA-32), provided that these architectures employ snooping-cache to maintain coherency between data in the cache memory and data read/written from/to main memory.

Versions up to v2.4 support monoprogrammed semi-asynchronous checkpoints, i.e. only one CPU offloaded checkpoint operation can be active at any time. This forces re-synchronization between CPU and DMA activities each time a checkpoint request must be issued at the simulation application level while the last issued one is still being carried out by the DMA engine. We are currently developing v3.0 that, exploiting hardware features of more advanced M3M-PCI64C Myrinet cards, supports multiprogrammed semi-asynchronous checkpoints.


Technical Details

As shown by the Figure below, Myrinet cards typically consists of the following main components Host access to the LANai internal memory takes place through a PCI bridge, which is also used for EBUS DMA data transfer from the host memory to the LANai internal memory and vice versa.

Communication

Communication functionalities are implemented in CCL according to common solutions supporting fast speed messaging on the Myrinet architecture. Messages incoming from the network are temporarily buffered in the LANai internal memory (data transfer between the packet interface and the internal memory takes place through the Receive DMA) and then transferred into the receive queue, located onto host memory, through the EBUS DMA (see the directed dashed line in the Figure). Following another common design choice, any send operation issued by the application involves copying the message content directly into the LANai internal memory. This is also referred to as ``zero-copy'' send. Then the message is transferred onto the network through the Send DMA. This optimization allows keeping the delivery latency at a minimum by avoiding intermediate buffering at the sender side.

Checkpointing

The EBUS DMA is used not only to transfer messages from the LANai internal memory to the receive queue, but also to perform data transfer associated with checkpointing. Specifically, a checkpoint operation involves data transfer from the LP state buffer (located onto host memory) to the stack of the checkpointed states of the LP (also located onto host memory). As shown by the directed dotted lines in the Figure, the transfer operation is charged on the EBUS DMA that uses the LANai internal memory as an intermediate buffer (Intermediate buffering is needed since the EBUS DMA does not support host-memory to host-memory data transfer directly. It only supports host-memory to internal-memory transfer or vice versa.) Any checkpoint operation is split into a sequence of data transfer operations to be performed by the EBUS DMA. Each operation transfers up to a maximum amount of bytes, called burst, from the LP state vector to the LANai internal memory (intermediate buffering) or from the LANai internal memory to the stack of checkpoints of the LP.

Priorities between tasks

Lower priority is assigned to data transfer associated with checkpointing as compared to message transfer into the receive queue. This engineering choice allows checkpointing functionalities offered by CCL to produce negligible interference with communication functionalities, namely the primary task to be carried out by the network card. Specifically, splitting any checkpoint operation into a sequence of bursts allows prompt re-assignment of the hardware resources on board of the Myrinet card (i.e. the EBUS DMA, the PCI bridge and the LBUS) to communication operations due to their higher priority.

Re-synchronization

Re-synchronization between CPU activities and checkpointing activities carried out by the EBUS DMA is used to avoid data inconsistency, and to prevent contention on the Myrinet hardware in case multiprogrammed checkpointing is not supported. Several re-synchronization semantics have been implemented and tested while developing versions of CCL. Presently, the more advanced semantic is called Conditional-Checkpoint-Abort (CCA). The re-synchronization functionality offered by CCL v2.4 and by a prototype implementation of CCL v3.0 supports this semantic. CCA causes freezing of CPU activities only in case at least a threshold fraction of the state vector currently being checkpointed through DMA had been already transferred into the checkpoint buffer. In the opposite case, the checkpoint operation is aborted, with no freezing at all. The threshold fraction can be adjusted at run time in order to optimize performance.


Related Publications

F. Quaglia and A. Santoro,
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation",
IEEE Transactions on Parallel and Distributed Systems, vol.14, no.6, pp.593-610, 2003.

F. Quaglia and A. Santoro,
"Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters" (extended version),
Journal of Parallel and Distributed Computing, vol.65, no.6, pp.667-677, 2005.

F.Quaglia and A.Santoro,
"CCL v3.0: Multiprogrammed Semi-Asynchronous Checkpoints",
Proc. 17th ACM/IEEE/SCS Workshop on Parallel and Distributed Simulation (PADS'03), San Diego, CA (USA), IEEE Computer Society Press, June 2003.

F.Quaglia, A.Santoro and B.Ciciani,
"Conditional Checkpoint Abort: an Alternative Semantic for Re-synchronization in CCL",
Proc. 16th ACM/IEEE/SCS Workshop on Parallel and Distributed Simulation (PADS'02), Washington, DC (USA), IEEE Computer Society Press, May 2002.

F.Quaglia, A.Santoro, and B.Ciciani,
Tuning of the Checkpointing and Communication Library for Optimistic Simulation on Myrinet Based NOWs,
Proc. 9th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'01), Cincinnati (Ohio, USA), IEEE Computer Society Press, August 2001.

A.Santoro and F.Quaglia,
Benefits from Semi-Asynchronous Checkpointing for Time Warp Simulations of a Large State PCS Model,
Proc. 2001 Winter Simulation Conference (WSC'01), Arlington (VA, USA), Society for Computer Simulation, December 2001.

F.Quaglia and A.Santoro,
Semi-Asynchronous Checkpointing for Optimistic Simulation on a Myrinet Based NOW
Proc. 15th ACM/IEEE/SCS Workshop on Parallel and Distributed Simulation (PADS'01), Lake Arrowhead, California (USA), IEEE Computer Society Press, May 2001.


Installation and User's Guide (by version)