Checkpointing and Communication Library (CCL)
Overview
CCL is a C-library in support of optimistic parallel discrete event
simulation (PDES) carried out on computer clusters connected by Myrinet networks.
Beyond classical low latency message delivery functionalities, this library
implements CPU offloaded, semi-asynchronous checkpointing functionalities based
on data transfer capabilities provided by a programmable DMA engine on board of
Myrinet network cards. This allows the CPU to carry out other simulation specific tasks
(e.g. event excution, LP scheduling, event list management) while checkpointing
is in progress. This is the reason why the checkpointing mode supported by CCL
is also referred to as non-blocking.
CCL versions up to v2.4 have been developed for LINUX 2.0.32 and the
M2M-PCI32C Myrinet network cards. They should work on any kernel 2.2.x and on
any Myrinet network card employing the LANai 4 chip. Anyway, we have only
tested them on kernel 2.0.32 and M2M-PCI32C Myrinet cards. Theoretically, the
library should work also on different hardware architectures than the one it
has been developed for (i.e. IA-32), provided that these architectures employ
snooping-cache to maintain coherency between data in the cache memory and data
read/written from/to main memory.
Versions up to v2.4 support monoprogrammed semi-asynchronous checkpoints,
i.e. only one CPU offloaded checkpoint operation can be active at any time.
This forces re-synchronization between CPU and DMA activities each time a
checkpoint request must be issued at the simulation application level while the
last issued one is still being carried out by the DMA engine. We are currently
developing v3.0 that, exploiting hardware features of more advanced M3M-PCI64C
Myrinet cards, supports multiprogrammed semi-asynchronous checkpoints.
Technical Details
As shown by the Figure below, Myrinet cards typically consists of the following
main components
- An internal bus, namely LBUS (Local BUS)
- A programmable RISC
processor connected to the LBUS, referred to as LANai processor.
- A RAM
bank (LANai internal memory), connected to the LBUS,
which can be mapped into the memory address space of the
host.
- A packet interface between the Myrinet switch and the LANai
chip, accessible by the LANai processor.
- Three DMA engines used for:
- Packet-interface/internal-memory transfer (Receive DMA)
- Internal-memory/packet-interface transfer (Send DMA)
- Internal-memory/host-memory transfer or vice-versa (EBUS DMA, namely
External Bus DMA)
Host access to the LANai internal memory takes place through a PCI
bridge, which is also used for EBUS DMA data transfer from the host
memory to the LANai internal memory and vice versa.
Communication
Communication functionalities are implemented in CCL according to
common solutions supporting fast speed messaging on the Myrinet
architecture. Messages incoming from the network are
temporarily buffered in the LANai internal memory (data transfer
between the packet interface and the internal memory takes place
through the Receive DMA) and then transferred into the receive queue,
located onto host memory, through the EBUS DMA (see the directed
dashed line in the Figure).
Following another common design choice, any send
operation issued by the application involves copying the message
content directly into the LANai internal memory. This is also referred
to as ``zero-copy'' send. Then the message is transferred onto the
network through the Send DMA. This optimization allows keeping the
delivery latency at a minimum by avoiding intermediate buffering at
the sender side.
Checkpointing
The EBUS DMA is used not only to transfer messages from the LANai
internal memory to the receive queue, but also to perform data
transfer associated with checkpointing. Specifically, a checkpoint
operation involves data transfer from the LP state buffer (located
onto host memory) to the stack of the checkpointed states of the LP
(also located onto host memory). As shown by the directed dotted
lines in the Figure, the transfer operation is charged on the
EBUS DMA that uses the LANai internal memory as an intermediate buffer
(Intermediate buffering is needed since the EBUS DMA does
not support host-memory to host-memory data transfer directly. It only
supports host-memory to internal-memory transfer or vice versa.)
Any checkpoint operation is split into a sequence of data transfer operations
to be performed by the EBUS DMA. Each operation transfers up to a maximum
amount of bytes, called burst, from the LP state vector to the LANai
internal memory (intermediate buffering) or from the LANai internal memory to
the stack of checkpoints of the LP.
Priorities between tasks
Lower priority is assigned to data
transfer associated with checkpointing as compared to message transfer into the
receive queue. This engineering choice allows checkpointing functionalities
offered by CCL to produce negligible interference with communication
functionalities, namely the primary task to be carried out by the network
card. Specifically, splitting any checkpoint operation into a sequence of
bursts allows prompt re-assignment of the hardware resources on board of the
Myrinet card (i.e. the EBUS DMA, the PCI bridge and the LBUS) to communication
operations due to their higher priority.
Re-synchronization
Re-synchronization between CPU activities and
checkpointing activities carried out by the EBUS DMA is used to avoid data
inconsistency, and to prevent contention on the Myrinet hardware in case
multiprogrammed checkpointing is not supported. Several re-synchronization
semantics have been implemented and tested while developing versions of
CCL. Presently, the more advanced semantic is called
Conditional-Checkpoint-Abort (CCA). The re-synchronization functionality
offered by CCL v2.4 and by a prototype implementation of CCL v3.0 supports this
semantic. CCA causes freezing of CPU activities only in case at
least a threshold fraction of the state vector currently being
checkpointed through DMA had been already transferred into the
checkpoint buffer. In the opposite case, the checkpoint operation is
aborted, with no freezing at all. The threshold fraction can be adjusted at run
time in order to optimize performance.
Related Publications
F. Quaglia and A. Santoro,
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation",
IEEE Transactions on Parallel and Distributed Systems, vol.14, no.6, pp.593-610, 2003.
F. Quaglia and A. Santoro,
"Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters" (extended version),
Journal of Parallel and Distributed Computing, vol.65, no.6, pp.667-677, 2005.
F.Quaglia and A.Santoro,
"CCL v3.0: Multiprogrammed Semi-Asynchronous Checkpoints",
Proc. 17th ACM/IEEE/SCS Workshop on Parallel and Distributed Simulation (PADS'03), San Diego, CA (USA), IEEE Computer Society Press, June 2003.
F.Quaglia, A.Santoro and B.Ciciani,
"Conditional Checkpoint Abort: an Alternative Semantic for Re-synchronization in CCL",
Proc. 16th ACM/IEEE/SCS Workshop on Parallel and Distributed Simulation (PADS'02), Washington, DC (USA), IEEE Computer Society Press, May 2002.
F.Quaglia, A.Santoro, and B.Ciciani,
Tuning of the
Checkpointing and Communication Library for Optimistic Simulation on Myrinet
Based NOWs,
Proc. 9th IEEE International Symposium on Modeling,
Analysis and Simulation of Computer and Telecommunication
Systems (MASCOTS'01),
Cincinnati (Ohio, USA), IEEE Computer Society Press, August 2001.
A.Santoro and F.Quaglia,
Benefits from
Semi-Asynchronous Checkpointing for Time Warp Simulations of a Large State PCS
Model,
Proc. 2001 Winter Simulation Conference (WSC'01), Arlington (VA, USA), Society for Computer
Simulation, December 2001.
F.Quaglia and A.Santoro,
Semi-Asynchronous Checkpointing for Optimistic Simulation on a Myrinet Based
NOW
Proc. 15th ACM/IEEE/SCS Workshop on Parallel and Distributed
Simulation (PADS'01), Lake Arrowhead,
California (USA), IEEE Computer Society Press, May 2001.
Installation and User's Guide (by version)