Using Existing MPI Codes with AMPI
==================================

Due to the nature of AMPI's virtualized ranks, some changes to existing
MPI codes may be necessary for them to function correctly with AMPI.

Entry Point
-----------

To convert an existing program to use AMPI, the main function or program
may need to be renamed. The changes should be made as follows:

Fortran
~~~~~~~

You must declare the main program as a subroutine called "MPI_MAIN". Do
not declare the main subroutine as a *program* because it will never be
called by the AMPI runtime.

.. code-block:: fortran

   program pgm -> subroutine MPI_Main
       ...                       ...
   end program -> end subroutine

C or C++
~~~~~~~~

The main function can be left as is, if ``mpi.h`` is included before the
main function. This header file has a preprocessor macro that renames
main, and the renamed version is called by the AMPI runtime for each
rank.

Command Line Argument Parsing
-----------------------------

Fortran
~~~~~~~

For parsing Fortran command line arguments, AMPI Fortran programs should
use our extension APIs, which are similar to Fortran 2003’s standard
APIs. For example:

.. code-block:: fortran

   integer :: i, argc, ierr
   integer, parameter :: arg_len = 128
   character(len=arg_len), dimension(:), allocatable :: raw_arguments

   call AMPI_Command_argument_count(argc)
   allocate(raw_arguments(argc))
   do i = 1, size(raw_arguments)
       call AMPI_Get_command_argument(i, raw_arguments(i), arg_len, ierr)
   end do

C or C++
~~~~~~~~

Existing code for parsing ``argc`` and ``argv`` should be sufficient,
provided that it takes place *after* ``MPI_Init``.

Global Variable Privatization
-----------------------------

In AMPI, ranks are implemented as user-level threads that coexist
within OS processes or OS threads, depending on how the Charm++
runtime was built. Traditional MPI
programs assume that each rank has an entire OS process to itself,
and that only one thread of control exists within its address space.
This allows them to safely use global and static variables in their
code. However, global and static variables are problematic for
multi-threaded environments such as AMPI or OpenMP. This is because
there is a single instance of those variables, so they will be shared
among different ranks in the single address space, and this could lead
to the program producing an incorrect result or crashing.

The following code is an example of this problem. Each rank queries its
numeric ID, stores it in a global variable, waits on a global barrier,
and then prints the value that was stored. If this code is run with
multiple ranks virtualized inside one OS process, each rank will store
its ID in the same single location in memory. The result is that all
ranks will print the ID of whichever one was the last to successfully
update that location. For this code to be semantically valid with AMPI,
each rank needs its own separate instance of the variable. This is
where the need arises for some special handling of these unsafe
variables in existing MPI applications, which we call *privatization*.

.. code-block:: c++

  int rank_global;

  void print_ranks(void)
  {
    MPI_Comm_rank(MPI_COMM_WORLD, &rank_global);

    MPI_Barrier(MPI_COMM_WORLD);

    printf("rank: %d\n", rank_global);
  }

The basic transformation needed to port MPI programs to AMPI is
privatization of global and static variables. Module variables, "saved"
subroutine local variables, and common blocks in Fortran90 also belong to
this category. Certain API calls use global variables internally, such as
``strtok`` in the C standard library, and as a result they are also
unsafe. If such a program is executed without privatization on AMPI, all
the AMPI ranks that reside in the same process will access the same
copy of such variables, which is clearly not the desired semantics. Note
that global variables that are constant or are only written to once
during initialization with the same value across all ranks are already
thread-safe.

To ensure AMPI programs execute correctly, it is necessary to make such
variables "private" to individual ranks. We provide several options to
achieve this with varying degrees of portability and required developer
effort.

.. warning::

   If you are writing a new MPI application from scratch and would like
   to support AMPI as a first-class target, it is highly recommended to
   follow certain guidelines for writing your code to avoid the global
   variable problem entirely, eliminating the need for time-consuming
   refactoring or platform-specific privatization methods later on. See
   the Manual Code Editing section below for an example of how to
   structure your code in order to accomplish this.

Manual Code Editing
~~~~~~~~~~~~~~~~~~~

With regard to performance and portability, the ideal approach to resolve
the global variable problem is to refactor your code to avoid use of
globals entirely. However, this comes with the obvious caveat that it
requires developer time to implement and can involve invasive changes
across the entire codebase, similar to converting a shared library to be
reentrant in order to allow multiple instantiations from the same OS
process. If these costs are a significant barrier to entry, it can be
helpful to instead explore one of the simpler transformations or fully
automated methods described below.

We have employed a strategy of argument passing to do this privatization
transformation. That is, the global variables are bunched together in a
single user-defined type, which is allocated by each thread dynamically
or on the stack. Then a pointer to this type is passed from subroutine
to subroutine as an argument. Since the subroutine arguments are passed
on the stack, which is not shared across all threads, each subroutine
when executing within a thread operates on a private copy of the global
variables.

This scheme is demonstrated in the following examples. The original
Fortran90 code contains a module ``shareddata``. This module is used in
the ``MPI_MAIN`` subroutine and a subroutine ``subA``. Note that
``PROGRAM PGM`` was renamed to ``SUBROUTINE MPI_MAIN`` and ``END PROGRAM``
was renamed to ``END SUBROUTINE``.

.. code-block:: fortran

   !FORTRAN EXAMPLE
   MODULE shareddata
     INTEGER :: myrank
     DOUBLE PRECISION :: xyz(100)
   END MODULE

   SUBROUTINE MPI_MAIN                               ! Previously PROGRAM PGM
     USE shareddata
     include 'mpif.h'
     INTEGER :: i, ierr
     CALL MPI_Init(ierr)
     CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr)
     DO i = 1, 100
       xyz(i) =  i + myrank
     END DO
     CALL subA
     CALL MPI_Finalize(ierr)
   END SUBROUTINE                                    ! Previously END PROGRAM

   SUBROUTINE subA
     USE shareddata
     INTEGER :: i
     DO i = 1, 100
       xyz(i) = xyz(i) + 1.0
     END DO
   END SUBROUTINE

.. code-block:: c++

   //C Example
   #include <mpi.h>

   int myrank;
   double xyz[100];

   void subA();
   int main(int argc, char** argv){
     int i;
     MPI_Init(&argc, &argv);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     for(i=0;i<100;i++)
       xyz[i] = i + myrank;
     subA();
     MPI_Finalize();
   }

   void subA(){
     int i;
     for(i=0;i<100;i++)
       xyz[i] = xyz[i] + 1.0;
   }

AMPI executes the main subroutine inside a user-level thread as a
subroutine.

Now we transform this program using the argument passing strategy. We
first group the shared data into a user-defined type.

.. code-block:: fortran

   !FORTRAN EXAMPLE
   MODULE shareddata
     TYPE chunk ! modified
       INTEGER :: myrank
       DOUBLE PRECISION :: xyz(100)
     END TYPE ! modified
   END MODULE

.. code-block:: c++

   //C Example
   struct shareddata{
     int myrank;
     double xyz[100];
   };

Now we modify the main subroutine to dynamically allocate this data and
change the references to them. Subroutine ``subA`` is then modified to
take this data as argument.

.. code-block:: fortran

   !FORTRAN EXAMPLE
   SUBROUTINE MPI_Main
     USE shareddata
     USE AMPI
     INTEGER :: i, ierr
     TYPE(chunk), pointer :: c ! modified
     CALL MPI_Init(ierr)
     ALLOCATE(c) ! modified
     CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr)
     DO i = 1, 100
       c%xyz(i) =  i + c%myrank ! modified
     END DO
     CALL subA(c)
     CALL MPI_Finalize(ierr)
   END SUBROUTINE

   SUBROUTINE subA(c)
     USE shareddata
     TYPE(chunk) :: c ! modified
     INTEGER :: i
     DO i = 1, 100
       c%xyz(i) = c%xyz(i) + 1.0 ! modified
     END DO
   END SUBROUTINE

.. code-block:: c++

   //C Example
   void MPI_Main{
     int i,ierr;
     struct shareddata *c;
     ierr = MPI_Init();
     c = (struct shareddata*)malloc(sizeof(struct shareddata));
     ierr = MPI_Comm_rank(MPI_COMM_WORLD, c.myrank);
     for(i=0;i<100;i++)
       c.xyz[i] = i + c.myrank;
     subA(c);
     ierr = MPI_Finalize();
   }

   void subA(struct shareddata *c){
     int i;
     for(i=0;i<100;i++)
       c.xyz[i] = c.xyz[i] + 1.0;
   }

With these changes, the above program can be made thread-safe. Note that
it is not really necessary to dynamically allocate ``chunk``. One could
have declared it as a local variable in subroutine ``MPI_Main``. (Or for
a small example such as this, one could have just removed the
``shareddata`` module, and instead declared both variables ``xyz`` and
``myrank`` as local variables). This is indeed a good idea if shared
data are small in size. For large shared data, it would be better to do
heap allocation because in AMPI, the stack sizes are fixed at the
beginning (and can be specified from the command line) and stacks do not
grow dynamically.

PIEglobals: Automatic Position-Independent Executable Runtime Relocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Position-Independent Executable (PIE) Globals allows fully automatic
privatization of global variables on GNU/Linux systems without
modification of user code. All languages (C, C++, Fortran, etc.) are
supported. Runtime migration, load balancing, checkpointing, and SMP
mode are all fully supported.

This method works by combining a specific method of building binaries
with GNU extensions to the dynamic linker. First, AMPI's toolchain
wrapper compiles your user program as a Position-Independent Executable
(PIE) and links it against a special shim of function pointers instead
of the normal AMPI runtime. It then builds a small loader utility that
links directly against AMPI. This loader dynamically opens the PIE
binary after the AMPI runtime is fully initialized. The glibc
extension ``dl_iterate_phdr`` is called before and after the ``dlopen``
call in order to determine the location of the PIE binary's code and
data segments in memory. This is useful because PIE binaries locate the
data segment containing global variables immediately after the code
segment so that they are accessed relative to the instruction pointer.
The PIEglobals loader makes a copy of the code and data segments for
each AMPI rank in the job via the Isomalloc allocator, thereby
privatizing their global state. It then constructs a synthetic function
pointer for each rank at its new locations and calls it.

To use PIEglobals in your AMPI program, compile and link with the
``-pieglobals`` parameter:

.. code-block:: bash

   $ ampicxx -o example.o -c example.cpp -pieglobals
   $ ampicxx -o example example.o -pieglobals

No further effort is needed. Global variables in ``example.cpp`` will be
automatically privatized when the program is run. Any libraries and
shared objects compiled as PIE will also be privatized. However, if
these objects call MPI functions, it will be necessary to build them
with the AMPI toolchain wrappers, ``-pieglobals``, and potentially also
the ``-standalone`` parameter in the case of shared objects. It is
recommended to do this in any case so that AMPI can ensure everything is
built as PIE.

One important caveat is that the relocated code segments are opaque to
runtime debuggers such as GDB and LLDB because debug symbols are not
translated to their new location in memory. For this reason it is
recommended to perform as much development and debugging as possible in
non-virtualized mode so the program can be debugged normally. One
faculty provided to assist in debugging with virtualization is the
``pieglobalsfind`` function. This can be called at runtime to translate
a privatized address back to its original location as allocated by the
system's runtime linker, thereby associating it with any debug symbols
included in the binary. In GDB, the command takes the form
``call pieglobalsfind((void *)0x...)``. It can be useful to directly
pass in the instruction pointer as an argument, such as
``call pieglobalsfind($rip)`` on x86_64.

TLSglobals: Automatic Thread-Local Storage Swapping
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thread Local Store (TLS) was originally employed in kernel threads to
localize variables to threads and provide thread safety. It can be used
by annotating global/static variable declarations in C with
*thread_local*, in C with *__thread* or C11 with *thread_local* or
*_Thread_local*, and in Fortran with OpenMP’s *threadprivate*
attribute. OpenMP is required for using tlsglobals in Fortran code since
Fortran has no other method of using TLS. The *__thread* keyword is not
an official extension of the C language, though compiler writers are
encouraged to implement this feature.

It handles both global and static variables and has no context-switching
overhead. AMPI provides runtime support for privatizing thread-local
variables to user-level threads by changing the TLS segment register
when context switching between user-level threads. The runtime overhead
is that of changing a single pointer per user-level thread context
switch. Currently, Charm++ supports it for x86/x86_64 platforms when
using GNU or LLVM compilers, as well as macOS on all supported
architectures.

.. code-block:: c++

   // C/C++ example:
   int myrank;
   double xyz[100];

.. code-block:: fortran

   ! Fortran example:
   integer :: myrank
   real*8, dimension(100) :: xyz

For the example above, the following changes to the code handle the
global variables:

.. code-block:: c++

   // C++ example:
   thread_local int myrank;
   thread_local double xyz[100];

   // C example:
   __thread int myrank;
   __thread double xyz[100];

.. code-block:: fortran

   ! Fortran example:
   integer :: myrank
   real*8, dimension(100) :: xyz
   !$omp threadprivate(myrank)
   !$omp threadprivate(xyz)

The runtime system also should know that TLSglobals is used at both
compile and link time:

.. code-block:: bash

   $ ampicxx -o example example.C -tlsglobals

PiPglobals: Automatic Process-in-Process Runtime Linking Privatization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Process-in-Process (PiP) [PiP2018]_ Globals allows fully automatic
privatization of global variables on GNU/Linux systems without
modification of user code. All languages (C, C++, Fortran, etc.) are
supported. This method currently lacks support for checkpointing and
migration, which are necessary for load balancing and fault tolerance.
Additionally, overdecomposition is limited to approximately 12 virtual
ranks per logical node, though this can be resolved by building a
patched version of glibc.

As with PIEglobals, this method compiles your user program as a
Position-Independent Executable (PIE) and links it against a special
shim of function pointers. A small loader utility calls the
glibc-specific function ``dlmopen`` on the PIE binary with a unique
namespace index. The loader uses ``dlsym`` to populate the PIE binary's
function pointers and then it calls the entry point. This ``dlmopen``
and ``dlsym`` process repeats for each rank. As soon as execution jumps
into the PIE binary, any global variables referenced within will appear
privatized. This is because PIE binaries locate the global data segment
immediately after the code segment so that PIE global variables are
accessed relative to the instruction pointer, and because ``dlmopen``
creates a separate copy of these segments in memory for each unique
namespace index.

Optionally, the first step in using PiPglobals is to build PiP-glibc to
overcome the limitation on rank count per process. Use the instructions
at https://github.com/RIKEN-SysSoft/PiP/blob/pip-1/INSTALL.md to download
an installable PiP package or build PiP-glibc from source by following
the ``Patched GLIBC`` section. AMPI may be able to automatically detect
PiP's location if installed as a package, but otherwise set and export
the environment variable ``PIP_GLIBC_INSTALL_DIR`` to the value of
``<GLIBC_INSTALL_DIR>`` as used in the above instructions. For example:

.. code-block:: bash

   $ export PIP_GLIBC_INSTALL_DIR=~/pip

To use PiPglobals in your AMPI program (with or without PiP-glibc),
compile and link with the ``-pipglobals`` parameter:

.. code-block:: bash

   $ ampicxx -o example.o -c example.cpp -pipglobals
   $ ampicxx -o example example.o -pipglobals

No further effort is needed. Global variables in ``example.cpp`` will be
automatically privatized when the program is run. Any libraries and
shared objects compiled as PIE will also be privatized. However, if
these objects call MPI functions, it will be necessary to build them
with the AMPI toolchain wrappers, ``-pipglobals``, and potentially also
the ``-standalone`` parameter in the case of shared objects. It is
recommended to do this in any case so that AMPI can ensure everything is
built as PIE.

Potential future support for checkpointing and migration will require
modification of the ``ld-linux.so`` runtime loader to intercept mmap
allocations of the previously mentioned segments and redirect them
through Isomalloc. The present lack of support for these features mean
PiPglobals is best suited for testing AMPI during exploratory phases
of development, and for production jobs not requiring load balancing or
fault tolerance.

FSglobals: Automatic Filesystem-Based Runtime Linking Privatization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Filesystem Globals (FSglobals) was discovered during the development of
PiPglobals and the two are highly similar. Like PiPglobals, it
requires no modification of user code and works with any language.
It also currently lacks support for checkpointing and migration,
preventing use of load balancing and fault tolerance. Unlike PiPglobals,
it is portable beyond GNU/Linux and has no limits to overdecomposition
beyond available disk space.

FSglobals works in the same way as PiPglobals except that instead of
specifying namespaces using ``dlmopen``, which is a GNU/Linux-specific
feature, this method creates copies of the user's PIE binary on the
filesystem for each rank and calls the POSIX-standard ``dlopen``.

To use FSglobals, compile and link with the ``-fsglobals`` parameter:

.. code-block:: bash

   $ ampicxx -o example.o -c example.cpp -fsglobals
   $ ampicxx -o example example.o -fsglobals

No additional steps are required. Global variables in ``example.cpp``
will be automatically privatized when the program is run. Variables in
statically linked libraries will also be privatized if compiled as PIE.
It is recommended to achieve this by building with the AMPI toolchain
wrappers and ``-fsglobals``, and this is necessary if the libraries call
MPI functions. Shared objects are currently not supported by FSglobals
due to the extra overhead of iterating through all dependencies and
copying each one per rank while avoiding system components, plus the
complexity of ensuring each rank's program binary sees the proper set of
objects.

This method's use of the filesystem is a drawback in that it is slow
during startup and can be considered wasteful. Additionally, support for
load balancing and fault tolerance would require further development in
the future, using the same infrastructure as what PiPglobals would
require. For these reasons FSglobals is best suited for the R&D phase
of AMPI program development and for small jobs, and it may be less
suitable for large production environments.

Swapglobals: Automatic Global Offset Table Swapping
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks to the ELF Object Format, we have successfully automated the
procedure of switching the set of user global variables when switching
thread contexts. Executable and Linkable Format (ELF) is a common
standard file format for Object Files in Unix-like operating systems.
ELF maintains a Global Offset Table (GOT) for globals so it is possible
to switch GOT contents at thread context-switch by the runtime system.

The only thing that the user needs to do is pass the flag
``-swapglobals`` at both compile and link time (e.g. "ampicc -o prog
prog.c -swapglobals"). This method does not require any changes to the
source code and works with any language (C, C++, Fortran, etc). However,
it does not handle static variables, has a context switching overhead
that grows with the number of global variables, and is incompatible with
SMP builds of AMPI, where multiple virtual ranks can execute
simultaneously on different scheduler threads within an OS process.

Currently, this feature only works on x86 and x86_64 platforms that
fully support ELF, and it requires ld version 2.23 or older, or else a
patched version of ld 2.24+ that we provide here:
https://charm.cs.illinois.edu/gerrit/gitweb?p=libbfd-patches.git;a=tree;f=swapglobals

For these reasons, and because more robust privatization methods are
available, swapglobals is considered deprecated.

Source-to-Source Transformation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One final approach is to use a tool to transform your program's source
code, implementing the changes described in one of the sections above in
an automated fashion.

We have multiple tools for automating these transformations for different
languages.
Currently, there is a tool called *Photran*
(http://www.eclipse.org/photran) for refactoring Fortran codes
that can do this transformation. It is Eclipse-based and works by
constructing Abstract Syntax Trees (ASTs) of the program.
We also have a tool built with *LLVM/LibTooling* that applies the
TLSglobals transformation to C/C++ codes, available upon request.

Summary
~~~~~~~

Table :numref:`tab:portability` shows portability of
different schemes.

.. _tab:portability:
.. table:: Portability of current implementations of three privatization schemes. "Yes" means we have implemented this technique. "Maybe" indicates there are no theoretical problems, but no implementation exists. "No" indicates the technique is impossible on this platform.

   ==================== ===== ====== ======= === ====== ===== =====
   Privatization Scheme Linux Mac OS Windows x86 x86_64 PPC   ARM7
   ==================== ===== ====== ======= === ====== ===== =====
   Manual Code Editing  Yes   Yes    Yes     Yes Yes    Yes   Yes
   PIEglobals           Yes   No     No      Yes Yes    Yes   Yes
   TLSglobals           Yes   Yes    Maybe   Yes Yes    Maybe Maybe
   PiPglobals           Yes   No     No      Yes Yes    Yes   Yes
   FSglobals            Yes   Yes    Yes     Yes Yes    Yes   Yes
   Swapglobals          Yes   No     No      Yes Yes    Yes   Yes
   ==================== ===== ====== ======= === ====== ===== =====