3.3. Using Existing MPI Codes with AMPI

Due to the nature of AMPI’s virtualized ranks, some changes to existing MPI codes may be necessary for them to function correctly with AMPI.

3.3.1. Entry Point

To convert an existing program to use AMPI, the main function or program may need to be renamed. The changes should be made as follows:

3.3.1.1. Fortran

You must declare the main program as a subroutine called “MPI_MAIN”. Do not declare the main subroutine as a program because it will never be called by the AMPI runtime.

program pgm -> subroutine MPI_Main
    ...                       ...
end program -> end subroutine

3.3.1.2. C or C++

The main function can be left as is, if mpi.h is included before the main function. This header file has a preprocessor macro that renames main, and the renamed version is called by the AMPI runtime for each rank.

3.3.2. Command Line Argument Parsing

3.3.2.1. Fortran

For parsing Fortran command line arguments, AMPI Fortran programs should use our extension APIs, which are similar to Fortran 2003’s standard APIs. For example:

integer :: i, argc, ierr
integer, parameter :: arg_len = 128
character(len=arg_len), dimension(:), allocatable :: raw_arguments

call AMPI_Command_argument_count(argc)
allocate(raw_arguments(argc))
do i = 1, size(raw_arguments)
    call AMPI_Get_command_argument(i, raw_arguments(i), arg_len, ierr)
end do

3.3.2.2. C or C++

Existing code for parsing argc and argv should be sufficient, provided that it takes place after MPI_Init.

3.3.3. Global Variable Privatization

In AMPI, ranks are implemented as user-level threads that coexist within OS processes or OS threads, depending on how the Charm++ runtime was built. Traditional MPI programs assume that each rank has an entire OS process to itself, and that only one thread of control exists within its address space. This allows them to safely use global and static variables in their code. However, global and static variables are problematic for multi-threaded environments such as AMPI or OpenMP. This is because there is a single instance of those variables, so they will be shared among different ranks in the single address space, and this could lead to the program producing an incorrect result or crashing.

The following code is an example of this problem. Each rank queries its numeric ID, stores it in a global variable, waits on a global barrier, and then prints the value that was stored. If this code is run with multiple ranks virtualized inside one OS process, each rank will store its ID in the same single location in memory. The result is that all ranks will print the ID of whichever one was the last to successfully update that location. For this code to be semantically valid with AMPI, each rank needs its own separate instance of the variable. This is where the need arises for some special handling of these unsafe variables in existing MPI applications, which we call privatization.

int rank_global;

void print_ranks(void)
{
  MPI_Comm_rank(MPI_COMM_WORLD, &rank_global);

  MPI_Barrier(MPI_COMM_WORLD);

  printf("rank: %d\n", rank_global);
}

The basic transformation needed to port MPI programs to AMPI is privatization of global and static variables. Module variables, “saved” subroutine local variables, and common blocks in Fortran90 also belong to this category. Certain API calls use global variables internally, such as strtok in the C standard library, and as a result they are also unsafe. If such a program is executed without privatization on AMPI, all the AMPI ranks that reside in the same process will access the same copy of such variables, which is clearly not the desired semantics. Note that global variables that are constant or are only written to once during initialization with the same value across all ranks are already thread-safe.

To ensure AMPI programs execute correctly, it is necessary to make such variables “private” to individual ranks. We provide several options to achieve this with varying degrees of portability and required developer effort.

Warning

If you are writing a new MPI application from scratch and would like to support AMPI as a first-class target, it is highly recommended to follow certain guidelines for writing your code to avoid the global variable problem entirely, eliminating the need for time-consuming refactoring or platform-specific privatization methods later on. See the Manual Code Editing section below for an example of how to structure your code in order to accomplish this.

3.3.3.1. Manual Code Editing

With regard to performance and portability, the ideal approach to resolve the global variable problem is to refactor your code to avoid use of globals entirely. However, this comes with the obvious caveat that it requires developer time to implement and can involve invasive changes across the entire codebase, similar to converting a shared library to be reentrant in order to allow multiple instantiations from the same OS process. If these costs are a significant barrier to entry, it can be helpful to instead explore one of the simpler transformations or fully automated methods described below.

We have employed a strategy of argument passing to do this privatization transformation. That is, the global variables are bunched together in a single user-defined type, which is allocated by each thread dynamically or on the stack. Then a pointer to this type is passed from subroutine to subroutine as an argument. Since the subroutine arguments are passed on the stack, which is not shared across all threads, each subroutine when executing within a thread operates on a private copy of the global variables.

This scheme is demonstrated in the following examples. The original Fortran90 code contains a module shareddata. This module is used in the MPI_MAIN subroutine and a subroutine subA. Note that PROGRAM PGM was renamed to SUBROUTINE MPI_MAIN and END PROGRAM was renamed to END SUBROUTINE.

!FORTRAN EXAMPLE
MODULE shareddata
  INTEGER :: myrank
  DOUBLE PRECISION :: xyz(100)
END MODULE

SUBROUTINE MPI_MAIN                               ! Previously PROGRAM PGM
  USE shareddata
  include 'mpif.h'
  INTEGER :: i, ierr
  CALL MPI_Init(ierr)
  CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr)
  DO i = 1, 100
    xyz(i) =  i + myrank
  END DO
  CALL subA
  CALL MPI_Finalize(ierr)
END SUBROUTINE                                    ! Previously END PROGRAM

SUBROUTINE subA
  USE shareddata
  INTEGER :: i
  DO i = 1, 100
    xyz(i) = xyz(i) + 1.0
  END DO
END SUBROUTINE
//C Example
#include <mpi.h>

int myrank;
double xyz[100];

void subA();
int main(int argc, char** argv){
  int i;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  for(i=0;i<100;i++)
    xyz[i] = i + myrank;
  subA();
  MPI_Finalize();
}

void subA(){
  int i;
  for(i=0;i<100;i++)
    xyz[i] = xyz[i] + 1.0;
}

AMPI executes the main subroutine inside a user-level thread as a subroutine.

Now we transform this program using the argument passing strategy. We first group the shared data into a user-defined type.

!FORTRAN EXAMPLE
MODULE shareddata
  TYPE chunk ! modified
    INTEGER :: myrank
    DOUBLE PRECISION :: xyz(100)
  END TYPE ! modified
END MODULE
//C Example
struct shareddata{
  int myrank;
  double xyz[100];
};

Now we modify the main subroutine to dynamically allocate this data and change the references to them. Subroutine subA is then modified to take this data as argument.

!FORTRAN EXAMPLE
SUBROUTINE MPI_Main
  USE shareddata
  USE AMPI
  INTEGER :: i, ierr
  TYPE(chunk), pointer :: c ! modified
  CALL MPI_Init(ierr)
  ALLOCATE(c) ! modified
  CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr)
  DO i = 1, 100
    c%xyz(i) =  i + c%myrank ! modified
  END DO
  CALL subA(c)
  CALL MPI_Finalize(ierr)
END SUBROUTINE

SUBROUTINE subA(c)
  USE shareddata
  TYPE(chunk) :: c ! modified
  INTEGER :: i
  DO i = 1, 100
    c%xyz(i) = c%xyz(i) + 1.0 ! modified
  END DO
END SUBROUTINE
//C Example
void MPI_Main{
  int i,ierr;
  struct shareddata *c;
  ierr = MPI_Init();
  c = (struct shareddata*)malloc(sizeof(struct shareddata));
  ierr = MPI_Comm_rank(MPI_COMM_WORLD, c.myrank);
  for(i=0;i<100;i++)
    c.xyz[i] = i + c.myrank;
  subA(c);
  ierr = MPI_Finalize();
}

void subA(struct shareddata *c){
  int i;
  for(i=0;i<100;i++)
    c.xyz[i] = c.xyz[i] + 1.0;
}

With these changes, the above program can be made thread-safe. Note that it is not really necessary to dynamically allocate chunk. One could have declared it as a local variable in subroutine MPI_Main. (Or for a small example such as this, one could have just removed the shareddata module, and instead declared both variables xyz and myrank as local variables). This is indeed a good idea if shared data are small in size. For large shared data, it would be better to do heap allocation because in AMPI, the stack sizes are fixed at the beginning (and can be specified from the command line) and stacks do not grow dynamically.

3.3.3.2. PIEglobals: Automatic Position-Independent Executable Runtime Relocation

Position-Independent Executable (PIE) Globals allows fully automatic privatization of global variables on GNU/Linux systems without modification of user code. All languages (C, C++, Fortran, etc.) are supported. Runtime migration, load balancing, checkpointing, and SMP mode are all fully supported.

This method works by combining a specific method of building binaries with GNU extensions to the dynamic linker. First, AMPI’s toolchain wrapper compiles your user program as a Position-Independent Executable (PIE) and links it against a special shim of function pointers instead of the normal AMPI runtime. It then builds a small loader utility that links directly against AMPI. This loader dynamically opens the PIE binary after the AMPI runtime is fully initialized. The glibc extension dl_iterate_phdr is called before and after the dlopen call in order to determine the location of the PIE binary’s code and data segments in memory. This is useful because PIE binaries locate the data segment containing global variables immediately after the code segment so that they are accessed relative to the instruction pointer. The PIEglobals loader makes a copy of the code and data segments for each AMPI rank in the job via the Isomalloc allocator, thereby privatizing their global state. It then constructs a synthetic function pointer for each rank at its new locations and calls it.

To use PIEglobals in your AMPI program, compile and link with the -pieglobals parameter:

$ ampicxx -o example.o -c example.cpp -pieglobals
$ ampicxx -o example example.o -pieglobals

No further effort is needed. Global variables in example.cpp will be automatically privatized when the program is run. Any libraries and shared objects compiled as PIE will also be privatized. However, if these objects call MPI functions, it will be necessary to build them with the AMPI toolchain wrappers, -pieglobals, and potentially also the -standalone parameter in the case of shared objects. It is recommended to do this in any case so that AMPI can ensure everything is built as PIE.

One important caveat is that the relocated code segments are opaque to runtime debuggers such as GDB and LLDB because debug symbols are not translated to their new location in memory. For this reason it is recommended to perform as much development and debugging as possible in non-virtualized mode so the program can be debugged normally. One faculty provided to assist in debugging with virtualization is the pieglobalsfind function. This can be called at runtime to translate a privatized address back to its original location as allocated by the system’s runtime linker, thereby associating it with any debug symbols included in the binary. In GDB, the command takes the form call pieglobalsfind((void *)0x...). It can be useful to directly pass in the instruction pointer as an argument, such as call pieglobalsfind($rip) on x86_64.

3.3.3.3. TLSglobals: Automatic Thread-Local Storage Swapping

Thread Local Store (TLS) was originally employed in kernel threads to localize variables to threads and provide thread safety. It can be used by annotating global/static variable declarations in C with thread_local, in C with __thread or C11 with thread_local or _Thread_local, and in Fortran with OpenMP’s threadprivate attribute. OpenMP is required for using tlsglobals in Fortran code since Fortran has no other method of using TLS. The __thread keyword is not an official extension of the C language, though compiler writers are encouraged to implement this feature.

It handles both global and static variables and has no context-switching overhead. AMPI provides runtime support for privatizing thread-local variables to user-level threads by changing the TLS segment register when context switching between user-level threads. The runtime overhead is that of changing a single pointer per user-level thread context switch. Currently, Charm++ supports it for x86/x86_64 platforms when using GNU or LLVM compilers, as well as macOS on all supported architectures.

// C/C++ example:
int myrank;
double xyz[100];
! Fortran example:
integer :: myrank
real*8, dimension(100) :: xyz

For the example above, the following changes to the code handle the global variables:

// C++ example:
thread_local int myrank;
thread_local double xyz[100];

// C example:
__thread int myrank;
__thread double xyz[100];
! Fortran example:
integer :: myrank
real*8, dimension(100) :: xyz
!$omp threadprivate(myrank)
!$omp threadprivate(xyz)

The runtime system also should know that TLSglobals is used at both compile and link time:

$ ampicxx -o example example.C -tlsglobals

3.3.3.4. PiPglobals: Automatic Process-in-Process Runtime Linking Privatization

Process-in-Process (PiP) [PiP2018] Globals allows fully automatic privatization of global variables on GNU/Linux systems without modification of user code. All languages (C, C++, Fortran, etc.) are supported. This method currently lacks support for checkpointing and migration, which are necessary for load balancing and fault tolerance. Additionally, overdecomposition is limited to approximately 12 virtual ranks per logical node, though this can be resolved by building a patched version of glibc.

As with PIEglobals, this method compiles your user program as a Position-Independent Executable (PIE) and links it against a special shim of function pointers. A small loader utility calls the glibc-specific function dlmopen on the PIE binary with a unique namespace index. The loader uses dlsym to populate the PIE binary’s function pointers and then it calls the entry point. This dlmopen and dlsym process repeats for each rank. As soon as execution jumps into the PIE binary, any global variables referenced within will appear privatized. This is because PIE binaries locate the global data segment immediately after the code segment so that PIE global variables are accessed relative to the instruction pointer, and because dlmopen creates a separate copy of these segments in memory for each unique namespace index.

Optionally, the first step in using PiPglobals is to build PiP-glibc to overcome the limitation on rank count per process. Use the instructions at https://github.com/RIKEN-SysSoft/PiP/blob/pip-1/INSTALL.md to download an installable PiP package or build PiP-glibc from source by following the Patched GLIBC section. AMPI may be able to automatically detect PiP’s location if installed as a package, but otherwise set and export the environment variable PIP_GLIBC_INSTALL_DIR to the value of <GLIBC_INSTALL_DIR> as used in the above instructions. For example:

$ export PIP_GLIBC_INSTALL_DIR=~/pip

To use PiPglobals in your AMPI program (with or without PiP-glibc), compile and link with the -pipglobals parameter:

$ ampicxx -o example.o -c example.cpp -pipglobals
$ ampicxx -o example example.o -pipglobals

No further effort is needed. Global variables in example.cpp will be automatically privatized when the program is run. Any libraries and shared objects compiled as PIE will also be privatized. However, if these objects call MPI functions, it will be necessary to build them with the AMPI toolchain wrappers, -pipglobals, and potentially also the -standalone parameter in the case of shared objects. It is recommended to do this in any case so that AMPI can ensure everything is built as PIE.

Potential future support for checkpointing and migration will require modification of the ld-linux.so runtime loader to intercept mmap allocations of the previously mentioned segments and redirect them through Isomalloc. The present lack of support for these features mean PiPglobals is best suited for testing AMPI during exploratory phases of development, and for production jobs not requiring load balancing or fault tolerance.

3.3.3.5. FSglobals: Automatic Filesystem-Based Runtime Linking Privatization

Filesystem Globals (FSglobals) was discovered during the development of PiPglobals and the two are highly similar. Like PiPglobals, it requires no modification of user code and works with any language. It also currently lacks support for checkpointing and migration, preventing use of load balancing and fault tolerance. Unlike PiPglobals, it is portable beyond GNU/Linux and has no limits to overdecomposition beyond available disk space.

FSglobals works in the same way as PiPglobals except that instead of specifying namespaces using dlmopen, which is a GNU/Linux-specific feature, this method creates copies of the user’s PIE binary on the filesystem for each rank and calls the POSIX-standard dlopen.

To use FSglobals, compile and link with the -fsglobals parameter:

$ ampicxx -o example.o -c example.cpp -fsglobals
$ ampicxx -o example example.o -fsglobals

No additional steps are required. Global variables in example.cpp will be automatically privatized when the program is run. Variables in statically linked libraries will also be privatized if compiled as PIE. It is recommended to achieve this by building with the AMPI toolchain wrappers and -fsglobals, and this is necessary if the libraries call MPI functions. Shared objects are currently not supported by FSglobals due to the extra overhead of iterating through all dependencies and copying each one per rank while avoiding system components, plus the complexity of ensuring each rank’s program binary sees the proper set of objects.

This method’s use of the filesystem is a drawback in that it is slow during startup and can be considered wasteful. Additionally, support for load balancing and fault tolerance would require further development in the future, using the same infrastructure as what PiPglobals would require. For these reasons FSglobals is best suited for the R&D phase of AMPI program development and for small jobs, and it may be less suitable for large production environments.

3.3.3.6. Swapglobals: Automatic Global Offset Table Swapping

Thanks to the ELF Object Format, we have successfully automated the procedure of switching the set of user global variables when switching thread contexts. Executable and Linkable Format (ELF) is a common standard file format for Object Files in Unix-like operating systems. ELF maintains a Global Offset Table (GOT) for globals so it is possible to switch GOT contents at thread context-switch by the runtime system.

The only thing that the user needs to do is pass the flag -swapglobals at both compile and link time (e.g. “ampicc -o prog prog.c -swapglobals”). This method does not require any changes to the source code and works with any language (C, C++, Fortran, etc). However, it does not handle static variables, has a context switching overhead that grows with the number of global variables, and is incompatible with SMP builds of AMPI, where multiple virtual ranks can execute simultaneously on different scheduler threads within an OS process.

Currently, this feature only works on x86 and x86_64 platforms that fully support ELF, and it requires ld version 2.23 or older, or else a patched version of ld 2.24+ that we provide here: https://charm.cs.illinois.edu/gerrit/gitweb?p=libbfd-patches.git;a=tree;f=swapglobals

For these reasons, and because more robust privatization methods are available, swapglobals is considered deprecated.

3.3.3.7. Source-to-Source Transformation

One final approach is to use a tool to transform your program’s source code, implementing the changes described in one of the sections above in an automated fashion.

We have multiple tools for automating these transformations for different languages. Currently, there is a tool called Photran (http://www.eclipse.org/photran) for refactoring Fortran codes that can do this transformation. It is Eclipse-based and works by constructing Abstract Syntax Trees (ASTs) of the program. We also have a tool built with LLVM/LibTooling that applies the TLSglobals transformation to C/C++ codes, available upon request.

3.3.3.8. Summary

Table 4 shows portability of different schemes.

4 Portability of current implementations of three privatization schemes. “Yes” means we have implemented this technique. “Maybe” indicates there are no theoretical problems, but no implementation exists. “No” indicates the technique is impossible on this platform.

Privatization Scheme

Linux

Mac OS

Windows

x86

x86_64

PPC

ARM7

Manual Code Editing

Yes

Yes

Yes

Yes

Yes

Yes

Yes

PIEglobals

Yes

No

No

Yes

Yes

Yes

Yes

TLSglobals

Yes

Yes

Maybe

Yes

Yes

Maybe

Maybe

PiPglobals

Yes

No

No

Yes

Yes

Yes

Yes

FSglobals

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Swapglobals

Yes

No

No

Yes

Yes

Yes

Yes