FELIPE Online Manual:
Global matrix storage, and equation solvers

When a finite element program runs, it is the solution of the global matrix equation {90} Ku=f which takes most of the time, and it is the storage of K which takes most of the space resources. Hence, much attention has been given to algorithms for the compact storage of K , and for efficient solution algorithms. These topics are often given little attention in f.e.m. textbooks, however, because they are seen as peripheral to the main f.e. theory. As FELIPE has been developed to help users understand the practical programming of the f.e. method, a wide range of storage and solution algorithms have been used in the `main engines', and will be described in this Chapter. Throughout the chapter, we will be considering the solution of {90}, where K has NTOTV rows and columns.

Compact storage of K

As we have seen in POISS and ELAST, the standard f.e. program needs to assemble the global stiffness matrix K from the element matrices written to scratch file by subroutine stiff. But storing the whole n \times n matrix K would be impractical. Luckily, we can make use of the fact that K will be banded, as you have seen in the graphical demonstration of the assembly process in PREFEL, with zeros in entries outside the band (which lies either side of the diagonal of the matrix). In many applications K will also be symmetric, so that we only need to store the lower (or upper) half of the band, together with the diagonal itself.

Symmetric band storage

The simplest example of band storage is in POISS.FOR, since in this application there is only one degree of freedom per node; hence, the equation corresponding to the i th node is in the i th row of the stiffness matrix equation. The stiffness matrix will thus have NPOIN rows.

The size of the band is defined by the bandwidth NBW. This defined as the smallest number for which: K_{i,j}=0 for all i,j satisfying |i-j| > {NBW} . If K is symmetric, we will store the lower half of the band, i.e. all entries K_{i,j} for which i gt;= j . The halfbandwidth NHBW is defined by NBW = 2*NHBW + 1. Since we only get a nonzero entry in K_{i,j} if nodes with global numbering i and j are contained in the same element, the halfbandwidth NHBW can be calculated as the maximum over all elements l , of the difference in node numbers for the nodes in l . In POISS.FOR, this calculation is done at the start of subroutine assmb. We will therefore use NPOIN rows of GSTIF (one row for each node in the mesh) and NHBW+1 columns. We denote the latter value by NBAND, and dimension GSTIF at the start of POISS.FOR as
DIMENSION GSTIF(MPOIN,MBAND)
where the parameters MPOIN and MBAND are set in the PARAMETER statement.

Then, on row i we store only k_{i,i- {NHBW}},...,k_{i,i} , for i=1,... ,NTOTV. We store these entries in an array GSTIF with NTOTV rows (where NTOTV is the total number of degrees of freedom) and (NHBW+1) columns. The mapping from K to GSTIF is: n K_{i,j} --> {GSTIF(I,J-I+NHBW+1)} for j=i- {NHBW},...,i .

Then the diagonal of the matrix (entries where i=j ) will be stored in the final column of the array. You can see the entries of the element stiffness matrices being placed in GSTIF, in subroutine assmb of POISS.FOR.

The mapping is slightly more complicated in ELAST.FOR, where there are two degrees of freedom per node. Here, the u and v d.o.f.s of node n (i.e. displacements in the x and y directions at node n ) are global degrees of freedom numbers i_n+1 and i_n+2 , where i_n=2(n-1) . You can see this mapping being used at the start of subroutine assmb. After making this adjustment, the local d.o.f.s are mapped to GSTIF by the formula in the previous section. At the start of assmb, the halfbandwidth is found by finding the position of the first non-zero entry in row I: this is stored in the array NMINVC. Thus, if NMINVC(I)=MI, it means that k_{i,mi} \neq 0 but k_{i,j}=0 {for} j=1,...,mi-1 . Then the halfbandwidth NHBW is given by \max_{i=1, {NTOTV}}(i-mi) .

It will be seen that the size of the bandwidth, and hence the amount of storage required, will depend on the order in which the nodes of the mesh have been numbered. For this reason it is very important to use the element renumbering facility in PREFEL when producing the .dat file for input to the `main engine', to obtain a sensible numbering of the elements --- and then to re-number the nodes accordingly. Note that the bandwidth does not depend on the element numbering directly.

Non-symmetric band storage

When the matrix K is nonsymmetric, the whole band must be stored. This is the case in THERM.FOR. Here, the array GSTIF has 3N rows (as there are 3 degrees of freedom per node) and NBW columns (NBW is the bandwidth). In this program we introduce the use of arrays NVAR and NDOF, both of size NPOIN. NDOF(N) stores the number of degrees of freedom at node N. NVAR(N) tells us which rows of the global matrix K these degrees of freedom correspond to: if NVAR(N)=IN, then the i'th d.o.f. of node N corresponds to row IN+i. NVAR is formed from NDOF by NVAR(N) = \sum_{i=1}^{N-1} NDOF(i). These arrays are used in other `main engines', especially where there are variable degrees of freedom per node.

The mapping from K to GSTIF is again n K_{i,j} --> {GSTIF(I,J-I+NHBW+1)}. Now, however, there are NBW columns, and the diagonal entries of K lie down the (NHBW+1)'th column.

Skyline storage

While band storage is considerably better than storing the whole matrix, it still leaves a large number of zero entries within the band. A more efficient storage system is skyline storage, where we decide separately for each row where we need to start the storage.

This storage technique is used in the soil consolidation program CONSL.FOR. As with band storage, the amount of storage used depends on the numbering of the nodes in the mesh. We again use the array NMINVC(I) to tell us the column where the first non-zero entry on row I of K lies. Then (as K will by symmetric in CONSL) we store only the entries from this point up to the diagonal entry, in GSTIF. GSTIF now becomes a one-dimensional array, and we use a pointer array LPOINT(I) to tell us where the I'th row of K begins. That is, the diagonal entry K_{i,i} will be stored as GSTIF(LPI), where LPI=LPOINT(I), for I=1 to NTOTV. The dimension of GSTIF is then LPOINT(NTOTV), which we denote by NSTIF. The parameter MSTIF is used to dimension GSTIF at the start of the program. The arrays NMINVC and LPOINT are set up by a subroutine mask in CONSL.FOR. LPOINT(I) can be found from LPOINT(I-1) by
LPOINT(I) = LPOINT(I-1) + I - NMINVC(I) + 1
since there are I-NMINVC(I)+1 entries to store on row I. We start with LPOINT(1)=1.

The mapping from the lower diagonal of K to GSTIF is now n K_{i,j} --> {GSTIF(LPOINT(I) - I + J)} if NMINVC(I) lt;= J lt;= I. This mapping is performed by a function gpos; if K_{i,j} lies outside the envelope of K which is being stored, gpos returns a value of zero. Otherwise, n K_{i,j} --> {GSTIF(GPOS(LPOINT,NTOTV,I,J))}.

Equation solution algorithms

As with the algorithms for compact storage of K described above, various equation solution algorithms are used in the `main engines' to solve n Gu = f, and will be introduced here, carefully graded from the simple to the more complex. We denote the number of rows and columns by n.

Gaussian elimination

The simplest matrix solution algorithm is the well-known Gaussian elimination, in which elementary row operations are applied systematically to G -- and simultaneously to the right-hand-side vector -- to get zeroes in the entries below the diagonal; the resulting upper triangular system can then be solved by back-substitution. This algorithm, which does not require that the matrix is symmetric, is used in the `main engine' THERM.FOR. The subroutine solve calls the subroutines reduce and backsu in turn, to perform the reduction (to upper triangular form) and back-substitution phases.

This algorithm is represented in the following fragments of pseudo-code. first we show the reduction phase, in which the lower triangular matrix L is formed:

C...Reduction:
      do 20 i=1,n-1
        pivot = G(i,i)
        do 10 k=i+1,n
          fact = G(k,i)/pivot
          do 5 j=i+1,n
            G(k,j) = G(k,j) - fact*G(i,j)
    5     continue
          f(k) = f(k) - fact*f(i)
   10   continue
   20 continue

Here, G(i,j) indicates the (i,j)'th entry of G; no modification has been made to represent the compact storage of G. If this pseudo-code is compared with the code in subroutine reduce of THERM.FOR, it will be seen how the mappings to the band storage of G in GSTIF (as described in the previous section for non-symmetric band storage) are implemented. Also, the start and end values in the do loops are modified, because we only need to perform these loops for entries within the band. These factors in translation from the pseudo-code given in the Manual, to the actual code in the `main engines', should be borne in mind for when studying the following algorithms.

The solution is completed by the back-substitution phase, performed in subroutine backsu of THERM.FOR:

c...back-substitution phase
      u(n) = f(n)/G(n,n)
      do 10 i=n-1,1, step -1
        pivot = G(i,i)
        do 5 j=i+1,n
          f(i) = f(i) - G(i,j)*u(j)
    5   continue
        u(i) = f(i)/pivot
   10 continue

Symmetric Choleski decomposition

We now consider an algorithm closely related to Gaussian elimination, and used in the Basic-level programs POISS and ELAST. A Choleski decomposition of a positive definite matrix K consists of a lower triangular matrix L (with 1's down the diagonal) and an upper triangular matrix U , such that K=LU . If K is symmetric, we modify this to finding a lower triangular matrix L such that K=LL^T . (Now we no longer have 1's down the diagonal of K ) Then our equation to be solved becomes L(L^Tu) = f . Once the decomposition has been performed, we solve the equation Ly=f by forward substitution, to find an intermediate vector y ; this becomes the right-hand-side in the final stage, which is to solve L^Tu=y by back-substitution. This final stage is identical to the back-substitution stage of Gaussian elimination.

The entries in L must be found systematically. The first one which can be found is l_{1,1} , since it will be seen by forming the matrix LL^T and comparing it with K , that k_{1,1} = {l_{1,1}}^2 . Once l_{1,1} is known, we could calculate all the entries below this in the first column. However, we will work in a row-based manner, and find l_{2,1} , use this to find l_{2,2} , before moving on to the 3rd row to find l_{3,1},l_{3,2},l_{3,3} --- then move on to the 4th row, and so on. This row-based approach matches the row-based skyline storage algorithm described earlier.

This Choleski decomposition algorithm is used in the Basic-level programs POISS and ELAST, and can be found in subroutine chodec. The pseudo-code for this decomposition, applied to the matrix G, is as follows:

c                       T
c...decompose G into L.L
      i = 1
      aii = G(1,1)
      clii = sqrt(aii)
      G(1,1) = clii
c...decompose each row in turn
      do 20 i=2,n
c...work along the row
        do 10 j=1,i-1
          cljj = G(j,j)
          aij = G(i,j)
          sum = 0.0
          do 5 k=1,j-1
            gik = G(i,k)
            gjk = G(j,k)
            sum = sum + gik*gjk
    5     continue
          clij = (aij - sum)/cljj
          G(i,j) = clij
   10   continue
c...finally find the diagonal entry on the row
        aii = G(i,i)
        sum = 0.0
        do 15 k=1,i-1
          clik = G(i,k)
          sum = sum + clik*clik
   15   continue
        clii2 = aii - sum
        clii = sqrt(clii2)
        G(i,i) = clii
   20 continue

Note that as each entry in L is found, it is written to G, so that the corresponding entry of the stiffness matrix gets overwritten. At the end of the algorithm, G contains the entries of L in its lower triangle.

In implementing this algorithm in a `main engine', we must use a mapping to convert G(i,j) to a storage space in GSTIF, as described in the earlier section on storage techniques.

The system Gu=f can now be rewritten LL^Tu=f , and is solved in two stages:

solve Ly=f using forward substitution;
solve L^Tu=y using back-substitution.

The pseudo-code for this, on which subroutine chosub in POISS.FOR is based, is:

c...forward substitution: solve Ly = f
      y(1) = f(1)/G(1,1)
      do 10 i=2,n
        bi = f(i)
        sum = 0.0
        do 5 k=1,i-1
          gik = G(i,k)
          ak = y(k)
          sum = sum + gik*ak
    5   continue
        clii = G(i,i)
        y(i) = (bi - sum)/clii
   10 continue
c                                   T
c...back-substitution phase: solve L x = y
      u(n) = y(n)/G(n,n)
      do 20 i=n-1,1, step -1
        yi = y(i)
        sum = 0.0
        do 15 k=i+1,n
          gki = G(k,i)
          ak = u(k)
          sum = sum + gki*ak
   15   continue
        clii = G(i,i)
        u(i) = (yi - sum)/clii
   20 continue

In program CONSL.FOR, we use the same solution algorithm, but modify its Fortran coding to take account of the skyline storage technique, described above. When row-based skyline storage is used with a Choleski decomposition, the algorithm given above for the final back-substitution phase (solve L^Tu=y ) is rather inefficient. This is because it works with rows of L^T , which are columns of L , the entries of which are scattered throughout the row-based storage array GSTIF. It can easily be modified, however, so that once an entry u_i is found, this is eliminated from all equations from i-1 up to row 1, which involves working up the column of L^T . The modified algorithm, replacing the last phase of the previous pseudo-code, is:

c...back-substitution phase: solve L x = y
c...We can do this in a column-based manner!
      do 20 i=n,1, step -1 
        yi = y(i)
        clii = G(i,i)
        xi = yi/clii
        u(i) = xi
c...once we have found x_i, transfer the term involving this, to
c...the r.h.s. in all earlier equations
        if (i.eq.1) goto 20
        do 15 j=i-1,1, step -1
          clij = G(i,j)
          y(j) = y(j) - clij*xi
   15   continue
   20 continue

This form of the routine is also used in PLADV, where skyline storage is also employed.

Frontal solution algorithm

The storage of a global stiffness matrix, even using a compact method such as skyline storage, still requires a large amount of space for complex finite element meshes. Researchers have therefore looked for ways of solving Gu=f without having to assemble G fully. One way of doing this is to use iterative solution algorithms, which will be discussed in the next section. However, we first look at a direct solution method which is widely used in commercial f.e. programs. This is called the frontal method. A full description is beyond the scope of this manual (see Chapter 10 for recommended texts), but essentially it is a form of Gaussian elimination in which the element assembly process is combined with the solution. First, the mesh topology is examined to identify the element at which each degree of freedom last appears. Then, entries in element stiffness matrices are read in element-by-element, and assigned space in an array of active variables. Once the last appearance of a d.o.f. has occurred, that variable is eliminated, and the space thus freed up is used for a new variable. Hence, there will be a `front' of active variables passing through the matrix. When this elimination phase is completed, the back-substitution phase can occur.

The coding for this algorithm, for the case of a symmetric matrix, can be seen in subroutine front of the advanced elasticity program ELADV.FOR. I am indebted to Dr David Naylor for permission to use this subroutine. A number of global one-dimensional arrays have to be used, but the algorithm is still more space-efficient than skyline storage, and so ELADV should be used for elasticity analyses with complex meshes. There are comment lines in the subroutine, giving some guidance as to how the algorithm works.

The algorithm can be adapted to the case of a non-symmetric matrix G , in which case approximately twice as much storage will be needed. My own adaptation of front for the nonsymmetric case, is the subroutine afront in the elasto-viscoplasticity program VPLAS.FOR. This program handles non-associated plastic flow, which results in non-symmetric stiffness matrices.

The amount of storage needed depends on the numbering of the elements, not the nodes, as the algorithm is dependent on the order in which the elements are assembled into the front. The subroutine makes a preliminary run-through of the elements to check that sufficient storage is provided; this is defined by the parameter MFRON at the start of the program. The entries of G are allocated spaces in the one-dimensional array GSTIF, which is of dimension MFRON. As the assembly process is handled within front, there is no assmb subroutine within ELADV and VPLAS.

Iterative solution methods

The standard iterative method for solving Gu=f , G symmetric positive definite, is the preconditioned conjugate gradients (PCG) algorithm. This may be summarised as follows:

an {Initialization}
v^{(0)} = 0
r^{(0)} = f
k = 0
z^{(0)} = C^{-1}r^{(0)}
p^{(0)} = z^{(0)}
{Begin iteration}
\alpha_k = \frac{(z^{(k)},r^{(k)})}{(Gp^{(k)},p^{(k)}}
v^{(k+1)} = v^{(k)} + \alpha_k p^{(k)}
r^{(k+1)} = r^{(k)} - \alpha_k Gp^{(k)}
z^{(k+1)} = C^{-1} r^{(k+1)}
ta_k = \frac{(z^{(k+1)},r^{(k+1)})}{(z^{(k)},r^{(k)}}
p^{(k+1)} = z^{(k+1)} + ta_k p^{(k)}
k = k + 1
{Next iteration} a

This form of the PCG algorithm is taken from Bartelt(1989) -- see Chapter 10 for details pf recommended texts. Here, (u,v) denotes the inner product of vectors u and v . The matrix C is the preconditioning matrix. The algorithm is terminated when the norm of the residual vector r^{(k+1)} is less than TOLER times the norm of f ; then the solution u=v^{(k+1)} .

The simplest form of preconditioning is diagonal preconditioning, in which C = {diag}(G) . This algorithm is implemented in the Basic-level program FRAME.FOR. The solution is performed in subroutine pcgsol, the pseudo-code for which is:

c...solve Kv=g using diagk as preconditioner
c...multiplications K.v are performed element-by-element, 
c...reading the element matrices of K from the scratch file
c
c...initial guess; 
      do 2 i=1,n
       v(i) = 0.0
       r(i) = g(i)
       z(i) = 0.0
    2 continue
c
c...z = {Kdiag}^-1 * r
      do 15 i=1,n
        z(i) = r(i)/diagk(i)
   15 continue
c
      do 16 i=1,n
        p(i) = z(i)
   16 continue
      ztr = 0.0
      do 17 i=1,n
        zi = z(i)
        ri = r(i)
        ztr = ztr + zi*ri
   17 continue
      call norm2(mtotv,ntotv,r,rnorm0)
      itscg = 0
      write(6,*)'PGSOL: iteration',itscg,',norm = ',rnorm0
c
   25 itscg = itscg + 1
c
c...create y which is G*p
      call ematgv( p , y )
      pty = 0.0
      do 28 i=1,n
        pty = pty + p(i)*y(i)
   28 continue
      alpha = ztr/pty      
      do 30 i=1,n
       v(i) = v(i) + alpha*p(i)
   30 continue
c
      do 40 i=1,n
        r(i) = r(i) - alpha*y(i)
   40 continue
c
      call norm2(mtotv,ntotv,r,rnorm)
c
      write(6,*)'       iteration',itscg,',norm = ',rnorm
c
      if(rnorm.lt.cgtol) goto 200
c
c...z = {Kdiag}^-1 * r
      do 45 i=1,n
        z(i) = r(i)/diagk(i)
   45 continue
c
      ztr0 = ztr
      ztr = 0.0
      do 60 i=1,ntotv
        ztr = ztr + z(i)*r(i)
   60 continue
      beta = ztr/ztr0
c
      do 130 i=1,ntotv
       p(i) = z(i) + beta*p(i)
  130 continue
c
      goto 25
c
  200 continue
      write(6,*)'Solution converged after ',itscg,' iterations.'

With the diagonal preconditioner (stored in the array diagk), the formation of C^{-1}v is achieved by simply dividing the elements of v by the corresponding elements of diagk. The array diagk is assembled from the element stiffness matrices in the subroutine assdk. The global matrix-vector multiplication Gv is performed element-by-element, reading the element stiffness matrices from the scratch file, in subroutine ematgv; thus, there is no assembly or storage of the global matrix G .

Diagonal preconditioning works surprisingly well, considering its simplicity. If a more sophisticated preconditioner is required, the most popular currently is to perform an Incomplete Choleski (IC) factorization of G . This is essentially a symmetric Choleski decomposition of G into LL^T , as described earlier, but avoiding some or all of the `fill-in' which occurs within the envelope of G . That is, we may set L(i,j)=0 instead of L(i,j)=l_{i,j} if G(i,j)=0 . To compensate this, the calculated l_{i,j} is instead added to the diagonal term L(i,i) .

Both diagonal and IC preconditioning are implemented in the advanced plasticity program PLADV.FOR. The form of IC preconditioning there is adaptive: the fill-in is avoided if the magnitude of l_{i,j} is less than a user-specified multiple ( filtol) of the diagonal term. PLADV.FOR uses the skyline storage technique for G . The subroutine inchol in PLADV.FOR performs this IC decomposition. The pseudo-code below should be compared with that for subroutine chodec:

c...Incomplete Choleski decomposition
      small = 1.e-8
c...count the amount of fill-in which occurs
      kfill = 0
      khole = 0
      i = 1
      a11 = G(1,1)
      cl11 = sqrt(a11)
      G(1,1) = cl11
c...decompose each row in turn
      do 20 i=2,n
        aii = G(i,i)
c...we will not fill in if the term is smaller than filtol
        filtol = factor*abs(aii)
c...work along the row
        do 10 j=1,i-1
          cljj = G(j,j)
          aij = G(i,j)
          sum = 0.0
          do 5 k=1,j-1
            sum = sum + G(i,k)*G(j,k)
    5     continue
          clij = (aij - sum)/cljj
c...here is the 'Incomplete' part
          gij = G(i,j)
          if (gij.eq.0.0) then
            khole = khole + 1
            if (abs(clij).lt.filtol) then
c...avoid fill-in: add i,j term to diagonal instead           
              G(i,i) = G(i,i) + clij
              goto 10
            endif
c...fill-in will occur
            kfill = kfill + 1
          endif         
          G(i,j) = clij
   10   continue
c...finally find the diagonal entry on the row
        aii = G(i,i)
        sum = 0.0
        do 15 k=1,i-1
          clik = gstif(i,k)
          if (clik.eq.0.0) goto 15
          sum = sum + clik*clik
   15   continue
        clii2 = aii - sum
        clii = sqrt(clii2)
        G(i,i) = clii
   20 continue
      write(6,*) 'No. of holes found in G matrix = ',khole
      write(6,*) 'No. of holes filled-in = ',kfill

As might be expected, there is a trade-off between the amount of fill-in which is prevented, and the effectiveness of the resulting preconditioner. The example datafile pltun4.dat, used with the solution options in PLADV, is discussed in Chapter 9.

Note that this is a very simple form of IC algorithm. There have been many improved algorithms proposed in recent years, and users wishing to apply this technique in practice are advised to search specialist journals for the lastest developments.

The reader will have noticed that additional tolerance parameters must be specified when an iterative solver is used. In PLADV, the iterative solvers are chosen by setting the parameter ISOLA to 2 or 3:

ISOLA = 0 for direct solution using Choleski ( L.L^T ) decomposition and skyline storage;
ISOLA = 1 for direct solution using L.D.L^T decomposition and skyline storage;
ISOLA = 2 for iterative solution using conjugate gradients with diagonal preconditioning;
ISOLA = 3 for Incomplete Choleski preconditioning with conjugate gradient solution.

If ISOLA = 2 or 3 is chosen, the convergence tolerance cgtol must be input; a value of 10^{-6} is suitable. (An interesting approach is to make the tolerance adaptive: use a greater tolerance in the early stages of the plasticity equilibrium iteration, to avoide unnecessary PCG iterations for a solution which is itself only an approximation, and a smaller tolerance for the final iterations). Then, if ISOLA=3, the fill-in tolerance factor should be input. If factor = 0.0 is chosen, then fill-in will always occur, and so Choleski decomposition will be `complete'; in this case, the preconditioner of G will be G itself, and the PCG iteration will converge immediately. See the example of pltun4.dat in the next Chapter, for experience in choosing factor.

Handling of fixed and specified degrees of freedom

The correct way to handle fixed or specified degrees of freedom in a finite element solution, is to delete those rows from the global system of equations, and in the remaining rows to transfer the terms containing these known values onto the right-hand-sides, before solving for the remaining unknown d.o.f.s. This is however very cumbersome to program, in general. the exception is with the frontal solution method, in which the degrees of freedom are dealt with in turn; it will be found that specified d.o.f.s are treated in this way in front and afront in ELADV.FOR and VPLAS.FOR.

The alternative method for imposing nodal fixities, used in ELAST.FOR, is sometimes called a `big spring' approach. To impose the condition that u_k \approx 0.0 for some k , we add a large number ( 10^{20} ) to the corresponding diagonal term in GSTIF; this is done in subroutine fixdof, called by solve before calling chodec. This will force the value of u_k arising from the solution process to be very small, and we later set such small values to zero.

Fixed values of u_k which are non-zero (e.g. specified displacements, or the specified values of the potential for nodes lying on Dirichlet boundaries) are dealt with similarly. If we want to ensure that u_k = g , we multiply K_{k,k} by 10^{12} , and set the corresponding entry in the r.h.s. vector as F_k = g*K_{k,k} . This process can be seen in fixdof in POISS.FOR.

When a displacement degree of freedom is fixed, there will be a reaction force in this direction at the node, to maintain equilibrium. This can be found after the main solution, by calculating the r.h.s. for the corresponding row. This is done in subroutine reactn in ELAST.FOR and FRAME.FOR; these reactions are written into the load vector ASLOD, which is written in the result file.

FELIPE Online Manual: Global matrix storage, and equation solvers