You must log in to edit PetroWiki. Help with editing

Content of PetroWiki is intended for personal use only and to supplement, not replace, engineering judgment. SPE disclaims any and all liability for your use of such content. More information

# High performance computing and reservoir simulation

The motivation for high-performance computing in reservoir simulation has always existed. From the earliest simulation models, computing resources have been severely taxed simply because the level of complexity desired by the engineer almost always exceeded the speed and memory of the hardware. The high-speed vector processors such as the Cray of the late 1970s and early 1980s led to orders-of-magnitude improvement in speed of computation and led to production models of several hundred thousand cells. The relief brought by these models, unfortunately, was short-lived. The desire for increased physics of compositional modeling and the introduction of geostatistically/structurally based geological models led to increases in computational complexity even beyond the large-scale models of the vector processors. Tens of millions of cells with complete reservoir parameters now became available for use by the engineer. Although upscaling provided a tool to dramatically reduce model sizes, the inherent assumptions of the upscaling techniques left a strong desire by the engineer to incorporate all of the available data in studies. The only available solution to this problem became the subdivision of the model into small segments and the use of parallel computers for reservoir simulation. The introduction of fast, low-cost commodity hardware led to a revolution in higher-performance computing based on clusters.

## Contents

- 1 Exposition of data and the chasm of scale
- 2 Characteristics of high-performance computing
- 3 Parallel reservoir simulator
- 4 Parallel linear-equation solver
- 5 Parallel computers and message-passing interface
- 6 Application of parallel computing
- 7 Future of high-performance simulation
- 8 References
- 9 Noteworthy papers in OnePetro
- 10 External links
- 11 See also

## Exposition of data and the chasm of scale

From the earliest reservoir simulation models of the 1960s with only a few tens of finite-difference cells, reservoir simulation had progressed by several orders of magnitude in the early 1990s to hundreds of thousands of cells. Unfortunately, these fine-scale models still were far from the scale of the data. This chasm of scale was further exacerbated by the introduction of geocellular models with geostatistically derived attributes. With these geostatistically based models, it was possible to generate geological descriptions for the reservoir simulation with literally tens of millions of finite-difference cells. Although upscaling offered some relief to this problem, the resultant assumptions required often left the engineer wondering how far from reality the resultant solutions had been driven. If the additional degree of freedom of uncertainty of the reservoir data is added to this problem, then the number and size of simulations becomes unlimited. This implies that the need for further improvements in high-performance computing will always exist for reservoir simulation. Parallel computing offers one approach to overcome the problem of the explosion of information through the use of fine-scale models with limited or no upscaling.

## Characteristics of high-performance computing

The basis of high-performance computing is the use of specialized hardware to achieve computational speeds that are much faster than conventional computers. This idea of "speedup" is encapsulated in what is known as Amdahl’s Law:

Speedup = 1 / (scalar + special / *n*),

where "scalar" is the fraction of the program which is scalar and "special" is the fraction of computations performed on specialized hardware (one minus scalar). Simply stated, Amdahl’s law indicates that if there is a piece of specialized hardware that is *n* times faster than the scalar processor(s) in a computer or group of computers, it must be used a large fraction of the time to have a significant impact on overall performance. The earlier supercomputers primarily used specialized hardware known as vector processors for the speedup. More recently, parallel computers based on connecting tens to hundreds of processors have formed the basis of high-performance computers. The importance of the serial or single-processor computer cannot be overemphasized, however. The introduction of the reduced instruction set computer (RISC) with "superscalar" performance in 1990 spelled the end of the dominance of the Cray supercomputer in the high-performance reservoir-simulation market because of a rapid decrease in the price/performance of the new hardware. Similarly, the introduction of low-cost, commodity hardware based on the INTEL chipset has led to significant improvements in price/performance. The application of this hardware to reservoir simulation has been a natural evolution in the application of the lowest-cost, highest-performance computers to reservoir simulation.

## Parallel reservoir simulator

The earlier high-performance computing hardware such as the Cray Computer required that modifications be made to the reservoir simulator programs to efficiently use the hardware.^{[1]}^{[2]} The modifications known as vectorization led to different approaches in the data organization. One of the best examples of this is given in Killough and Levesque^{[3]} and Young^{[4]}, in which a compositional model’s data structure is reorganized so that like phases in the finite-difference gridblocks of the reservoir are grouped. The notion behind parallel reservoir simulation is similar; however, in this case, the reservoir grid is subdivided into 3D blocks or domains. The computational work for each of the domains is then performed in parallel by a separate CPU of one of the computers in the parallel hardware. Profiles of computing workload for a typical reservoir simulation model often show tens, if not hundreds, of subroutines that are involved in a substantial portion of the calculations. Because of this, major reprogramming of reservoir simulation models is required to achieve high parallel efficiencies. As Killough^{[5]} points out, there are numerous obstacles to parallelization for reservoir simulation: recursive nature of the linear equation solver, load imbalance caused by reservoir heterogeneity and/or physics, and diverse data structure of well and facility management routines.

Several papers^{[6]}^{[7]}^{[8]}^{[9]}^{[10]}^{[11]}^{[12]}^{[13]}^{[14]}^{[15]}^{[16]} discuss techniques that have been used to bring about efficient parallel reservoir simulations. Each of these addresses the solutions to parallelization and, in particular, the obstacles mentioned previously in various ways. One of these simulators uses local grid refinement for parallelization.^{[6]} The main concept of parallelization with local grid refinement (LGR) is that LGR is not necessarily used to add additional cells to the model but simply to facilitate the subdivision of the grid. With LGR, the same simulation program can be used to perform simulations either serially on a single processor or in parallel on multiple processors simply through data manipulation. In addition, a great deal of flexibility for domain decomposition exists, which can lead to enhanced load balance for parallel simulations.

Parallelization using LGR involves assigning each grid to a processor. The same processor can be assigned to many grids or in the limit; each grid can be assigned to a separate processor. Variation of processor assignment and the grid refinement can dramatically affect parallel performance. Different preconditioners for the parallel linear equation solvers can also have a dramatic effect on parallel efficiency. Finally, the flexibility of LGR and processor assignment can be used to achieve improved parallel efficiency. For parallel domain decomposition, the base grid may be totally refined into subgrids so that all gridblocks in the base grid become inactive. Each grid is assigned to a processor so that only a portion of the model resides on a given processor. In this manner, the simulator memory requirements become scaleable. That is, as the number of processors is increased, the grid assigned to a given processor becomes smaller. Alternatively, only a portion of the base or root grid may be refined; the remaining unrefined cells in the root grid are then assigned to a processor. Because of this use of local grid refinement, only an extremely small number of global arrays are required on each processor—for example, the pore-volume array.

The existence of horizontal wells in most modern simulations requires that no restriction be placed on the placement of wells relative to grid and/or processor boundaries in a parallel simulator. To alleviate any difficulty with this, one approach is for well information to be passed by a broadcast among all processors.^{[6]} The overhead for broadcasting completion level information for all wells is included in the following results and appears to have little overall effect on parallel efficiency. Tubing hydraulics calculations for a predictive well-management routine have also been parallelized. These calculations are computationally intensive enough to support gather/scatter-type operations. For example, in a case with 1,500 producing wells, more than 90% of the computational workload for well management was in the tubing hydraulics calculations. Parallelization of only these calculations has shown a substantial improvement in parallel efficiency for several test cases for the overall well-management routines.

## Parallel linear-equation solver

A typical parallel linear solver involves composite grid decomposition and is based on the work of Wallis and Nolen.^{[17]} At each level of the composite grid, a red-black or block Jacobi ordering of the grid is used. Red-black refers to an ordering of the grids or domains. For example, if the grids form an equal areal subdivision of an areally square model, the red and black grids would appear as a checkerboard when viewed from the top of the grid. For a 1D or strip decomposition of the model, the red-black coloring would appear as alternately colored stripes when viewed perpendicular to the decomposition direction. An approximate block LU factorization approach is used for preconditioning and involves, for example, an L-factor for the red-black ordering as follows:

The *A*_{i} blocks are ordered first and represent the red grids or domains, while the *B*_{i} blocks are the black grids and are ordered second. The diagonals of the *B*_{i} submatrices are modified from the original coefficients to maintain rowsum constraints (see Wallis and Nolen^{[17]}). This approximate factorization leads to independent solutions on each processor for the first the red *A*_{i} and then followed by the black *B*_{i} grid matrices. For the block Jacobi approach, only a single color exists, and all grids at a given level are solved independently. As described in Killough, *et al*.^{[6]}, parallelization of this preconditioning can be achieved in a manner almost identical to the technique used for the coefficient routines. Burrows, *et al.*^{[18]} provide detail of a similar parallel solver developed independently, but almost simultaneously. Each grid is assigned to a processor. These may be the same or different processors. If all active grids lie on the same level, the composite grid solver performs all red or black preconditioning solutions independently [i.e., either block Jacobi (all red) or red-black orderings are used]. If grids are assigned to different levels, each grid of the same level and same color (Jacobi being all red) is solved simultaneously, with communication being performed after the entire solution of all grids on a level is completed. Factorization is performed so that rowsum constraints are also imposed. The flexibility of this solver allows the easy implementation of block red-black algorithms, or, as a special case, a red-black linear decomposition ("strip") decomposition could be used.

## Parallel computers and message-passing interface

With the rapid changes in technology for parallel computing, it is difficult to provide the state-of-the-art because this appears to change daily; however, current classification practice divides parallel computers into two distinct categories: shared and distributed memory. Shared-memory systems allow all of the memory associated with the parallel computer to be accessed by all of the processors. For UNIX systems, the memory sizes currently can range up to several tens of billion of bytes with hundreds of processors. For systems based on PC products, the shared-memory systems usually are limited to a few billion bytes of storage and fewer than 10 processors. (This may change with the recently introduced 64-bit processor Itanium.) The distributed memory systems allow access to local memory only by a small number of processors that reside on a "node" of the parallel system. Each of the "nodes" in the distributed memory parallel system often communicates with other nodes by a high-speed switch, although conventional ethernet communications are possible for some applications. Either of these parallel computers is capable of delivering several hundred million floating point operations per second (MFLOPS) for each node, or several billions of floating point operations per second (GFLOPS) for applications that can use the parallel hardware. parallel PC hardware can be further subdivided into operating systems: Windows- and LINUX-based systems. Current trends indicated that it is likely that the Windows-based systems will dominate the engineer’s desktop because of the wide availability of engineering/business applications available for this operating system. LINUX-based systems will likely be limited to clusters of servers to perform computer-intensive work such as parallel simulations.

To perform calculations in parallel, data must be passed in the form of messages among processors of the parallel machine. Message-passing interface (MPI) has been primarily used for parallelization on all classifications of parallel machines because of the wide availability and portability of MPI. For shared-memory computers (Sun and SGI for example), other forms of message passing exist such as the recently popular OpenMP. There are generally three forms of message passing for a distributed memory system: synchronous, asynchronous, and global operations. Synchronous message passing requires that all messages be completed before computations can proceed. Asynchronous, on the other hand, allows some overlap of processing and communications. Global message-passing operations distribute data from single processors to one processor, or vice-versa. Often some form of arithmetic function, such as summation, is performed as part of the global concatenation operation. These functions are particularly useful for evaluation of reduction functions such as dot products, or for distribution of data from a single node to all nodes in the parallel partition. Global communications are normally optimized for a particular architecture and are much more efficient than a simple set of send-receive operations for achieving the same results.

## Application of parallel computing

The rationalization of parallel reservoir simulation is based on two concepts: the ability to run larger models, or simply the ability to improve turnaround time on existing models. A typical larger-size model would be considered to have more than 1 million gridblocks. Results for a 1-million-cell, black-oil, implicit model show scaling is almost perfect from 1 to 18 processors with only a small degradation up to 36 processors (speedup = 32) on an SGI shared-memory parallel computer. Although this is a somewhat idealized example, it does point out the excellent scaling possible because of low message-passing overhead for a parallel simulator using domain decomposition. Results for a real-world compositional example with 3.4 million cells have been obtained on an IBM SP parallel computer with 108 processors. For this case, with 12 to 27 processors, the problem scales extremely well with a speedup of a factor of 2 for slightly more than doubling the number of processors. The scaling falls off slightly from 27 to 54 processors with only a factor of 1.7, but from 54 to 108 processors, this scaling is maintained at a factor of 1.7. Overall, the speedup from 12 to 108 processors was a very good factor of 5.6. If perfect scaling on 12 processors is assumed, this indicates a factor of 68 overall for 107 processors for a parallel efficiency of approximately 64%. The total elapsed time for the 108 processor case was approximately 12 hours, indicating that overnight turnaround for this 3.4-million-cell model could be achieved. Another example application of parallel simulation with more than 1 million gridblocks was recently presented in a black-oil simulation study of complex water encroachment in a large carbonate reservoir in Saudi Arabia.^{[16]}

## Future of high-performance simulation

It is clear that the uptake of parallel reservoir simulation has been limited by the hurdle of cost and the robustness of the technology. The recent advances in commodity hardware based on the INTEL architecture have provided a significant boost, however. Currently, the use of parallel INTEL-based clusters using inexpensive high-speed switches has lowered the entry point for significant parallel simulation by an order of magnitude. Just as the UNIX workstation caused a revolution in reservoir simulation, this new lost-cost hardware will likely bring parallel computing to the engineer’s desktop in the near future. The remaining hurdles for parallel simulation then are limited to enhancing parallel simulation technology. In particular, emphasis must be placed on load balancing and efficient linear equation solvers. Load balancing refers to the fact that all processors must perform the same amount of work to achieve high parallel efficiency. If only a few processors perform most of the work for a parallel simulation involving large numbers of processors, then poor parallel efficiency results—a case known as load imbalance. The allocation of optimization techniques to solve the load balancing problem offers some promise.^{[19]} An area of particular importance for load balancing is the coupled surface network/reservoir simulator. For this case, the network often dominates the simulation by almost an order of magnitude; therefore, to achieve reasonable parallel efficiency, the imbalance caused by the serial surface network calculations must be overcome. One approach to this is mentioned previously, the parallelization of the tubing hydraulics calculations; however, considerable additional parallelization effort will be required, especially if more than a few processors are to be used efficiently. Load imbalance is also brought about by complex geologies and complex locally refined grids. It is likely that the introduction of unstructured grids and associated load balancing using techniques such as "Hilbert space-filling curves"^{[20]} may well lead to the solution of this problem. Finally, the linear equation solver as the heart of the parallel simulator must be enhanced to provide efficient and robust solutions. Solver degradation often results from the use of large numbers of domains (processors) or a poor choice of the decomposition geometry.^{[6]} Although the previous examples show good parallel performance, future models with large, implicit systems must be routinely solved on large numbers of processors. Multigrid-like solvers may show the best promise for this improvement.^{[10]}^{[21]}

## References

- ↑ Killough, J.E. 1979. The Use of Vector Processors in Reservoir Simulation. Presented at the SPE Reservoir Simulation Symposium, Denver, Colorado, 31 January-2 Feburary 1979. SPE-7673-MS. http://dx.doi.org/10.2118/7673-MS
- ↑ Nolen, J.S., Kuba, D.W., and Jr., M.J.K. 1981. Application of Vector Processors To Solve Finite Difference Equations. Society of Petroleum Engineers Journal 21 (4): 447-453. SPE-7675-PA. http://dx.doi.org/10.2118/7675-PA
- ↑ Killough, J.E. and Levesque, J.M. 1982. Reservoir Simulation and the In-house Vector Processor: Experience for the First Year. Presented at the SPE Reservoir Simulation Symposium, New Orleans, Louisiana, 31 January-3 Feburary 1982. SPE-10521-MS. http://dx.doi.org/10.2118/10521-MS
- ↑ Young, L.C. 1987. Equation of State Compositional Modeling on Vector Processors. Presented at the SPE Symposium on Reservoir Simulation, San Antonio, Texas, 1-4 February 1987. SPE-16023-MS. http://dx.doi.org/10.2118/16023-MS
- ↑ Killough, J.E. 1993. Is Parallel Computing Ready for Reservoir Simulation? A Critical Analysis of the State of the Art. Presented at the SPE Annual Technical Conference and Exhibition, Houston, Texas, 3-6 October 1993. SPE-26634-MS. http://dx.doi.org/10.2118/26634-MS
- ↑
^{6.0}^{6.1}^{6.2}^{6.3}^{6.4}Killough, J.E. et al. 1996. A General-Purpose parallel Reservoir Simulator. Presented at the 1996 European Conference on the Mathematics of Oil Recovery, Leoben, Austria, 3–6 September. - ↑ van Daalen, D.T. et al. 1990. The parallelization of BOSIM, Shell’s Black/Volatile Oil Reservoir Simulator. Proc. of the First IMA/SPE European Conference on the Mathematics of Oil Recovery, Oxford U.
- ↑ Killough, J.E. and Bhogeswara, R. 1991. Simulation of Compositional Reservoir Phenomena on a Distributed-Memory Parallel Computer. J Pet Technol 43 (11): 1368-1374. SPE-21208-PA. http://dx.doi.org/10.2118/21208-PA
- ↑ Wheeler, J.A. and Smith, R.A. 1990. Reservoir Simulation on a Hypercube. SPE Res Eng 5 (4): 544-548. SPE-19804-PA. http://dx.doi.org/10.2118/19804-PA
- ↑
^{10.0}^{10.1}Rutledge, J.M., Jones, D.R., Chen, W.H. et al. 1992. Use of a Massively Parallel SIMD Computer for Reservoir Simulation. SPE Comp App 4 (4): 16-21. SPE-21213-PA. http://dx.doi.org/10.2118/21213-PA - ↑ Gautam S. et al. 1996. Parallel Computing Alters Approaches, Raises Integration Challenges in Reservoir Modeling. Oil & Gas J. (20 May): 48.
- ↑ Tchelepi, H.A., Durlofsky, L.J., Chen, W.H. et al. 1997. Practical Use of Scale Up and Parallel Reservoir Simulation Technologies in Field Studies. Presented at the SPE Annual Technical Conference and Exhibition, San Antonio, Texas, 5-8 October 1997. SPE-38886-MS. http://dx.doi.org/10.2118/38886-MS
- ↑ Chien, M.C.H. et al.: "A Scalable parallel Multi-Purpose Reservoir Simulator," paper SPE 37976 presented at the 1997 SPE Reservoir Simulation Symposium, Dallas, 8–11 June.
- ↑ Shiralkar, G.S., Stephenson, R.E., Joubert, W. et al. 1997. Falcon: A Production Quality Distributed Memory Reservoir Simulator. Presented at the SPE Reservoir Simulation Symposium, Dallas, Texas, 8-11 June 1997. SPE-37975-MS. http://dx.doi.org/10.2118/37975-MS
- ↑ Wang, P., Yotov, I., Wheeler, M. et al. 1997. A New Generation EOD Compositional Reservoir Simulator: Part 1—Formulation and Discretization. Presented at the SPE Reservoir Simulation Symposium, Dallas, Texas, 8–11 June. SPE-37979-MS. http://dx.doi.org/10.2118/37979-MS
- ↑
^{16.0}^{16.1}Pavlas Jr., E.J. 2001. MPP Simulation of Complex Water Encroachment in a Large Carbonate Reservoir in Saudi Arabia. Presented at the SPE Annual Technical Conference and Exhibition, New Orleans, Louisiana, 30 September-3 October 2001. SPE-71628-MS. http://dx.doi.org/10.2118/71628-MS - ↑
^{17.0}^{17.1}Wallis, J.R. and Nolen, J.S. 1993. Efficient Iterative Linear Solution of Locally Refined Grids Using Algebraic Multilevel Approximate Factorizations. Presented at the SPE Symposium on Reservoir Simulation, New Orleans, Louisiana, 28 February-3 March 1993. SPE-25239-MS. http://dx.doi.org/10.2118/25239-MS - ↑ Burrows, R., Ponting, D., and Wood, L. 1996. Parallel Reservoir Simulation with Nested Factorization. Presented at the 1996 European Conference on the Mathematics of Oil Recovery, Leoben, Austria, 3–6 September.
- ↑ Song, T. 1996. A Load-Balancing Technique for Reservoir Simulation Based on Dantzig’s Transportation Model. MS thesis, U. of Houston, Houston.
- ↑ Reference Manual for Zoltan. 1998. Los Alamos Natl. Laboratory, Los Alamos, New Mexico.
- ↑ Bhogeswara, R. and Killough, J.E. 1994. Parallel Linear Solvers for Reservoir Simulation: A Generic Approach for Existing and Emerging Computer Architectures. SPE Comp App 6 (1): 5-11. SPE-25240-PA. http://dx.doi.org/10.2118/25240-PA

## Noteworthy papers in OnePetro

Use this section to list papers in OnePetro that a reader who wants to learn more should definitely read

## External links

Use this section to provide links to relevant material on websites other than PetroWiki and OnePetro