Introduction
1.1 SimulationBased Optimization
Computer simulations are used extensively as models of real systems to evaluate output responses. Applications of simulation are widely found in many areas including supply chain management, finance, manufacturing, engineering design and medical treatment [42, 48, 60, 83, 96]. The choice of optimal simulation parameters can lead to improved operation, but configuring them well remains a challenging problem. Historically, the parameters are chosen by selecting the best from a set of candidate parameter settings. Simulationbased optimization [3, 40, 41, 47, 69, 90] is an emerging field which integrates optimization techniques into simulation analysis. The corresponding objective function is an associated measurement of an experimental simulation. Due to the complexity of the simulation, the objective function may be difficult and expensive to evaluate. Moreover, the inaccuracy of the objective function often complicates the optimization process. Deterministic optimization tools may lead to inaccurate solutions. Indeed, derivative information
Floating sleeve
Outer conductor
^{3}
7
Sleeve position
Teflon catheter
is typically unavailable, so many derivativedependent methods are not applicable to these problems.
A particular example is an engineering design problem for an microwave coaxial antenna in hepatic tumor ablation. (More details can be found in Section 4.2.) We aim to determine the optimal dimensions of the antenna, such as the slot size, the dipole tip length, and the thicknesses of the Teflon coatings (see Figure 1 for the antenna illustration) to yield a desired treatment performance. Finite element models (MultiPhysics 3.2) are used to simulate the electromagnetic (EM) field distribution in liver given a particular design. Since dielectric properties of the tissue are assumed to vary within ±10% of average properties, it is important to ensure that the antenna used in ablation is robust, i.e., relatively insensitive to the variations in physical properties of the tissue.
Although real world problems have many forms, in the thesis we consider the following bounded stochastic formulation:
min f (x)= E [F(x,£(u))] , (1.1)
where
Q = {x e R^{n} : l < x < u}.
Here, l and u are the lower and upper bounds for the input parameter x, respectively. We focus on the case where the parameter x comes from a continuous set Q, but not necessarily a finite set. £(u) is a random vector defined on a probability space. The sample response function F : R^{n} x R^{d} —► R takes two inputs, the simulation parametersx e R^{n} and a random sample of £(u) in R^{d}. Note that the variables x correspond to the design parameters (e.g., slot size, dipole tip length, etc.) in the above example, and the random realizations correspond to different dielectric properties of individuals. As in the example, our approaches concentrate on cases where the number of design parameters is small.
F(x, £(u)) corresponds to a particular evaluation using the finite element model. Given a random realization £iof £(u), F(x, £i) can be evaluated via a single simulation run. The underlying objective function f(x) is computed by taking an expectation over the sample response function and has no explicit form. (Different formulations other than expectation are acceptable; for example, maximum, minimum, or quantiles of the sample objectives [4, 32]). A basic assumption requires that the expectation function f (x) is well defined (for any x G R^{n} the function F(x, •)is measurable, and either E[F(x, £(u;))_{+}] or E[F(x,£(u))] is finite, see page 57 of [91]).
1.1.1 SimulationBased Optimization Methods
There are several excellent review papers on the subject of simulationbased optimization methods, such as [3, 40, 41]. These papers also provide a good survey on optimization addons for discreteevent simulation software, which have been undergoing fast development over the last decade; for example, OptQuest (www.optquest.com)for Arena (www.arena.com) and for SIMUL8 (www.simul8.com). A detailed summary of commercial simulation software and optimization addons can be found in Table 1 of [41]. Contemporary methods for continuous simulationbased optimization, which are implemented in these addons, are classified in several categories.
Response surface methodology The idea of Response Surface Methodology [12, 90] (RSM) is to construct one (or multiple) mathematical model A, which is called a surrogate model, to approximate the underlying function f,so that it is can be easily and cheaply evaluated at each parameter point x. Optimization procedures are executed over the inexpensive model A, instead of the expensive cost function. Moreover, the approximating model A could have estimated gradient values that enable the application of more efficient optimization algorithms. We determine the model A() by minimizing the difference of A() and the function F(•) over a representative set of points S:
^{min} J2x_{i&}s ^{0} ^{(F}^{(x}i^{,} & ^{} ^{A(x}i^{))} _{1} _{2)}
(1.2)
s.t. A(^) G A
Here 0() is a merit function, which is typically chosen as an /^{2}norm; ^ is one sample realization; and the set A is a class of tunable functional forms. The most popular choices of A are linear and quadratic models. Polynomial models are also compact forms and easy to evaluate, but they often have limited capacity to model complex functions of arbitrary shape. Splines are alternative tools, which are more flexible in fitting and are capable of handling large scale data. But in high dimensional approximation, the (Tensor product) spline requires sample data on a grid mesh for efficient implementation. Kriging is a type of interpolative tool which was originally developed in geostatistics. It is flexible enough to be implemented on highdimensional spatial data and has been used in many engineering problems. However, the performance is limited by the total number of design points; for example, increasing the number of points in S can quadratically increase the Kriging model construction time.
Heuristic methods Heuristic methods have proven to be practically useful in many realworld applications. We will briefly introduce the three most popular methods: genetic algorithms, tabu search and simulated annealing.
Genetic algorithms [13, 94, 95] are inspired from the process of biological evolution. The algorithm is initialized with a finite set of potential solutions called the population. Each potential solution, referred to as an individual, may be coded as a binary string or a realcoded integer or taken from a fixed alphabet of characters. These solutions are evaluated by a fitness function (normally the objective function) and the fit individuals are assigned a high probability to "reproduce" in the next generation of solutions, a sort of survival of the fittest scheme. In the reproducing process, new offspring inherit traits from parents (crossover) and are subject to mutation changes. The new offspring are accepted using various criteria to form a new population of candidate solutions. The algorithm proceeds to generate more favorable solutions in each iteration and eventually converges to a population with a distribution of good solutions.
Tabu search [45, 46] is a metaheuristic based on the local search method, which iteratively moves the current iterate to a neighbor solution, until certain criteria are satisfied. The algorithm allows a move to a neighbor solution that has a worse objective value. Meanwhile, in order to prevent circling to the previous solutions or infinite loops, a list of tabu or forbidden moves are kept and updated at each iteration. A shortterm memory tabu list is the most used type and is often referred to as a tabu tenure.
Simulated annealing [23, 31] searches local moves randomly from a list of candidate neighbor points. If a better neighbor point is encountered, it replaces the current iterate with probability one, or if a worse point is found, it replaces the iterate with a probability value strictly less than one. The appropriate probability value is determined by the difference of the objective values. For the algorithm to converge, the probability of moving towards a worse point should decrease along the iterations according to the decrement of a certain 'temperature' value, which changes based on a cooling schedule. All of the three algorithms are considered as global optimization methods since they are able to move iterates out of regions where locally optimal solutions reside.
Stochastic approximation Stochastic approximation [67, 104] falls into the category of gradientbased approaches. Typically, the method iteratively updates the current solution by
xfc+i = x_{k} + a_{k} Vf (x_{k}),
where the estimated gradient V f (x_{k}) can be calculated by various gradient estimation tools and a_{k} is the step length. Since the method is an extension of the line search method, it obviously is a local optimization method. Under proper conditions, such as when the error in the gradient approximation and the step length converges to zero as a certain rate, the stochastic approximation method can be shown to converge to a local optimum of the underlying function.
The approximate gradient Vf (x_{k}) is estimated using sample responses F.
Popular gradient estimation methods include perturbation analysis and the likelihood ratio method [43]. Spall's lab develops simultaneous perturbation (SP) and finite difference (FD) based stochastic approximation algorithms: SPSA and FDSA. The lab's web site (www.jhuapl.edu/SPSA) provides a good source of references on these two algorithms.
Derivativefree optimization methods Derivativefree optimization methods [80] are a class of methods that do not try to utilize or directly estimate the gradient value, thus are a good fit for the optimization problem (1.1). Compared to stochastic approximation algorithms, the derivativefree methods avoid the gradient estimation step, which is sensitive and crucial to the convergence of these algorithms. In many practical examples, we find that the gradient estimation tools often become incorrect and problematic when the gradient value gets close to zero (i.e., when near a local solution).
There are two categories of derivativefree methods: the socalled modelbased approach [20, 21] and the pattern or geometrybased approach. The modelbased approach typically constructs a chain of local models that approximate the objective function, and the algorithm proceeds based on model predictions; an alternative to the modelbased approach is the patternbased approach, which directly uses the functional output at locations specified by geometric arguments, such as the pattern search method [70].
One of the optimization methods we apply in the thesis is the UOBYQA algorithm [85] (Unconstrained Optimization BY Quadratic Approximation), which is a modelbased method. It is designed for solving nonlinear problems with a moderate number of dimensions. The method shares similarities to response surface methodology (RSM): both of them construct a series of models to the simulation response during the optimization process. However, many features in UOBYQA, such as the quadratic model update and local region shift have advantages over the classical RSM, and UOBYQA is more mathematically rigorous in convergence.
Dynamic programming and neurodynamic programming Dynamic Programming (DP) [6] problems are a special type of simulationbased optimization problems with internal time stages and state transitions. The objective function is not a single blackbox output, but typically is a combination of intermediate costs during state transitions plus a cost measurement of the final state. Appropriate controls are determined at each time stage, typically in a sequential favor.
NeuroDynamic Programming (NDP) [9, 47, 106] is a class of reinforcement learning methods to solve complex DP problems. The central idea of NDP is to approximate the optimal costtogo function using simple structured functions, such as neuro networks or simulation evaluations. NDP obtains suboptimal controls, trading off expensive computational costs. This feature distinguishes NDP methods from standard DP methods.
1.1.2 White Noise vs. Common Random Numbers
To handle stochastic functional output, we adopt a simple and effective approach  sample multiple replications per point and take the average to reduce the uncertainty. Algorithms without such a procedure can obtain a good solution but not the exact solution, because a singlereplication observation is inevitably affected by noise. The trickly question of setting the replication number is actually algorithm dependent. We have applied tools from Bayesian inference in a novel way to facilitate this computation process.
We separate our optimization approaches into two cases, using an important property of the simulation system, which is based on properties of the replications.
The white noise case corresponds to the situation when simulation outputs F(x,£(u)) are independent for different simulation runs, whatever the input x is. The random component £(u) of the function is not fixable. The simulation outputs are observed to be biased by white noise. The second case is the implementation of Common Random Numbers (CRN), which assumes the random factor £(u) can be fixed; for example, this can be achieved by fixing the random seed in the computer simulation process. One of the advantageous properties of using CRN is that given a realized sample the sample response function F(x,^) is a deterministic function. The introduction of CRN significantly reduces the level of randomness and can facilitate comparisons between different x.
When CRN is used, the expected value function f (x) in (1.1) can be approximated by a deterministic averaged sample response function
1 ^{N}
f (x) « f^{N}(x):= F(x,&, (1.3)
where is the number of samples. This is the core idea of the samplepath method [35, 49, 50, 83, 84, 89], which is a wellrecognized method in simulationbased optimization. The samplepath method is sometimes called theMonte Carlo sampling approach [99] or the sample average approximation method [52, 53, 63, 98, 100, 101]. The samplepath method has been applied in many settings, including buffer allocation, tandem queue servers, network design, etc. A wealth of deterministic optimization techniques can be employed to solve the samplepath problem
min f^{N}(x), (1.4)
which serves as a substitute for (1.1). An optimal solution x*^{N} to the problem (1.4) is then treated as an approximation of x*, the solution of (1.1). Note that the method is not restricted to bounded problems, but in more general settings it requires appropriate deterministic tools (i.e., constrained optimization methods) to be used.
^{1}The functions f^{N} epiconverges to a function f if the epigraphs of the function f^{N }converges to the epigraph of the function f [91].

Convergence proofs of the samplepath method are given in [89, 97]. Suppose there is a unique solution x^{*} to the problem (1.1), then under the assumption that the sequence of functions {f^{N}} epiconverges^{1} to the functionf, the optimal solution sequence {x*^{N}} converges to x* almost surely for all sample paths. See Figure 2 for the illustration of the samplepath optimization method. Starting from x_{0}, for a given N, a deterministic algorithm is applied to solve the samplepath problem. Invoking the above analysis then allows us to use the solution x*^{N} as an approximation to the true solution x^{*}.
1.2 Bayesian Analysis in Simulation
We have extensively used Bayesian analysis in our optimization methods. Our analysis is inspired by its application in selecting the best system, which is one of the fundamental problems in discrete simulationbased optimization.
In Bayesian analysis, a posterior distribution for the simulation output, including both mean and variance information, is derived. Since the posterior distribution has a parametric form, it can be readily assembled and used in optimization methods.
Another important usage of Bayesian analysis is Monte Carlo validation. When Bayesian posterior distributions cannot be used directly in the algorithms, the Monte Carlo validation step serves as an alternative. This idea is motivated by using Monte Carlo simulation of random variables to construct statistical estimators for unknown quantities. For example, in the computation of the expectation of a random variable, a simple way to accomplish this is to generate a large number of realizations from the random variable and take the arithmetic mean (or compute the sample mean). This provides a so called Monte Carlo estimator for the unknown expectation. Note that one crucial part of the Monte Carlo method necessitates an efficient way to generate random samples from a given distribution.
We adopt the same idea in Monte Carlo validation. Assume that Bayesian posterior distributions can capture the real uncertainty of the unknown random quantity well from a practical standpoint. In order to valid a given variance criterion, we can extract a vast amount of samples from the derived posterior distributions and plug them into the criterion to validate it. For example, we may observe that 1 — a of all the samples satisfy the criterion, therefore, we can assert that such a criterion can be satisfied with an accuracy of 1 — a. The Monte Carlo validation step avoids using the forms of the Bayesian distributions directly.
We will introduce the problem of selecting the best system in Section 1.2.1 and the role which Bayesian analysis plays. In Section 1.2.2, we describe the construction of Bayesian posterior estimations and in Section 1.2.3, we illustrate how the analysis is applied to a simple selecting the best system problem.
1.2.1 Selecting the Best System
The goal of selecting the best system is to identify the best system from a collection of candidate systems. The type of problem is encountered many times (in various forms) in our algorithmic designs, as will be shown later. Selecting the best system is equivalent to solving a discrete optimization problem:
argmin E (Yi), (1.5)
i
where Y_{i} is a random variable representing the output of system i, i = 1, 2, • • • ,K. (Without loss of generality, we assume that a desired system has the smallest expected mean.) Let fi_{i} be the unknown mean of Y_{i} and > H[_{2}\ > • • • > H[k\ be the ordered sequence of the means but such a sequence is unknown. We prefer to determine the index [k] and there is a
probability of correct selection (PCS) value that quantifies the correctness: PCS = Pr(select Y_{[K}]\>fi[K] ,j = 1, 2, ■■■ ,K  1). (1.6)
The difficulty of selecting the best system is to design a statistically accurate procedure that identifies the best system with the condition
PCS > 1  a,
where a is a threshold value, while using the least number of realized samples of Yi.
The standard approach to selecting the best system includes IndifferenceZone Ranking and Selection (IZR&S). An indifferencezone parameter 8 should first be prescribed. When the means of two systems are within a tolerance 8, it is indifferent to choose either system. One of the most influential IZR&S approaches is a twostage sampling strategy, described by Rinott [88] in 1978. Firstly, r_{0} replications of each system are simulated to estimate the preliminary sample variance. Secondly, it evaluates additional replications for each system, whereby numbers of replications are determined by the corresponding variance information. The best system is then chosen as the one having the smallest sample mean. Kim and Nelson [61] include a screen step that is designed to eliminate a set of inferior systems from consideration at an early stage. A good survey on the IZR&S procedures can be found in [62].
The Bayesian method is an alternative to the standard IZR&S [16, 17, 19]. The idea is to construct posterior distributions for the underlying mean j_{i} using observations and acquired knowledge. The selection accuracy is accordingly estimated via joint posterior distributions. There are many practically efficient procedures including Chen's OCBA [16] (Optimal Computer Budget Allocation) and Chick's 01(B) [17, 19]. The report [56] offers an empirical comparison of these methods, as well as Rinott's twostage IZR&S method.
The Bayesian approach has advantages over the traditional IZR&S approach. First, it is not necessary to set the indifferencezone parameter 5. In fact, it deals with the ranking and selection problem (1.5) directly. Moreover, it utilizes both the mean and variance information of the data, while IZR&S only uses the variance information directly.
1.2.2 Constructing Bayesian Posterior Estimations
The Bayesian approach described in this subsection is standard and close to that presented in Chick's papers [17, 19].
Given a set of points [x^{1},x^{2},... ,x^{L}}, we assume the simulation output at these points
F = (F(x^{1}, £(u)), F(x^{2}, £(u)),..., F(x^{L}, £(u)))
is a multivariate normal variable, with mean j = (^(x^{1}),..., fi(x^{L})) and
covariance matrix E:
F ~ N(p, E).
(1.7)
In the white noise case, since simulation outputs are independent, E should be a diagonal matrix; while for the CRN case, it is typically not. Suppose we evaluate N replications of simulation runs at each point, the existing dataX can be accumulated as an N x L matrix, with
Let p and E denote the sample mean and sample covariance matrix of the data. For simplicity, we introduce the notation s_{i} = (F(x^{1},^),..., F(x^{L}, ^)), i 1,..., N, so that
r
s1
X
S2
^{s}N
The sample mean and sample covariance matrix are calculated as
1
N
N
i=1
(f^{N} (x^{1}),...,^ (x^{L})),
(1.8)
and
1
N1
N
i=1
(1.9)
Since we do not have any prior assumption for the distributions of j and E, we assign noninformative prior distributions for them. In doing this, the joint posterior distributions of j and E are derived as
E\X ~ WishartL(E,N + L  2),
j\E,X ~ N(j, E/N). (1.10)
Here the Wishart distribution Wishart_{p}(v,m) has covariance matrix v and m degrees of freedom [25]. The Wishart distribution is a multivariate generalization of the x^{2} distribution.
The distribution of the mean value is of most interest to us. When the sample size is large, we can replace the covariance matrix E in (1.10) with the sample covariance matrix E, and asymptotically derive the posterior distribution of \ X as
j\X ~ N(j,E/N). (1.11)
It should be noted that, with an exact computation, the marginal distribution of \ X inferred by (1.10) (eliminating E) is,
j\X ~ StL(/,NE^{1},N  1), (1.12)
where a random variable with Student's tdistribution St_{L}(j, K, m) has mean j, precision k, and m degrees of freedom. The normal formulation (1.11) is more convenient to manipulate than the tversion (1.12). We have applied both forms in our algorithms when appropriate, but typically prefer the simpler normal form.
Particularly, in the white noise case, the distribution in (1.12) is separable. We have the component posterior distribution
j(x^{i})\X ~ N(j(x^{i}),a^{2}(x^{i})/N),
where a^{2}(x^{i}) is the sample mean for the point x^{%} and is the ith component in the diagonal of the matrix E.
While the multivariate normal assumption (1.7) is not always valid, several relevant points indicate that it is likely to be satisfied in practice [18].
• The form (1.7) is only used to derive the (normal) posterior distribution j\X.
• Other types of distribution assumptions may be appropriate in different circumstances. For example, when a simulation output follows a Bernoulli 01 distribution, then it would be easier to perform parameter analysis using beta prior and posterior distributions. The normal assumption (1.7) is the more relevant to continuous simulation output with unknown mean and variance.
• The normal assumption is asymptotically valid for many applications. Many regular distributions, such as distributions from the exponential family, are normallike distributions. The analysis using normal distributions is asymptotically correct.
1.2.3 The Bayesian Method for Selecting the Best System
As an introductory application, we consider a preliminary case in selecting the best system when there are only two systems involved, K = 2. A simplified analysis on how to compute the PCS is presented as follows. Interested readers can seek details (e.g., multiple comparisons and procedures) in [17, 19].
Let X = {yij,i = 1, 2,j = 1, 2, • • • ,r_{i}} denote two sequences of output of both systems. The sample mean ji_{i}and sample variance a^{2} are defined as Ui = E^{r}_{3}Li ^{y}i,j/n ^{and}
A decision is drawn by directly comparing the two sample means. The system evidenced with a smaller average output is selected:
{

1, if Ui < U2; 2, otherwise.
Without loss of generality, we assume that we observe the order of the sample means U_{i} < U_{2} and select the first system as the winner.
In the Bayesian framework, the posterior distribution of fi_{i}\X can be asymptotically estimated as
Ui\X ~ N(fii,a^{2}/Ti). (1.13)
Following our earlier assumption, the PCS corresponds to the probability of event {j1 < j^},
PCS ~ Pr(j1 < j2\ X) = Pr(jn\X  j2\X < 0). (1.14)
The difference of two normal random variables j_{1} \X and ji_{2}\X remains a normal random variable. It is straightforward to show:
Therefore, the PCS defined in (1.14) is a tail probability of a normal distribution:
PCS ~ Pr(N jj2, a1/n + a\/r_{2}) < 0). (1.15)
The computation corresponds to a single cdf evaluation of a onedimensional normal distribution. The value can be computed either by looking up in a standard cdf table or using routines in mathematics software, i.e., Matlab's function normcdf.
In general, for formulation (1.15), one may expect the PCS value to become higher as the numbers of replicationsr_{1} and r_{2} increase.
1.3 The TwoPhase Optimization Framework
The thesis focuses on a twophase optimization framework for solving (1.1), which is motivated by response surface methodology. Recalling the model construction step (1.2), the overall error in approximation comes from two sources: random error and model error. Random error is the type of error directly induced by the uncertainty£(u) in the sample response; model error is due to an inappropriate choice of functional forms to fit the data. For example, fitting a cubic curve with a quadratic function will result in a large discrepancy, when the domain of interest is not appropriately restricted. This type of error exists independent of the random term £(<j). In the global view, the model error dominates the random error; therefore, constructing a surrogate function aims at reducing the model error to the maximum extent. However, in the local view, reducing the random error becomes the primary consideration. For this reason, we design a twophase framework for simulationbased optimization which addresses distinct goals:
1. Phase I is a global exploration and rough search step. The algorithm explores the entire domain and proceeds to determine potentially good subregions for future investigation. We assume the observed randomness in the subregion detection process is insignificant and can be ignored. Appropriate criteria to determine when a good subregion has been identified are required.
2. Phase II is a local exploitation step. Local optimization algorithms are applied in each subregion to determine the exact optimum. The algorithms are required to deal with the randomness explicitly, since in local subregions, random error is considered as the dominating feature.
Phase I typically produces one or multiple good points representing centers of good local subregions. These points are used as starting points for the Phase II local search algorithms. Alternatively, Phase I may generate geometric information for the location and shape of the subregions, which are more interesting for analyzing the significance, interactions and patterns of individual dimensions.
1.3.1 The WISOPT Structure
We design an optimization package WISOPT (which stands for Wisconsin Simulation OPTimization) incorporating different optimization methodologies, based on the twophase framework. See Figure 3 for a flow chart.
Phase I is a global exploration step over the entire domain. Algorithms in Phase I should be able to generate (and evaluate) densely distributed samples in promising subregions and sparsely distributed samples in inferior subregions. The entire set of samples will be passed to a phase transition procedure between Phase I and Phase II, which implements a nonparametric statistical method to determine starting points and surrounding subregions for multistart Phase II optimizations.
One of our Phase I methods employs classification tools to facilitate the global search process. By learning a surrogate from existing data the approach identifies promising subregions and generates dense samples in the subregions. Additional features of the method are: (a) more reliable predictions obtained using a voting scheme combining the options of multiple classifiers, (b) a data preprocessing step that copes with imbalanced training data. Another Phase I method is the Noisy DIRECT (DIviding RECTangles) algorithm, which is an extension of the DIRECT optimization algorithm to the noisy case. As with the classificationbased global search, the method returns a collection of promising points together with surrounding rectangles.
Phase II performs local derivativefree optimization based on the UOBYQA (Unconstrained Optimization BY Quadratic Approximation) algorithm, in each of the subregions identified. We do consider whether we can implement CRN in the simulator, corresponding to the VNSPUOBYQA (VariableNumber SamplePath UOBYQA) algorithm for the CRN case and the Noisy UOBYQA algorithm for the white noise case. Both algorithms apply Bayesian techniques to guide appropriate sampling strategies while simultaneously enhancing algorithmic efficiency to obtain solutions of a desired accuracy. The statistically accurate scheme determines the number of simulation runs and guarantees the global convergence of the algorithm.
A key component in the extended algorithms is to incorporate distribution information provided by Bayesian estimations. Bayesian inference is an effective tool in input modeling and uncertainty analysis. In particular, in order to prevent unnecessarily exploring hyperrectangles in inferior regions,
DIRECT tests the accuracy of selecting hyperrectangles for future partitioning using a Monte Carlo validation process. Trial samples are extracted from the Bayesian posterior distributions to validate a predefined variance criterion. In UOBYQA algorithms, we derive the posterior volatility for the local model construction step, and therefore control the uncertainty of solutions in the trust region subproblems.
Certainly, the variety of methods in both phases is not restricted to what we present. Additional methods that satisfy the purposes of both phases may work as new modules and can be plugged into the twophase framework.
1.3.2 Introduction to the Methodologies
We give a brief introduction to the methodologies we use in WISOPT (refer to Figure 3). Details will be discussed in later chapters.
Classificationbased global optimization A good surrogate model for the entire space (often noted as a metamodel) may require a large amount of simulations and can be very expensive to compute, but it may be capable of approximating global behavior of the underlying function f. Since Phase I only attempts to determine promising subregions of the search space, the model can be constructed in a coarser manner.
Classificationbased global optimization
or
Noisy DIRECT
Phase II

Phase transition
VNSPUOBYQA
Noisy UOBYQA
Figure 3: The twophase WISOPT structure.
We are indeed interested in the behavior of a simple indicator function:
(1.16)
where promising subregions in the method correspond to level sets. This function gives sufficient information to determine where a subregion is located. Approximating the indicator function I is simpler than approximating the underlying function f, especially in a high dimension case. Normally, we sample dense designs in the subregion and sparse designs out of the subregion.
In the classificationbased global search, we utilize the predicting power of a classifier: the classifier works as a surrogate function for the indicator function I, which reveals the location of promising subregions. A classifier is a cheap mechanism to predict whether new samples are in promising subregions or not, thus we can generate a dense collection of points in these subregions. Figure 4 illustrates the local regions (the dotted circles and the level sets) and the samples (the '+'s). The method fits well to the theme of the Phase I optimization, which is to identify local regions. In fact, the method is relatively insensitive to noise, because the simplification step (1.16) smoothes out the occurrence of noise. That is reason we normally do not use replicated samples in training the classifiers. Setting the algorithm may require specific parameters; for example, we have to potentially understand the proper number of samples to use in training the classifiers.
Classification forms a major branch in in machine learning/data mining [51, 77]. In the past couple of years, there has been increasing interest on techniques interfacing optimization and machine learning. For example, Kim and Ding [60] have implemented data mining tools in an engineering design optimization problem. Mangasarian [74] has enhanced optimization robustness in support vector machines.
10, , , , , , , , , , 1
10 8 6 4 2 0 2 4 6 8 10
Figure 4: Predicated local subregions of the Griewank function. The function has local optimums in each subregion (circles) and the global optimum at [0, 0].
The Noisy DIRECT algorithm The DIRECT optimization method [37, 38, 58, 59] is a deterministic global optimization algorithm for boundconstrained problems. The algorithm, first motivated by Lipschitz optimization [59], has proven to be effective in a wide range of application domains. The algorithm centers around a spacepartitioning scheme that divides large hyperrectangles into small ones. Promising hyperrectangles are subject to further division. Figure 5 provides an illustration of the algorithm on the Goldstein Price function. The algorithm therefore proceeds to gradually explore promising subregions.
Figure 5: The DIRECT optimization algorithm on the Goldstein Price function. The contour plot of the Goldstein Price function is shown in Figure 15.
When the objective function is subjected to uncertainty, some crucial operational steps of the DIRECT algorithm are affected. For example, the choice of potentially optimal hyperrectangles becomes incorrect because of the noisy function values, possibly misleading the algorithm to search in inferior regions. We modify the original DIRECT algorithm using a simple approach  multiple replications are sampled to reduce output uncertainty. We expect the algorithm to proceed correctly as in the deterministic case. However, we must face the issue of handling the tradeoff between two design goals: efficiency of the algorithm versus total computational effort. Since the objective function is often computationally expensive to evaluate, we must be very cautious in using function evaluations. On the other hand, we need to maintain a certain precision in the functions for correctness of the algorithm. In our modification, we apply Bayesian techniques to derive a posterior distribution for the function output at each point, and incorporate the distribution information into the algorithm to determine an appropriate number of replications to be used.
The parameters for this algorithm are easy to set. Since deterministic DIRECT is designed for both global and local optimization, one should note that it is desirable to terminate the algorithm at a reasonable time, at which sufficient information of local regions can be identified. This can avoid unnecessary function evaluations for handling comparisons that are really dominated by noise.
The phase transition Using the evaluated samples in Phase I, the phase transition procedure consists of a nonparametric local quadratic regression method to determine the appropriate subregion size. Being different from regular regression methods, which use the entire set of samples in the domain to construct one model, local regression makes a prediction (at a point) using a local model based on samples within a 'window size', thus the approach values the local behavior of a function more. 'Nonparametric' means the regression model is not from a single parametric family. It is presumed that the samples outside the local region have a slight relevance to the current prediction. In our procedure, we treat the resulting 'window size' as our subregion radius.
A sequence of good starting points is generated, satisfying the criteria: (a) each starting point is the center of a subregion, (b) the subregions are mutually separated. The sequence of starting points and the subregion sizes are passed to Phase II for local processing, possibly in a parallel setting.
Extended UOBYQA algorithms In Phase II, the deterministic UOBYQA algorithm is applied as the base local search method and is extended for noisy function optimization. The method is an iterative algorithm in a trust region framework [80], but it differs from a classical trust region method in that it creates quadratic models by interpolating a set of sample points instead of using the gradient and Hessian values of the objective function (thus making it a derivativefree tool). Besides UOBYQA, other modelbased software include WEDGE [75] and NEWUOA [86]. We choose UOBYQA, because it
We developed variants of the original UOBYQA, called the VNSPUOBYQA and the Noisy UOBYQA, that are adapted for noisy optimization problems. The extension idea is similar to that of the Noisy DIRECT algorithm. We sample multiple replications per point to reduce variance and apply Bayesian techniques to guide appropriate sampling strategies to estimate the objective function. The two algorithms employ different mechanisms in the sampling process. The VNSPUOBYQA determines appropriate replication numbers by whether sufficient reduction is identified in the trustregion subproblem, while the Noisy UOBYQA determines the number by whether the quadratic models can be shown to be stable or not. Generally speaking, when CRN is implemented, the noise is relatively easy to handle because it is correlated among sites. Therefore, it is shown that the VNSPUOBYQA has better convergence properties.
1.4 Outline of the Thesis
The general theme of thesis centers around the WISOPT package, which is designed based on the twophase optimization framework. Phase I techniques are described in Chapter 2 and Phase II techniques are described in Chapter 3.
More specifically, the first section of Chapter 2 presents the classificationbased global search. The detailed procedures of applying several types of classifiers are included, together with two special features: a voting scheme to assemble multiple classifiers and imbalanced data handling. The second section introduces and explains the development of the Noisy DIRECT algorithm. We will first describe the deterministic DIRECT algorithm, then analyze how to enhance the algorithm under uncertain conditions. Several numerical examples and a simulation problem are presented. The third section is on the procedure of the phase transition.
Chapter 3 contains the details about the Phase II noisy versions of the UOBYQA algorithm. The ideas of the extensions of both the VNSPUOBYQA and the Noisy UOBYQA are similar: the Bayesian posterior distributions for the parameters of the quadratic model are derived and further we can estimate the stability of the algorithms. Since the VNSPUOBYQA is in the CRN setting, the corresponding section is written in a more rigorous manner, with the convergence proof of the algorithm provided.
We show in Chapter 4 two realworld simulation optimization examples. We aim to fine tune the simulation parameters in the Wisconsin Breast Cancer Epidemiology model, which is from a collaborative project with researchers in the Medical School at the University of Wisconsin. In the second example, we optimize the shape of a type of microwave coaxial antenna for hepatic tumor ablation, which is a current ongoing project with the Biomedical Engineering Department at the University of Wisconsin.
In Chapter 5 we demonstrate a special simulationbased problem in dynamic programming. This type of simulation problem has an internal time structure  the objective function is not a pure blackbox stochastic function, but is constituted of a sequence of costs along the time stages. We modify the rollout algorithm from neurodynamic programming, using a similar Bayesian analysis to that outline above to improve the simulation efficiency in training and in deriving optimal controls at each stage. We illustrate the effectiveness of our new algorithm using an example in fractionated radiotherapy treatment. This approach to using Bayesian analysis in neurodynamic programming effectively generalizes the methods of this thesis to time domain problems, and leads to further possible applications in other optimization contexts.
Tags
MCA Notes