Trajectory Optimization for Safe Exploration

This article serves as an introduction to my recent work on “Stochastic Optimal Control for Safe Exploration.

Real-time safe exploration algorithms for robotic systems are an enabling technology for deploying autonomous systems in uncertain environments. By using a safe exploration algorithm, the robots can potentially learn the uncertainty and improve the performance by accommodating for the uncertainty. For example, consider a flying robot deployed in a cluttered environment with no prior knowledge about the potential non-conservative forces such as drag that could perturb the robot from the expected motion. In this scenario, we can execute a trajectory with the help of a human expert to collect data about the perturbations and learn the non-conservative forces. The estimates of the forces are used for improving the performance. This approach is alike techniques used in system identification and model learning. To demonstrate full autonomy of the robot, a novel approach is required that computes probabilistic safe trajectories and policies online, for data collection to improve safety and performance after deployment. To compute the safe trajectories under partial knowledge and execute the trajectories safely, we propose the following episodic framework in Ref. [1].

An end-to-end episodic framework for safe exploration using chance-constrained trajectory optimization. In the framework, an initial estimate of the dynamics is computed using a known safe control policy. A probabilistic safe trajectory and policy that satisfies safety chance-constraints is computed using Info-SNOC for the estimated dynamics. This policy is used for rollout with a stable feedback controller to collect data.

Sequential Bayesian Optimization

Before proceeding further on how to design safe exploration trajectories for robotic systems, I will discuss about experiment design and its application in Bayesian optimization. Consider the following sequential Bayesian optimization problem for epoch e>0:

\begin{aligned}
    \max_{x \in \mathcal{X}} & \quad  f(x)\\
    s.t. \quad & y = f(x) + \xi \quad \xi \sim \mathcal{N}(0,\sigma_{\xi})
\end{aligned}

where the function f is unknown. We can measure the output y, at any sample point x_0 \in \mathcal{X}. At each epoch e, we sample a point x_e \in \mathcal{X} and measure the evaluation y_e with the goal to maximize the total reward \sum_{e = 1}^{T} f(x_e), to compute x^* = \mathrm{arg}\max f(x). Many engineering problems can be formulated as the above optimization problem with unknown reward. For example, given a distributed temperature sensing architecture, find the location with maximum temperature. The example was taken from Ref. [2].

An important question in the sequential optimization is how to choose best x_e for fast reduction of uncertainty in f , and guaranteed convergence of the cumulative reward, given the current estimate of f(x_{e-1}). This is often referred to as exploration strategy in the literature. In the Gaussian process multi-arm bandit setting, it was shown in [2] that the Gaussian Process Upper Confidence Bound (GP-UCB) cost function can be used for picking x_e which has maximum information about f. The GP-UCB problem is defined below:

\begin{aligned}
    x_e = \arg \max_{x\in\mathcal{X}} \mu_{e-1} + \beta^{1/2} \sigma_{e-1} (x),
\end{aligned}

where \beta is chosen based on the confidence bound. The approach optimally weighs exploration and exploitation and has information theoretic guarantees for convergence. We extent this approach to optimal control problems to compute an informative trajectory for exploration under safety constraints.

System Identification

On a similar note, the exploration policy design was studied in system identification literature earlier. For system identification of the unknown parameter \theta in the linear system

\begin{aligned}
\dot{x} =Ax + Bu +\theta,
\end{aligned}

the control design typically has two components u = u_s + u_e, where u_s = Kx is a stabilizing state feedback control and u_e \sim \mathcal{N}(0,\Sigma) is Gaussian noise for excitation. The covariance matrix \Sigma is designed to persistently excite the system to collect more information about the unknown parameter \theta. A simple practical technique to identify parameters of a linear system model of a structural element is to tap it to give random force inputs. Additionally, there are techniques proposed for identification in the Laplace domain. I suggest readers to look at Ref. [3] for frequency domain approach to system identification of linear systems. Inspired by the aforementioned ideas, the Ref. [1] proposes an end-to-end framework for robots to learn the interaction with the world.

The problem formulation for the Info-SNOC algorithm is as the following. If we have a linear robotic system \dot{x} = Ax + Bu + \theta, with unknown parameter \theta, can we design control u and state trajectory x that can improve the prior estimate \mathcal{N}(\mu_{\theta},\Sigma^\top_{\theta}\Sigma_{\theta}) of \theta. In line with the above system identification problem and experiment design, consider the following linear stochastic optimal control problem with probabilistic safety constraints. The problem combines the cost function UCB and the fuel optimality for efficient learning.

Chance-Constrained Stochastic Optimal Control

\begin{aligned}
    J = \min_{x,u} & \int_{t_0}^{t_f}\left[\|u\|_{1} - \sum_{i}\mu_{\theta_{i}} - \mathrm{tr}(\Sigma_{\theta}) \right]dt \\
    s.t. & \quad dx = Axdt + Budt + \mu_{\theta}dt + \Sigma_{\theta} dw \\
    & \quad \mathrm{Pr}(x\in \mathcal{F}) \geq 1- \epsilon\\
    & \quad u \in \mathcal{U}\\
    & \quad x(0) = x_0, \quad x(t_f) = x_f
\end{aligned}

The important aspects that need to be considered here are the stochastic dynamics and propagation of the stochastic dynamics. The chance- constraint \mathrm{Pr}(x\in \mathcal{F}) \geq 1- \epsilon represents the probabilistic safety. The feasible policy set is given by \mathcal{U}. The initial and terminal conditions are specified by x_o and x_f. Note that to find a globally optimal solution for this problem is not possible. So we compute a tractable approximation of this problem. The stochastic optimal control problem is solved by projecting to a deterministic space and using sequential convex programming.

A depiction of typical result obtained by applying the Info-SNOC.

The above figure is an example scenario for probabilistic formulation and expected output of the algorithm designed in the Ref. [1]. It shows the comparison of trajectories computed using the full knowledge of the robot model using standard optimal control approach and for the partially known robot model using the Info-SNOC algorithm. The partial knowledge leads to the probabilistic safety formulation that is discussed below. The obstacles are the safety constraints in this case. The Info-SNOC algorithm computes a probabilistic safe trajectory. In order to explore, we need to compute a safe and exploratory policy using the probabilistic safe trajectory.

Control loop design for safe exploration.

The control loop for safe exploration includes a stable feedback control and safety filter for real-time safety. The details for the control design and exploration strategy can be found in Ref.~[1,4,5]. The safety filter checks for potential safety violation at each time step and augments the feedback controller to ensure safety using the sensor data. This is crucial for deploying the robot.

A depiction of typical result obtained by applying the end-to-end framework.

In the above figure, the resulting probabilistic trajectories from the episodic framework for multiple epochs are placed adjacent to each other to see the expected outcome. The simulation results of a spacecraft dynamics simulator robot can be found in Ref.~[1]. We compute a probabilistic safe trajectory given an initial and terminal condition. The picture shows the variance around a mean trajectory at each epoch and learning from the collected data at the end of the exploration. The reduction in terminal variance implies learning consistency.

If you have any questions and feedback, please contact me via email.

References

[1] Y. K. Nakka, A. Liu, G. Shi, A Anandkumar, Y. Yue, and S.-J. Chung. Chance-Constrained Trajectory Optimization for Safe Exploration and Learning of Nonlinear Systems. IEEE Robotics and Automation Letters, under review, 2020. PDF

[2] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process
optimization in the bandit setting: no regret and experimental design,”
in Proc. of Int. Conf. on Mach. Learning, 2010, pp. 1015–1022.

[3] R. Pintelon and J. Schoukens, System identification: a frequency domain approach. John Wiley & Sons, 2012.

[4] Y. K. Nakka, Rebecca C Foust, Elena Sorina Lupu, David B Elliott, Irene S Crowell, Soon-Jo Chung, and Fred Y Hadaegh. Six degree-of-freedom spacecraft dynamics simulator for formation control research. AAS/AIAA Astrodynamics Specialist Conference, 2018
PDF Poster

[5] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in Euro. Control Conf., 2019, pp. 3420–3431.

Back to Top
error: Content is protected !!