Full publication list can be found on Google Scholar.
^{*} indicates equal contribution.

- Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

Tianyuan Jin,**Pan Xu**, Xiaokui Xiao, Anima Anandkumar

In Proc. of the 35th Conference on Advances in Neural Information Processing Systems (**NeurIPS**), New Orleans, Louisiana, USA, 2022.

[Summary] [Paper]We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound.

- Active Ranking without Strong Stochastic Transitivity

Hao Lou, Tao Jin, Yue Wu,**Pan Xu**, Farzad Farnoud, Quanquan Gu

In Proc. of the 35th Conference on Advances in Neural Information Processing Systems (**NeurIPS**), New Orleans, Louisiana, USA, 2022.

[Summary]We consider the problem of recovering the exact full ranking for a list of items under ranking models that do \emph{not} assume the Strong Stochastic Transitivity property. We propose a $\delta$-correct algorithm, Probe-Rank, that actively learns the ranking of the items from noisy pairwise comparisons. We prove a sample complexity upper bound for Probe-Rank, which only depends on the preference probabilities between items that are adjacent in the true ranking. This improves upon existing sample complexity results that depend on the preference probabilities for all pairs of items.

- Langevin Monte Carlo for Contextual Bandits

**Pan Xu**, Hongkai Zheng, Eric Mazumdar, Kamyar Azizzadenesheli, Anima Anandkumar

In Proc. of the 39th International Conference on Machine Learning (**ICML**), Baltimore, Maryland, USA, 2022.

[Summary] [Paper] [Code]We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.

- Neural Contextual Bandits with Deep Representation and Shallow Exploration

**Pan Xu**, Zheng Wen, Handong Zhao, Quanquan Gu

In Proc. of the 10th International Conference on Learning Representations (**ICLR**), Online, 2022.

[Summary] [Paper] [Code]We study neural contextual bandits, a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the specific reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\widetilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.

- Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons

Yue Wu*, Tao Jin*, Hao Lou,**Pan Xu**, Farzad Farnoud, Quanquan Gu

In Proc. of the 25th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Online, 2022.

[Summary] [Paper]In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus, a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from multiple users and improves the users' average accuracy by maintaining an active set of users. We prove that our algorithm can return the true ranking of items with high probability. We also provide a sample complexity bound for the proposed algorithm, which outperforms the non-active strategies in the literature and close to oracle under mild conditions. Experiments are provided to show the empirical advantage of the proposed methods over the state-of-the-art baselines.

- Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

Estee Y Cramer, ... ,**Pan Xu**, ... , Nicholas G. Reich.

Proceedings of the National Academy of Sciences (**PNAS**), Volume 119, No. 15, 2022.

[Summary] [Paper]This paper compares the probabilistic accuracy of short-term forecasts of reported deaths due to COVID-19 during the first year and a half of the pandemic in the United States. Results show high variation in accuracy between and within stand-alone models and more consistent accuracy from an ensemble model that combined forecasts from all eligible models. This demonstrates that an ensemble model provided a reliable and comparatively accurate means of forecasting deaths during the COVID-19 pandemic that exceeded the performance of all of the models that contributed to it. This work strengthens the evidence base for synthesizing multiple models to support public-health action.

- Double Explore-then-Commit: Asymptotic Optimality and Beyond

Tianyuan Jin,**Pan Xu**, Xiaokui Xiao, Quanquan Gu

In Proc. of the 34th Annual Conference on Learning Theory (**COLT**), Online, 2021.

This work has been presented at the NeurIPS 2020 Offline Reinforcement Learning Workshop.

[Summary] [Paper]We propose a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and show that it can achieve the asymptotically optimal regret bound. To our knowledge, DETC is the first non-fully-sequential algorithm that achieves asymptotic optimality. In addition, we extend DETC to batched bandit problems, where (i) the exploration process is split into a small number of batches and (ii) the round complexity is of central interest. We prove that a batched version of DETC can achieve the asymptotic optimality with only a constant round complexity. This is the first batched bandit algorithm that can attain the optimal asymptotic regret bound and optimal round complexity simultaneously.

- Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 37th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Online, 2021.

[Summary] [Paper]We provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain. Under certain conditions on the target distribution, we prove that $\widetilde O(d^4\epsilon^{-2})$ stochastic gradient evaluations suffice to guarantee $\epsilon$-sampling error in terms of the total variation distance, where $d$ is the problem dimension. This improves existing results on the convergence rate of SGLD \citep{raginsky2017non,xu2018global}. We further show that provided an additional Hessian Lipschitz condition on the log-density function, SGLD is guaranteed to achieve $\epsilon$-sampling error within $\widetilde O(d^{15/4}\epsilon^{-3/2})$ stochastic gradient evaluations. Our proof technique provides a new way to study the convergence of Langevin based algorithms, and sheds some light on the design of fast stochastic gradient based sampling algorithms.

- MOTS: Minimax Optimal Thompson Sampling

Tianyuan Jin,**Pan Xu**, Jieming Shi, Xiaokui Xiao, Quanquan Gu

In Proc. of the 38th International Conference on Machine Learning (**ICML**), Online, 2021.

[Summary] [Paper]Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can match the minimax lower bound $\Omega(\sqrt{KT})$ for $K$-armed bandit problems, where $T$ is the total time horizon. In this paper, we solve this long open problem by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(\sqrt{KT})$ for finite time horizon $T$, as well as the asymptotic optimal regret bound for Gaussian rewards when $T$ approaches infinity. To our knowledge, MOTS is the first Thompson sampling type algorithm that achieves the minimax optimality for multi-armed bandit problems.

- Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits

Tianyuan Jin, Jing Tang,**Pan Xu**, Keke Huang, Xiaokui Xiao, Quanquan Gu

In Proc. of the 38th International Conference on Machine Learning (**ICML**), Online, 2021.

[Summary] [Paper]In batched multi-armed bandit problems, the learner can adaptively pull arms and adjust strategy in batches. In many real applications, not only the regret but also the batch complexity need to be optimized. Existing batched bandit algorithms usually assume that the time horizon $T$ is known in advance. However, many applications involve an unpredictable stopping time. In this paper, we study the anytime batched multi-armed bandit problem. We propose an anytime algorithm that achieves the asymptotically optimal regret for exponential families of reward distributions with $\mathcal{O}(\log \log T\cdot \ilog^{\alpha} (T))$ batches, where $\alpha\in \mathcal{O}_{T}(1)$. Moreover, we prove that for any constant $c>0$, no algorithm can achieve the asymptotically optimal regret within $c \log \log T$ batches.

- A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

Yue Wu, Weitong Zhang,**Pan Xu**, Quanquan Gu

In Proc. of the 33rd Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Online, 2020.

This work has been presented at the ICML 2020 Theoretical Foundations of Reinforcement Learning Workshop.

[Summary] [Paper]We provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\theta)\|_2^2 \le \epsilon$) of the non-concave performance function $J({\theta})$, with $\mathcal{\widetilde{O}}(\epsilon^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.

- A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

**Pan Xu**, Quanquan Gu

In Proc. of the 37th International Conference on Machine Learning (**ICML**), Online, 2020.

[Summary] [Paper]We present a finite-time analysis of a neural Q-learning algorithm, where the data are generated from a Markov decision process, and the action-value function is approximated by a deep ReLU neural network. We prove that neural Q-learning finds the optimal policy with an $O(1/\sqrt{T})$ convergence rate if the neural function approximator is sufficiently overparameterized, where $T$ is the number of iterations. To our best knowledge, our result is the first finite-time analysis of neural Q-learning under non-i.i.d. data assumption.

- Stochastic Nested Variance Reduction for Nonconvex Optimization

Dongruo Zhou,**Pan Xu**, Quanquan Gu

Journal of Machine Learning Research (**JMLR**), Volume 21, No. 103, 2020.

The short version of this paper has been published in NeurIPS 2018. The journal version extends the original SNVRG algorithm for finding first order stationary points to local minimum finding algorithms by incorporating this manuscript: Finding Local Minima via Stochastic Nested Variance Reduction.

[Summary] [Paper]In this journal version of SNVRG, we further propose two algorithms that can find local minima faster than state-of-the-art algorithms in both finite-sum and general stochastic (online) nonconvex optimization. In particular, for finite-sum optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{finite}}$ algorithm achieves $\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2})$ gradient complexity to converge to an $(\epsilon, \epsilon_H)$-second-order stationary point, which outperforms $\text{SVRG}+\text{Neon2}^{\text{finite}}$ \citep{allen2018neon2}, the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{online}}$ achieves $\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3})$ gradient complexity, which is better than both $\text{SVRG}+\text{Neon2}^{\text{online}}$ \citep{allen2018neon2} and Natasha2 \citep{allen2017natasha2} in certain regimes. Thorough experimental results on different nonconvex optimization problems back up our theory.

- Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

**Pan Xu**, Felicia Gao, Quanquan Gu

In Proc. of the 8th International Conference on Learning Representations (**ICLR**), Addis Ababa, Ethiopia, 2020.

Partial results of this work have been presented at the NeurIPS 2019 Optimization Foundations of Reinforcement Learning Workshop, Vancouver, Canada, 2019 and the 2020 Virtual Conference on Reinforcement Learning for Real Life.

[Summary] [Paper] [Poster] [Code] [Video]We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/\epsilon^{3/2})$ episodes to find an $\epsilon$-approximate stationary point of the nonconcave performance function $J(\boldsymbol{\theta})$ (i.e., $\boldsymbol{\theta}$ such that $\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon$). This sample complexity improves the existing result $O(1/\epsilon^{5/3})$ for stochastic variance reduced policy gradient algorithms by a factor of $O(1/\epsilon^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.

- Rank Aggregation via Heterogeneous Thurstone Preference Models

Tao Jin*,**Pan Xu***, Quanquan Gu, Farzad Farnoud

In Proc. of the 34th Conference on Artificial Intelligence (**AAAI**), New York, New York, USA, 2020. [Oral presentation, 4.5%]

[Summary] [Paper] [Code]We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise comparisons to heterogeneous populations of users. Under this framework, we also propose a rank aggregation algorithm based on alternating gradient descent to estimate the underlying item scores and accuracy levels of different users simultaneously from noisy pairwise comparisons. We theoretically prove that the proposed algorithm converges linearly up to a statistical error which matches that of the state-of-the-art method for the single-user BTL model. We evaluate the proposed HTM model and algorithm on both synthetic and real data, demonstrating that it outperforms existing methods.

- Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 32nd Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Vancouver, Canada, 2019.

[Summary] [Paper] [Poster]Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice. In this paper, we propose a Stochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC) algorithm. It makes use of a semi-stochastic gradient estimator that recursively accumulates the gradient information to reduce the variance of the stochastic gradient. We provide a convergence analysis of SRVR-HMC for sampling from a class of non-log-concave distributions and show that SRVR-HMC converges faster than all existing HMC-type algorithms based on underdamped Langevin dynamics. Thorough experiments on synthetic and real-world datasets validate our theory and demonstrate the superiority of SRVR-HMC.

- Stochastic Variance-Reduced Cubic Regularization Methods

Dongruo Zhou,**Pan Xu**, Quanquan Gu

Journal of Machine Learning Research (**JMLR**), Volume 20, No. 134, 2019.

The short version of this paper has been published in ICML 2018. The journal version extends the SVRC algorithm to a sample-efficient one proposed in this manuscript: Sample Efficient Stochastic Variance-Reduced Cubic Regularization Method.

[Summary] [Paper]To reduce the sample complexity of Hessian matrix computation in SVRC, we also propose a sample efficient stochastic variance-reduced cubic regularization (Lite-SVRC) algorithm for finding the local minimum more efficiently. Lite-SVRC converges to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n+n^{2/3}/\epsilon^{3/2})$ Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with different nonconvex optimization problems conducted on real datasets validate our theoretical results for both SVRC and Lite-SVRC.

- An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

**Pan Xu**, Felicia Gao, Quanquan Gu

In Proc. of the 35th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Tel Aviv, Israel, 2019. [Oral presentation, 7.8%]

[Summary] [Paper]We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by \citet{papini2018stochastic} for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\epsilon^{5/3})$ trajectories. This sample complexity improves upon the best known result $O(1/\epsilon^2)$ by a factor of $O(1/\epsilon^{1/3})$. At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

- Sampling from Non-Log-Concave Distributions via Variance-Reduced Gradient Langevin Dynamics

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 22nd International Conference on Artificial Intelligence and Statistics (**AISTATS**), Naha, Okinawa, Japan, 2019.

[Summary] [Paper]We study stochastic variance reduction-based Langevin dynamic algorithms, SVRG-LD and SAGA-LD \citep{dubey2016variance}, for sampling from non-log-concave distributions. Under certain assumptions on the log density function, we establish the convergence guarantees of SVRG-LD and SAGA-LD in $2$-Wasserstein distance. More specifically, we show that both SVRG-LD and SAGA-LD require $ \tilde O\big(n+n^{3/4}/\epsilon^2 + n^{1/2}/\epsilon^4\big)\cdot \exp\big(\tilde O(d+\gamma)\big)$ stochastic gradient evaluations to achieve $\epsilon$-accuracy in $2$-Wasserstein distance, which outperforms the $ \tilde O\big(n/\epsilon^4\big)\cdot \exp\big(\tilde O(d+\gamma)\big)$ gradient complexity achieved by Langevin Monte Carlo Method \citep{raginsky2017non}. Experiments on synthetic data and real data back up our theory.

- Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

**Pan Xu***, Jinghui Chen*, Difan Zou, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018. [Spotlight presentation, 3.5%]

[Summary] [Paper] [Video]We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the $\textit{almost minimizer}$ within $\widetilde O\big(nd/(\lambda\epsilon) \big)$ and $\widetilde O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\lambda$ is the spectral gap of the Markov chain generated by GLD. Both results improve upon the best known gradient complexity results \citep{raginsky2017non}. Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (SVRG-LD) to the almost minimizer within $\widetilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. Our theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.

- Stochastic Nested Variance Reduction for Nonconvex Optimization

Dongrou Zhou,**Pan Xu**, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018. [Spotlight presentation, 3.5%]

[Summary] [Paper] [Video]We study finite-sum nonconvex optimization problems, where the objective function is an average of $n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses $K+1$ nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, the proposed algorithm converges to an $\epsilon$-approximate first-order stationary point (i.e., $\|\nabla F(\mathbf{x})\|_2\leq \epsilon$) within $\widetilde O(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$ number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG $O(n+n^{2/3}\epsilon^{-2})$ and that of SCSG $O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2})$. For gradient dominated functions, our algorithm also achieves better gradient complexity than the state-of-the-art algorithms. Thorough experimental results on different nonconvex optimization problems back up our theory.

- Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Yaodong Yu*,**Pan Xu***, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018.

[Summary] [Paper] [Video]We propose stochastic optimization algorithms that can find local minima faster than existing algorithms for nonconvex optimization problems, by exploiting the third-order smoothness to escape non-degenerate saddle points more efficiently. More specifically, the proposed algorithm only needs $\tilde{O}(\epsilon^{-10/3})$ stochastic gradient evaluations to converge to an approximate local minimum $\mathbf{x}$, which satisfies $\|\nabla f(\mathbf{x})\|_2\leq\epsilon$ and $\lambda_{\min}(\nabla^2 f(\mathbf{x}))\geq -\sqrt{\epsilon}$ in unconstrained stochastic optimization, where $\tilde{O}(\cdot)$ hides logarithm polynomial terms and constants. This improves upon the $\tilde{O}(\epsilon^{-7/2})$ gradient complexity achieved by the state-of-the-art stochastic local minima finding algorithms by a factor of $\tilde{O}(\epsilon^{-1/6})$. Experiments on two nonconvex optimization problems demonstrate the effectiveness of our algorithm and corroborate our theory.

- Subsampled Stochastic Variance-Reduced Gradient Langevin Dynamics

Difan Zou*,**Pan Xu***, Quanquan Gu

In Proc. of the 34th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Monterey, California, USA, 2018.

[Summary] [Paper]Stochastic variance-reduced gradient Langevin dynamics (SVRG-LD) was recently proposed to improve the performance of stochastic gradient Langevin dynamics (SGLD) by reducing the variance of the stochastic gradient. In this paper, we propose a variant of SVRG-LD, namely SVRG-LD$^+$, which replaces the full gradient in each epoch with a subsampled one. We provide a nonasymptotic analysis of the convergence of SVRG-LD$^+$ in $2$-Wasserstein distance, and show that SVRG-LD$^+$ enjoys a lower gradient complexity than SVRG-LD, when the sample size is large or the target accuracy requirement is moderate. Our analysis directly implies a sharper convergence rate for SVRG-LD, which improves the existing convergence rate by a factor of $\kappa^{1/6}n^{1/6}$, where $\kappa$ is the condition number of the log-density function and $n$ is the sample size. Experiments on both synthetic and real-world datasets validate our theoretical results.

- Continuous and Discrete-Time Accelerated Stochastic Mirror Descent for Strongly Convex Functions

**Pan Xu***, Tianhao Wang*, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We provide a second-order stochastic differential equation (SDE), which characterizes the continuous-time dynamics of accelerated stochastic mirror descent (ASMD) for strongly convex functions. This SDE plays a central role in designing new discrete-time ASMD algorithms via numerical discretization and providing neat analyses of their convergence rates based on Lyapunov functions. Our results suggest that the only existing ASMD algorithm, namely, AC-SA proposed in \citet{ghadimi2012optimal} is one instance of its kind, and we can derive new instances of ASMD with fewer tuning parameters.This sheds light on revisiting accelerated stochastic optimization through the lens of SDEs, which can lead to a better understanding as well as new simpler algorithms of acceleration in stochastic optimization. Numerical experiments on both synthetic and real data support our theory.

- Stochastic Variance-Reduced Cubic Regularized Newton Method

Dongruo Zhou,**Pan Xu**, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\widetilde{O}(n^{4/5}/\epsilon^{3/2})$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory.

- Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Difan Zou*,**Pan Xu***, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2-Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}\big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.

- Covariate Adjusted Precision Matrix Estimation via Nonconvex Optimization

Jinghui Chen,**Pan Xu**, Lingxiao Wang, Jian Ma, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018. [Long talk, 8.6%]

[Summary] [Paper]We propose a nonconvex estimator for the covariate adjusted precision matrix estimation problem in the high dimensional regime, under sparsity constraints. To solve this estimator, we propose an alternating gradient descent algorithm with hard thresholding. Compared with existing methods along this line of research, which lack theoretical guarantees in optimization error and/or statistical error, the proposed algorithm not only is computationally much more efficient with a linear rate of convergence, but also attains the optimal statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real data support our theory.

- Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms

**Pan Xu***, Tianhao Wang*, Quanquan Gu

In Proc. of the 21st International Conference on Artificial Intelligence and Statistics (**AISTATS**), Playa Blanca, Lanzarote, Canary Islands, 2018.

[Summary] [Paper]We present a new framework to analyze accelerated stochastic mirror descent through the lens of continuous-time stochastic dynamic systems. It enables us to design new algorithms, and perform a unified and simple analysis of the convergence rates of these algorithms. More specifically, under this framework, we provide a Lyapunov function based analysis for the continuous-time stochastic dynamics, as well as several new discrete-time algorithms derived from the continuous-time dynamics. We show that for general convex objective functions, the derived discrete-time algorithms attain the optimal convergence rate. Empirical experiments corroborate our theory.

- Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization

**Pan Xu**, Jian Ma, Quanquan Gu

In Proc. of the 30th Conference on Advances in Neural Information Processing Systems (**NIPS**), Long Beach, California, USA, 2017.

[Summary] [Paper] [Poster] [Code]We study the estimation of the latent variable Gaussian graphical model (LVGGM), where the precision matrix is the superposition of a sparse matrix and a low-rank matrix. In order to speed up the estimation of the sparse plus low-rank components, we propose a sparsity constrained maximum likelihood estimator based on matrix factorization, and an efficient alternating gradient descent algorithm with hard thresholding to solve it. Our algorithm is orders of magnitude faster than the convex relaxation based methods for LVGGM. In addition, we prove that our algorithm is guaranteed to linearly converge to the unknown sparse and low-rank components up to the optimal statistical precision. Experiments on both synthetic and genomic data demonstrate the superiority of our algorithm over the state-of-the-art algorithms and corroborate our theory.

- Uncertainty Assessment and False Discovery Rate Control in High-Dimensional Granger Causal Inference

Aditya Chaudhry,**Pan Xu**, Quanquan Gu

In Proc. of the 34th International Conference on Machine Learning (**ICML**), Sydney, Australia, 2017.

[Summary] [Paper]Causal inference among high-dimensional time series data proves an important research problem in many fields. In the classical regime, one often establishes causality among time series via a concept known as “Granger causality.” However, existing approaches for Granger causal inference in high-dimensional data lack the means to characterize the uncertainty associated with Granger causality estimates (e.g. p-values and confidence intervals). We make two contributions in this work. First, we introduce a novel unbiased estimator for assessing Granger causality in the high-dimensional regime. We propose test statistics and confidence intervals for our estimator to allow, for the first time, uncertainty characterization in high-dimensional Granger causal inference. Second, we introduce a novel method for false discovery rate control that achieves higher power in multiple testing than existing techniques and that can cope with dependent test statistics and dependent observations. We corroborate our theoretical results with experiments on both synthetic data and real-world climatological data.

- Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent

**Pan Xu**, Tingting Zhang, Quanquan Gu

In Proc. of the 20th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Fort Lauderdale, Florida, USA, 2017.

[Summary] [Paper]We study the sparse tensor-variate Gaussian graphical model (STGGM), where each way of the tensor follows a multivariate normal distribution whose precision matrix has sparse structures. To estimate the precision matrices, we propose a sparsity constrained maximum likelihood estimator. However, due to the complex structure of the tensor-variate GGMs, the likelihood based estimator is non-convex, which poses great challenges for both computation and theoretical analysis. In order to address these challenges, we propose an efficient alternating gradient descent algorithm to solve this estimator and prove that, under certain conditions on the initial estimator, our algorithm is guaranteed to linearly converge to the unknown precision matrices up to the optimal statistical error. Experiments on both synthetic data and real world brain imaging data corroborate our theory.

- Semiparametric Differential Graph Models

**Pan Xu**, Quanquan Gu

In Proc. of the 29th Conference on Advances in Neural Information Processing Systems (**NIPS**), Barcelona, Spain, 2016.

[Summary] [Paper] [Video]In many cases of network analysis, it is more attractive to study how a network varies under different conditions than an individual static network. We propose a novel graphical model, namely the Latent Differential Graph Model, where the networks under two different conditions are represented by two semiparametric elliptical distributions respectively, and the variation of these two networks (i.e., differential graph) is characterized by the difference between their latent precision matrices. We propose an estimator for the differential graph based on quasi-likelihood maximization with nonconvex regularization. We show that our estimator attains a faster statistical rate in parameter estimation than the state-of-the-art methods, and enjoys the oracle property under mild conditions. Thorough experiments on both synthetic and real-world data support our theory.

- Forward Backward Greedy Algorithms for Multi-Task Learning with Faster Rates

Lu Tian,**Pan Xu**, Quanquan Gu

In Proc. of the 32nd International Conference on Uncertainty in Artificial Intelligence (**UAI**), New York / New Jersey, USA, 2016.

[Summary] [Paper]In this paper, we develop a multi-task learning algorithm with faster convergence rates. In particular, we propose a general estimator for multitask learning with row sparsity constraints on the parameter matrix. The proposed estimator is a nonconvex optimization problem. To solve it, we develop a forward backward greedy algorithm with provable guarantees. More specifically, we prove that the output of the greedy algorithm attains a sharper estimation error bound than state-of-the-art multi-task learning methods. Moreover, our estimator enjoys model selection consistency under mild conditions. Thorough experiments on both synthetic and real-world data demonstrate the effectiveness of our method and back up our theory.

- COVID-19 Reopening Strategies at the County Level in the Face of Uncertainty: Multiple Models for Outbreak Decision Support [Paper]

Katriona Shea, ...,**Pan Xu**, ..., Michael C. Runge, medRxiv: 2020.11.03.20225409.

- Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. [Paper]

COVID-19 Forecast Hub Consortium,**Pan Xu**, medRxiv: 2020.08.19.20177493.

- Epidemic Model Guided Machine Learning for COVID-19 Forecasts in the United States [Paper]

Difan Zou, Lingxiao Wang,**Pan Xu**, Jinghui Chen, Weitong Zhang, Quanquan Gu, medRxiv: 2020.05.24.2011198.

This work has been presented at the ICLR 2021 Machine Learning for Preventing and Combating Pandemics Workshop - Communication-efficient Distributed Estimation and Inference for Transelliptical Graphical Models [Paper]

**Pan Xu**, Lu Tian, Quanquan Gu, arXiv:1612.09297.

This work has been presented at the ENAR 2016 Spring Meeting.