## 随机加速梯度算法的回归学习收敛速度

1 巢湖学院数学与统计学院 合肥 238024

2 浙江财经大学数据科学学院 杭州 310018

## On Stochastic Accelerated Gradient with Convergence Rate of Regression Learning

Cheng Yiyuan1, Zha Xingxing,1, Zhang Yongquan2

1 School of Mathematics and Statistics, Chaohu University, Hefei 238024

2 School of Data Sciences, Zhejiang University of Finance & Economics, Hangzhou 310018

 基金资助: 国家自然科学基金.  61573324安徽省高校自然科学研究项目.  KJ2018A0455安徽省高校青年人才支持基金.  gxyq2019082巢湖学院校级科研基金.  XLY-201903

 Fund supported: the NSFC.  61573324the Natural Science Research Project in Anhui Province.  KJ2018A0455the Program in the Youth Elite Support Plan in Universities of Anhui Province.  gxyq2019082the Fund of Chaohu University.  XLY-201903

Abstract

This paper studies the regression learning problem from given sample data by using stochastic approximation (SA) type algorithm, namely, the accelerated SA.We focus on problems without strong convexity, for which all well known algorithms achieve a convergence rate for function values of $O(1/n)$. We consider and analyze accelerated SA algorithm that achieves a rate of $O(1/n)$ for classical least square regression and logistic regression problems respectively. Comparing with the well known results, we only need fewer conditions to obtain the tight convergence rate for least square regression and logistic regression problems.

Keywords： Least-square regression ; Logistic regression ; Convergence rate

Cheng Yiyuan, Zha Xingxing, Zhang Yongquan. On Stochastic Accelerated Gradient with Convergence Rate of Regression Learning. Acta Mathematica Scientia[J], 2021, 41(5): 1574-1584 doi:

## 2 最小二乘回归的随机加速梯度算法

$f(\theta) = \frac{1}{2}{\Bbb E}\left[\ell\left(y_i, \langle\theta, x_i\rangle\right)\right]$, 其中$(x_i, y_i)\in {\cal X} \times {\cal Y}$表示数据样本, $\ell$表示损失函数, 样本出现按照未知的概率测量$\rho$.$(X, d)$是一个紧凑的度量空间, 以及$Y = {\Bbb R}$. 其中$\rho$是未知概率分布在$Z = {\cal X}\times{\cal Y}$上, $(X, Y)$是相应的随机变量. 给出结果前作如下假设:

(1) 训练数据$(x_{k}, y_{k}), k\geq1$, 这些数据在$\rho$上独立同分布(i.i.d.);

(2) ${\Bbb E}\|x_k\|^2$是有界的, 即${\Bbb E}\|x_k\|^2\leq M_{1}$对任意的$k\geq1$;

(3) $f(\theta) = \frac{1}{2}{\Bbb E}[\langle\theta, x_{k}\rangle^2-2y_{k}\langle\theta, x_{k}\rangle]$的全局最小值在$\theta^{*}\in {\Bbb R}^d$这一点上可以获的;

(4) 接下来, 将$\xi_{k} = \left(y_{k}-\langle\theta_{k}, x_{k}\rangle\right)x_{k}$表示为残差. 假设${\Bbb E}||\xi_{i}^2||\leq M^{2}$对于每个$k$定义$\overline{\xi}_{k} = \frac{1}{k}\sum\limits_{i = 1}^{k}\xi_{i}$.

Require  $\theta_0$ and $\alpha_1 = 1, \alpha_k >0$ for $k = 2, \cdots$, $\{\beta_k > 0 \}$ and $\{\lambda_k > 0\}$:

1: Set $\theta_{0}^{ag}\leftarrow\theta_{0}$, $\overline{\xi}_0 \leftarrow 0$ and $k\leftarrow 1$;

2: Set $\theta_{k}^{md}\leftarrow(1-\alpha_{k})\theta_{k-1}^{ag}+\alpha_{k}\theta_{k-1}$;

3: Set $z_{k}\leftarrow (\nabla f(\theta_{k}^{md})+\eta_{k}\theta_{k}^{md})/\alpha_{k} = \frac{\left(\langle \theta_{k}^{md}, x_{k}\rangle x_{k}-y_{k}x_{k}\right)+\eta_{k}\theta_{k}^{md}}{\alpha_{k}}$;

4: Set $\theta_{k}\leftarrow\theta_{k-1}-\lambda_{k}z_{k};$

5: Compute $\xi_{k}\leftarrow\left(y_{k}-\langle\theta_k, x_{k}\rangle\right)x_{k}$ and $\overline{\xi}_k \leftarrow \overline{\xi}_{k-1} + \frac{1}{k}(\xi_k - \overline{\xi}_{k-1})$;

6: Set $\theta_{k}^{ag}\leftarrow\theta_{k}^{md}-\beta_{k}\left(z_{k}+{\overline{\xi}_{k}}\right);$

7: Set $k \leftarrow k + 1$, and goto step 2.

(B1) ${\Bbb E}\|x_{i}\|^2$是有界的, 即${\Bbb E}\|x_i\|^2\leq M_{1}$对任意的$i\geq1$;

(B2) ${\Bbb E}\|\xi_{i}\|^2\leq M_{2}$对于每一个$i$.

Require   $\theta_0$ and $\alpha_1 = 1, \alpha_k >0$ for $k = 2, \cdots$, $\{\beta_k > 0 \}$ and $\{\lambda_k > 0\}$.

1: Set $\theta_{0}^{ag}\leftarrow\theta_{0}$, $\overline{\xi}_0 \leftarrow 0$ and $k\leftarrow 1$,

2: Set $\theta_{k}^{md}\leftarrow (1-\alpha_{k})\theta_{k-1}^{ag}+\alpha_{k}\theta_{k-1}$,

3: Compute $z_{k}\leftarrow \frac{1}{\alpha_{k}}(\nabla l(\theta_{k}^{md})+\eta_{k}\theta_{k}^{md}),$

4: Set $\theta_{k}\leftarrow \theta_{k-1}-\lambda_{k}z_{k},$

5: Compute $\xi_{k}\leftarrow \left(y_{k}-\langle\theta_k, x_{k}\rangle\right)x_{k}$ and $\overline{\xi}_k \leftarrow\overline{\xi}_{k-1} + \frac{1}{k}(\xi_k - \overline{\xi}_{k-1})$;

6: Set $\theta_{k}^{ag}\leftarrow\theta_{k}^{md}-\beta_{k}\left(z_{k}+\bar{\xi}_{k}\right).$

7: Set $k \leftarrow k + 1$, and goto step 2.

## 4 总结与比较

Bach和Moulines[9]的工作可能与我们的工作密切相关, 他们研究了凸函数极小化的随机逼近问题, 需要要求其在某些点的梯度的无偏估计, 其中包括基于经验风险最小化的机器学习方法. Bach和MouLein所考虑的样本设置与我们的相似：学习者给出了样本集$\{(x_{i}, y_{i})\}_{i = 1}^{n}$, 回归学习问题的目的是学习线性函数$\langle\theta, x\rangle$参数. 在没有强凸性假设的情况下, 本文和Bach和Moulines得到了最小二乘回归随机逼近算法的$O(1/n)$收敛速率.

## 参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Robbins H , Monro S .

A Stochastic approximation method

Annals of Mathematical Statistics, 1951, 22 (3): 400- 407

Polyak B T , Juditsky A B .

Acceleration of stochastic approximation by averaging

SIAM Journal on Control and Optimization, 1992, 30 (4): 838- 855

Bottou L, Bousquet O. The tradeoffs of large scale learning//Platt J C, Koller D, Singer Y, et al. Adv Neural Inform Process Syst 20. Cambridge: MIT Press, 2008

Shalev-Shwartz S , Singer Y , Srebro N , Cotter A .

Pegasos: Primal estimated sub-gradient solver for SVM

Math Prog, 2007, 127 (1): 3- 30

Nemirovski A , Juditsky A , et al.

Robust stochastic approximation approach to stochastic programming

SIAM Journal on Optimization, 2009, 19 (4): 1574- 1609

Lan G , Monteiro R D C .

Iteration-complexity of first-order penalty methods for convex programming

Mathematical Programming, 2013, 138, 115- 139

Ghadimi S , Lan G .

Accelerated gradient methods for nonconvex nonlinear and stochastic programming

Mathematical Programming, 2016, 156 (1/2): 59- 99

Bach F, Moulines E. Non-asymptotic analysis of stochastic approximation algorithms for machine learning//Platt J C, Koller D, Singer Y, et al. Adv Neural Inform Process Syst 20. Cambridge: MIT Press, 2011: 451-459

Bach F, Moulines E. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)//Platt J C, Koller D, Singer Y, et al. Adv Neural Inform Process Syst 20. Cambridge: MIT Press, 2013

Duchi J , Hazan E , Singer Y .

Journal of Machine Learning Research, 2010, 12, 2121- 2159

Gasnikov A V , Nesterov Y E , Spokoiny V G .

On the efficiency of a randomized mirror descent algorithm in online optimization problems

Computational Mathematics&Mathematical Physics, 2015, 55, 580- 596

Beck A , Teboulle M .

A fast iterative shrinkage-thresholding algorithm for linear inverse problems

SIAM J Imaging Sciences, 2009, 2, 183- 202

Tseng P , Yun S .

Incrementally updated gradient methods for constrained and regularized optimization

Journal of Optimization Theory & Applications, 2014, 160 (3): 832- 853

Nesterov Y .

Smooth minimization of nonsmooth functions

Mathematical Programming, 2005, 103, 127- 152

Nesterov Y .

Subgradient methods for huge-scale optimization problems

Mathematical Programming, 2014, 146 (1/2): 275- 297

Lan G .

An optimal method for stochastic composite optimization

Mathematical Programming, 2012, 133 (1): 365- 397

Ghadimi S , Lan G .

Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization Ⅰ: A Generic Algorithmic Framework

SIAM Journal on Optimization, 2012, 22, 1469- 1492

Ghadimi S , Lan G .

Accelerated gradient methods for nonconvex nonlinear and stochastic programming

Mathematical Programming, 2016, 156 (1/2): 59- 99

Ghadimi S , Lan G , Zhang H C .

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization

Mathematical Programming, 2016, 155 (1/2): 267- 305

Lan G .

Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization

Mathematical Programming, 2013, 149, 1- 45

/

 〈 〉