## Research on Change-Point Detection for Parameters in Regression Model

Zheng Jinhui,1, Yu Jinghu,1, Ding Yiming,1, Bao Zeyu,2

Received: 2020-11-2

Abstract

This paper constructs a method to detect the change point of regression model parameters based on the non-stationary measurement index (NS). Under the premise of selecting the appropriate parameter estimation method and window size, the residual sequence of the sample in the window and the corresponding NS value are calculated by judging the stationarity of the residual sequence within the window to achieve the purpose of change point detection. A series of two-segment regression models are constructed for experimental verification. The results show that this method can effectively detect the position of the change point of the two-segment regression model. The experimental results of comparison with other methods also show that the method is more accurate in the detection of regression model parameter change points.

Keywords： Regression model ; Change point detection ; Non-stationary measure

Zheng Jinhui, Yu Jinghu, Ding Yiming, Bao Zeyu. Research on Change-Point Detection for Parameters in Regression Model. Acta Mathematica Scientia[J], 2021, 41(4): 1124-1134 doi:

## 1 引言

### 2.2 二分段模型变点检测问题的数学描述

$$$Y_{i} = f_{\theta_{1}}(x_{i})\cdot {\bf 1}_{0<i\le t_{0}}+f_{\theta_{2}}(x_{i})\cdot {\bf 1}_{t_{0}<i\le T},$$$

(2) 对此$l$个样本, 采用合适的参数估计方法进行估计, 得到估计值$\hat{y}_{k}$及相应的残差序列$\hat{\varepsilon}_{k} = y_{k}-\hat{y}_{k}, k = 1, \cdots, l$;

(3) 计算残差序列{$\hat{\varepsilon}_{k}, k = 1, \cdots, l$} 对应的非平稳性度量值$NS(l, l_{1})$. 给定阈值$\alpha_{0}$, 若$NS(l, l_{1}) \ge \alpha_{0}$, 则判断$l$个样本中存在变点;

(4) 设置$l_{1}$从1变化到$l-1$, 累计$NS(l, l_{1}) \ge \alpha_{0}$的次数, 即判断变点存在正确的次数, 再给定阈值$\alpha_{l}$, 若

$$$\frac{\sharp\{0<l_{1}<l: NS(l, l_{1}) \ge \alpha_{0}\}}{l-1} \ge \alpha_{l},$$$

(5) 重复实验$M(M \ge 2)$轮, 并记每一轮所得到的最优窗口长度为$l_{j}(j \le M)$.$l = \frac{1}{M}\sum\limits_{j = 1}^{M}l_{j}$, 并称之为对应二分段模型的最优窗口长度.

(1) 根据选定的最优滑动窗口长度$l$, 从数据$(x_{i}, y_{i})$开始, 利用第$i$个窗口内的数据$\{(x_{k}, y_{k})\}^{i+l-1}_{k = i}$, 采用合适的参数估计方法, 对模型的参数进行估计, 得到估计值$\hat{y}_{k}, k = i, $$i+1, \cdots, i+l-1 , 以及相应的残差序列 \hat{\varepsilon}_{k} = y_{k}-\hat{y}_{k}, k = i, i+1, \cdots, i+l-1 (初始 i = 1 , 且 1 \le i \le L-l+1 ); (2) 计算残差序列 \hat{\varepsilon}_{k}, k = i, i+1, \cdots, i+l-1 对应的非平稳性度量值 NS(i) , 即 NS(i) 为以 (x_{i}, y_{i}) 为第一个数据的第 i 个窗口内包含的所有数据的残差序列的非平稳性度量值; (3) 若 NS(i) \ge \alpha_{0} , 判断第 i 个窗口内存在变点; 若 NS(i)<\alpha_{0} , 判断第 i 个窗口内不存在变点, 保持窗口长度不变, 滑动窗口, 令 i = i+1 , 重复上述操作. #### 3.2.2 对变点的准确位置进行探测 对确实存在变点的窗口进行变点准确位置探测, 设变点存在于第 i 个窗口内, 则在第 i 个窗口内对数据 \{(x_{k}, y_{k})\}^{i+l-1}_{k = i} 进行检验, 确定变点的具体位置. (1) 若 i = 1 , 用反证法, 假设 i+l-1 为变点位置, 则 NS(i+l-1)<\alpha_{0} , 将前一个数据 (x_{i+l-2}, y_{i+l-2}) 作为窗口的第一个数据, 对窗口内包含的数据 (x_{i+l-2}, y_{i+l-2}), (x_{i+l-1}, y_{i+l-1}) , \cdots , (x_{i+2l-3}, y_{i+2l-3}) , 计算 NS(i+l-2) , 若 NS(i+l-2)\ge \alpha_{0} , 则假设成立, i+l-1 为变点位置; 若 NS(i+l-2)<\alpha_{0} , 则假设不成立, 即 i+l-1 不是变点位置, 继续采用上述方法对位置 i+l-2 进行判别, 依次判别, 直至检索完该窗口内的所有数据为止; (2) 若 i = L-l+1 , 用反证法, 假设 i+1 为变点位置, 则 NS(i+1)<\alpha_{0} , 将数据 (x_{i+1}, y_{i+1}) 作为窗口内的最后一个数据, 对窗口内包含的数据 (x_{i-l+2}, y_{i-l+2}) , (x_{i-l+3}, y_{i-l+3}), \cdots, (x_{i+1},$$ y_{i+1})$, 计算$NS(i-l+2)$, 若$NS(i-l+2)\ge \alpha_{0}$, 则假设成立, $i+1$为变点位置; 若$NS(i-l+2)<\alpha_{0}$, 则假设不成立, 即$i+1$不是变点位置, 继续采用上述方法对位置$i+2$进行判别, 依次判别, 直至检索完该窗口内的所有数据为止;

(3) 若$1<i<L-l+1$, 则变点位置为$i+l-1$.

### 4.1 确定最优的滑动窗口长度

$l$依次取值为20, 30, 40, 50, 60, 70, 自变量$x$取值为$x_{k} = 0.01+0.01(k-1)(k = 1, \cdots, l)$. 设置阈值$\alpha_{0} = 0.9, \alpha_{l} = 0.95$, 随机误差$\varepsilon_{i}\sim N(0, 10^{-5})$, 重复实验50次, 并且记录每一轮得到的最优窗口长度, 取它们的平均值为相应二分段模型的最优窗口长度.

### 图 2

(1) 上述三个模型的最优窗口长度为$l = 60$. 根据图 1(a)可知, 当窗口长度$l = 20$时, 三个模型均只有3个变点位置判断正确, NS值均大于阈值0.9, 结合图 2判断变点在窗口内存在的正确率为16$\%<\alpha_{l}$, 故不是最优的滑动窗口长度; 结合图 1(b)$\sim$图 1(d)可知, 当窗口长度分别为$l = 30, 40, 50$, 变点变化在该窗口内的任何位置时, 能够检测到变点在该窗口内存在的正确率均小于阈值$\alpha_{l}$, 均不能作为最优滑动窗口长度; 而当窗口长度$l = 60$时, 模型一只有1个变点位置不能被检测到, 其它变点位置均被正确检测到, 结合图 2正确率为98$\%>\alpha_{l}$, 模型二和模型三只有3个变点位置不能被检测到, 判断变点在窗口内存在的正确率为95$\% \ge \alpha_{l}$, 并且当窗口长度$l = 70$时, 三个模型中能够检测到变点在该窗口内存在的正确率均为100$\%$, 故最优滑动窗口长度为$l = 60$.

(2) 由图 2可知, 随着窗口长度的增大, 非平稳性度量指标NS对变点的检测能力也显著提高.

(3) 数据不均衡性对NS指标有显著影响. 如图 1, 当来源于两个模型的数据占比小于0.2或大于0.8时, 非平稳性度量值大于阈值0.9, 变点的存在性更容易被检测到, 当数据占比介于0.4到0.6之间时, 非平稳性度量值小于阈值0.9, 随着窗口长度的增大, 数据占比失衡对非平稳性度量指标的影响逐步减小, 如图 1(e)图 1(f).

### 图 3

(2) 对二分段线性变点模型, 无论参数之间差异有多大, 对应的最优窗口大小值基本在60左右. 在实际问题中, 虽不知模型参数具体的取值, 则可以把最优窗口值定为60.

### 4.2 方法的有效性

$$$Y_{i} = (\alpha_{1}+\beta_{1} x_{i}+\varepsilon_{i})\cdot {\bf 1}_{0<i\le t_{0}}+(\alpha_{2}+\beta_{2} x_{i}+\varepsilon_{i})\cdot {\bf 1}_{t_{0}<i\le T}.$$$

$$$Y_{i} = (a_{1}x_{i}+b_{1}z_{i}+\varepsilon_{i})\cdot {\bf 1}_{0<i\le t_{0}}+(a_{2}x_{i}+b_{2}z_{i}+\varepsilon_{i})\cdot {\bf 1}_{t_{0}<i\le T}.$$$

$$$Y_{i} = (m_{1}e^{p_{1}x_{i}}+\varepsilon_{i})\cdot {\bf 1}_{0<i\le t_{0}}+(m_{2}e^{p_{2}x_{i}}+\varepsilon_{i})\cdot {\bf 1}_{t_{0}<i\le T}.$$$

 变点位置$l_{1}+1$ 无变点 11 41 101 161 191 模型一 100 95.5 96 98 99 99 模型二 100 96.5 97 99 98 99 模型三 100 95.5 97 97.5 97.5 99

 变点位置$l_{1}+1$ 无变点 11 41 101 161 191 模型四 100 94.5 98.5 96.5 98.5 99 模型五 100 95 94.5 95.5 99.5 99 模型六 100 94.5 97.5 98 97 98.5

 变点位置$l_{1}+1$ 无变点 11 41 101 161 191 模型七 100 90 93 95.5 96 95.5 模型八 100 96 92 94 96.5 96.5 模型九 100 92.5 94 92 95.5 98

## 5 与其它方法的比较

### 图 (5c)

 方法 LAD NS 变点位置 31 81 181 31 81 181 一元线性 79.256 77.196 77.890 1.178 0.702 0.970 多元线性 74.172 76.543 74.306 0.684 0.489 1.015 非线性 133.999 111.789 141.534 23.933 16.690 19.350

## 参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Page E S .

Continuous inspection schemes

Biometrika, 1954, 41 (1/2): 100- 115

Inclán C , Tiao G C .

Use of cumulative sums of squares for retrospective detection of changes of variance

Journal of the American Statistical Association, 1994, 89 (427): 913- 923

Goossens C , Berger A .

Annual and seasonal climatic variations over the northern hemisphere and Europe during the last century

Annales Geophysicae, 1986, 4 (4): 385- 400

Pettitt A N .

A nonparametric approach to the change-point problem

Journal of the Royal Statistical Society: Series C (Applied Statistics), 1979, 28 (2): 126- 135

Tan C C , Chen S , Miao B Q .

The strong convergence rate of jump-slope change point estimation

Journal of University of Science and Technology of China, 2011, 41 (9): 773- 777

Tan Z P , Miao B Q .

Nonparametric statistical inference on the distribution change point

Journal of University of Science and Technology of China, 2000, 3, 21- 28

Oh H , Lee S .

On score vector-and residual-based CUSUM tests in ARMA-GARCH models

Statistical Methods & Applications, 2018, 27 (3): 385- 406

Ye W Y , Miao B Q , Tan C C .

Analysis of financial contagion based on change point detection of quantile regression model

Quantitative and Technical Economic Research, 2007, 24 (10): 151- 160

Ye W Y , Miao B Q .

Analysis of the contagion of U.S. subprime debt financial crisis based on copula change coint detection

Chinese Management Science, 2009, 17 (3): 1- 7

Lattanzi C , Leonelli M .

A change point approach for the identification of financial extreme regimes

Statistics, 2019, 19 (1): 1- 34

Perreault L , Bernier J , Bobée B , et al.

Bayesian change-point analysis in hydrometeorological time series. Part 1:The normal model revisited

Journal of Hydrology, 2000, 235 (3-4): 221- 241

Chen S , Li Y , Kim J , et al.

Bayesian change point analysis for extreme daily precipitation

International Journal of Climatology, 2017, 37 (7): 3123- 3137

Hu J .

Cancer outlier detection based on likelihood ratio test

Bioinformatics, 2008, 24 (19): 2193- 2199

Wu D , Faria A V , Younes L , et al.

Mapping the order and pattern of brain structural MRI changes using change-point analysis in premanifest Huntington's disease

Human Brain Mapping, 2017, 38 (10): 5035- 5050

Quandt , Richard E .

The estimation of the parameters of a linear regression system obeying two separate regimes

Journal of the American Statistical Association, 1958, 53 (284): 873- 880

Quandt , Richard E .

Tests of the hypothesis that a linear regression system obeys two separate regimes

Journal of the American Statistical Association, 1960, 55 (290): 324- 330

Liu Zhihua , Qian Lianfen .

Changepoint estimation in a segmented linear regression via empirical likelihood

Communications in Statistics-Simulation and Computation, 2010, 39 (1): 85- 100

Lee S , Seo M H , Shin Y .

Testing for threshold effects in regression models

Journal of the American Statistical Association, 2011, 106 (493): 220- 231

Bai J .

Estimation of a change point in multiple regression models

Review of Economics and Statistics, 1997, 79 (4): 551- 563

Jiang J K , Lin H Z , Jiang L , et al.

Estimation of threshold and regression parameters in threshold regression model

Science in China: Mathematics, 2016, 46 (4): 41- 54

Tang Y C , Wang P P , Chen H .

Bayesian analysis for change-point linear regression models

Chinese Journal of Applied Probability and Statistics, 2015, 31 (1): 89- 102

Ciuperca G .

The M-estimation in a multi-phase random nonlinear model

Stats & Probability Letters, 2009, 79 (5): 573- 580

Ciuperca G .

Estimating nonlinear regression with and without change-points by the LAD method

Annals of the Institute of Statal Mathematics, 2011, 63 (4): 717- 743

Boldea O , Hall A R .

Estimation and inference in unstable nonlinear least squares models

Journal of Econometrics, 2013, 172 (1): 158- 167

Tan Q H. Non-Stationary Measurement of Time Series and Its Application[D]. Beijing: Graduate School of Chinese Academy of Sciences, 2013

Tan Q H , Ding Y M .

Empirical analysis of lottery data based on non-stationarity measure

Acta Math Sci, 2014, 34 (1): 207- 216

Tan Q H , Wu L , Li B .

Decomposition of noise and trend based on EMD and non-stationarity measure

Acta Math Sci, 2016, 36 (4): 783- 794

/

 〈 〉