import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import statsmodels.formula.api as smf
sns.set_theme()Autocorrelation in Stock Returns
One of the most important hypothesis in finance is the efficient market hypothesis. In its weak form, it states that past prices should not predict future stock prices (Fama 1970).
As a consequence of this, finance academics usually assume that stock returns are uncorrelated. For example, when computing the annualized standard deviation of daily returns we usually do: \[ \sigma_{\text{annualized}} = \sqrt{252} \times \sigma_{\text{daily}} \] since there are typically 252 return observations per year. In this computation we are implicitly assuming that daily returns are independent from each other.
One way to test if stocks are autocorrelated is to run the following regression: \[ r_{t} = \alpha + \beta r_{t-1} + e_{t}, \] and check whether \(\beta \neq 0.\) Lo and MacKinlay (1988) use a related approach to show that weekly stock returns are positively autocorrelated, rejecting the random walk hypothesis. Jegadeesh (1990) and Lehmann (1990) document significant negative serial correlation in individual stock returns at short horizons, consistent with return reversals.
We need to be careful, though, since the residuals might also be autocorrelated. We will see how to fix the covariance matrix of the errors to account for heteroscedasticity and autocorrelation of the residuals. We do not need to worry about small sample issues since we will be using several years of daily data in our analyses.
More generally, if we ran the regression, \[ r_{t} = \alpha + \beta_{1} r_{t-1} + \beta_{2} r_{t-2} + ... + \beta_{k} r_{t-k} + e_{t}, \] is it the case that by introducing more lags we improve our predictions?
These type of regressions are called autoregressive processes. The second regression specifies an autoregressive process of order \(k\) or AR(k) model whereas the first regression specifies an autoregressive process of order 1 or AR(1) model.
Getting Things Ready
Getting the Data
We start by loading the following libraries.
We then download stock price data for Nvidia (ticker: NVDA) from 2015 until 2025 and compute daily returns.
ticker = 'NVDA'
start_date = '2015-01-01'
end_date = '2025-01-01'
dret = (yf
.download(ticker, start=start_date, end=end_date, auto_adjust=False, progress=False, multi_level_index=False)
.loc[:, ['Close']]
.pct_change()
.rename(columns={'Close': 'RET'})
.dropna()
)We use Close rather than Adj Close here. This is one of the few contexts where the unadjusted price is appropriate: dividends are paid quarterly, so on any given day the difference between a Close-based return and an Adj Close-based return is zero except on the roughly four ex-dividend dates per year. Those four observations have negligible effect on an autocorrelation test estimated on 2,500+ daily observations. For monthly return calculations — where dividends represent a meaningful share of total return — Adj Close is always required, as we established in Module 2. We also rename the column Close to RET for notational convenience.
Generating a Scatter Plot
In order to study these questions, let’s first generate a scatter plot of the daily returns of the stock. Note that in Pandas, the shift() function allows us to lag a whole series. For example, dret['RET'] is a series of daily NVDA returns whereas dret['RET'].shift(1) is a series of one-day lagged NVDA returns.
sns.scatterplot(x=dret['RET'].shift(1), y=dret['RET'], alpha=0.5)
plt.title(ticker + ' Actual v/s Lagged Returns')
plt.xlabel('$R_{t-1}$')
plt.ylabel('$R_{t}$')
plt.axhline(y=0, color='gray', linewidth=1)
plt.axvline(x=0, color='gray', linewidth=1)
plt.show()The plot reveals that daily returns are very volatile and it is hard to tell if there is a trend without running an appropriate statistical test.
Estimating an AR(1) Model of Stock Returns
Statistical Estimation
In order to estimate the regression parameters, we regress NVDA daily returns on just one lag-returns, i.e. \(r_{t} = \alpha + \beta r_{t-1} + e_{t}.\) We use RET ~ RET.shift(1) as our model and data=dret specifies the dataframe, so that the regression is estimated by running res = smf.ols('RET ~ RET.shift(1)', data=dret).fit().
This will generate a set of results that I store in the variable res. The function summary() then presents the results in a nice format. Notice that display(res.summary()) uses a different font and the table might look odd. Use print(res.summary(slim=True)) instead.
res = smf.ols('RET ~ RET.shift(1)', data=dret).fit()
print(res.summary(slim=True)) OLS Regression Results
==============================================================================
Dep. Variable: RET R-squared: 0.005
Model: OLS Adj. R-squared: 0.004
No. Observations: 2514 F-statistic: 12.11
Covariance Type: nonrobust Prob (F-statistic): 0.000510
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.0029 0.001 4.711 0.000 0.002 0.004
RET.shift(1) -0.0693 0.020 -3.480 0.001 -0.108 -0.030
================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The regression results report both coefficient estimates along with their respective p-values. Therefore, the OLS analysis suggests that both the intercept and the coefficient on the lagged return are statistically significant at the 1% level since both p-values are less than 1%.
The R-squared tells us what percentage of the total variance of daily returns is explained by the lagged return. Not surprisingly, the R-squared is very small as can be expected by the previous plot.
Nevertheless, the autocorrelation coefficient is negative and statistically significant at the 1% level. A coefficient of -0.0693 means that if one day the return of NVDA stock is 10%, the return the next day is expected to be -0.69%.
Plotting the Regression Line
I now plot the regression line to compare it against the scatter plot. Seaborn’s regplot() draws both the scatter plot and the regression line in a single call.
sns.regplot(x=dret['RET'].shift(1), y=dret['RET'], scatter_kws={'alpha': 0.5}, ci=None)
plt.title(ticker + ' Actual v/s Lagged Returns')
plt.xlabel('$R_{t-1}$')
plt.ylabel('$R_{t}$')
plt.axhline(y=0, color='gray', linewidth=1)
plt.axvline(x=0, color='gray', linewidth=1)
plt.show()Note that the line effectively has a negative slope coefficient.
Controlling for Heteroscedasticity and Autocorrelation of Residuals
In practice, the previous analysis could be omitting the fact that the residuals of the regression can be heteroscedastic and autocorrelated. This would not affect the coefficient estimate but the p-value of the coefficient. Therefore, it is not uncommon to take into account for heteroscedasticy and serial correlation of the residuals when performing the significance tests.
Indeed, standard OLS assumes that the residuals are independent and identically distributed (i.i.d.). A standard way to control for those two deviations is to use the Newey-West covariance matrix (Newey and West 1987).
When estimating the regression, we can specify cov_type='HAC' to indicate the Newey-West covariance matrix and cov_kwds={'maxlags': lags} where lags is the number of lags to include in the covariance matrix. It is common to use the integer part of \(T^{1/4}\) for the number of lags, where \(T\) is the length of the time-series.
nw_lags = int(len(dret)**0.25)
res = (smf
.ols('RET ~ RET.shift(1)', data=dret)
.fit(cov_type='HAC', cov_kwds={'maxlags': nw_lags})
)
print(res.summary(slim=True)) OLS Regression Results
==============================================================================
Dep. Variable: RET R-squared: 0.005
Model: OLS Adj. R-squared: 0.004
No. Observations: 2514 F-statistic: 5.055
Covariance Type: HAC Prob (F-statistic): 0.0246
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.0029 0.001 4.607 0.000 0.002 0.004
RET.shift(1) -0.0693 0.031 -2.248 0.025 -0.130 -0.009
================================================================================
Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction
We can see that the autocorrelation coefficient is still significant although not at 1% but at the 5% level.
Adding More Lags
We can also run the regression with more lags to see if additional lagged returns improve our prediction.
res = (smf
.ols('RET ~ RET.shift(1) + RET.shift(2) + RET.shift(3) + RET.shift(4) + RET.shift(5)', data=dret)
.fit(cov_type='HAC', cov_kwds={'maxlags':nw_lags})
)
print(res.summary(slim=True)) OLS Regression Results
==============================================================================
Dep. Variable: RET R-squared: 0.005
Model: OLS Adj. R-squared: 0.003
No. Observations: 2510 F-statistic: 1.097
Covariance Type: HAC Prob (F-statistic): 0.360
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.0028 0.001 4.439 0.000 0.002 0.004
RET.shift(1) -0.0683 0.030 -2.308 0.021 -0.126 -0.010
RET.shift(2) 0.0190 0.025 0.752 0.452 -0.031 0.068
RET.shift(3) -0.0004 0.023 -0.017 0.986 -0.045 0.044
RET.shift(4) -0.0035 0.021 -0.163 0.870 -0.045 0.038
RET.shift(5) 0.0092 0.021 0.426 0.670 -0.033 0.051
================================================================================
Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction
The p-values of the additional lagged variables are all greater than 10%. As such, they are not statistically significant at any conventional level of significance. If anything, the daily returns of NVDA are autocorrelated with the previous day returns.
Practice Problems
Problem 1 Using data from January 2015 until May 2025, test if the daily returns of MSFT exhibit daily serial autocorrelation. In your test, use the Newey-West covariance matrix and the integer part of \(T^{1/4}\) for the number of lags.
Solution
ticker = 'MSFT'
start_date = '2015-01-01'
end_date = '2025-05-01'
dret = (yf
.download(ticker, start=start_date, end=end_date, auto_adjust=False, progress=False, multi_level_index=False)
.loc[:, ['Close']]
.pct_change()
.rename(columns={'Close': 'RET'})
.dropna()
)
nw_lags = int(len(dret)**0.25)
res = (smf
.ols('RET ~ RET.shift(1)', data=dret)
.fit(cov_type='HAC', cov_kwds={'maxlags': nw_lags})
)
print(res.summary(slim=True)) OLS Regression Results
==============================================================================
Dep. Variable: RET R-squared: 0.019
Model: OLS Adj. R-squared: 0.018
No. Observations: 2595 F-statistic: 6.044
Covariance Type: HAC Prob (F-statistic): 0.0140
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.0011 0.000 3.493 0.000 0.000 0.002
RET.shift(1) -0.1372 0.056 -2.458 0.014 -0.247 -0.028
================================================================================
Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction