Autocorrelation in Stock Returns

One of the most important hypothesis in finance is the efficient market hypothesis. In its weak form, it states that past prices should not predict future stock prices (Fama 1970).

Fama, Eugene F. 1970. “Efficient Capital Markets: A Review of Theory and Empirical Work.” Journal of Finance 25 (2): 383–417.

As a consequence of this, finance academics usually assume that stock returns are uncorrelated. For example, when computing the annualized standard deviation of daily returns we usually do: \[ \sigma_{\text{annualized}} = \sqrt{252} \times \sigma_{\text{daily}} \] since there are typically 252 return observations per year. In this computation we are implicitly assuming that daily returns are independent from each other.

One way to test if stocks are autocorrelated is to run the following regression: \[ r_{t} = \alpha + \beta r_{t-1} + e_{t}, \] and check whether \(\beta \neq 0.\) Lo and MacKinlay (1988) use a related approach to show that weekly stock returns are positively autocorrelated, rejecting the random walk hypothesis. Jegadeesh (1990) and Lehmann (1990) document significant negative serial correlation in individual stock returns at short horizons, consistent with return reversals.

Lo, Andrew W., and A. Craig MacKinlay. 1988. “Stock Market Prices Do Not Follow Random Walks: Evidence from a Simple Specification Test.” Review of Financial Studies 1 (1): 41–66.

Jegadeesh, Narasimhan. 1990. “Evidence of Predictable Behavior of Security Returns.” Journal of Finance 45 (3): 881–98.

Lehmann, Bruce N. 1990. “Fads, Martingales, and Market Efficiency.” Quarterly Journal of Economics 105 (1): 1–28.

We need to be careful, though, since the residuals might also be autocorrelated. We will see how to fix the covariance matrix of the errors to account for heteroscedasticity and autocorrelation of the residuals. We do not need to worry about small sample issues since we will be using several years of daily data in our analyses.

More generally, if we ran the regression, \[ r_{t} = \alpha + \beta_{1} r_{t-1} + \beta_{2} r_{t-2} + ... + \beta_{k} r_{t-k} + e_{t}, \] is it the case that by introducing more lags we improve our predictions?

These type of regressions are called autoregressive processes. The second regression specifies an autoregressive process of order \(k\) or AR(k) model whereas the first regression specifies an autoregressive process of order 1 or AR(1) model.

Getting Things Ready

Getting the Data

We start by loading the following libraries.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import statsmodels.formula.api as smf

sns.set_theme()

We then download stock price data for Nvidia (ticker: NVDA) from 2015 until 2025 and compute daily returns.

ticker = 'NVDA'
start_date = '2015-01-01'
end_date = '2025-01-01'

dret = (yf
        .download(ticker, start=start_date, end=end_date, auto_adjust=False, progress=False, multi_level_index=False)
        .loc[:, ['Close']]
        .pct_change()
        .rename(columns={'Close': 'RET'})
        .dropna()
        )

We use Close rather than Adj Close here. This is one of the few contexts where the unadjusted price is appropriate: dividends are paid quarterly, so on any given day the difference between a Close-based return and an Adj Close-based return is zero except on the roughly four ex-dividend dates per year. Those four observations have negligible effect on an autocorrelation test estimated on 2,500+ daily observations. For monthly return calculations — where dividends represent a meaningful share of total return — Adj Close is always required, as we established in Module 2. We also rename the column Close to RET for notational convenience.

Generating a Scatter Plot

In order to study these questions, let’s first generate a scatter plot of the daily returns of the stock. Note that in Pandas, the shift() function allows us to lag a whole series. For example, dret['RET'] is a series of daily NVDA returns whereas dret['RET'].shift(1) is a series of one-day lagged NVDA returns.

sns.scatterplot(x=dret['RET'].shift(1), y=dret['RET'], alpha=0.5)
plt.title(ticker + ' Actual v/s Lagged Returns')
plt.xlabel('$R_{t-1}$')
plt.ylabel('$R_{t}$')
plt.axhline(y=0, color='gray', linewidth=1)
plt.axvline(x=0, color='gray', linewidth=1)
plt.show()

The plot reveals that daily returns are very volatile and it is hard to tell if there is a trend without running an appropriate statistical test.

Estimating an AR(1) Model of Stock Returns

Statistical Estimation

In order to estimate the regression parameters, we regress NVDA daily returns on just one lag-returns, i.e. \(r_{t} = \alpha + \beta r_{t-1} + e_{t}.\) We use RET ~ RET.shift(1) as our model and data=dret specifies the dataframe, so that the regression is estimated by running res = smf.ols('RET ~ RET.shift(1)', data=dret).fit().

This will generate a set of results that I store in the variable res. The function summary() then presents the results in a nice format. Notice that display(res.summary()) uses a different font and the table might look odd. Use print(res.summary(slim=True)) instead.

res = smf.ols('RET ~ RET.shift(1)', data=dret).fit()
print(res.summary(slim=True))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    RET   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
No. Observations:                2514   F-statistic:                     12.11
Covariance Type:            nonrobust   Prob (F-statistic):           0.000510
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0029      0.001      4.711      0.000       0.002       0.004
RET.shift(1)    -0.0693      0.020     -3.480      0.001      -0.108      -0.030
================================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression results report both coefficient estimates along with their respective p-values. Therefore, the OLS analysis suggests that both the intercept and the coefficient on the lagged return are statistically significant at the 1% level since both p-values are less than 1%.

The R-squared tells us what percentage of the total variance of daily returns is explained by the lagged return. Not surprisingly, the R-squared is very small as can be expected by the previous plot.

Nevertheless, the autocorrelation coefficient is negative and statistically significant at the 1% level. A coefficient of -0.0693 means that if one day the return of NVDA stock is 10%, the return the next day is expected to be -0.69%.

Plotting the Regression Line

I now plot the regression line to compare it against the scatter plot. Seaborn’s regplot() draws both the scatter plot and the regression line in a single call.

sns.regplot(x=dret['RET'].shift(1), y=dret['RET'], scatter_kws={'alpha': 0.5}, ci=None)
plt.title(ticker + ' Actual v/s Lagged Returns')
plt.xlabel('$R_{t-1}$')
plt.ylabel('$R_{t}$')
plt.axhline(y=0, color='gray', linewidth=1)
plt.axvline(x=0, color='gray', linewidth=1)
plt.show()

Note that the line effectively has a negative slope coefficient.

Controlling for Heteroscedasticity and Autocorrelation of Residuals

In practice, the previous analysis could be omitting the fact that the residuals of the regression can be heteroscedastic and autocorrelated. This would not affect the coefficient estimate but the p-value of the coefficient. Therefore, it is not uncommon to take into account for heteroscedasticy and serial correlation of the residuals when performing the significance tests.

Indeed, standard OLS assumes that the residuals are independent and identically distributed (i.i.d.). A standard way to control for those two deviations is to use the Newey-West covariance matrix (Newey and West 1987).

Newey, Whitney K., and Kenneth D. West. 1987. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55 (3): 703–8.

When estimating the regression, we can specify cov_type='HAC' to indicate the Newey-West covariance matrix and cov_kwds={'maxlags': lags} where lags is the number of lags to include in the covariance matrix. It is common to use the integer part of \(T^{1/4}\) for the number of lags, where \(T\) is the length of the time-series.

nw_lags = int(len(dret)**0.25)

res = (smf
       .ols('RET ~ RET.shift(1)', data=dret)
       .fit(cov_type='HAC', cov_kwds={'maxlags': nw_lags})
      )
print(res.summary(slim=True))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    RET   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
No. Observations:                2514   F-statistic:                     5.055
Covariance Type:                  HAC   Prob (F-statistic):             0.0246
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0029      0.001      4.607      0.000       0.002       0.004
RET.shift(1)    -0.0693      0.031     -2.248      0.025      -0.130      -0.009
================================================================================

Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction

We can see that the autocorrelation coefficient is still significant although not at 1% but at the 5% level.

Adding More Lags

We can also run the regression with more lags to see if additional lagged returns improve our prediction.

res = (smf
       .ols('RET ~ RET.shift(1) + RET.shift(2) + RET.shift(3) + RET.shift(4) + RET.shift(5)', data=dret)
       .fit(cov_type='HAC', cov_kwds={'maxlags':nw_lags})
      )
print(res.summary(slim=True))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    RET   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.003
No. Observations:                2510   F-statistic:                     1.097
Covariance Type:                  HAC   Prob (F-statistic):              0.360
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0028      0.001      4.439      0.000       0.002       0.004
RET.shift(1)    -0.0683      0.030     -2.308      0.021      -0.126      -0.010
RET.shift(2)     0.0190      0.025      0.752      0.452      -0.031       0.068
RET.shift(3)    -0.0004      0.023     -0.017      0.986      -0.045       0.044
RET.shift(4)    -0.0035      0.021     -0.163      0.870      -0.045       0.038
RET.shift(5)     0.0092      0.021      0.426      0.670      -0.033       0.051
================================================================================

Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction

The p-values of the additional lagged variables are all greater than 10%. As such, they are not statistically significant at any conventional level of significance. If anything, the daily returns of NVDA are autocorrelated with the previous day returns.

Practice Problems

Problem 1 Using data from January 2015 until May 2025, test if the daily returns of MSFT exhibit daily serial autocorrelation. In your test, use the Newey-West covariance matrix and the integer part of \(T^{1/4}\) for the number of lags.

Solution

ticker = 'MSFT'
start_date = '2015-01-01'
end_date = '2025-05-01'

dret = (yf
        .download(ticker, start=start_date, end=end_date, auto_adjust=False, progress=False, multi_level_index=False)
        .loc[:, ['Close']]
        .pct_change()
        .rename(columns={'Close': 'RET'})
        .dropna()
        )

nw_lags = int(len(dret)**0.25)

res = (smf
       .ols('RET ~ RET.shift(1)', data=dret)
       .fit(cov_type='HAC', cov_kwds={'maxlags': nw_lags})
      )
print(res.summary(slim=True))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    RET   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.018
No. Observations:                2595   F-statistic:                     6.044
Covariance Type:                  HAC   Prob (F-statistic):             0.0140
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0011      0.000      3.493      0.000       0.000       0.002
RET.shift(1)    -0.1372      0.056     -2.458      0.014      -0.247      -0.028
================================================================================

Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 7 lags and without small sample correction

The results show that the autocorrelation coefficient for MSFT is almost significant at the 1% level. Microsoft seems to exhibit even more daily autocorrelation than Nvidia.