Downloading Financial Data

The Package yfinance

We’ll use the yfinance package to download stock price data from Yahoo! Finance. This Python library provides a simple interface for retrieving historical market data for stocks, indices, and other financial instruments. Before using it, you’ll need to install it first.

Anaconda

If you are using Anaconda you need to install yfinance first before you use it for the first time. In an Anaconda command prompt type:

pip install yfinance --upgrade

or in a Jupyter cell you can type:

!pip install yfinance --upgrade --quiet

Google Colab

In Google Colab, yfinance is pre-installed, but you should upgrade to the latest version due to recent Yahoo! Finance API changes. Run this in a code cell:

!pip install yfinance --upgrade --quiet

Importing the Library

You can import the library as follows.

import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf

sns.set_theme()

Data

In order to estimate asset pricing models, we need to get data on stock returns, for both a particular security and the market. An easy way to do the is to download this data from the web. We start by defining the security that we want to analyze and from which date we want to start the analysis.

To get the data, we use the function download() in yfinance. For example, to get the data for Microsoft (ticker: MSFT) we type:

yf.download(tickers='MSFT', auto_adjust=False, progress=False, multi_level_index=False)
Adj Close Close High Low Open Volume
Date
2026-01-28 480.533203 481.630005 483.739990 478.000000 483.209991 36875400
2026-01-29 432.512817 433.500000 442.500000 421.019989 439.989990 128855300
2026-01-30 429.310120 430.290009 439.600006 426.450012 439.170013 58566800
2026-02-02 422.405884 423.369995 430.739990 422.250000 430.239990 42219900
2026-02-03 410.273560 411.209991 422.049988 408.559998 422.010010 61424100
2026-02-04 413.246796 414.190002 419.799988 409.239990 411.000000 45012400
2026-02-05 392.773529 393.670013 408.299988 392.320007 407.440002 66289200
2026-02-06 400.226501 401.140015 401.790009 392.920013 399.170013 53515300
2026-02-09 412.658142 413.600006 414.890015 400.869995 404.850006 45480500
2026-02-10 412.328857 413.269989 423.679993 412.700012 419.619995 44857900
2026-02-11 403.449127 404.369995 416.459991 401.010010 416.179993 42491000
2026-02-12 400.924896 401.839996 406.200012 398.010010 405.000000 40802400
2026-02-13 400.406097 401.320007 405.540009 398.049988 404.450012 34091600
2026-02-17 395.956238 396.859985 400.519989 394.529999 399.220001 32078800
2026-02-18 398.690002 399.600006 402.559998 396.320007 398.130005 23223400
2026-02-19 398.459991 398.459991 404.429993 396.670013 400.690002 28234000
2026-02-20 397.230011 397.230011 400.119995 395.160004 396.109985 34015200
2026-02-23 384.470001 384.470001 395.359985 383.100006 395.000000 43238300
2026-02-24 389.000000 389.000000 389.359985 381.709991 384.140015 33884700
2026-02-25 400.600006 400.600006 401.470001 390.160004 390.529999 43625500
2026-02-26 401.720001 401.720001 407.489990 398.739990 404.709991 34405900
2026-02-27 392.739990 392.739990 396.799988 390.000000 390.992493 50401665

The output of the function download() is a Pandas dataframe. Therefore, all operations available on Pandas work on the dataframe just imported.

The yfinance library was recently updated so that the Close column now shows total return prices by default. To get both the unadjusted price and the adjusted price separately, we need to set auto_adjust=False when downloading data.

I use the option progress=False to suppress the status bar when downloading the data. The code runs just as fine without it.

Also, note that the dataset by default spans all dates for dates for which data is available. We can specify an arbitrary starting date by adding, for example, start='2012-01-01'.

yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)
Adj Close Close High Low Open Volume
Date
2012-01-03 20.917694 26.770000 26.959999 26.389999 26.549999 64731500
2012-01-04 21.409971 27.400000 27.469999 26.780001 26.820000 80516100
2012-01-05 21.628750 27.680000 27.730000 27.290001 27.379999 56081400
2012-01-06 21.964748 28.110001 28.190001 27.530001 27.530001 99455500
2012-01-09 21.675640 27.740000 28.100000 27.719999 28.049999 59706800
... ... ... ... ... ... ...
2026-02-23 384.470001 384.470001 395.359985 383.100006 395.000000 43238300
2026-02-24 389.000000 389.000000 389.359985 381.709991 384.140015 33884700
2026-02-25 400.600006 400.600006 401.470001 390.160004 390.529999 43625500
2026-02-26 401.720001 401.720001 407.489990 398.739990 404.709991 34405900
2026-02-27 392.739990 392.739990 396.799988 390.000000 390.992493 50401665

3559 rows × 6 columns

Close v/s Adjusted Close Price

The data query produces six columns. When estimating asset pricing models we will be interested in using the adjusted close (Adj Close) column since the price series is adjusted for stock splits and dividends. Let’s compare how the Adj Close differs from the Close price.

For this, let’s extract these two series and store them in a dataframe called df. The method df.loc[:, [column_label_1, column_label_2, ..., column_label_n]] selects multiple columns from the dataframe. In our example we want to select loc[:, ['Close', 'Adj Close']].

df = yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:, ['Close', 'Adj Close']]

Sometimes in Python lines might get really long. A suggested way to split a long line of code into several smaller lines is to use parentheses around the code that you want to split and split whenever there is a dot.

df = (yf
      .download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)
      .loc[:, ['Close', 'Adj Close']]
      )

The previous line of code does exactly the same as the one we wrote originally, but might be easier to read and understand.

We can now compute cumulative percentage changes (cum_pct) for both series by applying to each column in df a function that divides everything by the first price. The apply() method applies a function along an axis of a DataFrame or a Series. The function can be a built-in Python function or a custom function that you define.

First, we define a function that takes a series x as input and divides the series by its initial value x[0].1

1 In the original code, I wrote x / x[0]. However, that syntax is being deprecated and the suggestion is to access the row number of the pandas dataframe using iloc.

def normalize(x):
    return x / x.iloc[0]

Then, we apply this function to our dataframe and plot it.

cum_pct = df.apply(normalize)
cum_pct.plot()
plt.title('Close v/s Adjusted Close of Microsoft')
plt.ylabel('Cumulative Return')
plt.show()

According to the graph, the impact of dividends is significant. Another way to see this is by actually computing the dividends from the two series. We will do this later.

Practice Problems

Problem 1 Plot the Close price for Nvidia Corporation (Ticker: NVDA) from July 1, 2015 until May 1, 2023.

Solution
df = yf.download(tickers='NVDA', start='2015-07-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('NVDA Stock Price')
plt.show()

Problem 2 Plot the S&P 500 (Ticker: ^GSPC) from January 1, 2018 until May 1, 2023. Since the S&P 500 is an index you can use either the Close or Adj Close, it does not make a difference.

Solution
df = yf.download(tickers='^GSPC', start='2018-01-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('S&P 500')
plt.show()

Problem 3 Imagine that you invest $100 in NVDA and AMD in January 1, 2015. Plot the evolution of each investment until May 1, 2023.

Solution

To plot the evolution of each investment, all we need to do is normalize the Adj Close evolution to 100 in January 1, 2015 for both stocks. We can use the same function we built before and multiply the result by 100. We use Adj Close instead of Close to account for stock splits and dividends.

df = (yf
      .download(tickers=['NVDA', 'AMD'], start='2015-01-01', end='2023-05-01', auto_adjust=False, progress=False)
      .loc[:, 'Adj Close']
      .apply(normalize)
      )*100
df.plot()
plt.title('Evolution of a $100 Investment in NVDA and AMD')
plt.ylabel('Portfolio Value ($)')
plt.show()