Downloading Financial Data

The Package yfinance

We’ll use the yfinance package to download stock price data from Yahoo! Finance. This Python library provides a simple interface for retrieving historical market data for stocks, indices, and other financial instruments. Before using it, you’ll need to install it first.

Anaconda

If you are using Anaconda you need to install yfinance first before you use it for the first time. In an Anaconda command prompt type:

pip install yfinance --upgrade

or in a Jupyter cell you can type:

!pip install yfinance --upgrade --quiet

Google Colab

In Google Colab, yfinance is pre-installed. If you ever need to upgrade to the latest version run this in a code cell:

!pip install yfinance --upgrade --quiet

Importing the Libraries

You can import the libraries as follows.

import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf

sns.set_theme()

Data

In order to estimate asset pricing models, we need to get data on stock returns, for both a particular security and the market. An easy way to do the is to download this data from the web. We start by defining the security that we want to analyze and from which date we want to start the analysis.

To get the data, we use the function download() in yfinance. For example, to get the data for Microsoft (ticker: MSFT) we type:

yf.download(tickers='MSFT', auto_adjust=False, progress=False, multi_level_index=False)
Adj Close Close High Low Open Volume
Date
2026-02-27 392.739990 392.739990 396.820007 389.880005 390.880005 51367200
2026-03-02 398.549988 398.549988 401.190002 390.630005 392.859985 35474900
2026-03-03 403.929993 403.929993 406.700012 392.670013 393.140015 38199200
2026-03-04 405.200012 405.200012 411.029999 400.309998 401.269989 35808000
2026-03-05 410.679993 410.679993 411.609985 404.399994 404.420013 39001300
2026-03-06 408.959991 408.959991 413.049988 408.510010 409.200012 31123900
2026-03-09 409.410004 409.410004 410.209991 403.500000 404.920013 30131900
2026-03-10 405.760010 405.760010 410.200012 402.929993 410.029999 31706400
2026-03-11 404.880005 404.880005 409.010010 401.589996 405.570007 25512100
2026-03-12 401.859985 401.859985 406.119995 401.709991 404.630005 27263900
2026-03-13 395.549988 395.549988 404.799988 394.250000 401.000000 26848000
2026-03-16 399.950012 399.950012 400.630005 394.790009 398.070007 27733700
2026-03-17 399.410004 399.410004 404.399994 397.750000 400.269989 26228300
2026-03-18 391.790009 391.790009 398.000000 391.000000 397.130005 25908500
2026-03-19 389.019989 389.019989 392.489990 387.059998 390.100006 25138800
2026-03-20 381.869995 381.869995 387.000000 380.119995 386.790009 50853200
2026-03-23 383.000000 383.000000 387.209991 381.679993 383.899994 29680100
2026-03-24 372.739990 372.739990 382.470001 371.850006 382.359985 42733600
2026-03-25 371.040009 371.040009 377.059998 369.630005 376.920013 31181200
2026-03-26 365.970001 365.970001 374.719910 365.190002 370.815002 36436874

The output of the function download() is a Pandas dataframe. Therefore, all operations available on Pandas work on the dataframe just imported.

The yfinance library was recently updated so that the Close column now shows split and dividend-adjusted prices by default. To get both the unadjusted price and the adjusted price separately, we need to set auto_adjust=False when downloading data.

I use the option progress=False to suppress the status bar when downloading the data. The code runs just as fine without it.

The option multi_level_index=False flattens the column names to a single level. Without it, yfinance returns columns with two levels (price type and ticker), so you would need to write df['Close']['MSFT'] instead of simply df['Close'].

Also, note that by default, the dataset only includes data for the last month. We can specify an arbitrary starting date by adding, for example, start='2012-01-01'.

yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)
Adj Close Close High Low Open Volume
Date
2012-01-03 20.917698 26.770000 26.959999 26.389999 26.549999 64731500
2012-01-04 21.409966 27.400000 27.469999 26.780001 26.820000 80516100
2012-01-05 21.628754 27.680000 27.730000 27.290001 27.379999 56081400
2012-01-06 21.964752 28.110001 28.190001 27.530001 27.530001 99455500
2012-01-09 21.675644 27.740000 28.100000 27.719999 28.049999 59706800
... ... ... ... ... ... ...
2026-03-20 381.869995 381.869995 387.000000 380.119995 386.790009 50853200
2026-03-23 383.000000 383.000000 387.209991 381.679993 383.899994 29680100
2026-03-24 372.739990 372.739990 382.470001 371.850006 382.359985 42733600
2026-03-25 371.040009 371.040009 377.059998 369.630005 376.920013 31181200
2026-03-26 365.970001 365.970001 374.719910 365.190002 370.815002 36436874

3578 rows × 6 columns

Close v/s Adjusted Close Price

The data query produces six columns. When estimating asset pricing models we will be interested in using the adjusted close (Adj Close) column since the price series is adjusted for stock splits and dividends. Let’s compare how the Adj Close differs from the Close price.

For this, let’s extract these two series and store them in a dataframe called df. The method df.loc[:, [column_label_1, column_label_2, ..., column_label_n]] selects multiple columns from the dataframe. In our example we want to select loc[:, ['Close', 'Adj Close']].

df = yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:, ['Close', 'Adj Close']]

Sometimes in Python lines might get really long. A suggested way to split a long line of code into several smaller lines is to use parentheses around the code that you want to split and split whenever there is a dot.

df = (yf
      .download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)
      .loc[:, ['Close', 'Adj Close']]
      )

The previous line of code does exactly the same as the one we wrote originally, but might be easier to read and understand.

We can now compute cumulative percentage changes (cum_pct) for both series by applying to each column in df a function that divides everything by the first price. The apply() method applies a function along an axis of a DataFrame or a Series. The function can be a built-in Python function or a custom function that you define.

First, we define a function that takes a series x as input and divides the series by its initial value x[0].1

1 In the original code, I wrote x / x[0]. However, that syntax is being deprecated and the suggestion is to access the row number of the pandas dataframe using iloc.

def normalize(x):
    return x / x.iloc[0]

Then, we apply this function to our dataframe and plot it.

cum_pct = df.apply(normalize)
cum_pct.plot()
plt.title('Close v/s Adjusted Close of Microsoft')
plt.ylabel('Cumulative Return')
plt.show()

According to the graph, the impact of dividends is significant. Another way to see this is by actually computing the dividends from the two series. We will do this later.

Practice Problems

Problem 1 Plot the Close price for Nvidia Corporation (Ticker: NVDA) from July 1, 2015 until May 1, 2023.

Solution
df = yf.download(tickers='NVDA', start='2015-07-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('NVDA Stock Price')
plt.show()

Problem 2 Plot the S&P 500 (Ticker: ^GSPC) from January 1, 2018 until May 1, 2023. Since the S&P 500 is an index you can use either the Close or Adj Close, it does not make a difference.

Solution
df = yf.download(tickers='^GSPC', start='2018-01-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('S&P 500')
plt.show()

Problem 3 Imagine that you invest $100 in NVDA and AMD in January 1, 2015. Plot the evolution of each investment until May 1, 2023.

Solution

To plot the evolution of each investment, all we need to do is normalize the Adj Close evolution to 100 in January 1, 2015 for both stocks. We can use the same function we built before and multiply the result by 100. We use Adj Close instead of Close to account for stock splits and dividends.

df = (yf
      .download(tickers=['NVDA', 'AMD'], start='2015-01-01', end='2023-05-01', auto_adjust=False, progress=False)
      .loc[:, 'Adj Close']
      .apply(normalize)
      )*100
df.plot()
plt.title('Evolution of a $100 Investment in NVDA and AMD')
plt.ylabel('Portfolio Value ($)')
plt.show()