Downloading Financial Data

The Package yfinance

We’ll use the yfinance package to download stock price data from Yahoo! Finance. This Python library provides a simple interface for retrieving historical market data for stocks, indices, and other financial instruments. Before using it, you’ll need to install it first.

Anaconda

If you are using Anaconda you need to install yfinance first before you use it for the first time. In an Anaconda command prompt type:

pip install yfinance --upgrade

or in a Jupyter cell you can type:

!pip install yfinance --upgrade --quiet

Google Colab

In Google Colab, yfinance is pre-installed, but you should upgrade to the latest version due to recent Yahoo! Finance API changes. Run this in a code cell:

!pip install yfinance --upgrade --quiet

Importing the Library

You can import the library as follows.

import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf

sns.set_theme()

Data

In order to estimate asset pricing models, we need to get data on stock returns, for both a particular security and the market. An easy way to do the is to download this data from the web. We start by defining the security that we want to analyze and from which date we want to start the analysis.

To get the data, we use the function download() in yfinance. For example, to get the data for Microsoft (ticker: MSFT) we type:

yf.download(tickers='MSFT', auto_adjust=False, progress=False, multi_level_index=False)

	Adj Close	Close	High	Low	Open	Volume
Date
2026-01-28	480.533203	481.630005	483.739990	478.000000	483.209991	36875400
2026-01-29	432.512817	433.500000	442.500000	421.019989	439.989990	128855300
2026-01-30	429.310120	430.290009	439.600006	426.450012	439.170013	58566800
2026-02-02	422.405884	423.369995	430.739990	422.250000	430.239990	42219900
2026-02-03	410.273560	411.209991	422.049988	408.559998	422.010010	61424100
2026-02-04	413.246796	414.190002	419.799988	409.239990	411.000000	45012400
2026-02-05	392.773529	393.670013	408.299988	392.320007	407.440002	66289200
2026-02-06	400.226501	401.140015	401.790009	392.920013	399.170013	53515300
2026-02-09	412.658142	413.600006	414.890015	400.869995	404.850006	45480500
2026-02-10	412.328857	413.269989	423.679993	412.700012	419.619995	44857900
2026-02-11	403.449127	404.369995	416.459991	401.010010	416.179993	42491000
2026-02-12	400.924896	401.839996	406.200012	398.010010	405.000000	40802400
2026-02-13	400.406097	401.320007	405.540009	398.049988	404.450012	34091600
2026-02-17	395.956238	396.859985	400.519989	394.529999	399.220001	32078800
2026-02-18	398.690002	399.600006	402.559998	396.320007	398.130005	23223400
2026-02-19	398.459991	398.459991	404.429993	396.670013	400.690002	28234000
2026-02-20	397.230011	397.230011	400.119995	395.160004	396.109985	34015200
2026-02-23	384.470001	384.470001	395.359985	383.100006	395.000000	43238300
2026-02-24	389.000000	389.000000	389.359985	381.709991	384.140015	33884700
2026-02-25	400.600006	400.600006	401.470001	390.160004	390.529999	43625500
2026-02-26	401.720001	401.720001	407.489990	398.739990	404.709991	34405900
2026-02-27	392.739990	392.739990	396.799988	390.000000	390.992493	50401665

The output of the function download() is a Pandas dataframe. Therefore, all operations available on Pandas work on the dataframe just imported.

The yfinance library was recently updated so that the Close column now shows total return prices by default. To get both the unadjusted price and the adjusted price separately, we need to set auto_adjust=False when downloading data.

I use the option progress=False to suppress the status bar when downloading the data. The code runs just as fine without it.

Also, note that the dataset by default spans all dates for dates for which data is available. We can specify an arbitrary starting date by adding, for example, start='2012-01-01'.

yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)

	Adj Close	Close	High	Low	Open	Volume
Date
2012-01-03	20.917694	26.770000	26.959999	26.389999	26.549999	64731500
2012-01-04	21.409971	27.400000	27.469999	26.780001	26.820000	80516100
2012-01-05	21.628750	27.680000	27.730000	27.290001	27.379999	56081400
2012-01-06	21.964748	28.110001	28.190001	27.530001	27.530001	99455500
2012-01-09	21.675640	27.740000	28.100000	27.719999	28.049999	59706800
...	...	...	...	...	...	...
2026-02-23	384.470001	384.470001	395.359985	383.100006	395.000000	43238300
2026-02-24	389.000000	389.000000	389.359985	381.709991	384.140015	33884700
2026-02-25	400.600006	400.600006	401.470001	390.160004	390.529999	43625500
2026-02-26	401.720001	401.720001	407.489990	398.739990	404.709991	34405900
2026-02-27	392.739990	392.739990	396.799988	390.000000	390.992493	50401665

3559 rows × 6 columns

Close v/s Adjusted Close Price

The data query produces six columns. When estimating asset pricing models we will be interested in using the adjusted close (Adj Close) column since the price series is adjusted for stock splits and dividends. Let’s compare how the Adj Close differs from the Close price.

For this, let’s extract these two series and store them in a dataframe called df. The method df.loc[:, [column_label_1, column_label_2, ..., column_label_n]] selects multiple columns from the dataframe. In our example we want to select loc[:, ['Close', 'Adj Close']].

df = yf.download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:, ['Close', 'Adj Close']]

Sometimes in Python lines might get really long. A suggested way to split a long line of code into several smaller lines is to use parentheses around the code that you want to split and split whenever there is a dot.

df = (yf
      .download(tickers='MSFT', start='2012-01-01', auto_adjust=False, progress=False, multi_level_index=False)
      .loc[:, ['Close', 'Adj Close']]
      )

The previous line of code does exactly the same as the one we wrote originally, but might be easier to read and understand.

We can now compute cumulative percentage changes (cum_pct) for both series by applying to each column in df a function that divides everything by the first price. The apply() method applies a function along an axis of a DataFrame or a Series. The function can be a built-in Python function or a custom function that you define.

First, we define a function that takes a series x as input and divides the series by its initial value x[0].¹

¹ In the original code, I wrote x / x[0]. However, that syntax is being deprecated and the suggestion is to access the row number of the pandas dataframe using iloc.

def normalize(x):
    return x / x.iloc[0]

Then, we apply this function to our dataframe and plot it.

cum_pct = df.apply(normalize)
cum_pct.plot()
plt.title('Close v/s Adjusted Close of Microsoft')
plt.ylabel('Cumulative Return')
plt.show()

According to the graph, the impact of dividends is significant. Another way to see this is by actually computing the dividends from the two series. We will do this later.

Practice Problems

Problem 1 Plot the Close price for Nvidia Corporation (Ticker: NVDA) from July 1, 2015 until May 1, 2023.

Solution

df = yf.download(tickers='NVDA', start='2015-07-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('NVDA Stock Price')
plt.show()

Problem 2 Plot the S&P 500 (Ticker: ^GSPC) from January 1, 2018 until May 1, 2023. Since the S&P 500 is an index you can use either the Close or Adj Close, it does not make a difference.

Solution

df = yf.download(tickers='^GSPC', start='2018-01-01', end='2023-05-01', auto_adjust=False, progress=False, multi_level_index=False).loc[:,['Close']]
df.plot()
plt.title('S&P 500')
plt.show()

Problem 3 Imagine that you invest $100 in NVDA and AMD in January 1, 2015. Plot the evolution of each investment until May 1, 2023.

Solution

To plot the evolution of each investment, all we need to do is normalize the Adj Close evolution to 100 in January 1, 2015 for both stocks. We can use the same function we built before and multiply the result by 100. We use Adj Close instead of Close to account for stock splits and dividends.

df = (yf
      .download(tickers=['NVDA', 'AMD'], start='2015-01-01', end='2023-05-01', auto_adjust=False, progress=False)
      .loc[:, 'Adj Close']
      .apply(normalize)
      )*100
df.plot()
plt.title('Evolution of a $100 Investment in NVDA and AMD')
plt.ylabel('Portfolio Value ($)')
plt.show()