Data Pipelining With Polygon

Table of Contents

Introduction #

Similar to the recent post about how I collect and store crypto asset data from Coinbase, the scripts below pull minute, hour, and daily data for equities and ETFs from Polygon.io.

The scripts check for an existing data record, and if found then the existing record is updated to include the most recent data. If there is not an existing data record, then the complete historical record from Polygon is pulled and stored.

Python Functions #

Here are the functions needed for this project:

polygon_fetch_full_history: Fetch full historical data for a given product from Polygon API.
polygon_pull_data: Read existing data file, download price data from Polygon, and export data.

Function Usage #

Polygon Fetch Full History #

Here’s the docstring with the parameters/variables:

    """
    Fetch full historical data for a given product from Polygon API.

    Parameters:
    -----------
    client
        Polygon API client instance.
    ticker : str
        Ticker symbol to download.
    timespan : str
        Time span for the data (e.g., "minute", "hour", "day", "week", "month", "quarter", "year").
    multiplier : int
        Multiplier for the time span (e.g., 1 for daily data).
    adjusted : bool
        If True, return adjusted data; if False, return raw data.
    full_history_df : pd.DataFrame
        DataFrame containing the data.
    current_start : datetime
        Date for which to start pulling data in datetime format.
    free_tier : bool
        If True, then pause to avoid API limits.
    verbose : bool
        If True, print detailed information about the data being processed.

    Returns:
    --------
    full_history_df : pd.DataFrame
        DataFrame containing the data.
    """

This script pulls the full history for a specified asset:

from load_api_keys import load_api_keys
from polygon import RESTClient

# Load API keys from the environment
api_keys = load_api_keys()

# Get the environment variable for where data is stored
DATA_DIR = config("DATA_DIR")

# Open client connection
client = RESTClient(api_key=api_keys["POLYGON_KEY"])

# Create an empty DataFrame
df = pd.DataFrame({
    'Date': pd.Series(dtype="datetime64[ns]"),
    'open': pd.Series(dtype="float64"),
    'high': pd.Series(dtype="float64"),
    'low': pd.Series(dtype="float64"),
    'close': pd.Series(dtype="float64"),
    'volume': pd.Series(dtype="float64"),
    'vwap': pd.Series(dtype="float64"),
    'transactions': pd.Series(dtype="int64"),
    'otc': pd.Series(dtype="object")
})

# Example usage - minute
df = polygon_fetch_full_history(
    client=client,
    ticker="AMZN",
    timespan="day",
    multiplier=1,
    adjusted=True,
    full_history_df=df,
    current_start=datetime(2025, 1, 1),
    free_tier=True,
    verbose=True,
)

The example above pulls the daily data since 1/1/2025, but can handle data ranges of years because it pulls only a specific number of records at a time as recommended by Polygon (less than 5,000 records per API request), and then combines the records in the dataframe before returning the dataframe.

Here’s the first 5 rows:

	Date	open	high	low	close	volume	vwap	transactions
0	2025-01-02 05:00:00	222.03000	225.15000	218.19000	220.22000	33956579.00000	221.27450	449631
1	2025-01-03 05:00:00	222.50500	225.36000	221.62000	224.19000	27515606.00000	223.70500	346976
2	2025-01-06 05:00:00	226.78000	228.83500	224.84000	227.61000	31849831.00000	227.09210	410686
3	2025-01-07 05:00:00	227.90000	228.38100	221.46000	222.11000	28084164.00000	223.40330	379570
4	2025-01-08 05:00:00	223.18500	223.52000	220.20000	222.13000	25033292.00000	222.04140	325539

Polygon Pull Data #

This script uses the above function to perform the following:

Attempt to read an existing pickle data file
If a data file exists, then pull updated data
Otherwise, pull all historical data available for that asset for the past 2 years (using the free tier from Polygon)
Store pickle and/or excel files of the data in the specified directories

Here’s the docstring with the parameters/variables:

    """
    Read existing data file, download price data from Polygon, and export data.

    Parameters:
    -----------
    base_directory : any
        Root path to store downloaded data.
    ticker : str
        Ticker symbol to download.
    source : str
        Name of the data source (e.g., 'Polygon').
    asset_class : str
        Asset class name (e.g., 'Equities').
    start_date : datetime
        Start date for the data in datetime format.
    timespan : str
        Time span for the data (e.g., "minute", "hour", "day", "week", "month", "quarter", "year").
    multiplier : int
        Multiplier for the time span (e.g., 1 for daily data).
    adjusted : bool
        If True, return adjusted data; if False, return raw data.
    force_existing_check : bool
        If True, force a complete check of the existing data file to verify that there are not any gaps in the data.
    free_tier : bool
        If True, then pause to avoid API limits.
    verbose : bool
        If True, print detailed information about the data being processed.
    excel_export : bool
        If True, export data to Excel format.
    pickle_export : bool
        If True, export data to Pickle format.
    output_confirmation : bool
        If True, print confirmation message.

    Returns:
    --------
    None
    """

Through the base_directory, source, and asset_class variables the script knows where in the local filesystem to look for an existing pickle file and the store the resulting updated pickle and/or excel files:

current_year = datetime.now().year
current_month = datetime.now().month
current_day = datetime.now().day

# Example usage - daily
df = polygon_pull_data(
    base_directory=DATA_DIR,
    ticker="AMZN",
    source="Polygon",
    asset_class="Equities",
    start_date=datetime(current_year - 2, current_month, current_day),
    timespan="day",
    multiplier=1,
    adjusted=True,
    force_existing_check=True,
    free_tier=True,
    verbose=True,
    excel_export=True,
    pickle_export=True,
    output_confirmation=True,
)

Here’s the first 5 rows from the output from above:

	Date	open	high	low	close	volume	vwap	transactions
0	2023-07-28 04:00:00	129.69000	133.01000	129.33000	132.21000	46269781.00000	131.88370	413438
1	2023-07-31 04:00:00	133.20000	133.87000	132.38000	133.68000	41901516.00000	133.34100	406644
2	2023-08-01 04:00:00	133.55000	133.69000	131.61990	131.69000	42250989.00000	132.24700	385743
3	2023-08-02 04:00:00	130.15400	130.23000	126.82000	128.21000	50988614.00000	128.39730	532942
4	2023-08-03 04:00:00	127.48000	129.84000	126.41000	128.91000	90855736.00000	131.49410	746639

We can see that the index is not continuous - but this is not an issue because use of the data would likely need to re-index the data or simply set the date column as the index.

References #

Code #

The Jupyter notebook with the functions and all other code is available here.
The HTML export of the jupyter notebook is available here.
The PDF export of the jupyter notebook is available here.