Skip to main content

Data Pipelining With Polygon

·916 words·5 mins
Image generated by AI. Copyrights maintained by respective entities.

Introduction #

Similar to the recent post about how I collect and store crypto asset data from Coinbase, the scripts below pull minute, hour, and daily data for equities and ETFs from Polygon.io.

The scripts check for an existing data record, and if found then the existing record is updated to include the most recent data. If there is not an existing data record, then the complete historical record from Polygon is pulled and stored.

Python Functions #

Here are the functions needed for this project:

Function Usage #

Polygon Fetch Full History #

Here’s the docstring with the parameters/variables:

    """
    Fetch full historical data for a given product from Polygon API.

    Parameters:
    -----------
    client
        Polygon API client instance.
    ticker : str
        Ticker symbol to download.
    timespan : str
        Time span for the data (e.g., "minute", "hour", "day", "week", "month", "quarter", "year").
    multiplier : int
        Multiplier for the time span (e.g., 1 for daily data).
    adjusted : bool
        If True, return adjusted data; if False, return raw data.
    full_history_df : pd.DataFrame
        DataFrame containing the data.
    current_start : datetime
        Date for which to start pulling data in datetime format.
    free_tier : bool
        If True, then pause to avoid API limits.
    verbose : bool
        If True, print detailed information about the data being processed.

    Returns:
    --------
    full_history_df : pd.DataFrame
        DataFrame containing the data.
    """

This script pulls the full history for a specified asset:

from load_api_keys import load_api_keys
from polygon import RESTClient

# Load API keys from the environment
api_keys = load_api_keys()

# Get the environment variable for where data is stored
DATA_DIR = config("DATA_DIR")

# Open client connection
client = RESTClient(api_key=api_keys["POLYGON_KEY"])

# Create an empty DataFrame
df = pd.DataFrame({
    'Date': pd.Series(dtype="datetime64[ns]"),
    'open': pd.Series(dtype="float64"),
    'high': pd.Series(dtype="float64"),
    'low': pd.Series(dtype="float64"),
    'close': pd.Series(dtype="float64"),
    'volume': pd.Series(dtype="float64"),
    'vwap': pd.Series(dtype="float64"),
    'transactions': pd.Series(dtype="int64"),
    'otc': pd.Series(dtype="object")
})

# Example usage - minute
df = polygon_fetch_full_history(
    client=client,
    ticker="AMZN",
    timespan="day",
    multiplier=1,
    adjusted=True,
    full_history_df=df,
    current_start=datetime(2025, 1, 1),
    free_tier=True,
    verbose=True,
)

The example above pulls the daily data since 1/1/2025, but can handle data ranges of years because it pulls only a specific number of records at a time as recommended by Polygon (less than 5,000 records per API request), and then combines the records in the dataframe before returning the dataframe.

Here’s the first 5 rows:

Dateopenhighlowclosevolumevwaptransactionsotc
02025-01-02 05:00:00222.03000225.15000218.19000220.2200033956579.00000221.27450449631
12025-01-03 05:00:00222.50500225.36000221.62000224.1900027515606.00000223.70500346976
22025-01-06 05:00:00226.78000228.83500224.84000227.6100031849831.00000227.09210410686
32025-01-07 05:00:00227.90000228.38100221.46000222.1100028084164.00000223.40330379570
42025-01-08 05:00:00223.18500223.52000220.20000222.1300025033292.00000222.04140325539

Polygon Pull Data #

This script uses the above function to perform the following:

  1. Attempt to read an existing pickle data file
  2. If a data file exists, then pull updated data
  3. Otherwise, pull all historical data available for that asset for the past 2 years (using the free tier from Polygon)
  4. Store pickle and/or excel files of the data in the specified directories

Here’s the docstring with the parameters/variables:

    """
    Read existing data file, download price data from Polygon, and export data.

    Parameters:
    -----------
    base_directory : any
        Root path to store downloaded data.
    ticker : str
        Ticker symbol to download.
    source : str
        Name of the data source (e.g., 'Polygon').
    asset_class : str
        Asset class name (e.g., 'Equities').
    start_date : datetime
        Start date for the data in datetime format.
    timespan : str
        Time span for the data (e.g., "minute", "hour", "day", "week", "month", "quarter", "year").
    multiplier : int
        Multiplier for the time span (e.g., 1 for daily data).
    adjusted : bool
        If True, return adjusted data; if False, return raw data.
    force_existing_check : bool
        If True, force a complete check of the existing data file to verify that there are not any gaps in the data.
    free_tier : bool
        If True, then pause to avoid API limits.
    verbose : bool
        If True, print detailed information about the data being processed.
    excel_export : bool
        If True, export data to Excel format.
    pickle_export : bool
        If True, export data to Pickle format.
    output_confirmation : bool
        If True, print confirmation message.

    Returns:
    --------
    None
    """

Through the base_directory, source, and asset_class variables the script knows where in the local filesystem to look for an existing pickle file and the store the resulting updated pickle and/or excel files:

current_year = datetime.now().year
current_month = datetime.now().month
current_day = datetime.now().day

# Example usage - daily
df = polygon_pull_data(
    base_directory=DATA_DIR,
    ticker="AMZN",
    source="Polygon",
    asset_class="Equities",
    start_date=datetime(current_year - 2, current_month, current_day),
    timespan="day",
    multiplier=1,
    adjusted=True,
    force_existing_check=True,
    free_tier=True,
    verbose=True,
    excel_export=True,
    pickle_export=True,
    output_confirmation=True,
)

Here’s the first 5 rows from the output from above:

Dateopenhighlowclosevolumevwaptransactionsotc
02023-07-28 04:00:00129.69000133.01000129.33000132.2100046269781.00000131.88370413438
12023-07-31 04:00:00133.20000133.87000132.38000133.6800041901516.00000133.34100406644
22023-08-01 04:00:00133.55000133.69000131.61990131.6900042250989.00000132.24700385743
32023-08-02 04:00:00130.15400130.23000126.82000128.2100050988614.00000128.39730532942
42023-08-03 04:00:00127.48000129.84000126.41000128.9100090855736.00000131.49410746639

We can see that the index is not continuous - but this is not an issue because use of the data would likely need to re-index the data or simply set the date column as the index.

References #

  1. https://polygon.io/
  2. https://polygon.io/docs/rest/quickstart

Code #

The Jupyter notebook with the functions and all other code is available here.
The HTML export of the jupyter notebook is available here.
The PDF export of the jupyter notebook is available here.