Large Requests

Making very large requests for data can fail as the request can stall, time out, or the requested data could be so large that it does not fit in your computer's memory. It's better practice to make smaller requests and combine data locally.

Once a stream breaks you also have to retry the whole job. There's also no restart point when everything sits inside a single request. A mid-stream error wipes out all progress.

Best practices

  • Grouping request in the init_time dimension will lead to the best performance. Make requests for a single init_time (or group them together, such as by requesting all forecasts for a given day).

  • Write results to disk as soon as they arrive so you can resume from the last successful batch.

  • Use an append-friendly format such as Zarr, which avoids keeping the entire dataset in memory.

Example: Fetching Europe for August 2024

from datetime import datetime
from pathlib import Path

import pandas as pd

from jua import JuaClient
from jua.weather import Models, Variables


# Increase the request_credit_limit if desired or needed
client = JuaClient(request_credit_limit=10_000)

model = client.weather.get_model(Models.EPT2)

# Get all daily init_times for a week of EPT2
init_times = (
    pd.date_range(
        start=datetime(2024, 8, 1, 0),
        end=datetime(2024, 8, 31, 18),
        freq="6H",
        inclusive="both",
    )
    .to_pydatetime()
    .tolist()
)

zarr_path = Path("./hindcast_2024_august.zarr")

for index, init_time in enumerate(init_times):
    print(
        f"Fetching {index + 1}/{len(init_times)} "
        f"for init_time={init_time.isoformat()} ..."
    )

    jua_ds = model.get_forecasts(
        init_time=init_time,
        variables=[
            Variables.AIR_TEMPERATURE_AT_HEIGHT_LEVEL_2M,
            Variables.WIND_SPEED_AT_HEIGHT_LEVEL_10M,
            Variables.WIND_SPEED_AT_HEIGHT_LEVEL_100M,
            Variables.SURFACE_DIRECT_DOWNWELLING_SHORTWAVE_FLUX_SUM_1H,
            Variables.SURFACE_DOWNWELLING_SHORTWAVE_FLUX_SUM_1H,
        ],
        latitude=slice(72, 36),
        longitude=slice(-15, 35),
        max_lead_time=48,
    )

    dataset = jua_ds.to_xarray().chunk({"init_time": 1})

    if index == 0:
        dataset.to_zarr(zarr_path, mode="w")
        print(f"Wrote initial dataset to {zarr_path}")
    else:
        dataset.to_zarr(zarr_path, mode="a", append_dim="init_time")
        print(f"Appended dataset for {init_time.isoformat()} to {zarr_path}")

The loop fetches a handful of init times at a time, stores each batch immediately, and prints simple progress messages. If the connection drops after batch 5, rerun the script: the first five init times finish instantly thanks to caching, and the script resumes with batch 6. Memory use stays bounded because only one batch is loaded at any moment.

Treat this pattern as the baseline for long backfills. It keeps requests short, makes restarts painless, and protects local machines from surprises.

Last updated