How to write a single valid file¶

Here we document how to write a single, valid file with input4MIPs validation. This is the first step in preparing files for submission to the input4MIPs collection and, ultimately, publication in the ESGF index. This document assumes that you don't already have a file you want to submit. If you already have files, you can skip straight to "How to validate a single file".

Note: Before you submit your files, there are a few other steps you need to do too. See the instructions for data producers in the input4MIPs CVS repository. Don't forget to do those steps at some point too.

In [1]:

Copied!





import tempfile
from pathlib import Path

import cftime
import numpy as np
import xarray as xr
from loguru import logger

from input4mips_validation.cvs.loading import load_cvs_known_loader
from input4mips_validation.cvs.loading_raw import get_raw_cvs_loader
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.dataset.metadata_data_producer_minimum import (
    Input4MIPsDatasetMetadataDataProducerMinimum,
)
import tempfile
from pathlib import Path

import cftime
import numpy as np
import xarray as xr
from loguru import logger

from input4mips_validation.cvs.loading import load_cvs_known_loader
from input4mips_validation.cvs.loading_raw import get_raw_cvs_loader
from input4mips_validation.dataset import Input4MIPsDataset
from input4mips_validation.dataset.metadata_data_producer_minimum import (
    Input4MIPsDatasetMetadataDataProducerMinimum,
)

In [2]:

Copied!

# For this demonstration, disable the logger
logger.disable("input4mips_validation")
# For this demonstration, disable the logger
logger.disable("input4mips_validation")

Creating our file¶

Here we are going to go through a basic example of how we can create a valid data file. This is a very basic example. If it doesn't fit your case, please raise an issue. and we can see if we can add docs which fit your use case too.

In the below, we use xarray because we find it easiest to use. However, under the hood we are also using ncdata and iris. We use this combination because, while we find xarray easiest to work with, only iris writes the files correctly, and ncdata is the best to translate between the two (yes, you can imagine how fun it was figuring all of this out).

Ultimately, the choice of library is up to you. The simplest path (in our opinion) is below, but as long as your file passes validation, we don't mind how you created it.

The data¶

Let's imagine you have some data on a lat, lon, time grid.

In [3]:

Copied!





lon = np.arange(-165.0, 180.0, 30.0, dtype=np.float64)
lat = np.arange(-82.5, 90.0, 15.0, dtype=np.float64)
time = [cftime.datetime(y, m, 1) for y in range(2000, 2023 + 1) for m in range(1, 13)]

rng = np.random.default_rng()
ds_data = rng.random((lon.size, lat.size, len(time)))
lon = np.arange(-165.0, 180.0, 30.0, dtype=np.float64)
lat = np.arange(-82.5, 90.0, 15.0, dtype=np.float64)
time = [cftime.datetime(y, m, 1) for y in range(2000, 2023 + 1) for m in range(1, 13)]

rng = np.random.default_rng()
ds_data = rng.random((lon.size, lat.size, len(time)))

We can put this into an xarray object.

In [4]:

Copied!





ds = xr.Dataset(
    data_vars={
        "siconc": (["lat", "lon", "time"], ds_data),
    },
    coords=dict(
        lon=("lon", lon),
        lat=("lat", lat),
        time=time,
    ),
)
ds.coords
ds = xr.Dataset(
    data_vars={
        "siconc": (["lat", "lon", "time"], ds_data),
    },
    coords=dict(
        lon=("lon", lon),
        lat=("lat", lat),
        time=time,
    ),
)
ds.coords

Out[4]:

Coordinates:
  * lon      (lon) float64 96B -165.0 -135.0 -105.0 -75.0 ... 105.0 135.0 165.0
  * lat      (lat) float64 96B -82.5 -67.5 -52.5 -37.5 ... 37.5 52.5 67.5 82.5
  * time     (time) object 2kB 2000-01-01 00:00:00 ... 2023-12-01 00:00:00

In order to ensure that your data passes validation, you also have to specify either the "standard_name" of your variable (if there is a standard name for your variable in the official list) or the "long_name" attribute of your variable (you can specify both if you want, but you must have at least one). In this case, the variable is in the official list so we will set "standard_name".

In [5]:

Copied!

ds["siconc"].attrs["standard_name"] = "sea_ice_area_fraction"
ds["siconc"].attrs["standard_name"] = "sea_ice_area_fraction"

You don't have to do the step below, but we recommend it. Specifying encodings ensures that your data is written to disk as intended.

In [6]:

Copied!





ds["time"].encoding = {
    "calendar": "proleptic_gregorian",
    "units": "days since 1850-01-01 00:00:00",
    # Time has to be encoded as float
    # to ensure that non-integer days etc. can be handled
    # and the CF-checker doesn't complain.
    "dtype": np.dtypes.Float32DType,
}
# If you want to reduce your file size,
# you might want to encode some co-ordinates
# at lower resolution.
ds["lat"].encoding = {"dtype": np.dtypes.Float16DType}
ds["time"].encoding = {
    "calendar": "proleptic_gregorian",
    "units": "days since 1850-01-01 00:00:00",
    # Time has to be encoded as float
    # to ensure that non-integer days etc. can be handled
    # and the CF-checker doesn't complain.
    "dtype": np.dtypes.Float32DType,
}
# If you want to reduce your file size,
# you might want to encode some co-ordinates
# at lower resolution.
ds["lat"].encoding = {"dtype": np.dtypes.Float16DType}

The metadata¶

Assuming that you have already registered in the controlled vocabularies (CVs), creating a valid dataset is very straightforward. The key information is your source ID, from which lots of other information can be inferred. The rest of the metadata can be inferred from the data.

In [7]:

Copied!





metadata_minimum = Input4MIPsDatasetMetadataDataProducerMinimum(
    grid_label="gn",
    nominal_resolution="10000 km",
    source_id="CR-CMIP-0-2-0",
    target_mip="CMIP",
)
metadata_minimum
metadata_minimum = Input4MIPsDatasetMetadataDataProducerMinimum(
    grid_label="gn",
    nominal_resolution="10000 km",
    source_id="CR-CMIP-0-2-0",
    target_mip="CMIP",
)
metadata_minimum

Out[7]:

Input4MIPsDatasetMetadataDataProducerMinimum(grid_label='gn', nominal_resolution='10000 km', source_id='CR-CMIP-0-2-0', target_mip='CMIP')

The CVs¶

The last thing to set up is the CVs. You can pick different sources for the CVs. For example, you can load the CVs from local files, or from the input4MIPs CVs GitHub (or any other web source).

In this example, we're going to use a specific commit from the input4MIPs CVs GitHub to avoid anything breaking, even if we make further changes to the CVs. For your own work, you will probably want to use either:

local files
the branch where you have added your information to the CVs
a tagged version of the input4MIPs CVs GitHub
the main branch of the input4MIPs CVs GitHub

In [8]:

Copied!





# The object which can load our raw CVs files
raw_cvs_loader = get_raw_cvs_loader(
    "https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/v6.6.0/CVs/"
)

# # Other examples
# Load from local files
# raw_cvs_loader = get_raw_cvs_loader("/path/to/local/input4MIPs_CVs/CVs")
# Load from git branch
# branch_name = ""
# raw_cvs_loader = get_raw_cvs_loader(f"https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{branch_name}/CVs/")
# Load from tagged version
# version_tag = ""
# raw_cvs_loader = get_raw_cvs_loader(f"https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{version_tag}/CVs/")
# Load from input4MIPs CVs main
# raw_cvs_loader = get_raw_cvs_loader("https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/main/CVs/")
# Load from specific commit
# raw_cvs_loader = get_raw_cvs_loader("https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{commit_sha}/CVs/")
raw_cvs_loader
# The object which can load our raw CVs files
raw_cvs_loader = get_raw_cvs_loader(
    "https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/v6.6.0/CVs/"
)

# # Other examples
# Load from local files
# raw_cvs_loader = get_raw_cvs_loader("/path/to/local/input4MIPs_CVs/CVs")
# Load from git branch
# branch_name = ""
# raw_cvs_loader = get_raw_cvs_loader(f"https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{branch_name}/CVs/")
# Load from tagged version
# version_tag = ""
# raw_cvs_loader = get_raw_cvs_loader(f"https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{version_tag}/CVs/")
# Load from input4MIPs CVs main
# raw_cvs_loader = get_raw_cvs_loader("https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/main/CVs/")
# Load from specific commit
# raw_cvs_loader = get_raw_cvs_loader("https://raw.githubusercontent.com/PCMDI/input4MIPs_CVs/{commit_sha}/CVs/")
raw_cvs_loader

Out[8]:

RawCVLoaderKnownRemoteRegistry(registry=<pooch.core.Pooch object at 0x722b86570730>, force_download=False)

In [9]:

Copied!

cvs = load_cvs_known_loader(raw_cvs_loader)
cvs.source_id_entries.source_ids
cvs = load_cvs_known_loader(raw_cvs_loader)
cvs.source_id_entries.source_ids

Out[9]:

('CEDS-CMIP-2024-07-08',
 'CEDS-CMIP-2024-07-08-supplemental',
 'CEDS-CMIP-2024-10-21',
 'CEDS-CMIP-2024-10-21-supplemental',
 'CR-CMIP-0-2-0',
 'CR-CMIP-0-3-0',
 'DRES-CMIP-BB4CMIP7-1-0',
 'MRI-JRA55-do-1-6-0',
 'PCMDI-AMIP-1-1-9',
 'PCMDI-AMIP-ERSST5-1-0',
 'PCMDI-AMIP-Had1p1-1-0',
 'PCMDI-AMIP-OI2p1-1-0',
 'SOLARIS-HEPPA-CMIP-4-1',
 'SOLARIS-HEPPA-CMIP-4-2',
 'SOLARIS-HEPPA-CMIP-4-3',
 'SOLARIS-HEPPA-CMIP-4-4',
 'UOEXETER-CMIP-0-1-0',
 'UOEXETER-CMIP-1-1-2',
 'UOEXETER-CMIP-1-1-3',
 'UofMD-landState-3-0')

Putting it together¶

With the data and metadata, we can now create a valid Input4MIPsDataset object.

In [10]:

Copied!





input4mips_ds = Input4MIPsDataset.from_data_producer_minimum_information(
    data=ds,
    metadata_minimum=metadata_minimum,
    cvs=cvs,
    # We recommend using the two arguments below as well.
    # There is some rudimentary support guessing their values
    # based on the variable, but you are much more likely
    # to avoid errors if you don't rely on this
    dataset_category="SSTsAndSeaIce",
    realm="seaIce",
)
input4mips_ds = Input4MIPsDataset.from_data_producer_minimum_information(
    data=ds,
    metadata_minimum=metadata_minimum,
    cvs=cvs,
    # We recommend using the two arguments below as well.
    # There is some rudimentary support guessing their values
    # based on the variable, but you are much more likely
    # to avoid errors if you don't rely on this
    dataset_category="SSTsAndSeaIce",
    realm="seaIce",
)

This object holds both the data and metadata. For example, we can look at some of the metadata fields which were auto-generated from the CVs.

In [11]:

Copied!





# Inferred from CVs
print(f"{input4mips_ds.metadata.contact=}")
print()
# Inferred from the data
print(f"{input4mips_ds.metadata.frequency=}")
print()
# Inferred from CVs
print(f"{input4mips_ds.metadata.source_version=}")
print()
# Inferred from the data
print(f"{input4mips_ds.metadata.variable_id=}")
print()
# Inferred from CVs
print(f"{input4mips_ds.metadata.license=}")
# Inferred from CVs
print(f"{input4mips_ds.metadata.contact=}")
print()
# Inferred from the data
print(f"{input4mips_ds.metadata.frequency=}")
print()
# Inferred from CVs
print(f"{input4mips_ds.metadata.source_version=}")
print()
# Inferred from the data
print(f"{input4mips_ds.metadata.variable_id=}")
print()
# Inferred from CVs
print(f"{input4mips_ds.metadata.license=}")

input4mips_ds.metadata.contact='zebedee.nicholls@climate-resource.com;malte.meinshausen@climate-resource.com'

input4mips_ds.metadata.frequency='mon'

input4mips_ds.metadata.source_version='0.2.0'

input4mips_ds.metadata.variable_id='siconc'

input4mips_ds.metadata.license='The input4MIPs data linked to this entry is licensed under a Creative Commons Attribution 4.0 International (https://creativecommons.org/licenses/by/4.0/). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6Plus output, including citation requirements and proper acknowledgment. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.'

Writing our file¶

The last thing to do is write the file. This can be done with the write method. The key piece of information that you have to supply is the root directory in which to write the file. The rest of the path to the file is then auto-generated based on the data reference syntax (DRS) defined by the CVs. Below, we write the file to a temporary directory. You would obviously pick a more sensible location.

In [12]:

Copied!

print(f"{cvs.DRS.directory_path_template=}")
print(f"{cvs.DRS.filename_template=}")
print(f"{cvs.DRS.directory_path_template=}")
print(f"{cvs.DRS.filename_template=}")

cvs.DRS.directory_path_template='<activity_id>/<mip_era>/<target_mip>/<institution_id>/<source_id>/<realm>/<frequency>/<variable_id>/<grid_label>/v<version>'
cvs.DRS.filename_template='<variable_id>_<activity_id>_<dataset_category>_<target_mip>_<source_id>_<grid_label>[_<time_range>].nc'

In [13]:

Copied!

TMP_DIR = Path(tempfile.mkdtemp())
written_file = input4mips_ds.write(TMP_DIR)
print(f"The file was written in {written_file}")
TMP_DIR = Path(tempfile.mkdtemp())
written_file = input4mips_ds.write(TMP_DIR)
print(f"The file was written in {written_file}")

The file was written in /tmp/tmp65nqm8a7/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-2-0/seaIce/mon/siconc/gn/v20251001/siconc_input4MIPs_SSTsAndSeaIce_CMIP_CR-CMIP-0-2-0_gn_200001-202312.nc

Next steps¶

This procedure can obviously be repeated to write multiple files.

If you have written your files with input4MIPs validation, we recommend the following next steps:

Double check that your file(s) passes validation, see "How to validate a single file".
(You can skip "How to write a file in the DRS" because your file is already written in the DRS.)
Upload the file(s) to LLNL's FTP server, please see "How to upload to an FTP server".