Skip to content

input4mips_validation.cvs.drs#

input4mips_validation.cvs.drs #

Data reference syntax data model

This module needs a re-write. The rules are not clear enough and the code misrepresents the logic.

For the re-write:

  • change to just having a parse_filepath and parse_filename function for each DRS
    • introduce a prototype to allow different DRS handlers to be injected as needed
  • then you just do wrappers around this DRS handling
    • get metadata from filename
    • get metadata from path
    • create filename
    • create path
    • validate file correctly written according to DRS
    • etc.
  • remove the ability to inject the DRS via a string
    • fundamentally, it doesn't make sense and is too hard to get the logic right
    • it also hides the fact that there is lots of logic that can't be handled in a single string
  • as a result of the above, remove the DRS from the CVs entirely
    • maybe use a string as a key
  • maybe put the DRS in a separate package to allow better re-use
    • maybe, because there are some couplings here that might mean you can't do a clean split from the CVs/other logic that easily

DATA_REFERENCE_SYNTAX_FILENAME: str = 'input4MIPs_DRS.json' module-attribute #

Default name of the file in which the data reference syntax is saved

DEFAULT_REPLACEMENTS: dict[str, str] = {'.': '-'} module-attribute #

Default replacements for characters in directories and file names

DataReferenceSyntaxUnstructured: TypeAlias = dict[str, str] module-attribute #

Form into which the DRS is serialised for the CVs

DRSSubstitution #

A substitution to apply to a DRS template

Source code in src/input4mips_validation/cvs/drs.py
@frozen
class DRSSubstitution:
    """
    A substitution to apply to a DRS template
    """

    string_to_replace: str
    """String in the DRS template to replace"""

    required_metadata: tuple[str, ...]
    """The metadata required to correctly apply the substitution"""

    replacement_string: str
    """
    String to use to replace `self.string_to_replace` in the DRS template

    This contains placeholders.
    The actual replacement is generated by calling `replacement_string.format`
    as part of `self.format_replacement_string`.
    """

    optional: bool
    """Whether this substitution is optional or not.

    If the substitution is optional, then,
    if metadata is not present when calling `self.apply_substitution`,
    `self.string_to_replace` is simply deleted from the DRS template.
    If the substitution is not optional, then,
    if metadata is not present when calling `self.apply_substitution`,
    a `KeyError` is raised.
    """

    def apply_substitution(
        self,
        start: str,
        metadata: dict[str, str],
        to_directory_path: bool,
        validate_substituted_metadata: bool = True,
    ) -> str:
        """
        Apply the substitution

        Parameters
        ----------
        start
            String to which to apply the substitution

        metadata
            Metadata from which the substitution values can be retrieved

        to_directory_path
            Are the substitutions being applied to create a directory path?

            If `False`, we assume that we are creating a file name.

        validate_substituted_metadata
            If `True`, the substituted metadata is validated to ensure that its values
            only contain allowed characters before being applied.

        Returns
        -------
        :
            `start` with the substitution defined by `self` applied
        """
        missing_metadata = [k for k in self.required_metadata if k not in metadata]
        if missing_metadata:
            if not self.optional:
                # raise a custom error here which shows what metadata is missing
                # and the start string
                # raise MetadataRequiredForSubstitutionMissingError(missing_metadata)
                raise NotImplementedError()

            # Optional but meadata no there, so simply delete this section
            res = start.replace(self.string_to_replace, "")

        else:
            metadata_to_substitute = {k: metadata[k] for k in self.required_metadata}
            if validate_substituted_metadata:
                assert_all_metadata_substitutions_only_contain_valid_characters(
                    metadata_to_substitute,
                    to_directory_path=to_directory_path,
                )

            res = start.replace(
                self.string_to_replace,
                self.replacement_string.format(**metadata_to_substitute),
            )

        return res

optional: bool instance-attribute #

Whether this substitution is optional or not.

If the substitution is optional, then, if metadata is not present when calling self.apply_substitution, self.string_to_replace is simply deleted from the DRS template. If the substitution is not optional, then, if metadata is not present when calling self.apply_substitution, a KeyError is raised.

replacement_string: str instance-attribute #

String to use to replace self.string_to_replace in the DRS template

This contains placeholders. The actual replacement is generated by calling replacement_string.format as part of self.format_replacement_string.

required_metadata: tuple[str, ...] instance-attribute #

The metadata required to correctly apply the substitution

string_to_replace: str instance-attribute #

String in the DRS template to replace

apply_substitution(start, metadata, to_directory_path, validate_substituted_metadata=True) #

Apply the substitution

Parameters:

Name Type Description Default
start str

String to which to apply the substitution

required
metadata dict[str, str]

Metadata from which the substitution values can be retrieved

required
to_directory_path bool

Are the substitutions being applied to create a directory path?

If False, we assume that we are creating a file name.

required
validate_substituted_metadata bool

If True, the substituted metadata is validated to ensure that its values only contain allowed characters before being applied.

True

Returns:

Type Description
str

start with the substitution defined by self applied

Source code in src/input4mips_validation/cvs/drs.py
def apply_substitution(
    self,
    start: str,
    metadata: dict[str, str],
    to_directory_path: bool,
    validate_substituted_metadata: bool = True,
) -> str:
    """
    Apply the substitution

    Parameters
    ----------
    start
        String to which to apply the substitution

    metadata
        Metadata from which the substitution values can be retrieved

    to_directory_path
        Are the substitutions being applied to create a directory path?

        If `False`, we assume that we are creating a file name.

    validate_substituted_metadata
        If `True`, the substituted metadata is validated to ensure that its values
        only contain allowed characters before being applied.

    Returns
    -------
    :
        `start` with the substitution defined by `self` applied
    """
    missing_metadata = [k for k in self.required_metadata if k not in metadata]
    if missing_metadata:
        if not self.optional:
            # raise a custom error here which shows what metadata is missing
            # and the start string
            # raise MetadataRequiredForSubstitutionMissingError(missing_metadata)
            raise NotImplementedError()

        # Optional but meadata no there, so simply delete this section
        res = start.replace(self.string_to_replace, "")

    else:
        metadata_to_substitute = {k: metadata[k] for k in self.required_metadata}
        if validate_substituted_metadata:
            assert_all_metadata_substitutions_only_contain_valid_characters(
                metadata_to_substitute,
                to_directory_path=to_directory_path,
            )

        res = start.replace(
            self.string_to_replace,
            self.replacement_string.format(**metadata_to_substitute),
        )

    return res

DataReferenceSyntax #

Data reference syntax definition

This defines how directories and filepaths should be created.

Within the templates, we apply the following rules for parsing the templates.

Carets ("<" and ">") are placeholders for values that should be replaced with metadata values. E.g. assuming that the data's model_id is "BD29", "CMIP/" will be translated to "CMIP/BD29".

Square brackets ("[" and "]") indicate that the part of the template is optional. If the optional metadata isn't there, this part of the DRS will not be included. E.g. assuming that the data's frequency is "mon" and the model ID is "AB34", "[_]" will be translated to "AB34_mon", but if there is no frequency metadata, then the result will simply be "AB34".

Paths are handled using pathlib.Path, hence unix-like paths can be used in the template and they will still work on Windows machines.

Source code in src/input4mips_validation/cvs/drs.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
@frozen
class DataReferenceSyntax:
    """
    Data reference syntax definition

    This defines how directories and filepaths should be created.

    Within the templates, we apply the following rules for parsing the templates.

    Carets ("<" and ">") are placeholders for values
    that should be replaced with metadata values.
    E.g. assuming that the data's model_id is "BD29",
    "CMIP/<model_id>" will be translated to "CMIP/BD29".

    Square brackets ("[" and "]") indicate that the part of the template is optional.
    If the optional metadata isn't there, this part of the DRS will not be included.
    E.g. assuming that the data's frequency is "mon" and the model ID is "AB34",
    "<model_id>[_<frequency>]" will be translated to "AB34_mon",
    but if there is no frequency metadata,
    then the result will simply be "AB34".

    Paths are handled using [pathlib.Path][],
    hence unix-like paths can be used in the template
    and they will still work on Windows machines.
    """

    directory_path_template: str
    """Template for creating directories"""

    directory_path_example: str
    """Example of a complete directory path"""

    filename_template: str
    """Template for creating filenames"""

    filename_example: str
    """Example of a complete filename"""

    def get_file_path(  # noqa: PLR0913
        self,
        root_data_dir: Path,
        available_attributes: dict[str, str],
        time_start: cftime.datetime | dt.datetime | np.datetime64 | None = None,
        time_end: cftime.datetime | dt.datetime | np.datetime64 | None = None,
        frequency_metadata_key: str = "frequency",
        version: str | None = None,
    ) -> Path:
        """
        Get the (full) path to a file based on the DRS

        Parameters
        ----------
        root_data_dir
            Root directory in which the data is to be written.

            The generated path will be relative to `root_data_dir`.

        available_attributes
            The available metadata attributes for creating the path.

            All the elements expected by the DRS must be provided.
            For example, if the DRS' filename template is
            "<model_id>_<institution_id>.nc",
            then `available_attributes` must provide both
            `"model_id"` and `"institution_id"`.

        time_start
            The earliest time point in the data's time axis.

            This is a special case.
            Some DRS definitions expect time range information,
            but this metadata is not contained in any of the file's metadata,
            hence the end time must be supplied
            so that the time range information can be included.

        time_end
            The latest time point in the data's time axis.

            This is a special case.
            Some DRS definitions expect time range information,
            but this metadata is not contained in any of the file's metadata,
            hence the end time must be supplied
            so that the time range information can be included.

        frequency_metadata_key
            The key in the data's metadata
            which points to information about the data's frequency.

        version
            The version to use when creating the path.

            This is a special case.
            Some DRS definitions expect version metadata,
            but this metadata is not contained in any of the file's metadata,
            hence must be supplied separately.
            If not supplied and version is required,
            we simply use today's date, formatted as YYYYMMDD.

        Returns
        -------
            The path in which to write the file according to the DRS

        Raises
        ------
        KeyError
            A metadata attribute expected by the DRS is not supplied.

        ValueError
            Time range information is required by the DRS,
            but `time_start` and `time_end` are not supplied.
        """
        # First step: apply a number of known replacements globally
        all_available_metadata = {
            k: apply_known_replacements(v) for k, v in available_attributes.items()
        }

        directory_substitutions = self.parse_drs_template(
            drs_template=self.directory_path_template
        )
        filename_substitutions = self.parse_drs_template(
            drs_template=self.filename_template
        )

        all_substitutions = (*directory_substitutions, *filename_substitutions)
        if (
            key_required_for_substitutions("version", all_substitutions)
            and version is None
        ):
            version = dt.datetime.now(dt.timezone.utc).strftime("%Y%m%d")
            all_available_metadata["version"] = version

        if key_required_for_substitutions("time_range", all_substitutions) and (
            time_start is None or time_end is None
        ):
            msg = (
                "The DRS requires time range information "
                "so both time_start and time_end must be provided. "
                f"Received {time_start=} and {time_end=}."
            )
            ValueError(msg)

        if (
            key_in_substitutions("time_range", all_substitutions)
            and time_start is not None
            and time_end is not None
        ):
            # Hard-code here for now because the rules about time-range
            # creation cannot be inferred from the DRS as currently written.
            all_available_metadata["time_range"] = create_time_range_for_filename(
                time_start=time_start,
                time_end=time_end,
                ds_frequency=all_available_metadata[frequency_metadata_key],
            )

        apply_subs = functools.partial(
            apply_substitutions,
            validate_substituted_metadata=True,
        )
        # Cast to path to ensure windows compatibility
        directory = Path(
            apply_subs(
                self.directory_path_template,
                metadata=all_available_metadata,
                substitutions=directory_substitutions,
                to_directory_path=True,
            )
        )
        # In the filename, underscore is swapped for hyphen to avoid delimiter issues.
        # Burying this here feels too deep,
        # but I don't know how to express this in a more obvious way.
        all_available_metadata_for_filename = {
            k: apply_known_replacements(v, {"_": "-"})
            for k, v in all_available_metadata.items()
        }
        filename = apply_subs(
            self.filename_template,
            metadata=all_available_metadata_for_filename,
            substitutions=filename_substitutions,
            to_directory_path=False,
        )

        generated_path = directory / filename
        # Check validity of everything, excluding the suffix
        for component in generated_path.with_suffix("").parts:
            assert_full_filepath_only_contains_valid_characters(component)

        return root_data_dir / generated_path

    @staticmethod
    @functools.cache
    def parse_drs_template(drs_template: str) -> tuple[DRSSubstitution, ...]:  # noqa: PLR0912, PLR0915
        """
        Parse a DRS template string

        For the rules about parsing, see
        [`DataReferenceSyntax`][input4mips_validation.cvs.drs.DataReferenceSyntax].

        Parameters
        ----------
        drs_template
            DRS template to parse

        Returns
        -------
            Substitutions defined by `drs_template`
        """
        # This is a pretty yuck implementation.
        # PRs welcome to improve it
        # (will need quite some tests too to ensure correctness)
        # For now, we are ok with this because
        # a) it is relatively easy to follow
        # b) we use caching so it doesn't matter too much that it is slow

        # Hard-code here to ensure we match docstrings.
        # Could loosen this in future of course,
        # but I don't want to abstract that far right now.
        start_placeholder = "<"
        end_placeholder = ">"
        start_optional = "["
        end_optional = "]"

        substitutions_l = []
        in_optional_section = False
        in_placeholder = False
        placeholder_pieces: list[str] = []
        optional_pieces: list[str] = []
        for i, c in enumerate(drs_template):
            if c == start_optional:
                if optional_pieces:
                    msg = (
                        "Starting new optional section, "
                        "should not have optional_pieces"
                    )
                    # Can use the below for better error messages
                    # drs_template[:i + 1]
                    raise AssertionError(msg)

                in_optional_section = True
                continue

            if c == start_placeholder:
                if placeholder_pieces:
                    msg = (
                        "Starting new placeholder section, "
                        "should not have placeholder_pieces"
                    )
                    raise AssertionError(msg)

                in_placeholder = True
                if in_optional_section:
                    optional_pieces.append(c)
                continue

            if c == end_placeholder:
                if not in_placeholder:
                    msg = "Found end_placeholder without being in_placeholder"
                    raise AssertionError(msg)

                if in_optional_section:
                    optional_pieces.append(c)

                else:
                    # Can finalise this section
                    metadata_key = "".join(placeholder_pieces)
                    substitutions_l.append(
                        DRSSubstitution(
                            optional=False,
                            string_to_replace=f"{start_placeholder}{metadata_key}{end_placeholder}",
                            required_metadata=(metadata_key,),
                            replacement_string=f"{{{metadata_key}}}",
                        )
                    )

                    placeholder_pieces = []

                in_placeholder = False
                continue

            if c == end_optional:
                if not in_optional_section:
                    msg = "Found end_optional without being in_optional_section"
                    raise AssertionError(msg)

                if in_placeholder:
                    msg = "Should have already exited placeholder"
                    raise AssertionError(msg)

                metadata_key = "".join(placeholder_pieces)
                optional_section = "".join(optional_pieces)
                substitutions_l.append(
                    DRSSubstitution(
                        optional=True,
                        string_to_replace=f"{start_optional}{optional_section}{end_optional}",
                        required_metadata=(metadata_key,),
                        replacement_string=optional_section.replace(
                            f"{start_placeholder}{metadata_key}{end_placeholder}",
                            f"{{{metadata_key}}}",
                        ),
                    )
                )

                placeholder_pieces = []
                optional_pieces = []
                in_optional_section = False
                continue

            if in_placeholder:
                placeholder_pieces.append(c)

            if in_optional_section:
                optional_pieces.append(c)

        return tuple(substitutions_l)

    def extract_metadata_from_path(
        self, directory: Path, include_root_data_dir: bool = False
    ) -> dict[str, str | None]:
        """
        Extract metadata from a path

        To be specific, the bit of the path
        that corresponds to `self.directory_path_template`.
        In other words,
        `path` should only be the directory/folder bit of the full filepath,
        the filename should not be part of `path`.

        Parameters
        ----------
        directory
            Directory from which to extract the metadata

        include_root_data_dir
            Should the key "root_data_dir" be included in the output?

            The value of this key specifies the (inferred) root directory
            of the data.

        Returns
        -------
        :
            Extracted metadata
        """
        root_data_dir_key = "root_data_dir"

        directory_regexp = self.get_regexp_for_capturing_directory_information(
            root_data_dir_group=root_data_dir_key
        )
        match = re.match(directory_regexp, str(directory))
        if match is None:
            msg = f"regexp failed. {directory_regexp=}. {directory=}"
            raise AssertionError(msg)

        match_groups = match.groupdict()

        if include_root_data_dir:
            res = match_groups

        else:
            res = {k: v for k, v in match_groups.items() if k != root_data_dir_key}

        return res

    @functools.cache
    def get_regexp_for_capturing_directory_information(
        self, root_data_dir_group: str = "root_data_dir"
    ) -> str:
        """
        Get a regular expression for capturing information from a directory

        Parameters
        ----------
        root_data_dir_group
            Group name for the group which captures the root data directory.

        Returns
        -------
        :
            Regular expression which can be used to capture information
            from a directory.

        Notes
        -----
        According to
        [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk),
        there are no restrictions on the directory characters
        (there are only restrictions on the characters in the filename,
        see
        [`get_regexp_for_capturing_filename_information`][input4mips_validation.cvs.drs.DataReferenceSyntax.get_regexp_for_capturing_filename_information]).
        """
        # Hard-code according to the spec
        # All characters except the path separator
        allowed_chars = "[^/]"

        drs_template = self.directory_path_template
        directory_substitutions = self.parse_drs_template(drs_template=drs_template)
        capturing_regexp = get_regexp_from_template_and_substitutions(
            drs_template,
            substitutions=directory_substitutions,
            capturing_allowed_chars=allowed_chars,
        )

        # Make sure that the separators will behave
        sep_escape = re.escape(os.sep)
        capturing_regexp = capturing_regexp.replace("/", sep_escape)
        # And that our regexp allows for the root directory
        capturing_regexp = (
            f"(?P<{root_data_dir_group}>.*){sep_escape}{capturing_regexp}"
        )

        return capturing_regexp

    def extract_metadata_from_filename(self, filename: str) -> dict[str, str | None]:
        """
        Extract metadata from a filename

        Parameters
        ----------
        filename
            Filename from which to extract the metadata

        Returns
        -------
        :
            Extracted metadata
        """
        filename_regexp = self.get_regexp_for_capturing_filename_information()
        match = re.match(filename_regexp, filename)
        if match is None:
            msg = f"regexp failed. {filename_regexp=}. {filename=}"
            raise AssertionError(msg)

        match_groups = match.groupdict()

        return match_groups

    @functools.cache
    def get_regexp_for_capturing_filename_information(
        self,
    ) -> str:
        """
        Get a regular expression for capturing information from a filename

        Returns
        -------
        :
            Regular expression which can be used to capture information
            from a directory.

        Notes
        -----
        According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):

        - only [a-zA-Z0-9-] are allowed in file path names
          except where underscore is used as a separator.

        We use this to significantly simplify our regular expression.
        """
        # Hard-code according to the spec
        # Underscore not included because it can't be in capturing groups
        allowed_chars = "[a-zA-Z0-9-]"

        drs_template = self.filename_template
        filename_substitutions = self.parse_drs_template(drs_template=drs_template)
        capturing_regexp = get_regexp_from_template_and_substitutions(
            drs_template,
            substitutions=filename_substitutions,
            capturing_allowed_chars=allowed_chars,
        )

        return capturing_regexp

    def validate_file_written_according_to_drs(
        self,
        file: Path,
        frequency_metadata_keys: FrequencyMetadataKeys = FrequencyMetadataKeys(),
        time_dimension: str = "time",
        xr_variable_processor: XRVariableProcessorLike = XRVariableHelper(),
    ) -> None:
        """
        Validate that a file is correctly written in the DRS

        Parameters
        ----------
        file
            File to validate

        frequency_metadata_keys
            Metadata definitions for frequency information

        time_dimension
            The time dimension of the data

        xr_variable_processor
            Helper to use for processing the variables in xarray objects.

        Raises
        ------
        ValueError
            The file is not correctly written in the DRS
        """
        # TODO: try except here
        # If the file is clearly wrong,
        # just print out the directory and print out the template
        # and say, try again
        directory_metadata: dict[str, str | None] = self.extract_metadata_from_path(
            file.absolute().parent
        )
        file_metadata: dict[str, str | None] = self.extract_metadata_from_filename(
            file.name
        )

        ds = ds_from_iris_cubes(
            iris.load(file),
            xr_variable_processor=xr_variable_processor,
            raw_file=file,
            time_dimension=time_dimension,
        )
        comparison_metadata = {
            k: apply_known_replacements(v)
            for k, v in ds.attrs.items()
            # Ignore everything that isn't a string for comparisons
            if isinstance(v, str)
        }

        # Infer time range information, in case it appears in the DRS.
        # Annoying that we have to pass this all the way through to here.
        time_start, time_end = infer_time_start_time_end_for_filename(
            ds=ds,
            frequency_metadata_key=frequency_metadata_keys.frequency_metadata_key,
            no_time_axis_frequency=frequency_metadata_keys.no_time_axis_frequency,
            time_dimension=time_dimension,
        )
        if time_start is not None and time_end is not None:
            time_range = create_time_range_for_filename(
                time_start=time_start,
                time_end=time_end,
                ds_frequency=ds.attrs[frequency_metadata_keys.frequency_metadata_key],
            )

            comparison_metadata["time_range"] = time_range

        # This key is unverifiable because we don't save this data anywhere in the file,
        # and it can take any value.
        # TODO: infer this once we have the required_global_attributes
        # handling implemented in the CVs.
        unverifiable_keys_directory = ["version"]

        mismatches = []
        for k, v in directory_metadata.items():
            if v is None:
                # No info in directory, presumably because key was optional
                continue

            if k in unverifiable_keys_directory:
                continue

            if comparison_metadata[k] != v:
                mismatches.append(
                    [k, "directory", directory_metadata[k], comparison_metadata[k]]
                )

        for k, v in file_metadata.items():
            if v is None:
                # No info in filename, presumably because key was optional
                continue

            # In the filename, underscore
            # can be swapped for hyphen to avoid delimiter issues.
            # Burying this here feels too deep,
            # but I don't know how to express this in a more obvious way.
            valid_filename_values = {
                comparison_metadata[k],
                apply_known_replacements(comparison_metadata[k], {"_": "-"}),
            }
            if v not in valid_filename_values:
                mismatches.append(
                    [k, "filename", file_metadata[k], comparison_metadata[k]]
                )

        if mismatches:
            msg_l = [
                "File not written in line with the DRS.",
                f"{file.absolute()=}.",
                f"{self.directory_path_template=}",
                f"{self.filename_template=}",
            ]
            for mismatch in mismatches:
                key, location, filepath_val, expected_val = mismatch

                tmp = (
                    f"Mismatch in {location} for {key}. "
                    f"{filepath_val=!r} {expected_val=!r}"
                )
                msg_l.append(tmp)

            msg = "\n".join(msg_l)
            raise ValueError(msg)

    def get_esgf_dataset_master_id(self, file: Path) -> str:
        """
        Get the ESGF's master ID for the dataset to which a file belongs

        Parameters
        ----------
        file
            File for which to get the dataset ID

        Returns
        -------
        :
            ESGF master ID for the dataset to which `file` belongs

        Examples
        --------
        >>> drs = DataReferenceSyntax(
        ...     directory_path_template="<model_id>/v<version>",
        ...     directory_path_example="ACCESS/v20240726",
        ...     filename_template="<variable_id>_<model_id>.nc",
        ...     filename_example="tas_ACCESS.nc",
        ... )
        >>> file = Path("/path/to/somewhere/CanESM/v20240812/tas_CanESM.nc")
        >>> drs.get_esgf_dataset_master_id(file)
        'CanESM.v20240812'
        """
        metadata_directories = self.extract_metadata_from_path(
            file.parent, include_root_data_dir=True
        )

        if metadata_directories["root_data_dir"] is None:
            raise AssertionError

        res = str(
            file.parent.relative_to(metadata_directories["root_data_dir"])
        ).replace(os.sep, ".")

        return res

directory_path_example: str instance-attribute #

Example of a complete directory path

directory_path_template: str instance-attribute #

Template for creating directories

filename_example: str instance-attribute #

Example of a complete filename

filename_template: str instance-attribute #

Template for creating filenames

extract_metadata_from_filename(filename) #

Extract metadata from a filename

Parameters:

Name Type Description Default
filename str

Filename from which to extract the metadata

required

Returns:

Type Description
dict[str, str | None]

Extracted metadata

Source code in src/input4mips_validation/cvs/drs.py
def extract_metadata_from_filename(self, filename: str) -> dict[str, str | None]:
    """
    Extract metadata from a filename

    Parameters
    ----------
    filename
        Filename from which to extract the metadata

    Returns
    -------
    :
        Extracted metadata
    """
    filename_regexp = self.get_regexp_for_capturing_filename_information()
    match = re.match(filename_regexp, filename)
    if match is None:
        msg = f"regexp failed. {filename_regexp=}. {filename=}"
        raise AssertionError(msg)

    match_groups = match.groupdict()

    return match_groups

extract_metadata_from_path(directory, include_root_data_dir=False) #

Extract metadata from a path

To be specific, the bit of the path that corresponds to self.directory_path_template. In other words, path should only be the directory/folder bit of the full filepath, the filename should not be part of path.

Parameters:

Name Type Description Default
directory Path

Directory from which to extract the metadata

required
include_root_data_dir bool

Should the key "root_data_dir" be included in the output?

The value of this key specifies the (inferred) root directory of the data.

False

Returns:

Type Description
dict[str, str | None]

Extracted metadata

Source code in src/input4mips_validation/cvs/drs.py
def extract_metadata_from_path(
    self, directory: Path, include_root_data_dir: bool = False
) -> dict[str, str | None]:
    """
    Extract metadata from a path

    To be specific, the bit of the path
    that corresponds to `self.directory_path_template`.
    In other words,
    `path` should only be the directory/folder bit of the full filepath,
    the filename should not be part of `path`.

    Parameters
    ----------
    directory
        Directory from which to extract the metadata

    include_root_data_dir
        Should the key "root_data_dir" be included in the output?

        The value of this key specifies the (inferred) root directory
        of the data.

    Returns
    -------
    :
        Extracted metadata
    """
    root_data_dir_key = "root_data_dir"

    directory_regexp = self.get_regexp_for_capturing_directory_information(
        root_data_dir_group=root_data_dir_key
    )
    match = re.match(directory_regexp, str(directory))
    if match is None:
        msg = f"regexp failed. {directory_regexp=}. {directory=}"
        raise AssertionError(msg)

    match_groups = match.groupdict()

    if include_root_data_dir:
        res = match_groups

    else:
        res = {k: v for k, v in match_groups.items() if k != root_data_dir_key}

    return res

get_esgf_dataset_master_id(file) #

Get the ESGF's master ID for the dataset to which a file belongs

Parameters:

Name Type Description Default
file Path

File for which to get the dataset ID

required

Returns:

Type Description
str

ESGF master ID for the dataset to which file belongs

Examples:

>>> drs = DataReferenceSyntax(
...     directory_path_template="<model_id>/v<version>",
...     directory_path_example="ACCESS/v20240726",
...     filename_template="<variable_id>_<model_id>.nc",
...     filename_example="tas_ACCESS.nc",
... )
>>> file = Path("/path/to/somewhere/CanESM/v20240812/tas_CanESM.nc")
>>> drs.get_esgf_dataset_master_id(file)
'CanESM.v20240812'
Source code in src/input4mips_validation/cvs/drs.py
def get_esgf_dataset_master_id(self, file: Path) -> str:
    """
    Get the ESGF's master ID for the dataset to which a file belongs

    Parameters
    ----------
    file
        File for which to get the dataset ID

    Returns
    -------
    :
        ESGF master ID for the dataset to which `file` belongs

    Examples
    --------
    >>> drs = DataReferenceSyntax(
    ...     directory_path_template="<model_id>/v<version>",
    ...     directory_path_example="ACCESS/v20240726",
    ...     filename_template="<variable_id>_<model_id>.nc",
    ...     filename_example="tas_ACCESS.nc",
    ... )
    >>> file = Path("/path/to/somewhere/CanESM/v20240812/tas_CanESM.nc")
    >>> drs.get_esgf_dataset_master_id(file)
    'CanESM.v20240812'
    """
    metadata_directories = self.extract_metadata_from_path(
        file.parent, include_root_data_dir=True
    )

    if metadata_directories["root_data_dir"] is None:
        raise AssertionError

    res = str(
        file.parent.relative_to(metadata_directories["root_data_dir"])
    ).replace(os.sep, ".")

    return res

get_file_path(root_data_dir, available_attributes, time_start=None, time_end=None, frequency_metadata_key='frequency', version=None) #

Get the (full) path to a file based on the DRS

Parameters:

Name Type Description Default
root_data_dir Path

Root directory in which the data is to be written.

The generated path will be relative to root_data_dir.

required
available_attributes dict[str, str]

The available metadata attributes for creating the path.

All the elements expected by the DRS must be provided. For example, if the DRS' filename template is "_.nc", then available_attributes must provide both "model_id" and "institution_id".

required
time_start datetime | datetime | datetime64 | None

The earliest time point in the data's time axis.

This is a special case. Some DRS definitions expect time range information, but this metadata is not contained in any of the file's metadata, hence the end time must be supplied so that the time range information can be included.

None
time_end datetime | datetime | datetime64 | None

The latest time point in the data's time axis.

This is a special case. Some DRS definitions expect time range information, but this metadata is not contained in any of the file's metadata, hence the end time must be supplied so that the time range information can be included.

None
frequency_metadata_key str

The key in the data's metadata which points to information about the data's frequency.

'frequency'
version str | None

The version to use when creating the path.

This is a special case. Some DRS definitions expect version metadata, but this metadata is not contained in any of the file's metadata, hence must be supplied separately. If not supplied and version is required, we simply use today's date, formatted as YYYYMMDD.

None

Returns:

Type Description
The path in which to write the file according to the DRS

Raises:

Type Description
KeyError

A metadata attribute expected by the DRS is not supplied.

ValueError

Time range information is required by the DRS, but time_start and time_end are not supplied.

Source code in src/input4mips_validation/cvs/drs.py
def get_file_path(  # noqa: PLR0913
    self,
    root_data_dir: Path,
    available_attributes: dict[str, str],
    time_start: cftime.datetime | dt.datetime | np.datetime64 | None = None,
    time_end: cftime.datetime | dt.datetime | np.datetime64 | None = None,
    frequency_metadata_key: str = "frequency",
    version: str | None = None,
) -> Path:
    """
    Get the (full) path to a file based on the DRS

    Parameters
    ----------
    root_data_dir
        Root directory in which the data is to be written.

        The generated path will be relative to `root_data_dir`.

    available_attributes
        The available metadata attributes for creating the path.

        All the elements expected by the DRS must be provided.
        For example, if the DRS' filename template is
        "<model_id>_<institution_id>.nc",
        then `available_attributes` must provide both
        `"model_id"` and `"institution_id"`.

    time_start
        The earliest time point in the data's time axis.

        This is a special case.
        Some DRS definitions expect time range information,
        but this metadata is not contained in any of the file's metadata,
        hence the end time must be supplied
        so that the time range information can be included.

    time_end
        The latest time point in the data's time axis.

        This is a special case.
        Some DRS definitions expect time range information,
        but this metadata is not contained in any of the file's metadata,
        hence the end time must be supplied
        so that the time range information can be included.

    frequency_metadata_key
        The key in the data's metadata
        which points to information about the data's frequency.

    version
        The version to use when creating the path.

        This is a special case.
        Some DRS definitions expect version metadata,
        but this metadata is not contained in any of the file's metadata,
        hence must be supplied separately.
        If not supplied and version is required,
        we simply use today's date, formatted as YYYYMMDD.

    Returns
    -------
        The path in which to write the file according to the DRS

    Raises
    ------
    KeyError
        A metadata attribute expected by the DRS is not supplied.

    ValueError
        Time range information is required by the DRS,
        but `time_start` and `time_end` are not supplied.
    """
    # First step: apply a number of known replacements globally
    all_available_metadata = {
        k: apply_known_replacements(v) for k, v in available_attributes.items()
    }

    directory_substitutions = self.parse_drs_template(
        drs_template=self.directory_path_template
    )
    filename_substitutions = self.parse_drs_template(
        drs_template=self.filename_template
    )

    all_substitutions = (*directory_substitutions, *filename_substitutions)
    if (
        key_required_for_substitutions("version", all_substitutions)
        and version is None
    ):
        version = dt.datetime.now(dt.timezone.utc).strftime("%Y%m%d")
        all_available_metadata["version"] = version

    if key_required_for_substitutions("time_range", all_substitutions) and (
        time_start is None or time_end is None
    ):
        msg = (
            "The DRS requires time range information "
            "so both time_start and time_end must be provided. "
            f"Received {time_start=} and {time_end=}."
        )
        ValueError(msg)

    if (
        key_in_substitutions("time_range", all_substitutions)
        and time_start is not None
        and time_end is not None
    ):
        # Hard-code here for now because the rules about time-range
        # creation cannot be inferred from the DRS as currently written.
        all_available_metadata["time_range"] = create_time_range_for_filename(
            time_start=time_start,
            time_end=time_end,
            ds_frequency=all_available_metadata[frequency_metadata_key],
        )

    apply_subs = functools.partial(
        apply_substitutions,
        validate_substituted_metadata=True,
    )
    # Cast to path to ensure windows compatibility
    directory = Path(
        apply_subs(
            self.directory_path_template,
            metadata=all_available_metadata,
            substitutions=directory_substitutions,
            to_directory_path=True,
        )
    )
    # In the filename, underscore is swapped for hyphen to avoid delimiter issues.
    # Burying this here feels too deep,
    # but I don't know how to express this in a more obvious way.
    all_available_metadata_for_filename = {
        k: apply_known_replacements(v, {"_": "-"})
        for k, v in all_available_metadata.items()
    }
    filename = apply_subs(
        self.filename_template,
        metadata=all_available_metadata_for_filename,
        substitutions=filename_substitutions,
        to_directory_path=False,
    )

    generated_path = directory / filename
    # Check validity of everything, excluding the suffix
    for component in generated_path.with_suffix("").parts:
        assert_full_filepath_only_contains_valid_characters(component)

    return root_data_dir / generated_path

get_regexp_for_capturing_directory_information(root_data_dir_group='root_data_dir') cached #

Get a regular expression for capturing information from a directory

Parameters:

Name Type Description Default
root_data_dir_group str

Group name for the group which captures the root data directory.

'root_data_dir'

Returns:

Type Description
str

Regular expression which can be used to capture information from a directory.

Notes

According to the DRS description, there are no restrictions on the directory characters (there are only restrictions on the characters in the filename, see get_regexp_for_capturing_filename_information).

Source code in src/input4mips_validation/cvs/drs.py
@functools.cache
def get_regexp_for_capturing_directory_information(
    self, root_data_dir_group: str = "root_data_dir"
) -> str:
    """
    Get a regular expression for capturing information from a directory

    Parameters
    ----------
    root_data_dir_group
        Group name for the group which captures the root data directory.

    Returns
    -------
    :
        Regular expression which can be used to capture information
        from a directory.

    Notes
    -----
    According to
    [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk),
    there are no restrictions on the directory characters
    (there are only restrictions on the characters in the filename,
    see
    [`get_regexp_for_capturing_filename_information`][input4mips_validation.cvs.drs.DataReferenceSyntax.get_regexp_for_capturing_filename_information]).
    """
    # Hard-code according to the spec
    # All characters except the path separator
    allowed_chars = "[^/]"

    drs_template = self.directory_path_template
    directory_substitutions = self.parse_drs_template(drs_template=drs_template)
    capturing_regexp = get_regexp_from_template_and_substitutions(
        drs_template,
        substitutions=directory_substitutions,
        capturing_allowed_chars=allowed_chars,
    )

    # Make sure that the separators will behave
    sep_escape = re.escape(os.sep)
    capturing_regexp = capturing_regexp.replace("/", sep_escape)
    # And that our regexp allows for the root directory
    capturing_regexp = (
        f"(?P<{root_data_dir_group}>.*){sep_escape}{capturing_regexp}"
    )

    return capturing_regexp

get_regexp_for_capturing_filename_information() cached #

Get a regular expression for capturing information from a filename

Returns:

Type Description
str

Regular expression which can be used to capture information from a directory.

Notes

According to the DRS description:

  • only [a-zA-Z0-9-] are allowed in file path names except where underscore is used as a separator.

We use this to significantly simplify our regular expression.

Source code in src/input4mips_validation/cvs/drs.py
@functools.cache
def get_regexp_for_capturing_filename_information(
    self,
) -> str:
    """
    Get a regular expression for capturing information from a filename

    Returns
    -------
    :
        Regular expression which can be used to capture information
        from a directory.

    Notes
    -----
    According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):

    - only [a-zA-Z0-9-] are allowed in file path names
      except where underscore is used as a separator.

    We use this to significantly simplify our regular expression.
    """
    # Hard-code according to the spec
    # Underscore not included because it can't be in capturing groups
    allowed_chars = "[a-zA-Z0-9-]"

    drs_template = self.filename_template
    filename_substitutions = self.parse_drs_template(drs_template=drs_template)
    capturing_regexp = get_regexp_from_template_and_substitutions(
        drs_template,
        substitutions=filename_substitutions,
        capturing_allowed_chars=allowed_chars,
    )

    return capturing_regexp

parse_drs_template(drs_template) cached staticmethod #

Parse a DRS template string

For the rules about parsing, see DataReferenceSyntax.

Parameters:

Name Type Description Default
drs_template str

DRS template to parse

required

Returns:

Type Description
Substitutions defined by `drs_template`
Source code in src/input4mips_validation/cvs/drs.py
@staticmethod
@functools.cache
def parse_drs_template(drs_template: str) -> tuple[DRSSubstitution, ...]:  # noqa: PLR0912, PLR0915
    """
    Parse a DRS template string

    For the rules about parsing, see
    [`DataReferenceSyntax`][input4mips_validation.cvs.drs.DataReferenceSyntax].

    Parameters
    ----------
    drs_template
        DRS template to parse

    Returns
    -------
        Substitutions defined by `drs_template`
    """
    # This is a pretty yuck implementation.
    # PRs welcome to improve it
    # (will need quite some tests too to ensure correctness)
    # For now, we are ok with this because
    # a) it is relatively easy to follow
    # b) we use caching so it doesn't matter too much that it is slow

    # Hard-code here to ensure we match docstrings.
    # Could loosen this in future of course,
    # but I don't want to abstract that far right now.
    start_placeholder = "<"
    end_placeholder = ">"
    start_optional = "["
    end_optional = "]"

    substitutions_l = []
    in_optional_section = False
    in_placeholder = False
    placeholder_pieces: list[str] = []
    optional_pieces: list[str] = []
    for i, c in enumerate(drs_template):
        if c == start_optional:
            if optional_pieces:
                msg = (
                    "Starting new optional section, "
                    "should not have optional_pieces"
                )
                # Can use the below for better error messages
                # drs_template[:i + 1]
                raise AssertionError(msg)

            in_optional_section = True
            continue

        if c == start_placeholder:
            if placeholder_pieces:
                msg = (
                    "Starting new placeholder section, "
                    "should not have placeholder_pieces"
                )
                raise AssertionError(msg)

            in_placeholder = True
            if in_optional_section:
                optional_pieces.append(c)
            continue

        if c == end_placeholder:
            if not in_placeholder:
                msg = "Found end_placeholder without being in_placeholder"
                raise AssertionError(msg)

            if in_optional_section:
                optional_pieces.append(c)

            else:
                # Can finalise this section
                metadata_key = "".join(placeholder_pieces)
                substitutions_l.append(
                    DRSSubstitution(
                        optional=False,
                        string_to_replace=f"{start_placeholder}{metadata_key}{end_placeholder}",
                        required_metadata=(metadata_key,),
                        replacement_string=f"{{{metadata_key}}}",
                    )
                )

                placeholder_pieces = []

            in_placeholder = False
            continue

        if c == end_optional:
            if not in_optional_section:
                msg = "Found end_optional without being in_optional_section"
                raise AssertionError(msg)

            if in_placeholder:
                msg = "Should have already exited placeholder"
                raise AssertionError(msg)

            metadata_key = "".join(placeholder_pieces)
            optional_section = "".join(optional_pieces)
            substitutions_l.append(
                DRSSubstitution(
                    optional=True,
                    string_to_replace=f"{start_optional}{optional_section}{end_optional}",
                    required_metadata=(metadata_key,),
                    replacement_string=optional_section.replace(
                        f"{start_placeholder}{metadata_key}{end_placeholder}",
                        f"{{{metadata_key}}}",
                    ),
                )
            )

            placeholder_pieces = []
            optional_pieces = []
            in_optional_section = False
            continue

        if in_placeholder:
            placeholder_pieces.append(c)

        if in_optional_section:
            optional_pieces.append(c)

    return tuple(substitutions_l)

validate_file_written_according_to_drs(file, frequency_metadata_keys=FrequencyMetadataKeys(), time_dimension='time', xr_variable_processor=XRVariableHelper()) #

Validate that a file is correctly written in the DRS

Parameters:

Name Type Description Default
file Path

File to validate

required
frequency_metadata_keys FrequencyMetadataKeys

Metadata definitions for frequency information

FrequencyMetadataKeys()
time_dimension str

The time dimension of the data

'time'
xr_variable_processor XRVariableProcessorLike

Helper to use for processing the variables in xarray objects.

XRVariableHelper()

Raises:

Type Description
ValueError

The file is not correctly written in the DRS

Source code in src/input4mips_validation/cvs/drs.py
def validate_file_written_according_to_drs(
    self,
    file: Path,
    frequency_metadata_keys: FrequencyMetadataKeys = FrequencyMetadataKeys(),
    time_dimension: str = "time",
    xr_variable_processor: XRVariableProcessorLike = XRVariableHelper(),
) -> None:
    """
    Validate that a file is correctly written in the DRS

    Parameters
    ----------
    file
        File to validate

    frequency_metadata_keys
        Metadata definitions for frequency information

    time_dimension
        The time dimension of the data

    xr_variable_processor
        Helper to use for processing the variables in xarray objects.

    Raises
    ------
    ValueError
        The file is not correctly written in the DRS
    """
    # TODO: try except here
    # If the file is clearly wrong,
    # just print out the directory and print out the template
    # and say, try again
    directory_metadata: dict[str, str | None] = self.extract_metadata_from_path(
        file.absolute().parent
    )
    file_metadata: dict[str, str | None] = self.extract_metadata_from_filename(
        file.name
    )

    ds = ds_from_iris_cubes(
        iris.load(file),
        xr_variable_processor=xr_variable_processor,
        raw_file=file,
        time_dimension=time_dimension,
    )
    comparison_metadata = {
        k: apply_known_replacements(v)
        for k, v in ds.attrs.items()
        # Ignore everything that isn't a string for comparisons
        if isinstance(v, str)
    }

    # Infer time range information, in case it appears in the DRS.
    # Annoying that we have to pass this all the way through to here.
    time_start, time_end = infer_time_start_time_end_for_filename(
        ds=ds,
        frequency_metadata_key=frequency_metadata_keys.frequency_metadata_key,
        no_time_axis_frequency=frequency_metadata_keys.no_time_axis_frequency,
        time_dimension=time_dimension,
    )
    if time_start is not None and time_end is not None:
        time_range = create_time_range_for_filename(
            time_start=time_start,
            time_end=time_end,
            ds_frequency=ds.attrs[frequency_metadata_keys.frequency_metadata_key],
        )

        comparison_metadata["time_range"] = time_range

    # This key is unverifiable because we don't save this data anywhere in the file,
    # and it can take any value.
    # TODO: infer this once we have the required_global_attributes
    # handling implemented in the CVs.
    unverifiable_keys_directory = ["version"]

    mismatches = []
    for k, v in directory_metadata.items():
        if v is None:
            # No info in directory, presumably because key was optional
            continue

        if k in unverifiable_keys_directory:
            continue

        if comparison_metadata[k] != v:
            mismatches.append(
                [k, "directory", directory_metadata[k], comparison_metadata[k]]
            )

    for k, v in file_metadata.items():
        if v is None:
            # No info in filename, presumably because key was optional
            continue

        # In the filename, underscore
        # can be swapped for hyphen to avoid delimiter issues.
        # Burying this here feels too deep,
        # but I don't know how to express this in a more obvious way.
        valid_filename_values = {
            comparison_metadata[k],
            apply_known_replacements(comparison_metadata[k], {"_": "-"}),
        }
        if v not in valid_filename_values:
            mismatches.append(
                [k, "filename", file_metadata[k], comparison_metadata[k]]
            )

    if mismatches:
        msg_l = [
            "File not written in line with the DRS.",
            f"{file.absolute()=}.",
            f"{self.directory_path_template=}",
            f"{self.filename_template=}",
        ]
        for mismatch in mismatches:
            key, location, filepath_val, expected_val = mismatch

            tmp = (
                f"Mismatch in {location} for {key}. "
                f"{filepath_val=!r} {expected_val=!r}"
            )
            msg_l.append(tmp)

        msg = "\n".join(msg_l)
        raise ValueError(msg)

apply_known_replacements(inp, known_replacements=None) #

Apply known replacements of characters in metadata values

This helps ensure that only valid characters appear in our populated DRS templates. For further details about the characters which are valid, see assert_all_metadata_substitutions_only_contain_valid_characters and assert_full_filepath_only_contains_valid_characters.

Parameters:

Name Type Description Default
inp str

Input metadata value

required
known_replacements dict[str, str] | None

Known replacements to apply.

If None, we use DEFAULT_REPLACEMENTS.

None
Result

: inp with known replacements applied.

Source code in src/input4mips_validation/cvs/drs.py
def apply_known_replacements(
    inp: str, known_replacements: dict[str, str] | None = None
) -> str:
    """
    Apply known replacements of characters in metadata values

    This helps ensure that only valid characters appear in our populated DRS templates.
    For further details about the characters which are valid,
    see [`assert_all_metadata_substitutions_only_contain_valid_characters`][input4mips_validation.cvs.drs.assert_all_metadata_substitutions_only_contain_valid_characters]
    and [`assert_full_filepath_only_contains_valid_characters`][input4mips_validation.cvs.drs.assert_full_filepath_only_contains_valid_characters].

    Parameters
    ----------
    inp
        Input metadata value

    known_replacements
        Known replacements to apply.

        If `None`, we use
        [`DEFAULT_REPLACEMENTS`][input4mips_validation.cvs.drs.DEFAULT_REPLACEMENTS].

    Result
    ------
    :
        `inp` with known replacements applied.
    """  # noqa: E501
    if known_replacements is None:
        known_replacements = DEFAULT_REPLACEMENTS

    res = inp
    for old, new in known_replacements.items():
        res = res.replace(old, new)

    return res

apply_substitutions(drs_template, substitutions, metadata, to_directory_path, validate_substituted_metadata=True) #

Apply a series of substitutions to a DRS template

Parameters:

Name Type Description Default
drs_template str

DRS template to which the substitutions should be applied

required
substitutions Iterable[DRSSubstitution]

Substitutions to apply

required
metadata dict[str, str]

Metadata from which the substitution values can be retrieved

required
to_directory_path bool

Are the substitutions being applied to create a directory path?

If False, we assume that we are creating a file name.

required
validate_substituted_metadata bool True

Returns:

Type Description
str

DRS template, with all substitutions in substitutions applied

Source code in src/input4mips_validation/cvs/drs.py
def apply_substitutions(
    drs_template: str,
    substitutions: Iterable[DRSSubstitution],
    metadata: dict[str, str],
    to_directory_path: bool,
    validate_substituted_metadata: bool = True,
) -> str:
    """
    Apply a series of substitutions to a DRS template

    Parameters
    ----------
    drs_template
        DRS template to which the substitutions should be applied

    substitutions
        Substitutions to apply

    metadata
        Metadata from which the substitution values can be retrieved

    to_directory_path
        Are the substitutions being applied to create a directory path?

        If `False`, we assume that we are creating a file name.

    validate_substituted_metadata
        Passed to
        [`DRSSubstitution.apply_substitution`][input4mips_validation.cvs.drs.DRSSubstitution.apply_substitution].

    Returns
    -------
    :
        DRS template, with all substitutions in `substitutions` applied
    """
    res = drs_template
    for substitution in substitutions:
        res = substitution.apply_substitution(
            res,
            metadata=metadata,
            to_directory_path=to_directory_path,
            validate_substituted_metadata=validate_substituted_metadata,
        )
        # # TODO: swap to the below
        # try:
        #     res = substitution.apply_substitution(res, metadata=metadata)
        # except MetadataRequiredForSubstitutionMissingError:
        #     # Add information about the full DRS, then raise from

    return res

assert_all_metadata_substitutions_only_contain_valid_characters(metadata, to_directory_path) #

Assert that all the metadata substitutions only contain valid characters

For metadata being applied to a DRS template, only certain characters are allowed.

According to the DRS description:

All strings appearing in the file name are constructed using only the following characters: a-z, A-Z, 0-9, and the hyphen ("-"), except the hyphen must not appear in variable_id. Underscores are prohibited throughout except as shown in the template.

We prohibit the use of underscores in the filenames, following the DRS description. However, we've clearly ignored the rule about no hyphens in variable IDs in input4MIPs, so we allow hyphens to appear in the variable ID part of the file name. Hyphens will only appear in the variable ID part of the file name when the original variable had underscores, and these underscores have been replaced with hyphens to avoid breaking the DRS.

Nothing is said about the directory names, so all values are allowed for directory names.

Parameters:

Name Type Description Default
metadata dict[str, str]

Metadata substitutions to check

required
to_directory_path bool

Are the substitutions being applied to create a directory path?

If False, we assume that we are creating a file name.

required

Raises:

Type Description
ValueError

One of the metadata substitutions contains invalid characters

See Also

assert_full_filepath_only_contains_valid_characters

Source code in src/input4mips_validation/cvs/drs.py
def assert_all_metadata_substitutions_only_contain_valid_characters(
    metadata: dict[str, str],
    to_directory_path: bool,
) -> None:
    """
    Assert that all the metadata substitutions only contain valid characters

    For metadata being applied to a DRS template, only certain characters are allowed.

    According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):

    > All strings appearing in the file name
    > are constructed using only the following characters:
    > a-z, A-Z, 0-9, and the hyphen ("-"),
    > except the hyphen must not appear in variable_id.
    > Underscores are prohibited throughout except as shown in the template.

    We prohibit the use of underscores in the filenames, following the DRS description.
    However, we've clearly ignored the rule
    about no hyphens in variable IDs in input4MIPs,
    so we allow hyphens to appear in the variable ID part of the file name.
    Hyphens will only appear in the variable ID part of the file name
    when the original variable had underscores,
    and these underscores have been replaced with hyphens to avoid breaking the DRS.

    Nothing is said about the directory names,
    so all values are allowed for directory names.

    Parameters
    ----------
    metadata
        Metadata substitutions to check

    to_directory_path
        Are the substitutions being applied to create a directory path?

        If `False`, we assume that we are creating a file name.

    Raises
    ------
    ValueError
        One of the metadata substitutions contains invalid characters

    See Also
    --------
    [`assert_full_filepath_only_contains_valid_characters`][input4mips_validation.cvs.drs.assert_full_filepath_only_contains_valid_characters]
    """
    # Hard-code according to the spec
    if to_directory_path:
        # Truth is that this is probably even wider than this, but ok
        valid_chars = set(string.ascii_letters + string.digits + "-" + "_")

    else:
        valid_chars = set(string.ascii_letters + string.digits + "-")

    for k, v in metadata.items():
        # Special case for variable_id would go here if we applied it
        try:
            assert_only_valid_chars(v, valid_chars=valid_chars)
        except ValueError as exc:
            msg = f"Metadata for {k}, {v!r}, contains invalid characters"
            raise ValueError(msg) from exc

assert_full_filepath_only_contains_valid_characters(full_filepath) #

Assert that a filepath only contains valid characters

According to the DRS description:

  • only [a-zA-Z0-9-] are allowed in file path names except where underscore is used as a separator.

Parameters:

Name Type Description Default
full_filepath str | Path

Full filepath (directories and filename) to validate

required

Raises:

Type Description
ValueError

full_filepath contains invalid characters

See Also

assert_all_metadata_substitutions_only_contain_valid_characters

Source code in src/input4mips_validation/cvs/drs.py
def assert_full_filepath_only_contains_valid_characters(
    full_filepath: str | Path,
) -> None:
    """
    Assert that a filepath only contains valid characters

    According to [the DRS description](https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk):

    - only [a-zA-Z0-9-] are allowed in file path names
      except where underscore is used as a separator.

    Parameters
    ----------
    full_filepath
        Full filepath (directories and filename) to validate

    Raises
    ------
    ValueError
        `full_filepath` contains invalid characters

    See Also
    --------
    [`assert_all_metadata_substitutions_only_contain_valid_characters`][input4mips_validation.cvs.drs.assert_all_metadata_substitutions_only_contain_valid_characters]
    """
    # Hard-code according to the spec
    valid_chars = set(string.ascii_letters + string.digits + "-" + "_")
    assert_only_valid_chars(full_filepath, valid_chars=valid_chars)

assert_only_valid_chars(inp, valid_chars) #

Assert that the input only contains valid characters

Parameters:

Name Type Description Default
inp str | Path

Input to validate

required
valid_chars set[str]

Set of valid characters

required

Raises:

Type Description
ValueError

inp contains characters that are not in valid_chars

Source code in src/input4mips_validation/cvs/drs.py
def assert_only_valid_chars(inp: str | Path, valid_chars: set[str]) -> None:
    """
    Assert that the input only contains valid characters

    Parameters
    ----------
    inp
        Input to validate

    valid_chars
        Set of valid characters

    Raises
    ------
    ValueError
        ``inp`` contains characters that are not in ``valid_chars``
    """
    inp_set = set(str(inp))
    invalid_chars = inp_set.difference(valid_chars)

    if invalid_chars:
        msg = (
            f"Input contains invalid characters. "
            f"{inp=}, {sorted(invalid_chars)=}, {sorted(valid_chars)=}"
        )
        raise ValueError(msg)

convert_drs_to_unstructured_cv(drs) #

Convert the data reference syntax (DRS) to the raw CV form

Parameters:

Name Type Description Default
drs DataReferenceSyntax

DRS

required

Returns:

Type Description
Raw CV data
Source code in src/input4mips_validation/cvs/drs.py
def convert_drs_to_unstructured_cv(
    drs: DataReferenceSyntax,
) -> DataReferenceSyntaxUnstructured:
    """
    Convert the data reference syntax (DRS) to the raw CV form

    Parameters
    ----------
    drs
        DRS

    Returns
    -------
        Raw CV data
    """
    raw_cv_form: DataReferenceSyntaxUnstructured = converter_json.unstructure(drs)

    return raw_cv_form

convert_unstructured_cv_to_drs(unstructured) #

Convert the raw CV data to its structured form

Parameters:

Name Type Description Default
unstructured DataReferenceSyntaxUnstructured

Unstructured CV data

required

Returns:

Type Description
Data reference syntax
Source code in src/input4mips_validation/cvs/drs.py
def convert_unstructured_cv_to_drs(
    unstructured: DataReferenceSyntaxUnstructured,
) -> DataReferenceSyntax:
    """
    Convert the raw CV data to its structured form

    Parameters
    ----------
    unstructured
        Unstructured CV data

    Returns
    -------
        Data reference syntax
    """
    return converter_json.structure(unstructured, DataReferenceSyntax)

format_date_for_time_range(date, ds_frequency) #

Format date for providing time range information

Parameters:

Name Type Description Default
date datetime | datetime | datetime64

Date to format

required
ds_frequency str

Frequency of the data in the underlying dataset

required

Returns:

Type Description
Formatted date

Examples:

>>> format_date_for_time_range(dt.datetime(2024, 7, 12), "yr")
'2024'
>>> format_date_for_time_range(dt.datetime(2024, 7, 12), "mon")
'202407'
>>> format_date_for_time_range(dt.datetime(2024, 7, 12), "day")
'20240712'
>>> format_date_for_time_range(dt.datetime(2024, 7, 12, 4, 30, 30), "3hr")
'202407120430'
Source code in src/input4mips_validation/cvs/drs.py
def format_date_for_time_range(
    date: cftime.datetime | dt.datetime | np.datetime64,
    ds_frequency: str,
) -> str:
    """
    Format date for providing time range information

    Parameters
    ----------
    date
        Date to format

    ds_frequency
        Frequency of the data in the underlying dataset

    Returns
    -------
        Formatted date

    Examples
    --------
    >>> format_date_for_time_range(dt.datetime(2024, 7, 12), "yr")
    '2024'
    >>> format_date_for_time_range(dt.datetime(2024, 7, 12), "mon")
    '202407'
    >>> format_date_for_time_range(dt.datetime(2024, 7, 12), "day")
    '20240712'
    >>> format_date_for_time_range(dt.datetime(2024, 7, 12, 4, 30, 30), "3hr")
    '202407120430'
    """
    if isinstance(date, np.datetime64):
        date_safe: cftime.datetime | dt.datetime = pd.to_datetime(str(date))
    else:
        date_safe = date

    if ds_frequency.startswith("mon"):
        return date_safe.strftime("%Y%m")

    if ds_frequency.startswith("yr"):
        return date_safe.strftime("%Y")

    if ds_frequency.startswith("day"):
        return date_safe.strftime("%Y%m%d")

    if ds_frequency.startswith("3hr"):
        return date_safe.strftime("%Y%m%d%H%M")

    raise NotImplementedError(ds_frequency)

get_regexp_from_template_and_substitutions(drs_template, substitutions, capturing_allowed_chars='[a-zA-Z0-9-]') #

Get a capturing regular expression from a template and substitutions

Parameters:

Name Type Description Default
drs_template str

DRS template from which to generate the regexp

required
substitutions Iterable[DRSSubstitution]

Substitutions that can be applied to the DRS template

required
capturing_allowed_chars str

Specification for characters that are allowed in the capturing groups.

'[a-zA-Z0-9-]'

Returns:

Type Description
str

Generated regexp, which will capture metadata according to the DRS

Examples:

>>> template_str = "<model_id>[_<optional_id>]_<time_range>.nc"
>>> substitutions = DataReferenceSyntax.parse_drs_template(template_str)
>>> get_regexp_from_template_and_substitutions(
...     template_str,
...     substitutions,
...     capturing_allowed_chars="[a-z]",
... )
'(?P<model_id>[a-z]+)(_(?P<optional_id>[a-z]+))?_(?P<time_range>[a-z]+).nc'
Source code in src/input4mips_validation/cvs/drs.py
def get_regexp_from_template_and_substitutions(
    drs_template: str,
    substitutions: Iterable[DRSSubstitution],
    capturing_allowed_chars: str = "[a-zA-Z0-9-]",
) -> str:
    """
    Get a capturing regular expression from a template and substitutions

    Parameters
    ----------
    drs_template
        DRS template from which to generate the regexp

    substitutions
        Substitutions that can be applied to the DRS template

    capturing_allowed_chars
        Specification for characters that are allowed in the capturing groups.

    Returns
    -------
    :
        Generated regexp, which will capture metadata according to the DRS

    Examples
    --------
    >>> template_str = "<model_id>[_<optional_id>]_<time_range>.nc"
    >>> substitutions = DataReferenceSyntax.parse_drs_template(template_str)
    >>> get_regexp_from_template_and_substitutions(
    ...     template_str,
    ...     substitutions,
    ...     capturing_allowed_chars="[a-z]",
    ... )
    '(?P<model_id>[a-z]+)(_(?P<optional_id>[a-z]+))?_(?P<time_range>[a-z]+).nc'
    """
    capturing_regexp = drs_template
    for substitution in substitutions:
        capturing_group = substitution.replacement_string.replace(
            "}", f">{capturing_allowed_chars}+)"
        ).replace("{", "(?P<")

        if substitution.optional:
            capturing_group = f"({capturing_group})?"

        capturing_regexp = capturing_regexp.replace(
            substitution.string_to_replace,
            capturing_group,
        )

    return capturing_regexp

key_in_substitutions(key, substitutions) #

Return whether a key is in the DRS or not

Parameters:

Name Type Description Default
key str

Metadata key

required
substitutions tuple[DRSSubstitution, ...]

Substitutions that will be applied to the DRS

required

Returns:

Type Description
`True` if the key is in the DRS, otherwise `False`.
Source code in src/input4mips_validation/cvs/drs.py
def key_in_substitutions(key: str, substitutions: tuple[DRSSubstitution, ...]) -> bool:
    """
    Return whether a key is in the DRS or not

    Parameters
    ----------
    key
        Metadata key

    substitutions
        Substitutions that will be applied to the DRS

    Returns
    -------
        `True` if the key is in the DRS, otherwise `False`.
    """
    return any(key in v.required_metadata for v in substitutions)

key_required_for_substitutions(key, substitutions) #

Return whether a key is required metadata or not for populating the DRS

Parameters:

Name Type Description Default
key str

Metadata key

required
substitutions tuple[DRSSubstitution, ...]

Substitutions that will be applied to the DRS

required

Returns:

Type Description
`True` if the key is required metadata to populate the DRS, otherwise `False`.
Source code in src/input4mips_validation/cvs/drs.py
def key_required_for_substitutions(
    key: str, substitutions: tuple[DRSSubstitution, ...]
) -> bool:
    """
    Return whether a key is required metadata or not for populating the DRS

    Parameters
    ----------
    key
        Metadata key

    substitutions
        Substitutions that will be applied to the DRS

    Returns
    -------
        `True` if the key is required metadata to populate the DRS, otherwise `False`.
    """
    return any(key in v.required_metadata and not v.optional for v in substitutions)

load_drs(raw_cvs_loader, filename=DATA_REFERENCE_SYNTAX_FILENAME) #

Load the DRS in the CVs

Parameters:

Name Type Description Default
raw_cvs_loader RawCVLoader

Loader of raw CVs data.

required
filename str

Name of the file from which to load the CVs.

Passed to raw_cvs_loader.load_raw.

DATA_REFERENCE_SYNTAX_FILENAME

Returns:

Type Description
Loaded DRS
Source code in src/input4mips_validation/cvs/drs.py
def load_drs(
    raw_cvs_loader: RawCVLoader,
    filename: str = DATA_REFERENCE_SYNTAX_FILENAME,
) -> DataReferenceSyntax:
    """
    Load the DRS in the CVs

    Parameters
    ----------
    raw_cvs_loader
        Loader of raw CVs data.

    filename
        Name of the file from which to load the CVs.

        Passed to
        [`raw_cvs_loader.load_raw`][input4mips_validation.cvs.loading_raw.RawCVLoader.load_raw].

    Returns
    -------
        Loaded DRS
    """
    return convert_unstructured_cv_to_drs(
        json.loads(raw_cvs_loader.load_raw(filename=filename))
    )