๐๏ธ Data-Model Engineeringยถ
In PVGIS 6 scientific data structures are defined in YAML, not Python. A transformation engine generates Python-native data models. This separation enables domain scientists to shape models and data structures while developers maintain a transformation engine.
- Core entities are described once in YAML.
- Recursive loaders convert YAML into rich Python dictionaries and Pydantic models.
- Complex relationships may be visualised as graphs to uncover redundant structures, reveal hidden coupling, and guide refactors.
๐ A layered architectureยถ
1 YAML definitionsยถ
Atomic YAML files declare data structuresโfield names, types, units, dependenciesโin a clean, programming-language-agnostic format. A scientist defines GlobalInclinedIrradiance by listing its physical components (direct beam, diffuse sky, ground reflection) without writing Python.
The contents of the definitions.yaml directory
๐ ReadMe.py
โโโ ๐ atmosphere/
โโโ ๐ attribute/
โโโ colors.md
โโโ data_model_template.yaml
โโโ ๐ earth/
โโโ ๐ irradiance/
โโโ ๐ metadata/
โโโ ๐ meteorology/
โโโ ๐ options/
โโโ ๐ output/
โโโ ๐ performance/
โโโ ๐ power/
โโโ ๐ sun/
โโโ ๐ surface/
โโโ ๐ unit/
The require: directive enables compositional inheritance: a model pulls attributes from multiple parent templates, reusing common patterns (timestamps, location metadata) while adding domain-specific fields (Linke turbidity, albedo, temperature coefficients).
2 Definition factoryยถ
The generate.py script orchestrates a graph-based resolution algorithm:
A safe and reusable command to run and generate the definitions.py is :
yes "yes" | rm definitions.py && echo "PVGIS_DATA_MODEL_DEFINITIONS = {}" > definitions.py && python generate.py --log-level DEBUG --log-file definitions.log
This function
- Loads YAML files and parses
require:directives (inheritance declarations) - Traverses dependency trees using recursive descent, detecting circular references
- Merges parent attributes into child models via deep-merge logic
- Generates a consolidated
definitions.pycontaining fully expanded model specifications
This approach collapses complex inheritance chains (e.g., SolarAltitude โ DataModelTemplate โ BaseTemplate) into a single, self-contained definition.
A future task for the project would be to run this generation automatically at installation time !
3 Runtime factoriesยถ
DataModelFactory dynamically generates Pydantic models:
The factory maps YAML type strings to Python types, injects NumPy array handling, and enables validation that catches errors like missing required fields, incorrect data types, incompatible array shapes, or out-of-range values before calculations proceed.
Functions of the context factory transform model instances into structured outputs :
The above reads output structure definitions from YAML, evaluates conditional sections (e.g., verbosity levels), and constructs nested dictionaries representing calculation results. This ensures API responses, CLI output, and documentation remain synchronized.
๐ Command and data object lifecycleยถ
Data models exist transientlyโinstantiated on-demand, used during calculation, garbage-collected immediately after:
1. Import
Models are imported from the consolidated definitions.py module.
2. Factory generation
At runtime, DataModelFactory retrieves the model's definition and dynamically generates a Pydantic class.
3. Calculation
Models are typically returned by calculation functions:
def calculate_solar_position(latitude, longitude, timestamp):
# ... calculations ...
return SolarAzimuth(
value=azimuth_array,
solaraltitude=altitude_array,
timestamp=timestamp,
location=(latitude, longitude)
)
Pydantic validation occurs immediately upon instantiation, ensuring downstream functions receive well-formed, type-safe data.
4. Output Generation
Each data model embeds an output structure definition describing how its attributes should be presented. The ContextBuilder reads this structure and calls the model's .build_output() method:
This populates the output attribute automaticallyโa structured dict ready for consumption by:
- Web API endpoints (JSON responses)
- CLI tools (formatted terminal output)
- Core API functions (programmatic access)
5. Expiration
Once output is returned, the model instance is garbage-collected. No persistent state remains between requests, ensuring thread safety and predictable memory usage in multi-user environments.
๐ Language-Agnostic Philosophyยถ
YAML definitions are intentionally programming-free. This way they can be reused in other contexts and programming languages. Another experimental idea/feature embeds dependency annotations (e.g., "GlobalInclinedIrradiance requires: direct beam, diffuse sky, tilt angle") that may serve as executable documentation. While the current prototype doesn't really/fully exploit such annotations, they enable:
- Cross-platform model reuse (parsers in Julia, R, JavaScript could regenerate workflows)
- Transparent calculation pipelines (researchers see required inputs without reading code)
- Automated dependency graphs (visualize model relationships)
Graph visualisation & moreยถ
PVGIS can visualise its own data model graphs.
Examples
The data_model_template.yaml is
name: DataModelTemplate
label: Data Model Template
description: A generic template for data models
symbol: โ
color: gray
require:
# Identifier attributes
- attribute/name
- attribute/shortname
- attribute/supertitle
- attribute/title
- attribute/label
- attribute/description
- attribute/symbol
# Values
- attribute/value
- attribute/unit
# Metadata attribute
- attribute/algorithm
- attribute/equation
- metadata/data_source
and brings in various attributes to build a template data model.
For example, the symbol attribute is described in a YAML file itself
name: SymbolAttribute
label: Symbol
description: Attribute for a symbol of a data model
color: lightgray
sections:
symbol:
type: str
title: Symbol
description: Symbol for the data model
initial:
We can visualise the template via :
pvgis-prototype data-model --log-file data_model.log visualise gravis-d3 --yaml-file definitions.yaml/data_model_template.yaml
will generate an dynamic and clickable HTML file (here an SVG export of it is shown)
This is for example the generic template which many data model definitions use.
The mostly complex photovltaic power output data structure can be visualised visualised
rm data_model_graph.html # first -- or fix-the CLI to overwrite this !
pvgis-prototype data-model visualise gravis-d3 --yaml-file definitions.yaml/power/photovoltaic.yaml
Ditto, this image is unreadable. Generate the HTML file, open and explore it in your browser !
โ๏ธ The trade-offยถ
Why this complexity?
PVGIS counts a large number of interconnected data models that may evolve as solar research advances. Changes to irradiance algorithms, metadata structures, or output formats propagate through YAML editsโnot scattered code modifications. Domain experts can contribute directly to model definitions while developers can focus in the transformation engine.
Acknowledged limitations
- Learning curve: Understanding
requirechains requires conceptual investment - Debugging difficulty: YAML merge errors can be opaqueโyet the build process generates detailed logs
- Build-time dependency: Changes require regenerating
definitions.py
Future work
A refactoring pass will migrate hardcoded values from constants.py into YAML definitions, completing the separation of domain knowledge from implementation.
๐ฏ This approach is...ยถ
- Managing dozens of similar but distinct data structures
- Enabling cross-disciplinary collaboration (scientists define models, engineers build infrastructure)
- Supporting rapidly changing domain requirements (new algorithms, extended outputs)
- Ensuring long-term maintainability over immediate simplicity
This architecture prioritizes scientific transparency and future flexibility over ease of onboardingโa deliberate trade-off recognizing that PVGIS models will outlive any single development team.