Skip to content

Optimisation

Unsorted content

Following content needs a review and consolidation.

This page overviews some key areas for optimization.

Performance Optimisation

The goal is to improve the efficiency of PVGIS by optimizing data structures, refining algorithms, and implementing modern Python practices for asynchronous, concurrent, and parallel executions. Additionally, caching strategies and load balancing are essential for enhancing performance and scalability.

Status

The current Proof-of-Concept, (see commit : 5cca629ea186ff3c7711fbdbd8219841caf4d6b ) includes among other elements :

  • quite some constants (see constants.py)
  • print() statements for output and support debugging which is slowing down a programs runtime 1
  • debugging calls, specifically debug(locals()) from devtools
  • input data validation using Pydantic
  • in-function output data validation
  • custom data classes/Pydantic models
  • use of lists and list comprehensions
  • frequently requested/repeated calculations
  • lack of caching/memoization practices
  • lack of :
    • asynchronous executions,
    • concurrent executions,
    • parallel executions
      • no parallel processing beyond NumPy's own internals (?)
  • using Pandas' DatetimeIndex which is not hashable
  • no use of any external compiler or library for High-Performance Computing

Hence, the margin for optimisation is quite large.

Profiling

Before optimising, however, it is important to quantify performance bottlenecks.

Using profiling tools like cProfile for Python we can analyse and understand which parts of the code are consuming the most resources.

Areas for improvement

Out of common/public programmatic experience, documented in books, articles, software projects, publicly accessible wikis and fora, we can list ahead some areas for improvement and discuss possible optimisation actions.

Logging

  1. Replace print statements and debug(locals()) with a structured logging framework like :

  2. Remove print statements completely and return only JSON or other structured output through the Web API in the production version ?

Debugging

The debug(locals()) calls from devtools can be optimised (?) or removed completely in the production version to reduce overhead.

Data Validation

  • Avoid redundant checks

  • Consider removing/switching off the input data validation for the efficient Web API module(s). Albeit, after extensive validation of the fundamental algorithms, the core API and the CLI which can ensure quality and reproducibility of operations.

Example : Efficient Data Validation with Pydantic
from pydantic import BaseModel

class ExampleModel(BaseModel):
    attribute1: int
    attribute2: str

# Using Pydantic for validation
example = ExampleModel(attribute1=123, attribute2="test")

Use libraries developed in C/C++

There are numerous libraries/packages developed in C/C++ that can be integrated into Python programs. Numpy and Scipy are prominent examples, known for their effectiveness in handling large datasets.

Use such libraries to speed-up operations.

NumPy Arrays

NumPy is the golden standard for scientific and high-performance computing with Python. NumPy arrays outpace significantly common Python lists in processing massive data and performing numerical computations. consuming less memory than lists.

Do Not Use .dot Operation

Dot operations may be time-consuming!

Function with a . (dot) first call __getattribute()__ or __getattr()__, which then uses a dictionary operation. This adds some overhead. It is recommended to import functions for optimizing Python code for speed.

Intern Strings in Python

Danger

Explain.-

Generator expressions

Use generator expressions instead of list comprehensions

Apply multiple assignements

Instead of doing

a = 3
b = 6
c = 9

better do

a, b, c = 3, 6, 9

This approach optimizes and speeds up the code execution.

Peephole Optimization

Code readability often comes with cost in terms of efficiency as the language interpreter automatically calculates constant expressions. The peephole technique means to let Python pre-compute such expressions, replace repetitive instances with the result and employ membership tests. This may help to avoid performance decrease and boost software performance.

Data structure Optimization

Optimize in-advance massive time series data structures :

  • to be contiguous time
  • small chunks of data in space

  • Handle massive time series data programmatically by using efficient data structures like NumPy Arrays.

Data Classes

  • Refactor PVGIS' custom Python dataclasses for efficiency
  • Use alternatives from well-known libraries :
    • Python's dataclasses or attrs ?
Example : Python Data Class

Use a Python dataclass as a decorator to add special methods to classes :

from dataclasses import dataclass

@dataclass
class ExampleClass:
    attribute1: int
    attribute2: str

example = ExampleClass(123, "test")

Caching/Memoization Strategies

Intermediate outputs

Some core API functions, though they produce output for different calculated quantities, may depend on idenctical intermediate components. Hence, it is important to experiment, understand and apply local and distributed caching strategies.

Caching the output of frequently requested functions or data, for example using lru_cache or similar mechanisms, at the core API level, can decrease the computation time for functions with shared dependencies and consequently reduce response times and server load significantly.

  • For local caching, consider Python's built-in caching tools like functools.lru_cache for caching the output of functions, especially for functions with expensive or repetitive computations.

    Example : Caching/Memoization with functools.lru_cache
    from functools import lru_cache
    
    @lru_cache(maxsize=100)
    def expensive_function(arg):
        # Time-consuming computations
        return result
    
  • The non-hashable nature of Pandas' DateTimeIndex can be a limitation in the context of caching. Are there alternative data structures or methods for handling timestamps ?

    Danger

    Does not work with non-hashable data structures!

  • For distributed caching, consider tools like Redis or Memcached.

  • Caching repeatedly requested final output calls at the Web API level ?

Asynchronous operations

Asynchronous execution for I/O-bound operations can improve the performance of network operations, the responsiveness and the efficiency of I/O tasks. It can be implemented using Python’s asyncio module

Example : Asynchronous Execution with asyncio
import asyncio

async def async_task():
    # Perform async operations
    return result

# Running async tasks
asyncio.run(async_task())

Concurrent operations

For CPU-bound tasks, explore Python’s multiprocessing or multithreading (if tasks are I/O-bound) to distribute computations and enhance performance.

  • Many in-between calculations do not depend on each other and can, therefore, be executed independently. Use Python's concurrent.futures or similar libraries to manage concurrent tasks.

    Example : Concurrent Executions with concurrent.futures
    from concurrent.futures import ThreadPoolExecutor
    
    def function_to_run_concurrently(arg):
        # Operations
        return result
    
    with ThreadPoolExecutor(max_workers=5) as executor:
        future = executor.submit(function_to_run_concurrently, (arg))
        return_value = future.result()
    
  • For independent calculations explore libraries like concurrent.futures to manage concurrent tasks efficiently.

Parallel operations

  • Use parallel processing techniques and software to handle intensive computational tasks.

  • Use Python’s multiprocessing module for CPU-bound tasks to distribute computations across multiple cores.

    Example : Parallel Processing with multiprocessing
    from multiprocessing import Pool
    
    def function_to_run_in_parallel(arg):
        # Operations
        return result
    
    if __name__ == "__main__":
        with Pool(processes=4) as pool:
            results = pool.map(function_to_run_in_parallel, iterable_of_args)
    

Optimizing Pandas Usage

Use vectorized operations and efficient data handling in Pandas.

Example : Vectorised operation using Pandas
import pandas as pd

# Example: Vectorized operation instead of a loop
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df['c'] = df['a'] + df['b']  # Vectorized addition

Algorithmic Efficiency

Optimizing the fundamental algorithms and the core API that power PVGIS, can reduce the computational complexity, which in turn may speed up operations significantly. This is crucial for handling efficiently large datasets, and perform complex calculations.

The focus is on :

  • reviewing and refactoring core algorithms to reduce complexity
  • use systematically efficient libraries like NumPy and SciPy for numerical computations.
  • best programming practices like avoiding Python's currently inefficient for loop

High Performance Computation with Python ?

Explore the great potential of using external libraries/frameworks for High Performance Computing to boost the performance of PVGIS.

  • Compilers/Just-in-Time Compilers

    • PyPy: A Just-In-Time (JIT) compiler for Python.
    • mypyc: A compiler that compiles Python to C-extension modules.
    • Pyjion: A JIT compiler for Python, using the .NET CLR.
    • Cython: an optimising static compiler for Python and Cython, allows writing C extensions for Python.

Cython gives you the combined power of Python and C to let you

  • Libraries/Frameworks

    • Jax: A library for numerical computations with auto-differentiation and GPU/TPU support.
    • GT4Py: A framework for writing stencil computations in geosciences.
    • Pythran: A compiler-like tool that converts Python to optimized C++ code, but also acts as a library.
    • Dace: An framework for data-centric parallel programming with support for Ahead-of-Time (AoT) compilation in addition to JIT.

Load Balancing

Note

A task mainly for and to work collaboratively with the IT support team

  • Implement load balancing
  • Distribute API requests evenly across multiple servers
  • Enhance scalability and reliability

  • Collaborate with the IT support team for implementing load balancing. This includes distributing API requests across servers and enhancing system scalability and reliability.

References