Optimisation
Unsorted content
Following content needs a review and consolidation.
This page overviews some key areas for optimization.
Performance Optimisation
The goal is to improve the efficiency of PVGIS by optimizing data structures, refining algorithms, and implementing modern Python practices for asynchronous, concurrent, and parallel executions. Additionally, caching strategies and load balancing are essential for enhancing performance and scalability.
Status¶
The current Proof-of-Concept, (see commit : 5cca629ea186ff3c7711fbdbd8219841caf4d6b ) includes among other elements :
- quite some constants (see constants.py)
print()statements for output and support debugging which is slowing down a programs runtime 1- debugging calls, specifically
debug(locals())fromdevtools - input data validation using Pydantic
- in-function output data validation
- custom data classes/Pydantic models
- use of lists and list comprehensions
- frequently requested/repeated calculations
- lack of caching/memoization practices
- lack of :
- asynchronous executions,
- concurrent executions,
- parallel executions
- no parallel processing beyond NumPy's own internals (?)
- using Pandas' DatetimeIndex which is not hashable
- no use of any external compiler or library for High-Performance Computing
Hence, the margin for optimisation is quite large.
Profiling¶
Before optimising, however, it is important to quantify performance bottlenecks.
Using profiling tools like cProfile for Python we can analyse and understand which parts of the code are consuming the most resources.
Areas for improvement¶
Out of common/public programmatic experience, documented in books, articles, software projects, publicly accessible wikis and fora, we can list ahead some areas for improvement and discuss possible optimisation actions.
Logging¶
-
Replace print statements and
debug(locals())with a structured logging framework like :- Python's
loggingmodule logurustructlog
- Python's
-
Remove print statements completely and return only JSON or other structured output through the Web API in the production version ?
Debugging¶
The debug(locals()) calls from devtools can be optimised (?) or removed completely in the production version to reduce overhead.
Data Validation¶
-
Avoid redundant checks
-
Consider removing/switching off the input data validation for the efficient Web API module(s). Albeit, after extensive validation of the fundamental algorithms, the core API and the CLI which can ensure quality and reproducibility of operations.
Example : Efficient Data Validation with Pydantic
Use libraries developed in C/C++¶
There are numerous libraries/packages developed in C/C++ that can be integrated into Python programs. Numpy and Scipy are prominent examples, known for their effectiveness in handling large datasets.
Use such libraries to speed-up operations.
NumPy Arrays¶
NumPy is the golden standard for scientific and high-performance computing with Python. NumPy arrays outpace significantly common Python lists in processing massive data and performing numerical computations. consuming less memory than lists.
Do Not Use .dot Operation¶
Dot operations may be time-consuming!
Function with a . (dot) first call __getattribute()__ or __getattr()__, which then uses a dictionary operation. This adds some overhead. It is recommended to import functions for optimizing Python code for speed.
Intern Strings in Python¶
Danger
Explain.-
Generator expressions¶
Use generator expressions instead of list comprehensions
Apply multiple assignements¶
Instead of doing
better do
This approach optimizes and speeds up the code execution.
Peephole Optimization¶
Code readability often comes with cost in terms of efficiency as the language interpreter automatically calculates constant expressions. The peephole technique means to let Python pre-compute such expressions, replace repetitive instances with the result and employ membership tests. This may help to avoid performance decrease and boost software performance.
Data structure Optimization¶
Optimize in-advance massive time series data structures :
- to be contiguous time
-
small chunks of data in space
-
Handle massive time series data programmatically by using efficient data structures like NumPy Arrays.
Data Classes¶
- Refactor PVGIS' custom Python
dataclassesfor efficiency - Use alternatives from well-known libraries :
- Python's
dataclassesorattrs?
- Python's
Example : Python Data Class
Use a Python dataclass as a decorator to add special methods to classes :
Caching/Memoization Strategies¶
Intermediate outputs¶
Some core API functions, though they produce output for different calculated quantities, may depend on idenctical intermediate components. Hence, it is important to experiment, understand and apply local and distributed caching strategies.
Caching the output of frequently requested functions or data, for example using lru_cache or similar mechanisms, at the core API level, can decrease the computation time for functions with shared dependencies and consequently reduce response times and server load significantly.
-
For local caching, consider Python's built-in caching tools like
functools.lru_cachefor caching the output of functions, especially for functions with expensive or repetitive computations. -
The non-hashable nature of Pandas' DateTimeIndex can be a limitation in the context of caching. Are there alternative data structures or methods for handling timestamps ?
Danger
Does not work with non-hashable data structures!
-
For distributed caching, consider tools like Redis or Memcached.
-
Caching repeatedly requested final output calls at the Web API level ?
Asynchronous operations¶
Asynchronous execution for I/O-bound operations can improve the performance of network operations, the responsiveness and the efficiency of I/O tasks. It can be implemented using Python’s asyncio module
Example : Asynchronous Execution with asyncio
Concurrent operations¶
For CPU-bound tasks, explore Python’s multiprocessing or multithreading (if tasks are I/O-bound) to distribute computations and enhance performance.
-
Many in-between calculations do not depend on each other and can, therefore, be executed independently. Use Python's
concurrent.futuresor similar libraries to manage concurrent tasks.Example : Concurrent Executions with
concurrent.futures -
For independent calculations explore libraries like
concurrent.futuresto manage concurrent tasks efficiently.
Parallel operations¶
-
Use parallel processing techniques and software to handle intensive computational tasks.
-
Use Python’s
multiprocessingmodule for CPU-bound tasks to distribute computations across multiple cores.
Optimizing Pandas Usage¶
Use vectorized operations and efficient data handling in Pandas.
Example : Vectorised operation using Pandas
Algorithmic Efficiency¶
Optimizing the fundamental algorithms and the core API that power PVGIS, can reduce the computational complexity, which in turn may speed up operations significantly. This is crucial for handling efficiently large datasets, and perform complex calculations.
The focus is on :
- reviewing and refactoring core algorithms to reduce complexity
- use systematically efficient libraries like NumPy and SciPy for numerical computations.
- best programming practices like avoiding Python's currently inefficient
forloop
High Performance Computation with Python ?¶
Explore the great potential of using external libraries/frameworks for High Performance Computing to boost the performance of PVGIS.
-
Compilers/Just-in-Time Compilers
- PyPy: A Just-In-Time (JIT) compiler for Python.
- mypyc: A compiler that compiles Python to C-extension modules.
- Pyjion: A JIT compiler for Python, using the .NET CLR.
- Cython: an optimising static compiler for Python and Cython, allows writing C extensions for Python.
Cython gives you the combined power of Python and C to let you
-
Libraries/Frameworks
- Jax: A library for numerical computations with auto-differentiation and GPU/TPU support.
- GT4Py: A framework for writing stencil computations in geosciences.
- Pythran: A compiler-like tool that converts Python to optimized C++ code, but also acts as a library.
- Dace: An framework for data-centric parallel programming with support for Ahead-of-Time (AoT) compilation in addition to JIT.
Load Balancing¶
Note
A task mainly for and to work collaboratively with the IT support team
- Implement load balancing
- Distribute API requests evenly across multiple servers
-
Enhance scalability and reliability
-
Collaborate with the IT support team for implementing load balancing. This includes distributing API requests across servers and enhancing system scalability and reliability.