Parallel Processing Guide

Overview

The optimized backtest function includes optional parallel processing using Python's ProcessPoolExecutor, allowing you to leverage all CPU cores for massive speedups when running thousands of backtests.

Performance Comparison

For 10,000 backtests:

Configuration	Runtime	Speedup	Best Use Case
Original (sequential)	~180 min	1x	Baseline
Optimized (sequential)	~90 min	2x	Single-core or I/O bound
Optimized (4 cores)	~30 min	6x	Typical laptop
Optimized (8 cores)	~20 min	9x	Desktop/server
Optimized (16 cores)	~15 min	12x	High-end server

When to Use Parallel Processing

✅ USE Parallel Processing When:

Running 1,000+ backtests
CPU-bound operations (complex indicators, many calculations)
Multi-core processor available (4+ cores)
Each backtest is independent
Memory per core is sufficient (~500MB-2GB per worker)

❌ DON'T Use Parallel Processing When:

Running < 100 backtests (overhead > benefit)
I/O bound (disk/network bottleneck)
Limited RAM (< 8GB total)
Single/dual core processor
Strategies share mutable state

Usage

Basic Usage

import os

# Use all available cores
backtests = app.run_vector_backtests_with_checkpoints_optimized(
    initial_amount=1000,
    strategies=strategies,  # 10,000 strategies
    backtest_date_ranges=[date_range_1, date_range_2],
    n_workers=-1,  # Use all CPU cores
    show_progress=True,
)

Recommended Usage

# Leave one core free for system
n_cores = os.cpu_count() - 1

backtests = app.run_vector_backtests_with_checkpoints_optimized(
    initial_amount=1000,
    strategies=strategies,
    backtest_date_ranges=[date_range_1, date_range_2],
    n_workers=n_cores,  # e.g., 7 cores on an 8-core machine
    batch_size=100,
    checkpoint_batch_size=50,
    show_progress=True,
)

Conservative Usage

# Use half available cores (safer for shared systems)
n_cores = max(1, os.cpu_count() // 2)

backtests = app.run_vector_backtests_with_checkpoints_optimized(
    initial_amount=1000,
    strategies=strategies,
    backtest_date_ranges=[date_range_1, date_range_2],
    n_workers=n_cores,
    show_progress=True,
)

Configuration Parameters

`n_workers` Parameter

None (default): Sequential processing (no parallelization)
-1: Use all available CPU cores (os.cpu_count())
N: Use exactly N worker processes

Examples:

n_workers=None  # Sequential (safest, ~90min for 10k)
n_workers=1     # Also sequential
n_workers=4     # Use 4 cores (~30min for 10k)
n_workers=8     # Use 8 cores (~20min for 10k)
n_workers=-1    # Use all cores (~15-20min for 10k on 8-core)

How It Works

Architecture

Main Process:
├── Loads checkpoint cache (shared read-only)
├── Identifies missing backtests
├── Creates ProcessPoolExecutor with N workers
├── Submits backtest tasks to worker pool
└── Collects results and saves in batches

Worker Process 1-N:
├── Receives strategy + parameters
├── Creates isolated app instance
├── Runs backtest independently
├── Returns result to main process
└── Terminates

Key Implementation Details

Process-based parallelism: Uses multiprocessing.ProcessPoolExecutor to bypass Python's GIL
Isolated workers: Each worker creates its own app instance
Batch collection: Results are collected and saved in batches to reduce I/O
Progress tracking: tqdm shows real-time progress across all workers
Error handling: Continues or fails based on continue_on_error flag

Memory Considerations

Estimating Memory Usage

Each worker process needs:

Base: ~200-500MB (Python + framework)
Data: ~100-500MB (OHLCV data for backtest period)
Strategy: ~50-200MB (indicators, calculations)
Total per worker: ~500MB - 2GB

Example for 8 workers:

Conservative: 8 × 500MB = 4GB
Typical: 8 × 1GB = 8GB
Heavy: 8 × 2GB = 16GB

Memory Management Tips

# Monitor memory usage
import psutil

def check_memory():
    mem = psutil.virtual_memory()
    return f"Used: {mem.percent}%, Available: {mem.available / 1024**3:.1f}GB"

# Adjust workers based on available memory
available_gb = psutil.virtual_memory().available / 1024**3
recommended_workers = int(available_gb / 2)  # 2GB per worker
n_workers = min(os.cpu_count() - 1, recommended_workers)

print(f"Using {n_workers} workers (Memory: {check_memory()})")

backtests = app.run_vector_backtests_with_checkpoints_optimized(
    strategies=strategies,
    n_workers=n_workers,
    ...
)

Performance Tuning

Optimal Configuration Formula

import os

# System specs
total_cores = os.cpu_count()
available_ram_gb = psutil.virtual_memory().available / 1024**3

# Calculate optimal settings
n_workers = min(
    total_cores - 1,  # Leave one core free
    int(available_ram_gb / 2),  # 2GB per worker
    len(strategies) // 10  # At least 10 strategies per worker
)

# Batch sizes
checkpoint_batch_size = max(10, n_workers * 5)  # 5 per worker
batch_size = max(50, n_workers * 10)  # 10 per worker

print(f"Optimal config: {n_workers} workers, "
      f"checkpoint_batch={checkpoint_batch_size}, "
      f"batch={batch_size}")

CPU-Bound vs I/O-Bound

CPU-Bound (benefits from parallelization):

Complex indicator calculations
Many mathematical operations
Large datasets
Long time periods

I/O-Bound (may not benefit):

Frequent disk reads/writes
Network-based data fetching
Database queries
Very fast backtests (< 1 second each)

Benchmark Your Setup

import time

strategies_sample = strategies[:100]  # Test with 100 strategies

# Test sequential
start = time.time()
results_seq = app.run_vector_backtests_with_checkpoints_optimized(
    strategies=strategies_sample,
    n_workers=None,  # Sequential
    ...
)
seq_time = time.time() - start

# Test parallel with 4 workers
start = time.time()
results_par4 = app.run_vector_backtests_with_checkpoints_optimized(
    strategies=strategies_sample,
    n_workers=4,
    ...
)
par4_time = time.time() - start

# Test parallel with all cores
start = time.time()
results_par_all = app.run_vector_backtests_with_checkpoints_optimized(
    strategies=strategies_sample,
    n_workers=-1,
    ...
)
par_all_time = time.time() - start

print(f"Sequential: {seq_time:.1f}s")
print(f"Parallel (4): {par4_time:.1f}s (speedup: {seq_time/par4_time:.1f}x)")
print(f"Parallel (all): {par_all_time:.1f}s (speedup: {seq_time/par_all_time:.1f}x)")

# Extrapolate to full dataset
estimated_full = par_all_time * (len(strategies) / 100)
print(f"\nEstimated time for {len(strategies)} strategies: {estimated_full/60:.1f} minutes")

Common Issues and Solutions

Issue 1: "Out of Memory" Errors

Symptoms: Process killed, memory error Solution:

# Reduce number of workers
n_workers = max(1, os.cpu_count() // 2)  # Use half the cores

# Reduce batch sizes
checkpoint_batch_size = 25  # Smaller batches

Issue 2: No Speed Improvement

Symptoms: Parallel is same speed or slower than sequential Causes:

I/O bound workload
Too few strategies (< 100)
Overhead exceeds benefit

Solution:

# Use sequential for small workloads
if len(strategies) < 100:
    n_workers = None
else:
    n_workers = os.cpu_count() - 1

Issue 3: Pickling Errors

Symptoms: "Can't pickle X" errors Cause: Strategy contains unpickleable objects (lambdas, local functions) Solution:

# Make sure strategies are pickleable
# Avoid: lambdas, local functions, file handles in strategy

# Good:
class MyStrategy(TradingStrategy):
    def __init__(self, param1, param2):
        self.param1 = param1  # Simple types only
        super().__init__(...)

# Bad:
class MyStrategy(TradingStrategy):
    def __init__(self):
        self.callback = lambda x: x + 1  # Lambda not pickleable!

Issue 4: Slower on Windows

Symptoms: Windows significantly slower than Linux/Mac Cause: Windows process creation overhead Solution:

# Use fewer workers on Windows
import platform

if platform.system() == 'Windows':
    n_workers = max(1, os.cpu_count() // 2)
else:
    n_workers = os.cpu_count() - 1

Platform-Specific Notes

Linux (Best Performance)

Fast process forking
Efficient memory sharing
Recommended: Use 90% of cores

macOS (Good Performance)

Similar to Linux
May have stricter memory limits
Recommended: Use 80% of cores

Windows (Moderate Performance)

Slower process creation
Higher memory overhead
Recommended: Use 50% of cores

Advanced: Custom Worker Function

If you need custom worker behavior, you can modify _run_single_backtest_worker:

@staticmethod
def _run_single_backtest_worker(args):
    """Custom worker with profiling."""
    import cProfile
    import pstats
    from io import StringIO
    
    # Profile the backtest
    profiler = cProfile.Profile()
    profiler.enable()
    
    # Run backtest (existing code)
    result = ... # existing worker code
    
    profiler.disable()
    
    # Optional: Log profiling stats
    s = StringIO()
    ps = pstats.Stats(profiler, stream=s).sort_stats('cumtime')
    ps.print_stats(10)
    
    return result

Example: Complete Optimized Setup

import os
import psutil
from investing_algorithm_framework import create_app

# Configuration
def get_optimal_config():
    total_cores = os.cpu_count()
    available_ram_gb = psutil.virtual_memory().available / 1024**3
    
    # Calculate workers
    n_workers = min(
        total_cores - 1,
        int(available_ram_gb / 2),
        8  # Cap at 8 for stability
    )
    
    # Calculate batch sizes
    checkpoint_batch = max(25, n_workers * 5)
    batch = max(50, n_workers * 10)
    
    return n_workers, checkpoint_batch, batch

# Setup
app = create_app(name="OptimizedBacktester")
app.add_market(market="BITVAVO", trading_symbol="EUR", initial_balance=1000)

# Get optimal config
n_workers, checkpoint_batch, batch = get_optimal_config()

print(f"Configuration:")
print(f"  Workers: {n_workers}")
print(f"  Checkpoint batch: {checkpoint_batch}")
print(f"  Batch size: {batch}")
print(f"  Total strategies: {len(strategies)}")
print(f"  Expected time: ~{len(strategies) / (100 * n_workers):.1f} minutes")

# Run optimized backtests
backtests = app.run_vector_backtests_with_checkpoints_optimized(
    initial_amount=1000,
    strategies=strategies,
    backtest_date_ranges=date_ranges,
    n_workers=n_workers,
    batch_size=batch,
    checkpoint_batch_size=checkpoint_batch,
    show_progress=True,
    continue_on_error=True,  # Don't stop on individual failures
)

print(f"\nCompleted {len(backtests)} backtests successfully!")

Summary

✅ Use n_workers=-1 for maximum speed on dedicated machines ✅ Use n_workers=os.cpu_count()-1 for shared/production systems ✅ Use n_workers=None for < 100 backtests or debugging ✅ Monitor memory and adjust workers accordingly ✅ Benchmark first with a small sample to find optimal settings ✅ Consider platform (Linux > macOS > Windows for parallel performance)

With proper configuration, parallel processing can reduce 10,000 backtest runtime from 3 hours to 15-20 minutes!

Parallel Processing Guide

Overview​

Performance Comparison​

When to Use Parallel Processing​

✅ USE Parallel Processing When:​

❌ DON'T Use Parallel Processing When:​

Usage​

Basic Usage​

Recommended Usage​

Conservative Usage​

Configuration Parameters​

n_workers Parameter​

How It Works​

Architecture​

Key Implementation Details​

Memory Considerations​

Estimating Memory Usage​

Memory Management Tips​

Performance Tuning​

Optimal Configuration Formula​

CPU-Bound vs I/O-Bound​

Benchmark Your Setup​

Common Issues and Solutions​

Issue 1: "Out of Memory" Errors​

Issue 2: No Speed Improvement​

Issue 3: Pickling Errors​

Issue 4: Slower on Windows​

Platform-Specific Notes​

Linux (Best Performance)​

macOS (Good Performance)​

Windows (Moderate Performance)​

Advanced: Custom Worker Function​

Example: Complete Optimized Setup​

Summary​