Performance Optimization Guide ============================== Optimize MechanicsDSL simulations for maximum performance. Numba JIT Acceleration ---------------------- Use the Numba-accelerated solver for 5-10x speedups: .. code-block:: python from mechanics_dsl.solver_numba import NumbaSimulator import sympy as sp # Define equations theta = sp.Symbol('theta') g, l = sp.Symbol('g'), sp.Symbol('l') accelerations = {'theta_ddot': -g/l * sp.sin(theta)} # Create Numba simulator sim = NumbaSimulator() sim.set_parameters({'g': 9.81, 'l': 1.0}) sim.set_initial_conditions({'theta': 0.3, 'theta_dot': 0.0}) sim.compile_equations(accelerations, ['theta']) # Run simulation (5-10x faster than SciPy) solution = sim.simulate_numba( t_span=(0, 100), num_points=10000, method='rk4' # 'euler', 'rk4', or 'rk45' ) Available Methods ~~~~~~~~~~~~~~~~~ - ``euler`` - Simple Euler (fastest, least accurate) - ``rk4`` - 4th order Runge-Kutta (recommended) - ``rk45`` - Adaptive Dormand-Prince (most accurate) GPU Acceleration with CUDA -------------------------- For massive parallelism on NVIDIA GPUs: 1. Generate CUDA code: .. code-block:: python from mechanics_dsl.codegen import CudaGenerator gen = CudaGenerator(...) gen.generate("cuda_output/") 2. Compile with nvcc: .. code-block:: bash cd cuda_output mkdir build && cd build cmake .. make 3. Run on GPU: .. code-block:: bash ./simulation_cuda CPU Fallback ~~~~~~~~~~~~ If no NVIDIA GPU is available, use the CPU version: .. code-block:: bash ./simulation_cpu Multi-Core Parallelism with OpenMP ---------------------------------- For multi-core CPU simulation: .. code-block:: python from mechanics_dsl.codegen import OpenMPGenerator gen = OpenMPGenerator( ..., num_threads=8 # 0 = auto-detect ) gen.generate("simulation_openmp.cpp") Compile with: .. code-block:: bash g++ -fopenmp -O3 -march=native -o simulation simulation_openmp.cpp Memory Optimization ------------------- For large particle simulations: 1. **Use float32** instead of float64 where precision allows 2. **Structure of Arrays (SoA)** layout for cache efficiency 3. **Spatial hashing** for O(n) neighbor search Benchmarking ------------ Run the included benchmark: .. code-block:: bash cd benchmarks python numba_performance.py Expected output: +----------+------------+------------+----------+ | Points | Numba | SciPy | Speedup | +==========+============+============+==========+ | 1,000 | 5 ms | 45 ms | 9x | | 10,000 | 50 ms | 450 ms | 9x | | 100,000 | 500 ms | 4500 ms | 9x | +----------+------------+------------+----------+ Best Practices -------------- 1. **Start with Python** for debugging 2. **Profile first** to identify bottlenecks 3. **Use Numba** for quick wins (no code changes) 4. **Generate C++/CUDA** for production 5. **Batch simulations** with OpenMP for parameter sweeps