A production-ready AI orchestration system for CLT-aware signal/noise classification. Featuring Mixture of Experts (MoE) architecture, first-principles model selection, and cognitive-load-aware processing optimized for therapeutic journaling and neurodivergent care.
A modular pipeline that consumes user text and produces intelligent, context-aware responses.
Deterministic pipeline that consumes user text and produces a WorkOrder. Uses heuristics and pattern matching before any LLM is called.
First-principles measurement-based system that assigns models to experts using deterministic probe suites (JSON adherence, coding sanity).
ModelProvider abstraction layer supporting HuggingFace, llama.cpp (GGUF), vLLM, Ollama, and cloud APIs (Groq, OpenAI, Anthropic).
Consumes WorkOrder and routes tasks through specialized Experts (Planner, Code, Writer) assigned by the Selection Engine.
Stores long-term memory, user profiles, and session history in SQLite. Thread-safe with memory-aware LRU eviction.
Surfaces risk signals and compresses clinical context for mental health professionals. HIPAA-aligned data handling.
Enterprise-grade features for real-world deployment.
Unlike monolithic LLMs, tasks are split into a graph. Specialists handle planning, coding, and writing with optimized model assignments.
HF Transformers for dev, llama.cpp for efficiency, vLLM for serving, cloud APIs for scale. Hybrid local/cloud routing for privacy.
Adaptive streaming of text based on real-time cognitive load metrics. Information delivered at the pace you can process.
Cognitive-load-aware notification clustering by theme. Delivery scheduled based on user's current mental bandwidth.
Thread-safe ModelCache with memory-aware LRU eviction. Prevents 14GB reloads and manages VRAM/RAM lifecycle.
Custom exception hierarchy for granular error recovery. Retry transient timeouts, escalate crisis-level content.
Choose the right engine for your deployment scenario.
Best for development and experimentation. 4-bit/8-bit quantization support.
DevelopmentMaximum efficiency and broad model support. Optimized for local inference.
EfficiencyOptimized for serving and local server integration. High-throughput inference.
ServingGroq (~500 tok/sec), OpenAI, Anthropic. Hybrid routing between local and cloud.
ScaleModular installation via pyproject.toml for flexible dependency management.
[project.optional-dependencies]
core = [
"transformers",
"torch",
"accelerate"
]
api = [
"fastapi",
"uvicorn",
"pydantic",
"httpx"
]
llamacpp = ["llama-cpp-python"]
vllm = ["httpx"]
full = ["cle[core,api,llamacpp,vllm]"]
The CLE is instrumented with the CLE Observer for real-time monitoring and analysis.
Real-time tracking of expert selection for specific inputs
Processing time across the pipeline for bottleneck identification
Safety-triggered refusal analysis to tune policy strictness
Cost and performance tracking across model versions
The brain behind the PRJCT LAZRUS ecosystem. Contact us to learn more.