LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
-
Updated
Jun 12, 2024 - Python
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
A high-throughput and memory-efficient inference and serving engine for LLMs
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"
AICI: Prompts as (Wasm) Programs
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
The simplest way to serve AI/ML models in production
A scalable inference server for models optimized with OpenVINO™
Standardized Serverless ML Inference Platform on Kubernetes
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
Examples of serving LLM on Modal.
Tools for easing the handoff between AI/ML and App/SRE teams.
Hopsworks - Data-Intensive AI platform with a Feature Store
🏕️ Reproducible development environment
📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Add a description, image, and links to the model-serving topic page so that developers can more easily learn about it.
To associate your repository with the model-serving topic, visit your repo's landing page and select "manage topics."