Mastering Local AI

Your complete guide to RKLLama, RKNN, and building on-device super agents on Rockchip hardware

RK3588 • RK3576 • NPU Acceleration

Core Technologies

Two powerful approaches for deploying LLMs on Rockchip NPU hardware

RKLlama

Community • GPL-3.0221 stars

Ollama alternative for Rockchip NPU with simplified deployment, familiar APIs, and direct HuggingFace integration. Perfect for rapid prototyping and development.

Ollama-compatible API endpoints
Dynamic model loading/unloading
Tool/Function calling support
Direct HuggingFace integration
Version 0.0.42 • Python 3.8-3.12

RKNN Toolkit

Official • Rockchip1.8k stars

Official Rockchip toolkit with advanced quantization, optimal performance, and comprehensive model support. Production-ready with enterprise features.

Advanced quantization (w4a16, w8a8)
Multi-batch inference support
Cross-attention inference
Multimodal model support
Version 1.2.1 • Python 3.8-3.12

Both approaches support RK3588, RK3576, and other Rockchip NPU platforms

🔥 NPU Acceleration⚡ Real-time Inference🛠️ Tool Integration

RKLlama Deep Dive

Community-driven Ollama alternative optimized for Rockchip NPU with simplified deployment and familiar APIs

Installation Guide

Key Features

Ollama Compatibility

Drop-in replacement for Ollama with familiar API endpoints

Dynamic Model Management

Load and unload models at runtime without restart

HuggingFace Integration

Direct model pulling from HuggingFace repositories

Tool/Function Calling

Complete support for external API integration

Streaming Responses

Real-time token generation for responsive UIs

API Usage Examples

Interactive chat with conversation history and context:

Chat API Request
curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:3b",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain how NPUs work"}
    ],
    "stream": true
  }'

Response Format:

{
  "model": "qwen2.5:3b",
  "created_at": "2024-01-01T00:00:00Z",
  "message": {
    "role": "assistant",
    "content": "NPUs (Neural Processing Units) are specialized..."
  },
  "done": false
}

Model Management

RKNN Toolkit Deep Dive

Official Rockchip toolkit with advanced quantization, optimal performance, and enterprise-grade features

Installation & Setup

Performance Benchmarks

Tokens per second across different models and platforms

* Benchmarks from RKNN-LLM v1.1.0 with 64 sequence length, 320 max context, 256 new tokens

Quantization Strategies

8-bit Weights, 8-bit Activations

Best performance on RK3588 with 6 TOPS NPU. Provides excellent speed with minimal accuracy loss.

Characteristics:
  • Memory Usage: ~50% of original model size
  • Accuracy: Minimal quantization loss
  • Speed: Excellent inference speed
  • Best For: RK3588, production deployments
w8a8 Configuration
# Quantization configuration
quantization_config = {
    "weight_bits": 8,
    "activation_bits": 8,
    "algorithm": "kld",
    "calibration_dataset": "custom",
    "batch_size": 16
}

Advanced Features (v1.2.1)

RKLLM Pre-converted Models

Ready-to-use optimized models from ThomasTheMaker's collection - 20+ models fine-tuned for RK3588 NPU performance

ThomasTheMaker Collection

Comprehensive collection of 20+ pre-converted RKLLM models optimized specifically for Rockchip RK3588 NPU. All models are quantized, tested, and ready for production use.

20+
Pre-converted Models
RK3588
NPU Optimized
Q4/Q8
Quantization
Ready
Production Use

Model Specifications

Performance Benchmarks

Performance benchmarks on RK3588 with 8GB RAM and NPU acceleration:

ModelLoad TimeFirst TokenTokens/secMemory Usage
Qwen2.5-0.5B2.1s0.8s45-551.2GB
Qwen2.5-1.5B3.5s1.2s35-422.8GB
Qwen2.5-3B5.2s1.8s25-325.1GB
Llama-3.2-1B2.8s1.0s38-451.8GB
Llama-3.1-8B12.5s3.2s12-189.2GB

Note: Performance varies based on prompt complexity, context length, and system load. NPU acceleration provides 2-3x speedup compared to CPU-only inference.

Installation & Setup

AI Agents Framework

Build intelligent, autonomous agents using Qwen-Agent framework with RKLlama - From simple assistants to complex multi-agent systems

Core Concepts & Architecture

Qwen-Agent Framework

Qwen-Agent is a comprehensive framework for building AI agents with Qwen models:

Core Features

  • Multi-agent conversation support
  • Rich tool ecosystem integration
  • Code interpreter and execution
  • Web browsing and search capabilities
  • Document processing and RAG

Architecture

Framework Structure
qwen-agent/
├── agents/          # Agent implementations
├── tools/           # Built-in tool library
├── memory/          # Memory management
├── llm/            # LLM interface layer
├── gui/            # Web interface
└── examples/       # Usage examples

Use Cases & Applications

Building Methods & Implementation

Head-to-Head Comparison

Detailed analysis to help you choose the right approach for your local super agent development

CategoryRKLlamaRKNN Toolkit
Deployment Complexity
Easy

Single command installation, Docker support

Moderate

Multi-step conversion process, environment setup

API Compatibility
Excellent

Ollama-compatible endpoints, familiar syntax

Good

OpenAI-compatible (v1.2.1), C/C++ and Python APIs

Performance (RK3588)
Good

Optimized for NPU, dynamic loading

Excellent

Advanced quantization, multi-batch inference

Model Support
Good

Major LLMs, HuggingFace integration

Excellent

Comprehensive model zoo, multimodal support

Quantization Options
Basic

Standard RKLLM quantization

Advanced

w4a16, w8a8, group quantization (g128, g512)

Development Speed
Fast

Rapid prototyping, familiar tools

Moderate

Longer setup, conversion workflow

Production Readiness
Beta

Community project, GPL-3.0 license

Production

Official support, enterprise features

Documentation
Good

Community docs, examples

Excellent

Official documentation, comprehensive guides

RKLlama Analysis

Advantages

  • Quick setup and deployment
  • Ollama-compatible API endpoints
  • Dynamic model loading/unloading
  • Direct HuggingFace integration
  • Active community development
  • Docker containerization support

Considerations

  • Beta stage, potential stability issues
  • Limited advanced quantization options
  • GPL-3.0 license restrictions
  • Less comprehensive model support
  • Community-driven support only

RKNN Toolkit Analysis

Advantages

  • Official Rockchip support
  • Advanced quantization algorithms
  • Optimal performance on NPU
  • Comprehensive model zoo
  • Multi-batch inference support
  • Production-ready stability
  • Multimodal model support

Considerations

  • Complex installation process
  • Steep learning curve
  • Multi-step conversion workflow
  • Environment setup requirements
  • Longer development cycle

Choose RKLlama for:

  • Rapid prototyping and experimentation
  • Familiar Ollama-like development experience
  • Simple deployment requirements
  • Community-driven projects
  • Learning and educational purposes

Choose RKNN for:

  • Production deployments
  • Maximum performance requirements
  • Advanced quantization needs
  • Multimodal applications
  • Enterprise and commercial projects

Performance Benchmarks

Tokens per second across different models and platforms

* Benchmarks from RKNN-LLM v1.1.0 with 64 sequence length, 320 max context, 256 new tokens

Frequently Asked Questions

Common questions and detailed answers for building local super agents on Rockchip hardware

Need More Help?

Join the community discussions and get support from other developers building on Rockchip NPU.

GitHub IssuesCommunity ForumsDiscord Channels