Mastering Local AI

Your complete guide to RKLLama, RKNN, and building on-device super agents on Rockchip hardware

RK3588 • RK3576 • NPU Acceleration

Core Technologies

Two powerful approaches for deploying LLMs on Rockchip NPU hardware

RKLlama

Community • GPL-3.0221 stars

Ollama alternative for Rockchip NPU with simplified deployment, familiar APIs, and direct HuggingFace integration. Perfect for rapid prototyping and development.

✅ Ollama-compatible API endpoints

✅ Dynamic model loading/unloading

✅ Tool/Function calling support

✅ Direct HuggingFace integration

Version 0.0.42 • Python 3.8-3.12

RKNN Toolkit

Official • Rockchip1.8k stars

Official Rockchip toolkit with advanced quantization, optimal performance, and comprehensive model support. Production-ready with enterprise features.

✅ Advanced quantization (w4a16, w8a8)

✅ Multi-batch inference support

✅ Cross-attention inference

✅ Multimodal model support

Version 1.2.1 • Python 3.8-3.12

Both approaches support RK3588, RK3576, and other Rockchip NPU platforms

🔥 NPU Acceleration⚡ Real-time Inference🛠️ Tool Integration

RKLlama Deep Dive

Community-driven Ollama alternative optimized for Rockchip NPU with simplified deployment and familiar APIs

Installation Guide

Key Features

Ollama Compatibility

Drop-in replacement for Ollama with familiar API endpoints

Dynamic Model Management

Load and unload models at runtime without restart

HuggingFace Integration

Direct model pulling from HuggingFace repositories

Tool/Function Calling

Complete support for external API integration

Streaming Responses

Real-time token generation for responsive UIs

API Usage Examples

Interactive chat with conversation history and context:

Chat API Request

curl -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:3b",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain how NPUs work"}
    ],
    "stream": true
  }'

Response Format:

{
  "model": "qwen2.5:3b",
  "created_at": "2024-01-01T00:00:00Z",
  "message": {
    "role": "assistant",
    "content": "NPUs (Neural Processing Units) are specialized..."
  },
  "done": false
}

Model Management

RKNN Toolkit Deep Dive

Official Rockchip toolkit with advanced quantization, optimal performance, and enterprise-grade features

Installation & Setup

Performance Benchmarks

Tokens per second across different models and platforms

* Benchmarks from RKNN-LLM v1.1.0 with 64 sequence length, 320 max context, 256 new tokens

Quantization Strategies

8-bit Weights, 8-bit Activations

Best performance on RK3588 with 6 TOPS NPU. Provides excellent speed with minimal accuracy loss.

Characteristics:

• Memory Usage: ~50% of original model size
• Accuracy: Minimal quantization loss
• Speed: Excellent inference speed
• Best For: RK3588, production deployments

w8a8 Configuration

# Quantization configuration
quantization_config = {
    "weight_bits": 8,
    "activation_bits": 8,
    "algorithm": "kld",
    "calibration_dataset": "custom",
    "batch_size": 16
}

Advanced Features (v1.2.1)

RKLLM Pre-converted Models

Ready-to-use optimized models from ThomasTheMaker's collection - 20+ models fine-tuned for RK3588 NPU performance

ThomasTheMaker Collection

Comprehensive collection of 20+ pre-converted RKLLM models optimized specifically for Rockchip RK3588 NPU. All models are quantized, tested, and ready for production use.

View Collection

20+

Pre-converted Models

RK3588

NPU Optimized

Q4/Q8

Quantization

Ready

Production Use

Model Specifications

Performance Benchmarks

Performance benchmarks on RK3588 with 8GB RAM and NPU acceleration:

Model	Load Time	First Token	Tokens/sec	Memory Usage
Qwen2.5-0.5B	2.1s	0.8s	45-55	1.2GB
Qwen2.5-1.5B	3.5s	1.2s	35-42	2.8GB
Qwen2.5-3B	5.2s	1.8s	25-32	5.1GB
Llama-3.2-1B	2.8s	1.0s	38-45	1.8GB
Llama-3.1-8B	12.5s	3.2s	12-18	9.2GB

Note: Performance varies based on prompt complexity, context length, and system load. NPU acceleration provides 2-3x speedup compared to CPU-only inference.

Installation & Setup

AI Agents Framework

Build intelligent, autonomous agents using Qwen-Agent framework with RKLlama - From simple assistants to complex multi-agent systems

Core Concepts & Architecture

Qwen-Agent Framework

Qwen-Agent is a comprehensive framework for building AI agents with Qwen models:

Core Features

Multi-agent conversation support
Rich tool ecosystem integration
Code interpreter and execution
Web browsing and search capabilities
Document processing and RAG

Architecture

Framework Structure

qwen-agent/
├── agents/          # Agent implementations
├── tools/           # Built-in tool library
├── memory/          # Memory management
├── llm/            # LLM interface layer
├── gui/            # Web interface
└── examples/       # Usage examples

Use Cases & Applications

Building Methods & Implementation

Head-to-Head Comparison

Detailed analysis to help you choose the right approach for your local super agent development

Category	RKLlama	RKNN Toolkit
Deployment Complexity	Easy Single command installation, Docker support	Moderate Multi-step conversion process, environment setup
API Compatibility	Excellent Ollama-compatible endpoints, familiar syntax	Good OpenAI-compatible (v1.2.1), C/C++ and Python APIs
Performance (RK3588)	Good Optimized for NPU, dynamic loading	Excellent Advanced quantization, multi-batch inference
Model Support	Good Major LLMs, HuggingFace integration	Excellent Comprehensive model zoo, multimodal support
Quantization Options	Basic Standard RKLLM quantization	Advanced w4a16, w8a8, group quantization (g128, g512)
Development Speed	Fast Rapid prototyping, familiar tools	Moderate Longer setup, conversion workflow
Production Readiness	Beta Community project, GPL-3.0 license	Production Official support, enterprise features
Documentation	Good Community docs, examples	Excellent Official documentation, comprehensive guides

RKLlama Analysis

Advantages

Quick setup and deployment
Ollama-compatible API endpoints
Dynamic model loading/unloading
Direct HuggingFace integration
Active community development
Docker containerization support

Considerations

Beta stage, potential stability issues
Limited advanced quantization options
GPL-3.0 license restrictions
Less comprehensive model support
Community-driven support only

RKNN Toolkit Analysis

Advantages

Official Rockchip support
Advanced quantization algorithms
Optimal performance on NPU
Comprehensive model zoo
Multi-batch inference support
Production-ready stability
Multimodal model support

Considerations

Complex installation process
Steep learning curve
Multi-step conversion workflow
Environment setup requirements
Longer development cycle

Choose RKLlama for:

Rapid prototyping and experimentation
Familiar Ollama-like development experience
Simple deployment requirements
Community-driven projects
Learning and educational purposes

Choose RKNN for:

Production deployments
Maximum performance requirements
Advanced quantization needs
Multimodal applications
Enterprise and commercial projects

Performance Benchmarks

Tokens per second across different models and platforms

* Benchmarks from RKNN-LLM v1.1.0 with 64 sequence length, 320 max context, 256 new tokens

Frequently Asked Questions

Common questions and detailed answers for building local super agents on Rockchip hardware

Need More Help?

Join the community discussions and get support from other developers building on Rockchip NPU.

GitHub IssuesCommunity ForumsDiscord Channels

Mastering Local AI

Core Technologies

RKLlama

RKNN Toolkit

RKLlama Deep Dive

Installation Guide

Standard Installation

Docker Installation

Key Features

Ollama Compatibility

Dynamic Model Management

HuggingFace Integration

Tool/Function Calling

Streaming Responses

API Usage Examples

Response Format:

Model Management

Adding Models via Pull Command

Manual Model Installation

RKNN Toolkit Deep Dive

Installation & Setup

Environment Setup

Model Conversion Workflow

Performance Monitoring

Performance Benchmarks

Quantization Strategies

8-bit Weights, 8-bit Activations

Characteristics:

Advanced Features (v1.2.1)

Multi-batch Inference

Cross-attention Inference

Function Calling

RKLLM Pre-converted Models

ThomasTheMaker Collection

Model Specifications

Qwen2.5 Series Models

Llama 3.1 & 3.2 Series

Specialized Models

Performance Benchmarks

Installation & Setup

Direct Download from Collection

Using RKLlama Pull Command

Manual Installation & Setup

AI Agents Framework

Core Concepts & Architecture

Agent vs LLM Differences

Agent Architecture Components

Agent Mode vs LLM Mode

Qwen-Agent Framework

Core Features

Architecture

Use Cases & Applications

Personal AI Assistants

Workflow Automation

Research and Analysis

Building Methods & Implementation

Quick Start with Built-in Agents

Custom Agent with Tools

Multi-Agent Systems

Head-to-Head Comparison

RKLlama Analysis

Advantages

Considerations

RKNN Toolkit Analysis

Advantages

Considerations

Choose RKLlama for:

Choose RKNN for:

Performance Benchmarks

Frequently Asked Questions

Which hardware platforms are supported?

How do I choose between w4a16 and w8a8 quantization?

What are the memory requirements for different models?

How do I implement tool/function calling?

What about streaming responses and real-time inference?

How do I troubleshoot common deployment issues?

What are the licensing considerations?

Need More Help?