Baby steps with Kronk

Kronk provides a high-level API on top of the Yzma library to develop Generative AI applications in Go by directly incorporating local model inference into your application (so your application doesn't depend on any external service like Ollama, for example).

This means you can develop completely independent Generative AI applications in Go, without relying on any third-party solution to run model inference.

Kronk aims to be close to the OpenAI API (and targets compatibility with it), which will make it easier to use.

This project was initiated by William (Bill) Kennedy, a Go consultant and trainer, who is notably the author of the excellent book Go in Action and above all one of the founding members of GoBridge.

Fun fact: I met Bill for the first time because I found myself interviewed (to my great surprise) on his podcast Ardanlabs Podcast in 2023, and it was a really fun moment.

Today, we're going to see how to build our first projects with Kronk.

Prerequisites

Installing dependencies

You'll need to have installed the libraries necessary for using Yzma (for installation and usage, see the previous article: Installing and Using Yzma on a Jetson Orin Nano).

Note: you don't need to have a Jetson Orin Nano to use Kronk and Yzma, even though it's the platform I use for my tests. The dependencies also exist for other platforms, which you can find listed here: support.

Download a model

You'll need a model in GGUF format to do your tests with Kronk. My favorite model is Qwen2.5:0.5b which you can find on Hugging Face:

mkdir first-steps-with-kronk
cd first-steps-with-kronk
curl -L -o qwen2.5-0.5b-instruct-q4_k_m.gguf --progress-bar https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf?download=true

First project with Kronk: simple completion

We're going to do a "first completion" with Kronk using the model we just downloaded. But first, we need to initialize our Go module and add the Kronk dependency:

go mod init first-steps-with-kronk
go get github.com/ardanlabs/kronk@v0.25.0
touch main.go

Then create the main.go file with the following content: main.go

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "strings"
    "time"

    "github.com/ardanlabs/kronk"
    "github.com/ardanlabs/kronk/model"
)

func main() {
    // Step 1️⃣
    ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
    defer cancel()

    // Step 2️⃣
    modelFile := "./qwen2.5-0.5b-instruct-q4_k_m.gguf"
    libPath := os.Getenv("YZMA_LIB")

    // Step 3️⃣
    // Initialize Kronk
    err := kronk.Init(libPath, kronk.LogSilent)
    if err != nil {
        log.Fatal("😡 Error initializing Kronk:", err)
    }

    // Step 4️⃣
    // modelInstances represents the number of instances of the model to create.
    // Unless you have more than 1 GPU, the recommended number of instances is 1.
    const modelInstances = 1

    modelConfig := model.Config{
        ModelFile: modelFile,
    }
    // Create a new Kronk inference model
    krn, err := kronk.New(modelInstances, modelConfig)
    if err != nil {
        log.Fatal("😡 Unable to create inference model:", err)
    }
    defer krn.Unload(context.Background())

    // Step 5️⃣
    data := model.D{
        "messages": model.DocumentArray(
            model.ChatMessage("system", "You are a helpful assistant, expert in Star Trek."),
            model.ChatMessage("user", "Who is Jean-Luc Picard?"),
        ),
    }
    params := model.Params{
        Temperature: 0.0,
        TopP:        0.9,
    }

    // Step 6️⃣
    response, err := krn.Chat(ctx, params, data)

    if err != nil {
        log.Fatal("😡 Chat:", err)
    }

    // Step 7️⃣
    // Print the response
    fmt.Println(strings.Repeat("-", 60))
    fmt.Println(response.Choice[0].Delta.Content)

    fmt.Println("\n" + strings.Repeat("-", 60))

    log.Println("✅ Chat complete.")

}

Some explanations

Here's what the program does step by step:

Context configuration
- Creating a context with a 120-second timeout to limit execution time
- The defer cancel() ensures that context resources will be freed at the end
Parameter preparation
- Defining the path to the downloaded GGUF model
- Retrieving the Yzma library path from the YZMA_LIB environment variable
Kronk initialization
- Calling kronk.Init() to initialize the Yzma library
- The kronk.LogSilent parameter disables verbose logs
- In case of error, the program stops immediately
Creating the inference instance
- The modelInstances = 1 parameter defines the number of model instances to create
- Unless you have more than 1 GPU, the recommended number of instances is 1
- model.Config contains the model configuration with the path to the GGUF file
- kronk.New() loads the model into memory and creates the inference instance
- The defer krn.Unload(context.Background()) ensures that the model will be unloaded from memory at the end
Preparing messages and parameters
- Building a data structure (model.D) containing messages in OpenAI format
- Using model.DocumentArray() and model.ChatMessage() helper functions
- The "system" message defines the assistant's role (Star Trek expert)
- The "user" message contains the question asked
- Inference parameters are defined separately in a model.Params structure
Executing the chat
- Calling krn.Chat() with context, parameters, and data
- Temperature: 0.0 makes responses deterministic (no variability)
- TopP: 0.9 controls the diversity of vocabulary used
- In case of error, the program stops
Displaying the result
- Displaying the response generated by the model (response.Choice[0].Delta.Content)

Running the program

Don't forget the first time to do a go mod tidy to download dependencies and then run the program with the following command:

go run main.go

And you should have output like this after a few seconds (it all depends on your hardware of course):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
load_backend: loaded CUDA backend from /home/k33g/yzma-projects/yzma_cuda_lib/libggml-cuda.so
load_backend: loaded CPU backend from /home/k33g/yzma-projects/yzma_cuda_lib/libggml-cpu-armv8.2_2.so
------------------------------------------------------------
Jean-Luc Picard is a fictional character created by the creator of Star Trek, the show and its franchise. He is a starship commander on the starship Enterprise, the starship that is the flagship of the Starfleet crew.

Jean-Luc Picard is a highly advanced AI, a type of artificial intelligence that is designed to assist and facilitate human decision-making. He is known for his advanced programming, his ability to learn from experience, and his ability to make decisions based on complex calculations and data.

Jean-Luc Picard is a member of the crew of the Enterprise, a starship that is the flagship of the Starfleet crew. He is a member of the crew of the Enterprise-A, a starship that is the flagship of the Starfleet crew. He is also a member of the crew of the Enterprise-B, a starship that is the flagship of the Starfleet crew.

Jean-Luc Picard is a member of the crew of the Enterprise-C, a starship that is the flagship of the Starfleet crew. He is also a member of the crew of the Enterprise-D, a starship that is the flagship of the Starfleet crew.

------------------------------------------------------------
2025/11/30 07:11:44 ✅ Chat complete.

You see, it's pretty simple to use Kronk to do local inference with GGUF models in Go 🥰!

Streaming Chat completion

If the model takes time to respond, rather than waiting for the completion to finish, it's possible to display the response as it's being generated (streaming). This will greatly improve the user experience as they can see results appearing little by little. To do this, we need to modify the previous code a bit.

We're going to replace this code:

response, err := krn.Chat(ctx, params, data)

if err != nil {
    log.Fatal("😡 Chat:", err)
}

With this code:

ch, err := krn.ChatStreaming(ctx, params, data)

if err != nil {
    log.Fatal("😡 Chat streaming:", err)
} else {
    log.Println("😁 Chat streaming ready...")
}

for resp := range ch {
    fmt.Print(resp.Choice[0].Delta.Content)
}

Brief explanations

Here's what this code does step by step:

Creating a streaming channel
- krn.ChatStreaming() returns a channel (ch) instead of a complete response
- This channel will receive response fragments as they're generated
- The parameters (Messages, Temperature, TopP) remain identical to non-streaming mode
Reception loop
- for resp := range ch iterates over each fragment received from the channel
- Each resp contains a small piece of the generated text
- resp.Choice[0].Delta.Content extracts the fragment content
- fmt.Print() displays the fragment immediately (without newline)

Complete code

main.go

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "strings"
    "time"

    "github.com/ardanlabs/kronk"
    "github.com/ardanlabs/kronk/model"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
    defer cancel()

    modelFile := "./qwen2.5-0.5b-instruct-q4_k_m.gguf"
    libPath := os.Getenv("YZMA_LIB")

    // Initialize Kronk
    err := kronk.Init(libPath, kronk.LogSilent)
    if err != nil {
        log.Fatal("😡 Error initializing Kronk:", err)
    }

    // modelInstances represents the number of instances of the model to create.
    // Unless you have more than 1 GPU, the recommended number of instances is 1.
    const modelInstances = 1

    modelConfig := model.Config{
        ModelFile: modelFile,
    }
    // Create a new Kronk inference model
    krn, err := kronk.New(modelInstances, modelConfig)
    if err != nil {
        log.Fatal("😡 Unable to create inference model:", err)
    }
    defer krn.Unload(context.Background())

    data := model.D{
        "messages": model.DocumentArray(
            model.ChatMessage("system", "You are a helpful assistant, expert in Star Trek."),
            model.ChatMessage("user", "Who is Jean-Luc Picard?"),
        ),
    }
    params := model.Params{
        Temperature: 0.0,
        TopP:        0.9,
    }

    ch, err := krn.ChatStreaming(ctx, params, data)

    if err != nil {
        log.Fatal("😡 Chat streaming:", err)
    } else {
        log.Println("😁 Chat streaming ready...")
    }

    log.Println("⏳ Chat streaming is starting...")
    fmt.Println(strings.Repeat("-", 60))

    for resp := range ch {
        fmt.Print(resp.Choice[0].Delta.Content)
    }
    fmt.Println("\n" + strings.Repeat("-", 60))

    log.Println("✅ Chat streaming complete.")

}

All you have to do now is run the program again with the command:

go run main.go

And there you go! You now have a program that uses streaming chat completion with a GGUF model locally 🥳! You can see that the code isn't much more complicated.

You can find the complete source code for these examples here: https://codeberg.org/GenAI-On-Small-Devices/genai-with-kronk.

See you very soon for new adventures with Kronk and Yzma! 🤓

Baby steps with Kronk

Prerequisites

Installing dependencies

Download a model

First project with Kronk: simple completion

Some explanations

Running the program

Streaming Chat completion

Brief explanations

Complete code

Comments (1)

GenAI on small devices

Installing and Using Yzma on a Jetson Orin Nano

More from this blog

GoloScript will be the scripting language for AI

GoloScript and WebAssembly

GoloScript and WebAssembly Support with Wazero

A bit of Golo for the weekend

Playground & Getting Started with Golo 🤓

Command Palette

Prerequisites

Installing dependencies

Download a model

First project with Kronk: simple completion

Some explanations

Running the program

Streaming Chat completion

Brief explanations

Complete code

Comments (1)

GenAI on small devices

Installing and Using Yzma on a Jetson Orin Nano

More from this blog