Tabby: Self-Hosted AI Code Completion
Tabby is a self‑hosted AI code completion engine that brings the power of large language models (LLMs) right to your development environment without sending any code to the cloud. It’s built on top of open‑source transformer models, runs on commodity hardware, and integrates with popular editors like VS Code, Neovim, and JetBrains IDEs. In this post we’ll walk through the core concepts, set up Tabby on a typical Linux workstation, and explore real‑world scenarios where a private completion engine can boost productivity while keeping your codebase secure.
What makes Tabby different?
Most AI‑powered completions you’ve tried—GitHub Copilot, Tabnine, or Amazon CodeWhisperer—rely on remote inference services. That model offers convenience but also means every keystroke is streamed to a third‑party server. Tabby flips the script: the inference happens locally, and you retain full control over model versions, data retention policies, and resource allocation.
Under the hood, Tabby uses a lightweight inference server that can load models ranging from 1 B to 7 B parameters. Because the server speaks a simple HTTP API, any editor that can make a request can become an AI‑assisted coding partner. This decoupling also makes it easy to swap models, add custom prompts, or even run multiple instances for different teams.
Key benefits at a glance
- Privacy first: No outbound network traffic unless you explicitly enable it.
- Cost control: No per‑token fees; you only pay for the hardware you already own.
- Customizability: Fine‑tune on your own repositories to capture domain‑specific idioms.
- Multi‑IDE support: One server, many clients.
Getting started: Installing Tabby
The easiest way to spin up Tabby is with Docker. The official image bundles the inference server, a lightweight model cache, and a health‑check endpoint. Below is a minimal docker run command that pulls the latest stable release and maps port 8080 to your host.
docker run -d \
--name tabby \
-p 8080:8080 \
-v $HOME/.tabby/models:/app/models \
ghcr.io/TabbyML/tabby:latest \
--model /app/models/starcoderbase-1b \
--host 0.0.0.0 \
--port 8080
Let’s break down the flags:
-v $HOME/.tabby/models:/app/modelsmounts a persistent volume for model files.--modelpoints the server to the model you want to load (here we use StarCoder‑Base‑1B).--host 0.0.0.0makes the service reachable from any local interface.
After a few seconds, you can verify the server is alive by curling the health endpoint:
curl http://localhost:8080/health
# {"status":"ok"}
Running without Docker
If you prefer a native installation—perhaps on a GPU‑enabled workstation—you can clone the repo and install the Python dependencies directly. The following script sets up a virtual environment, installs torch with CUDA support, and launches the server.
git clone https://github.com/TabbyML/tabby.git
cd tabby
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install torch==2.2.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
python -m tabby.server --model ./models/starcoderbase-1b --port 8080
Both approaches give you a running HTTP API at http://localhost:8080 that accepts JSON payloads describing the current file, cursor position, and optional context.
Hooking Tabby into your editor
Tabby ships with client plugins for the most common editors. The VS Code extension, for instance, is a thin wrapper that forwards the current buffer to the local server and renders the returned suggestions as inline snippets.
To install the VS Code client, open the Extensions pane, search for “Tabby”, and click Install. Once the extension is active, open the Settings (Ctrl+,) and set the endpoint URL to http://localhost:8080. The extension will automatically detect the running server and start sending completion requests.
Neovim integration
Neovim users can leverage the built‑in LSP client. Add the following snippet to your init.lua (or init.vim if you’re still on Vimscript):
require('lspconfig').tabby = {
cmd = {'python', '-m', 'tabby.lsp', '--host', '127.0.0.1', '--port', '8080'},
filetypes = {'python', 'javascript', 'go', 'rust'},
}
require('lspconfig').tabby.setup{}
After reloading Neovim, you’ll see Tabby suggestions appear as you type, just like any other LSP‑based completion source.
JetBrains IDEs
For IntelliJ IDEA, PyCharm, or WebStorm, download the Tabby plugin from the JetBrains Marketplace. The plugin’s settings panel mirrors the VS Code UI: point it at http://localhost:8080, select the languages you want, and enable “inline completions”.
Pro tip: In JetBrains IDEs you can bind “Tab” to accept a suggestion only when the completion popup is visible. This prevents accidental acceptance while you’re still typing.
Real‑world use cases
Now that Tabby is up and running, let’s explore three scenarios where a self‑hosted completion engine shines.
1. Enterprise codebases with strict compliance
Financial institutions, healthcare providers, and defense contractors often operate under regulations that forbid transmitting source code outside the corporate firewall. By deploying Tabby on an internal server, developers can still benefit from AI assistance while staying fully compliant with GDPR, HIPAA, or ITAR.
Because the model runs on premises, you can also audit the inference logs. Tabby provides an optional request‑logging middleware that records the prompt, response, and timestamp—useful for post‑mortem analysis or demonstrating compliance to auditors.
2. Accelerating onboarding for new hires
New engineers spend weeks learning the codebase’s conventions, naming schemes, and internal APIs. Fine‑tuning Tabby on your monorepo creates a “knowledge base” that surfaces idiomatic patterns as soon as a junior dev writes a function stub.
Here’s a quick script that extracts all .py files from a repository, formats them as a JSONL dataset, and triggers a fine‑tuning job using the Tabby CLI:
import os, json, subprocess
def collect_py_files(root):
data = []
for dirpath, _, filenames in os.walk(root):
for f in filenames:
if f.endswith('.py'):
path = os.path.join(dirpath, f)
with open(path, 'r', encoding='utf-8') as fp:
data.append({'prompt': fp.read()})
return data
repo_path = '/srv/company/monorepo'
dataset = collect_py_files(repo_path)
with open('fine_tune.jsonl', 'w', encoding='utf-8') as out:
for entry in dataset:
out.write(json.dumps(entry) + '\n')
subprocess.run([
'tabby', 'fine-tune',
'--model', 'starcoderbase-1b',
'--data', 'fine_tune.jsonl',
'--output', 'model_finetuned',
'--epochs', '3'
])
After the fine‑tune finishes, restart the server with --model model_finetuned and watch the suggestions become more aligned with your internal style.
3. Continuous Integration (CI) linting
Beyond interactive completion, Tabby can be used as a static analysis tool in CI pipelines. By feeding the entire repository to the model and asking it to generate a “review” of each file, you can automatically catch anti‑patterns or missing docstrings.
Below is a minimal GitHub Actions workflow that runs Tabby in a container, extracts suggestions, and fails the job if any suggestion contains the keyword “TODO”.
name: Tabby CI Review
on: [push, pull_request]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Start Tabby server
run: |
docker run -d --name tabby -p 8080:8080 \
-v ${{ github.workspace }}/models:/app/models \
ghcr.io/tabbyml/tabby:latest \
--model /app/models/starcoderbase-1b
- name: Wait for healthcheck
run: |
for i in {1..10}; do
if curl -s http://localhost:8080/health | grep ok; then break; fi
sleep 3
done
- name: Run review script
env:
TABBY_URL: http://localhost:8080
run: |
python .github/scripts/tabby_review.py
- name: Stop Tabby
if: always()
run: docker rm -f tabby
The referenced tabby_review.py script iterates over changed files, sends them to the server, and checks the model’s response for disallowed tokens. This pattern turns Tabby into a “soft linter” that evolves as your model improves.
Pro tip: Combine Tabby’s suggestions with traditional linters (e.g., Flake8, ESLint). Use the linter for strict rule enforcement and Tabby for stylistic, context‑aware hints.
Advanced configuration and prompt engineering
Tabby’s default prompt concatenates the file content up to the cursor, a few lines of surrounding context, and a “completion” token. For specialized workflows you can customize this prompt template via the --prompt-template flag.
A common enhancement for API‑heavy projects is to prepend a short “system message” that lists the most used internal services. Here’s an example JSON payload that the client can send:
{
"system": "You are an assistant familiar with the company's internal HTTP client library `myhttp`. Prefer using `myhttp.get` and `myhttp.post` over `requests`.",
"prompt": "def fetch_user(user_id):\n ",
"max_tokens": 64,
"temperature": 0.2
}
When the model receives this payload, it’s more likely to generate code that adheres to the preferred library, reducing post‑completion edits.
Multi‑model routing
Some teams run both a small, fast model for everyday completions and a larger, more accurate model for complex refactoring tasks. Tabby supports a “router” mode where the client can specify a model_id field, and the server forwards the request to the appropriate backend.
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model_id": "starcoder-7b",
"prompt": "class Cache:\n def __init__(self):\n ",
"max_tokens": 128,
"temperature": 0.1
}'
Behind the scenes, the server spawns a separate inference process for the 7 B model, keeping the latency of the 1 B model untouched for routine edits.
Performance tuning tips
Running a transformer model on a laptop can be memory‑intensive. Here are three practical tweaks to keep latency under 200 ms for a 1 B model.
- Enable half‑precision (FP16): Start the server with
--dtype fp16. This halves the memory footprint and often speeds up GPU kernels. - Cache KV‑states: Tabby can reuse key/value caches across consecutive completions in the same file. Pass
"use_cache": truein the request body. - Batch requests: If you have multiple editors on the same machine, configure them to send batched prompts every 50 ms. The server will process them in a single forward pass.
Pro tip: Monitor GPU utilization with nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv. If you see spikes above 80 %, consider scaling down the model or adding a second GPU for parallel inference.
Security considerations
Even though Tabby runs locally, it’s still wise to treat the inference server as a networked service. Restrict access to localhost unless you explicitly need remote clients, and enable TLS if you expose the endpoint beyond the host.
Tabby also offers an --sanitize flag that strips potentially sensitive identifiers (e.g., API keys) from the prompt before feeding it to the model. This is useful in environments where developers might inadvertently type secrets while coding.
Community and ecosystem
The Tabby project is open source under the Apache 2.0 license, and a vibrant community contributes model checkpoints, Docker recipes, and editor plugins. The official Discord channel hosts weekly “model‑swap” sessions where users share performance metrics for different hardware configurations.
If you’re interested in extending Tabby, the codebase follows a plugin architecture. Adding a new client is as simple as implementing a /v1/completions HTTP wrapper that conforms to the OpenAI‑compatible schema. This means you can integrate Tabby with tools like coc.nvim, Emacs lsp-mode, or even custom web‑based IDEs.
Conclusion
Tabby demonstrates that AI‑assisted coding doesn’t have to be a cloud‑only proposition. By hosting the model yourself, you gain privacy, cost predictability, and the flexibility to tailor completions to your organization’s unique codebase. Whether you’re safeguarding regulated data, accelerating onboarding, or enriching CI pipelines, Tabby offers a pragmatic path to bring large‑language‑model power directly to the developer’s desk. Give it a spin, fine‑tune on your own repositories, and watch productivity climb—without ever leaving your own network.