Getting started with DeepSeek V4: a practical guide

I was staring at a $340 API bill last month. Most of it came from agent loops — my coding assistant spinning through tool calls, reading files, making edits, and sometimes going in circles. When you're paying per token for every one of those intermediate thoughts, costs spiral fast. That's what pushed me to give DeepSeek V4 a serious look.

The promise was straightforward: strong coding performance at a fraction of the cost. After a couple of weeks of running it through its paces — both through the hosted API and locally — here's what I've learned, including the gotchas that caught me off guard.

Picking the Right Variant

DeepSeek V4 isn't one model — it's a family, and picking the wrong one for your hardware or use case is the first mistake I made.

V4-Pro: The flagship. Massive context window (up to 1M tokens), best reasoning. This is what you want for complex code generation and agentic workflows through the API. Do not try to run this locally unless you have a serious multi-GPU rig.
V4-Flash: The workhorse. Faster, cheaper, still very capable at coding tasks. This is what you should reach for first for cost-efficient coding, long-context routing, and high-throughput agent workloads. It's also the realistic option for local deployment if you have the hardware.
V4: The middle ground. If you need more than Flash but don't want to pay Pro prices.

My mistake: I initially tried to pull V4-Pro locally through Ollama on my 24GB VRAM machine. It didn't end well — out of memory before the model even finished loading. Flash is the realistic local option.

Path A: Getting Started with the Hosted API (Fastest)

If you just want to start using DeepSeek V4, the hosted API is the easiest path. No GPU requirements, no disk space concerns — just an API key.

Head to the DeepSeek Platform, create an account, and generate an API key. The API is OpenAI-compatible, which means you can point almost any existing tool at it with minimal configuration changes.

Here's a quick Python test:

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Write a Python function that finds the longest increasing subsequence in a list."}
    ],
    temperature=0.0
)

print(response.choices[0].message.content)

That's it. If you've used the OpenAI SDK before, nothing here is surprising.

The pricing is where it gets interesting. V4-Flash is dramatically cheaper than GPT-4-class models, which matters enormously when you're running agent loops that burn through thousands of tokens per turn.

Plugging It Into Claude Code

This is where things got genuinely exciting for me. DeepSeek exposes an Anthropic-compatible API endpoint, which means you can use it as a backend for Claude Code — Anthropic's terminal-based coding assistant. The result is Claude Code's interface powered by DeepSeek's pricing.

Here's exactly what I did:

First, install Claude Code:

npm install -g @anthropic-ai/claude-code
claude --version

Then configure the environment variables. On Linux/macOS:

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN=<your DeepSeek API Key>
export ANTHROPIC_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max

On Windows PowerShell:

$env:ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic"
$env:ANTHROPIC_AUTH_TOKEN="<your DeepSeek API Key>"
$env:ANTHROPIC_MODEL="deepseek-v4-pro[1m]"
$env:ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-pro[1m]"
$env:ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-pro[1m]"
$env:ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"
$env:CLAUDE_CODE_SUBAGENT_MODEL="deepseek-v4-flash"
$env:CLAUDE_CODE_EFFORT_LEVEL="max"

Then navigate to your project and launch:

cd /path/to/my-project
claude

The key insight in this configuration: I'm using V4-Pro for the main model (the one that does the heavy lifting) and V4-Flash for the subagent (the one that handles quicker lookups and smaller tasks). This keeps costs down while maintaining quality where it matters.

Surprise: The first time I ran this, it failed silently. Turns out I had a stale Node.js version (16.x). Claude Code requires Node 18+. Upgraded, and everything worked perfectly.

Path B: Running Locally with Ollama

If you want your prompts to stay on your machine — for compliance reasons, privacy, or just to stop depending on someone else's data center — local deployment is the way to go. Ollama makes this surprisingly painless, as long as you pick the right model.

Step 1: Install Ollama from ollama.com. Straightforward installer for macOS, Linux, and Windows.

Step 2: Pull a model. Here's where discipline matters:

# Realistic for a 24GB VRAM GPU
ollama pull deepseek-v4-flash

# Only attempt this with serious hardware (80GB+ VRAM multi-GPU)
ollama pull deepseek-v4

Step 3: Run an interactive session:

ollama run deepseek-v4-flash

Step 4: Use the OpenAI-compatible API. Ollama serves a local endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # doesn't matter, but the SDK requires it
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Explain this function: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"}
    ]
)

print(response.choices[0].message.content)

This works offline. No API keys, no token bills. The tradeoff is inference speed — even with a good GPU, you won't match hosted API throughput.

Path C: Production Local with vLLM

If you're deploying locally for a team or production workload, vLLM is the serious option. It's Linux-only and requires CUDA 12.x, but it offers much better throughput and proper batching.

# Install vLLM
pip install vllm

# Download weights from Hugging Face (this is big — 100GB+ for V4-Pro)
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash

# Serve the model
python -m vllm.entrypoints.openai.api_server \
    --model ./DeepSeek-V4-Flash \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

The tensor-parallel-size flag splits the model across multiple GPUs. Adjust based on your hardware. The gpu-memory-utilization flag tells vLLM how much VRAM it's allowed to use — 0.9 leaves some headroom to avoid OOM crashes.

Gotcha: I initially set max-model-len to the full 1M context window. vLLM tried to pre-allocate KV cache for that and immediately ran out of memory. Set it to what you actually need.

Honest Assessment and Limitations

After a couple of weeks, here's where I land:

What's genuinely good:

V4-Flash is an absurd value for coding tasks. It handles most of what I threw at it — refactoring, debugging, writing tests — at maybe 10-15% of what I was paying before.
The Anthropic-compatible endpoint is a clever move. Being able to swap DeepSeek into existing tooling without rewriting configs is huge.
Local deployment with Ollama actually works well for V4-Flash, if you have the GPU memory.

Where it falls down:

V4-Pro is not a laptop model. Don't kid yourself about running it locally on your MacBook. Even V4-Flash requires a serious GPU for acceptable speeds.
The Anthropic compatibility layer isn't perfect. I hit occasional weirdness with tool calling in Claude Code — responses that would parse fine with Claude's own models but caused DeepSeek to stumble. It works 90% of the time, but that 10% means you need to watch for it.
Documentation is still catching up. I found myself piecing together information from the API docs, community forums, and trial-and-error more than I'd like.

Practical tips:

Start with V4-Flash. Only upgrade to Pro if you hit a wall Flash can't handle.
When using Claude Code with DeepSeek, set the subagent model to Flash. Most subagent work doesn't need Pro-level reasoning.
If running locally, be honest about your hardware. V4-Flash on a single 24GB GPU works. V4-Pro doesn't.
Keep your hosted API key handy even if you run locally. You'll want it for comparison testing and for tasks that exceed your local hardware.

My API bill this month? $47. That's the real recommendation.