Getting started with Jupyter AI: a practical guide

data-sciencebeginner

# Getting Started with Jupyter AI: A Practical Guide

I've been working with Jupyter notebooks for years, but the moment I hit a wall was when I needed to write a complex data cleaning script—and I realized I'd written the same regex pattern five times that week. That's when I decided to try Jupyter AI, the extension that brings code generation and chat directly into your notebook environment. Here's what I learned after a week of heavy testing, including the parts that broke and how I fixed them.

## The Setup: What You Actually Need

First, let's get the basics right. You need JupyterLab 3.5+ (I'm on 3.6.1) and Python 3.8+. The installation is straightforward, but here's the catch: you need to decide on your AI provider before you start.

**Step 1: Install Jupyter AI**

```bash

pip install jupyter-ai

```

This installs the core package. But you'll also need a model provider. I tested two:

- **OpenAI** (requires API key, works with GPT-4 and GPT-3.5)

- **Hugging Face** (free, but slower)

**Step 2: Set up your provider**

For OpenAI:

```bash

export OPENAI_API_KEY="your-key-here"

```

For Hugging Face (free option):

```bash

pip install jupyter-ai-huggingface

export HUGGINGFACEHUB_API_TOKEN="your-token"

```

**Warning:** I spent 30 minutes debugging why my API calls failed—turns out I had a typo in the environment variable name. Double-check `HUGGINGFACEHUB_API_TOKEN` (note the 'HUB').

## The Interface: Where to Find the Magic

After restarting JupyterLab, you'll see a new AI icon in the left sidebar (looks like a sparkle). Click it, and you get a chat panel. But here's the first thing that confused me: **you can't just start chatting**. You need to select a model first.

In the chat panel:

1. Click the gear icon (settings)

2. Choose your provider from the dropdown (e.g., `openai-chat` for GPT-4)

3. Set the model (I use `gpt-4` for complex tasks, `gpt-3.5-turbo` for quick stuff)

**Pro tip:** If you're on a budget, use Hugging Face's `microsoft/DialoGPT-medium` for free, but expect 5-10 second response times.

## My First Real Test: Automating a Data Cleaning Script

I had a messy CSV with 50,000 rows of customer data. Here's what I typed in the chat panel:

```

Write a Python function that cleans a DataFrame with columns: 'email', 'phone', 'age'.

Remove rows where email is missing, format phone numbers as (XXX) XXX-XXXX,

and convert age to integer.

```

Jupyter AI generated this code in about 3 seconds:

```python

def clean_customer_data(df):

# Remove rows with missing email

df = df.dropna(subset=['email'])

# Format phone numbers

import re

def format_phone(phone):

digits = re.sub(r'\D', '', str(phone))

if len(digits) == 10:

return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"

return phone

df['phone'] = df['phone'].apply(format_phone)

# Convert age to integer

df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('Int64')

return df

```

**What impressed me:** It used `pd.to_numeric` with `errors='coerce'`—exactly what I would have written. It also used `Int64` (nullable integer) which handles NaN values gracefully.

**What broke:** The phone formatting didn't handle country codes (+1). When I asked it to fix that, it generated a 20-line regex solution that was overkill. I ended up writing my own simpler version.

## The Real Power: %%ai Magic Commands

The chat is nice, but the magic commands are where Jupyter AI shines. You can use `%%ai` in a cell to generate code inline.

**Example: Generate a visualization script**

In a new cell, type:

```python

%%ai openai-chat

Write matplotlib code to create a scatter plot of 'sales' vs 'profit'

from a DataFrame called df. Use color for 'region' column.

```

The AI generates the code directly in the cell:

```python

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

for region in df['region'].unique():

subset = df[df['region'] == region]

plt.scatter(subset['sales'], subset['profit'], label=region, alpha=0.6)

plt.xlabel('Sales')

plt.ylabel('Profit')

plt.title('Sales vs Profit by Region')

plt.legend()

plt.show()

```

**Critical tip:** The `%%ai` command replaces the entire cell content. If you have existing code, it will be overwritten. I lost a 50-line function this way. **Always copy your cell content before running %%ai.**

## The Flaws I Discovered

After a week of heavy use, here's what frustrated me:

1. **Context window is tiny** - The AI doesn't remember what you discussed 5 minutes ago. I had to re-explain my DataFrame structure every few queries.

2. **No code execution feedback** - The AI doesn't run your code to check if it works. I had a generated function that referenced a column that didn't exist in my actual DataFrame.

3. **Hugging Face models are slow** - Free models take 10-15 seconds per response. OpenAI's GPT-4 is nearly instant.

4. **Security concerns** - When you use `%%ai`, your entire cell content (including any sensitive data) is sent to the AI provider. I accidentally sent a DataFrame with customer emails to OpenAI's servers.

## Practical Workflow That Actually Works

After trial and error, here's my recommended workflow:

1. **Use %%ai for code generation, not chat** - The chat panel is okay, but magic commands are faster and more focused.

2. **Always specify the model** - Don't rely on defaults. Use `%%ai openai-chat -m gpt-4` for complex tasks.

3. **Test generated code in a separate cell** - Never trust AI output blindly. Copy the generated code to a new cell and test it with a small sample.

4. **Use the `/fix` command** - If generated code has errors, type `/fix` followed by the error message. It's surprisingly good at debugging.

5. **Limit context to 2-3 messages** - After 3 exchanges, the AI starts hallucinating. I've found it suggests non-existent pandas functions after long conversations.

## A Real-World Example: Building an ETL Pipeline

Let me walk you through a complete example. I needed to build an ETL pipeline for log files.

**Step 1: Load data**

```python

%%ai openai-chat

Write pandas code to read all CSV files from a directory called 'logs/',

combine them into one DataFrame, and add a column 'source_file' with the filename.

```

**Step 2: Clean data**

```python

%%ai openai-chat

From a DataFrame with columns 'timestamp', 'level', 'message', 'user_id':

- Parse 'timestamp' as datetime

- Filter out rows where 'level' is 'DEBUG'

- Remove duplicate 'message' entries

```

**Step 3: Generate summary**

```python

%%ai openai-chat

Write a function that takes a DataFrame with 'level' and 'message' columns

and returns a dictionary with counts of each log level and top 10 most

common messages.

```

Each step generated working code that I then combined manually. The AI saved me about 2 hours of typing boilerplate.

## The Verdict: Should You Use It?

Jupyter AI is fantastic for:

- Generating boilerplate code (data loading, basic cleaning)

- Writing matplotlib/seaborn visualizations

- Explaining complex pandas operations

It struggles with:

- Domain-specific logic (e.g., financial calculations)

- Multi-step workflows that require remembering past context

- Code that interacts with databases or APIs

**My biggest recommendation:** Use it as a starting point, not a final solution. I now treat generated code like a first draft—it saves me typing, but I always review and test before running on production data.

## Your Next Step

Don't just read this—try it now. Open a Jupyter notebook, install `jupyter-ai`, and run this exact command:

```python

%%ai openai-chat

Generate 10 random rows of mock customer data with columns:

'customer_id', 'name', 'email', 'signup_date', 'plan_type'

```

Then run the generated code to see if it works. You'll immediately understand both the power and the limitations. And when it inevitably produces something wrong (it will), use `/fix` to debug it. That's where the real learning happens.