How to use Jupyter AI for data science
# I Spent a Week Debugging Pandas Queries Until Jupyter AI Fixed My Workflow in 10 Minutes
Here's the scenario that finally broke me: I was trying to join three DataFrames with inconsistent column names, apply a custom function to a datetime column, and pivot the result—all while my coffee went cold. I'd typed `df.merge()` so many times I could hear my keyboard crying. Then I discovered Jupyter AI, and I haven't written a raw Pandas line since. But it's not magic. It's a tool with sharp edges, and I'll show you exactly where they cut.
## What Actually Broke Before I Found This
I was working on a customer churn dataset with 47 columns. Every time I needed to filter by a date range, I'd forget whether `pd.to_datetime()` needed `infer_datetime_format=True` or `format='%Y-%m-%d'`. I'd Google it, copy-paste from Stack Overflow, and then the date format would be wrong because the CSV used slashes instead of dashes. That's when I realized: I'm not a bad data scientist, I'm just bad at remembering Pandas syntax.
Jupyter AI doesn't replace your brain—it replaces your browser tabs. Here's how I set it up and what actually works.
## Installing Jupyter AI: The Two-Minute Setup That Took 15
First, the honest version of installation. You'll need Python 3.8+ and Jupyter Lab 3.5+. Run:
```bash
pip install jupyter-ai
```
Then enable the extension:
```bash
jupyter labextension install @jupyter-ai/core
```
Wait—that might fail if you're using Jupyter Lab 4.x. I learned this the hard way. For Lab 4, you need:
```bash
pip install jupyter-ai>=2.0
```
And skip the `labextension` command entirely—it's built in. The version mismatch cost me 20 minutes of debugging. Once installed, restart Jupyter Lab and look for the AI icon in the left sidebar. If you see a robot face, you're golden. If not, check your Jupyter version with `jupyter --version`.
## Configuring Your Model: Why GPT-4 Is the Only Real Option
Jupyter AI supports multiple models (Anthropic, Cohere, Hugging Face), but I tested all of them. Here's the truth: for data science tasks, GPT-4 is the only one that consistently outputs working code. Claude 3 is close, but it sometimes hallucinates column names. Local models like Llama 2 are too slow for real-time use.
Set up your API key in the Jupyter AI settings panel (gear icon in the AI sidebar). Paste your OpenAI key. Don't use environment variables unless you're deploying—the settings panel persists across sessions.
Here's my first actual test. I typed in the chat panel:
```
"Load the CSV file 'sales_2023.csv' and show the first 5 rows with all columns visible"
```
Jupyter AI responded with:
```python
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv('sales_2023.csv')
df.head()
```
It worked. But here's the first flaw: it didn't check if the file existed. When I ran it and the file was in a subdirectory, it crashed. I learned to always specify the path in my prompts: "Load from the 'data/raw/' folder."
## Real-World Example: Cleaning a Messy Dataset
Let me show you what I actually use this for. I had a dataset where dates were stored as strings like "2023-01-15 14:30:00" and "15/01/2023" in the same column. A nightmare. I typed:
```
"Convert the 'timestamp' column to datetime. Some values use YYYY-MM-DD format, others use DD/MM/YYYY. Handle errors by setting them to NaT."
```
Jupyter AI generated:
```python
df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed', errors='coerce')
```
That `format='mixed'` parameter was something I'd never used. It worked perfectly. But then it didn't handle timezone-aware timestamps—the output was naive. I had to add a follow-up: "Now make all timestamps timezone-aware to UTC." It appended `.dt.tz_localize('UTC')`.
The lesson: Jupyter AI is iterative. You don't get perfect code in one shot. Treat it like a junior developer who needs step-by-step instructions.
## Advanced: Custom Functions and Vectorization
Here's where it gets powerful. I needed to calculate a weighted moving average with a custom window. I typed:
```
"Create a function that calculates exponential weighted moving average with a span of 12 periods, then apply it to the 'revenue' column grouped by 'store_id'"
```
It returned:
```python
def calc_ewma(group):
return group['revenue'].ewm(span=12, adjust=False).mean()
df['ewma_revenue'] = df.groupby('store_id').apply(calc_ewma).reset_index(level=0, drop=True)
```
This worked, but it was slow on 2 million rows. I asked: "Can you vectorize this?" It suggested using `transform`:
```python
df['ewma_revenue'] = df.groupby('store_id')['revenue'].transform(lambda x: x.ewm(span=12, adjust=False).mean())
```
That cut runtime from 45 seconds to 3. This is the kind of optimization I'd normally spend 30 minutes Googling.
## The Hidden Flaws You'll Encounter
1. **It forgets your DataFrame structure.** After 5 prompts, it might generate code referencing columns that don't exist. Always include `df.columns.tolist()` in your initial prompt.
2. **Memory leaks in the chat panel.** After about 50 interactions, the AI becomes unresponsive. I restart the kernel and clear the chat. Keep your prompts short and focused.
3. **It can't handle large datasets.** I tried asking it to "optimize this query on 10 million rows." It generated code that worked on a sample but crashed on the full dataset. Use `.sample(1000)` for testing.
4. **Plotting is hit or miss.** I asked for a "seaborn heatmap of correlation matrix" and it imported matplotlib instead. Specify the library explicitly: "Use plotly to create an interactive scatter plot."
## Visualizations That Actually Work
For plotting, I've found the most success with explicit prompts:
```
"Create a plotly express line chart. X-axis is 'date', Y-axis is 'sales'. Color by 'region'. Add a title 'Monthly Sales by Region'. Make it interactive with hover data showing 'product_count'."
```
It generated:
```python
import plotly.express as px
fig = px.line(df, x='date', y='sales', color='region',
title='Monthly Sales by Region', hover_data=['product_count'])
fig.show()
```
Pro tip: Always ask for `hover_data` explicitly. The default is minimal.
## The Workflow That Finally Stuck
Here's my current routine. I open a new notebook, run `%load_ext jupyter_ai` in the first cell (though it's automatic now), then in the AI panel I type:
```
"I have a DataFrame with columns: [paste df.columns]. Goal: [describe task]. Constraints: [memory limit, time limit]. Output: [single code cell or explanation]."
```
This structured prompt gets me working code in 80% of cases. The other 20%? I debug it myself or break the task into smaller prompts.
## Your Next Five Minutes
Don't try to learn all of Jupyter AI at once. Instead, do this right now:
1. Install it (use the Lab 4 command above).
2. Open any messy CSV you have.
3. Ask it to "Fix all date columns to datetime and handle errors."
4. Then ask "Impute missing values in numeric columns with median."
5. Then "Create a correlation heatmap using plotly."
You'll have a clean dataset with visualizations in under 10 minutes. That's the promise. But remember: it's a tool, not a replacement. The moment you blindly trust its output is the moment it generates a pivot table that sums strings instead of numbers. I've been there. My coffee is still cold.