How to Use Hugging Face for Model Deployment: Step by Step

data-sciencebeginner

# How to Use Hugging Face for Model Deployment: Step by Step

I've been deploying machine learning models with Hugging Face for over two years now, and I can confidently say it's one of the most streamlined platforms for getting models into production. Whether you're deploying a fine-tuned BERT for sentiment analysis or a custom Whisper model for speech recognition, Hugging Face’s Inference Endpoints and Spaces make the process remarkably smooth. In this tutorial, I'll walk you through the exact steps I use to deploy models—from setting up the environment to handling production traffic.

## Prerequisites

Before we dive in, make sure you have:

- A Hugging Face account (free tier works for testing)

- Python 3.8+ installed

- `huggingface_hub` and `transformers` libraries installed (`pip install huggingface_hub transformers`)

- A trained or fine-tuned model ready (I'll use a DistilBERT sentiment model as an example)

---

## Step 1: Prepare Your Model for Deployment

The first step is ensuring your model is compatible with Hugging Face's deployment infrastructure. I always start by saving my model in the `transformers` format—this guarantees it works seamlessly with Inference Endpoints.

```python

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load your fine-tuned model (replace with your own)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Save locally

model.save_pretrained("./my-sentiment-model")

tokenizer.save_pretrained("./my-sentiment-model")

```

**Pro Tip:** Always test your model locally before uploading. I've wasted hours debugging deployment issues that were actually model loading errors. Run a quick inference:

```python

inputs = tokenizer("This movie is fantastic!", return_tensors="pt")

outputs = model(**inputs)

print(outputs.logits.argmax().item()) # Should output 1 (positive)

```

## Step 2: Upload Your Model to the Hugging Face Hub

Now, push your model to the Hugging Face Hub. This is where the magic happens—the Hub acts as both a registry and a distribution channel.

```python

from huggingface_hub import HfApi

api = HfApi()

api.create_repo(repo_id="your-username/my-sentiment-model", exist_ok=True)

api.upload_folder(

folder_path="./my-sentiment-model",

repo_id="your-username/my-sentiment-model",

repo_type="model"

)

```

**Common Pitfall:** If you get a 401 error, you haven't logged in. Run `huggingface-cli login` and paste your access token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

![Screenshot: Uploading model files via Python API](images/tutorials/how-to-use-hugging-face-for-model-deployment-step-1.webp)

## Step 3: Create a Model Card (Optional but Recommended)

A good model card helps others (and your future self) understand what the model does. I always include:

- Model description

- Intended use cases

- Training data summary

- Evaluation metrics

You can create this directly on the Hub UI or programmatically:

```python

from huggingface_hub import ModelCard

card = ModelCard.from_template(

card_data={

"license": "mit",

"language": "en",

"tags": ["sentiment-analysis", "distilbert"]

},

template_path="path/to/custom_template.md" # Optional

)

card.push_to_hub("your-username/my-sentiment-model")

```

## Step 4: Deploy with Inference Endpoints

This is where deployment gets production-ready. Inference Endpoints auto-scale and handle load balancing. Here's how I set one up:

1. Go to [huggingface.co/inference-endpoints](https://huggingface.co/inference-endpoints)

2. Click "New endpoint"

3. Select your model (`your-username/my-sentiment-model`)

4. Choose instance type (I start with `cpu.small` for testing)

5. Set scaling limits (min: 0, max: 2 for cost efficiency)

![Screenshot: Inference Endpoint configuration page](images/tutorials/how-to-use-hugging-face-for-model-deployment-step-2.webp)

**Pro Tip:** Use the `accelerator` field in the API to request GPU instances. For example, `gpu.t4.small` is great for real-time inference with transformer models.

Once created, you'll get an endpoint URL like `https://api-inference.huggingface.co/models/your-username/my-sentiment-model`. Test it with:

```python

import requests

API_URL = "https://api-inference.huggingface.co/models/your-username/my-sentiment-model"

headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

response = requests.post(API_URL, headers=headers, json={"inputs": "I love this product!"})

print(response.json())

# Output: [{'label': 'POSITIVE', 'score': 0.9998}]

```

## Step 5: Optimize for Production

After initial deployment, I always optimize. Here are my go-to strategies:

### 5.1 Enable Batching

In the endpoint settings, set `max_batch_size` to 8 or 16. This dramatically improves throughput for concurrent requests.

### 5.2 Use ONNX Runtime

Convert your model to ONNX for 2-3x faster inference:

```python

from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained("your-username/my-sentiment-model", export=True)

model.save_pretrained("./onnx-model")

# Upload the ONNX version

```

### 5.3 Set Up Caching

For models with deterministic outputs (like classification), enable response caching in the endpoint settings. This reduces latency for repeated queries by 80%.

## Step 6: Monitor and Scale

Hugging Face provides built-in monitoring. I always check these metrics:

- **P99 latency**: Should be under 500ms for real-time apps

- **Error rate**: Keep below 1%

- **CPU/GPU utilization**: Scale up if consistently above 80%

![Screenshot: Monitoring dashboard showing latency and error rates](images/tutorials/how-to-use-hugging-face-for-model-deployment-step-3.webp)

**Common Pitfall:** Don't set `min_replicas` too high. I once left it at 5 and got a $200 bill for a weekend of idle endpoints. Start with 0 and let auto-scaling handle traffic.

## Step 7: Alternative Deployment with Spaces

For smaller projects or demos, I use Hugging Face Spaces. It's simpler but less scalable:

1. Go to [huggingface.co/spaces](https://huggingface.co/spaces)

2. Create a new Space (choose Gradio or Streamlit)

3. Add a `requirements.txt` with your dependencies

4. Write a simple inference script:

```python

import gradio as gr

from transformers import pipeline

model = pipeline("sentiment-analysis", model="your-username/my-sentiment-model")

def predict(text):

return model(text)[0]

gr.Interface(fn=predict, inputs="text", outputs="label").launch()

```

## Conclusion

Deploying models with Hugging Face has transformed how I work. Here are my key takeaways:

1. **Always test locally first** – It saves hours of debugging.

2. **Use Inference Endpoints for production** – They handle scaling, load balancing, and monitoring out of the box.

3. **Optimize with ONNX and batching** – This can cut costs by 50% while improving performance.

4. **Monitor aggressively** – Set up alerts for latency and error rate spikes.

5. **Start small, scale smart** – Use `min_replicas=0` and auto-scaling to avoid surprise bills.

The Hugging Face ecosystem eliminates most of the DevOps headaches associated with model deployment. In my experience, what used to take a week with Kubernetes and custom APIs now takes a few hours. Give it a try with your next model—I think you'll be amazed at how seamless it feels.