Replicate Version control for machine learning

This guide will help you learn how Replicate works by building a simple model.

If you prefer working in notebooks, follow our notebook tutorial on Colab.

We're going to make a model that classifies Iris plants, trained on the Iris dataset. It's an intentionally simple model that trains really fast, just so we can show you how Replicate works.

Install dependencies

First, let's make a directory to work in:

mkdir iris-classifier
cd iris-classifier

Replicate is a Python package, and we need a few other Python packages to make the model run.

Create requirements.txt to define the Python packages to install:

replicate==0.2.4
scikit-learn==0.23.1
torch==1.4.0

Then, install them:

pip install -r requirements.txt

You might want to use a Virtualenv or Conda so these packages don't collide with others on your computer.

Write a model

Copy and paste this code into train.py:

import argparse
import replicate
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import torch
from torch import nn
from torch.autograd import Variable
def train(learning_rate, num_epochs):
# Create an "experiment". This represents a run of your training script.
# It saves the training code at the given path and any hyperparameters.
experiment = replicate.init(
path=".",
params={"learning_rate": learning_rate, "num_epochs": num_epochs},
)
print("Downloading data set...")
iris = load_iris()
train_features, val_features, train_labels, val_labels = train_test_split(
iris.data,
iris.target,
train_size=0.8,
test_size=0.2,
random_state=0,
stratify=iris.target,
)
train_features = torch.FloatTensor(train_features)
val_features = torch.FloatTensor(val_features)
train_labels = torch.LongTensor(train_labels)
val_labels = torch.LongTensor(val_labels)
torch.manual_seed(0)
model = nn.Sequential(nn.Linear(4, 15), nn.ReLU(), nn.Linear(15, 3),)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
outputs = model(train_features)
loss = criterion(outputs, train_labels)
loss.backward()
optimizer.step()
with torch.no_grad():
model.eval()
output = model(val_features)
acc = (output.argmax(1) == val_labels).float().sum() / len(val_labels)
print(
"Epoch {}, train loss: {:.3f}, validation accuracy: {:.3f}".format(
epoch, loss.item(), acc
)
)
torch.save(model, "model.pth")
# Create a checkpoint within the experiment.
# This saves the metrics at that point, and makes a copy of the file
# or directory given, which could weights and any other artifacts.
experiment.checkpoint(
path="model.pth",
step=epoch,
metrics={"loss": loss.item(), "accuracy": acc},
primary_metric=("loss", "minimize"),
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float, default=0.01)
parser.add_argument("--num_epochs", type=int, default=100)
args = parser.parse_args()
train(args.learning_rate, args.num_epochs)

Notice there are two highlighted lines that call Replicate. They don't affect the behavior of the training – they just save data in Replicate to keep track of what is going on.

The first is replicate.init(). This creates an experiment, which represents a run of your training script. The experiment records the hyperparameters you pass to it and makes a copy of the given path to save your training code.

The second is experiment.checkpoint(). This creates a checkpoint within the experiment. The checkpoint saves the metrics at that point, and makes a copy of the file or directory you pass to it, which could include weights and any other artifacts.

Each experiment contains multiple checkpoints. You typically save your model periodically during training, because the best result isn't necessarily the most recent one. A checkpoint is created just after you save your model, so Replicate can keep track of versions of your saved model.

Define a repository

We need to tell Replicate where to store your experiments. Create replicate.yaml with this content:

repository: "file://.replicate"

This will store your experiments in the .replicate directory relative to this file. You can also store them on Amazon S3 or Google Cloud Storage if you want to store data in the cloud. Learn more about this in our guide to usage cloud storage.

Train the model

We're now going to train this model a couple of times with different parameters to see what we can do with Replicate.

First, train it with default parameters:

$ python train.py
Epoch 0, train loss: 1.184, validation accuracy: 0.333
Epoch 1, train loss: 1.117, validation accuracy: 0.333
Epoch 2, train loss: 1.061, validation accuracy: 0.467
...
Epoch 97, train loss: 0.121, validation accuracy: 1.000
Epoch 98, train loss: 0.119, validation accuracy: 1.000
Epoch 99, train loss: 0.118, validation accuracy: 1.000

Next, run the training with a different learning rate:

$ python train.py --learning_rate=0.2
Epoch 0, train loss: 1.184, validation accuracy: 0.333
Epoch 1, train loss: 1.161, validation accuracy: 0.633
Epoch 2, train loss: 1.124, validation accuracy: 0.667
...
Epoch 97, train loss: 0.057, validation accuracy: 0.967
Epoch 98, train loss: 0.057, validation accuracy: 0.967
Epoch 99, train loss: 0.056, validation accuracy: 0.967

Experiments and checkpoints

The calls to the replicate Python library have saved your experiments locally. You can use replicate ls to list them:

$ replicate ls
EXPERIMENT STARTED STATUS LEARNING_RATE LATEST CHECKPOINT LOSS BEST CHECKPOINT LOSS
b90ad56 12 seconds ago stopped 0.01 4941495 (step 99) 0.1176 4941495 (step 99) 0.1176
9cce006 3 seconds ago stopped 0.2 a122e85 (step 99) 0.056486 a122e85 (step 99) 0.056486

The --filter flag allows you to narrow in on a subset of experiments:

$ replicate ls --filter "learning_rate = 0.2"
EXPERIMENT STARTED STATUS LEARNING_RATE LATEST CHECKPOINT LOSS BEST CHECKPOINT LOSS
9cce006 3 seconds ago stopped 0.2 a122e85 (step 99) 0.056486 a122e85 (step 99) 0.056486

As a reminder, this is a list of experiments which represents runs of the train.py script. They store a copy of the code as it was when the script was started.

Within experiments are checkpoints, which are created every time you call experiment.checkpoint() in your training script. The checkpoint contains your weights, Tensorflow logs, and any other artifacts you want to save.

To list the checkpoints within an experiment, you can use replicate show. Run this, replacing b90ad56 with an experiment ID from your output of replicate ls:

$ replicate show b90ad56
Experiment: b90ad56a755371548ae2ab98c9d40a85911fd6198254880e600cdf00f55a18ca
Created: Wed, 02 Sep 2020 20:44:51 PDT
Status: stopped
Host: 107.133.144.125
User: ben
Command: train.py
Params
learning_rate: 0.01
num_epochs: 100
Checkpoints
ID STEP CREATED ACCURACY LOSS
9ed04a2 0 5 minutes ago 0.33333 1.1836
b37e01d 1 5 minutes ago 0.33333 1.1173
c74e9c6 2 5 minutes ago 0.46667 1.0611
7ba5b47 3 5 minutes ago 0.63333 1.0138
886f612 4 5 minutes ago 0.7 0.97689
667fdba 5 5 minutes ago 0.9 0.9496
...
cd1223c 95 5 minutes ago 1 0.12417
510eb98 96 5 minutes ago 1 0.12244
59129de 97 5 minutes ago 1 0.12076
e301a55 98 5 minutes ago 1 0.11915
4941495 99 5 minutes ago 1 0.1176 (best)

You can also use replicate show on a checkpoint to get all the information about it. Run this, replacing 494 with a checkpoint ID from the experiment:

$ replicate show 494
Checkpoint: 49414952394edfdf7923edd6bfb4aabe5558a6276a02a71a5965e1622ee7b9fd
Created: Wed, 02 Sep 2020 20:44:52 PDT
Path: model.pth
Step: 99
Experiment
ID: b90ad56a755371548ae2ab98c9d40a85911fd6198254880e600cdf00f55a18ca
Created: Wed, 02 Sep 2020 20:44:51 PDT
Status: stopped
Host: 107.133.144.125
User: ben
Command: train.py
Params
learning_rate: 0.01
num_epochs: 100
Metrics
accuracy: 1
loss: 0.11759971082210541 (primary, minimize)

Notice you can pass a prefix to replicate show, and it'll automatically find the experiment that starts with just those characters. Saves a few keystrokes.

Compare checkpoints

Let's compare the last checkpoints from the two experiments we ran. Run this, replacing 4941495 and a122e85 with the two checkpoint IDs from the LATEST CHECKPOINT column in replicate ls:

$ replicate diff 4941495 a122e85
Experiment
ID: b90ad56 9cce006
Command: train.py train.py --learning_rate=0.2
Created: Wed, 02 Sep 2020 20:44:51 PDT Wed, 02 Sep 2020 20:45:01 PDT
Params
learning_rate: 0.01 0.2
Checkpoint
ID: 4941495 a122e85
Created: Wed, 02 Sep 2020 20:44:55 PDT Wed, 02 Sep 2020 20:45:04 PDT
Metrics
accuracy: 1 0.9666666388511658
loss: 0.11759971082210541 0.056485891342163086

replicate diff works a bit like git diff, except in addition to the code, it compares all of the metadata that Replicate is aware of: params, metrics, dependencies, and so on.

replicate diff compares checkpoints, because that is the thing that actually has all the results.

You can also pass an experiment ID, and it will pick the best or latest checkpoint from that experiment.

Check out a checkpoint

At some point you might want to get back to some point in the past. Maybe you've run a bunch of experiments in parallel, and you want to choose one that works best. Or, perhaps you've gone down a line of exploration and it's not working, so you want to get back to where you were a week ago.

The replicate checkout command will copy the code and weights from a checkpoint into your working directory. Run this, replacing 4941495 with a checkpoint ID you passed to replicate diff:

$ replicate checkout 4941495
═══╡ The directory "/Users/ben/p/tmp/iris-classifier" is not empty.
═══╡ This checkout may overwrite existing files. Make sure you've committed everything to Git so it's safe!
Do you want to continue? (y/N) y
═══╡ Checked out 4941495 to "/Users/ben/p/tmp/iris-classifier"

The model file in your working directory is now the model saved in that checkpoint:

$ ls -lh model.pth
-rw-r--r-- 1 ben staff 8.3K Aug 7 16:42 model.pth

This is useful for getting the trained model out of a checkpoint, but it also copies all of the code from the experiment that checkpoint is part of. If you made a change to the code and didn't commit to Git, replicate checkout will allow you get back the exact code from an experiment.

This means you don't have to remember to commit to Git when you're running experiments. Just try a bunch of things, then when you've found something that works, use Replicate to get back to the exact code that produced those results and formally commit it to Git.

Neat, huh? Replicate is keeping track of everything in the background so you don't have to.

The workflow so far

With these tools, let's recap what the workflow looks like:

  • Add experiment = replicate.init() and experiment.checkpoint() to your training code.
  • Run several experiments by running the training script as usual, with changes to the hyperparameters or code.
  • See the results of our experiments with replicate ls and replicate show.
  • Compare the differences between experiments with replicate diff.
  • Get the code from the best experiment with replicate checkout.
  • Commit that code cleanly to Git.

You don't have to keep track of what you changed in your experiments, because Replicate does that automatically for you. You can also safely change things without committing to Git, because replicate checkout will always be able to get you back to the exact environment the experiment was run in.

What's next

Next, you might want to:

If something doesn't make sense, doesn't work, or you just have some questions, please email us: team@replicate.ai. We love hearing from you!