Prepare and Deploy a TensorFlow Model to Tensorflow Serving

In this lab, we will:

Downloading and running the ResNet module
Creating serving signatures for the module
Exporting the model as a SavedModel
Deploying the SavedModel to AI Platform Prediction
Validating the deployed model
Advanced model server configuration: WarmUp, Batching, …

We will export two trained Resnet model, consist of Resnet50 and Resnet101, then serve these.

1. Download pretrained model from TFHub:

$ wget https://storage.googleapis.com/tfhub-modules/google/imagenet/resnet_v2_50/classification/5.tar.gz
$ wget https://storage.googleapis.com/tfhub-modules/google/imagenet/resnet_v2_101/classification/5.tar.gz

Then extract into a folder, similarity following to:

$ tree /home/dangho/Downloads/resnet
/home/dangho/Downloads/resnet
├── 101
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── 50
    ├── saved_model.pb
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index
4 directories, 6 files

Note that the directory 50, same as 101 is the version of the model.

The expected input to most TF2.x image classification models, is a rank 4 tensor conforming to the following tensor specification: tf.TensorSpec([None, height, width, 3], tf.float32). More concretely, the expected image size is height x width = 224 x 224. The color values for all channels are expected to be normalized to the [0, 1] range.

The output of the model is a batch of logits vectors. The indices into the logits are the num_classes = 1001 classes from the ImageNet dataset. The mapping from indices to class labels can be found in the labels file with class 0 for “background”, followed by 1000 actual ImageNet classes.

We will now test the model on a couple of JPEG images.

Display sample images:

>>> import utils
>>> IMAGES_FOLDER="images"
>>> image_list = utils.get_image_list(image_dir=IMAGES_FOLDER)
>>> utils.show_image(image_list)
(720, 498, 3)
(600, 512, 3)

We get two images and its shape but the images need to be preprocessed to conform to the format expected (224, 224, 3) by the ResNet model.

>>> size = 224
>>> raw_images = tf.stack(image_list)
>>> preprocessed_images = utils.preprocess_image(raw_images, size)
>>> preprocessed_images.shape
TensorShape([2, 224, 224, 3])

Run inference:

>>> model = tf.keras.models.load_model('/home/dangho/Downloads/resnet/101')
>>> predictions = model(preprocessed_images)
>>> predictions
<tf.Tensor: shape=(2, 1001), dtype=float32, numpy=
array([[ 0.27374715, -1.2126322 , -0.85858756, ..., -1.8846453 ,
         0.25237346,  1.8259864 ],
       [ 0.28163522,  0.61459076, -0.00311601, ..., -0.5948272 ,
        -0.05215326, -0.11519516]], dtype=float32)>

The model returns a batch of arrays with logits. This is not a very user friendly output so we will convert it to the list of ImageNet class labels.

>>> import numpy as np
>>> imagenet_labels = np.array(open("labels/ImageNetLabels.txt").read().splitlines())
>>> imagenet_labels
array(['background', 'tench', 'goldfish', ..., 'bolete', 'ear',
       'toilet tissue'], dtype='<U30')

2. Create Serving Signatures:

The inputs and outputs of the model as used during model training may not be optimal for serving. For example, in a typical training pipeline, feature engineering is performed as a separate step preceding model training and hyperparameter tuning. When serving the model, it may be more optimal to embed the feature engineering logic into the serving interface rather than require a client application to preprocess data.

The ResNet model from TFHub is optimized for recomposition and fine-tuning. Since there are no serving signatures in the model’s metadata, it cannot be served with TF Serving as is.

To make it servable, we need to add a serving signature(s) describing the inference method(s) of the model. We will add two signatures:

The default signature: this will expose the default predict method of the ResNet model.
Prep/post-processing signature: since the expected inputs to this interface require a relatively complex image preprocessing to be performed by a client, we will also expose an alternative signature that embeds the preprocessing and postprocessing logic and accepts raw unprocessed images and returns the list of ranked class labels and associated label probabilities.

The signatures are created by defining a custom module class derived from the tf.Module base class. The custom module will be exported as SavedModel that includes the original model, the preprocessing logic, and two serving signatures.

Test the custom serving module:

>>> serving_module = utils.ServingModule(model, size, imagenet_labels)
>>> predictions = serving_module.predict_labels(raw_images)
>>> predictions
{'labels': <tf.Tensor: shape=(2, 5), dtype=string, numpy=
array([[b'Egyptian cat', b'tiger cat', b'tabby', b'lynx', b'Siamese cat'],
       [b'military uniform', b'suit', b'Windsor tie', b'pickelhaube',
        b'bow tie']], dtype=object)>, 'probabilities': <tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[8.2705331e-01, 1.3128258e-01, 4.1055005e-02, 5.7081261e-04,
        1.8924713e-05],
       [9.4001341e-01, 4.8532788e-02, 6.4066364e-03, 2.0129983e-03,
        6.0433790e-04]], dtype=float32)>}

In this case, my custom serving module only get top-5 labels that have highest probabilities.

3. Save the custom serving module as `SavedModel`

>>> model_path = "/home/dangho/Downloads/resnet_serving/101"
>>> default_signature = serving_module.__call__.get_concrete_function()
>>> preprocess_signature = serving_module.predict_labels.get_concrete_function()
>>> signatures = {'serving_default': default_signature,
                  'serving_preprocess': preprocess_signature}
>>> tf.saved_model.save(serving_module, model_path, signatures=signatures)
INFO:tensorflow:Assets written to: /home/dangho/Downloads/resnet_serving/101/assets

Verify the Resnet model serving:

$ saved_model_cli show --dir "${model_path}" --tag_set serve --all
...
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['x'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: serving_default_x:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['output_0'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1001)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

signature_def['serving_preprocess']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['raw_images'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_preprocess_raw_images:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['labels'] tensor_info:
        dtype: DT_STRING
        shape: (-1, -1)
        name: StatefulPartitionedCall_1:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1)
        name: StatefulPartitionedCall_1:1
  Method name is: tensorflow/serving/predict

Defined Functions:
  Function Name: '__call__'
    Option #1
      Callable with:
        Argument #1
          x: TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='x')

  Function Name: 'predict_labels'
    Option #1
      Callable with:
        Argument #1
          raw_images: TensorSpec(shape=(None,), dtype=tf.string, name='raw_images')

Test loading and executing the SavedModel:

>>> model = tf.keras.models.load_model(model_path)
>>> model.predict_labels(raw_images)
{'probabilities': <tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[8.2705331e-01, 1.3128258e-01, 4.1055005e-02, 5.7081261e-04,
        1.8924713e-05],
       [9.4001341e-01, 4.8532788e-02, 6.4066364e-03, 2.0129983e-03,
        6.0433790e-04]], dtype=float32)>, 'labels': <tf.Tensor: shape=(2, 5), dtype=string, numpy=
array([[b'Egyptian cat', b'tiger cat', b'tabby', b'lynx', b'Siamese cat'],
       [b'military uniform', b'suit', b'Windsor tie', b'pickelhaube',
        b'bow tie']], dtype=object)>}

We can absolutely do the same with the Resnet50 model. Finally we have the custom serving module for Resnet model with two version, 50 and 101.

$ tree /home/dangho/Downloads/resnet_serving
├── 101
│   ├── assets
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── 50
    ├── assets
    ├── saved_model.pb
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index
6 directories, 6 files

4. Deploying the `SavedModel`

Now we will serving multiple versions of the Resnet model, by write model config models.config file.

model_config_list {
  config {
    name: 'resnet'
    base_path: '/models/resnet/'
    model_platform: 'tensorflow'

    model_version_policy {
        specific {
            versions: 50
            versions: 101
          }
        }
    version_labels {
        key: 'stable'
        value: 50
    }
    version_labels {
        key: 'canary'
        value: 101
    }
  }
}

Write the docker-compose.yaml file:

version: '3.2'
services:
  tf-serving:
    container_name: tf_serving
    image: tensorflow/serving:2.5.1
    ports:
      - "8501:8501"
      - "8500:8500"
    volumes:
      - "/home/dangho/Downloads/resnet_serving:/models/resnet"
      - "./models.config:/models/models.config"
    command:
      - '--model_config_file=/models/models.config'
      - '--model_config_file_poll_wait_seconds=60'
      - '--allow_version_labels_for_unavailable_models'

By default TFS does not allow labelling of models that are not ready to serve. So usually you had to avoid including the labels in the config file and spin up the container (otherwise you’ll get an error) and after the models were loaded for serving you could edit the config file to include labels. By setting --allow_version_labels_for_unavailable_models flag to true you avoid having to do this.

Deploying the model by docker-compose command:

$ docker-compose up -d
[+] Running 2/2
 ⠿ Network tensorflow-serving-configuration_default  Created            0.2s
 ⠿ Container tf_serving                              Started            1.2s

$ docker ps | grep tf_serving
e1f6a5ed1276   tensorflow/serving:2.5.1   "/usr/bin/tf_serving…"   About a minute ago   Up About a minute   0.0.0.0:8500-8501->8500-8501/tcp, :::8500-8501->8500-8501/tcp   tf_serving

$ docker logs -f tf_serving
...
2023-10-01 11:26:22.126759: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/resnet/50/assets.extra/tf_serving_warmup_requests
2023-10-01 11:26:23.183213: I tensorflow_serving/model_servers/server.cc:393] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2023-10-01 11:26:23.185253: I tensorflow_serving/model_servers/server.cc:414] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...

5. Validating the deployed model

$ curl -d @payloads/request-body.json -X POST http://localhost:8501/v1/models/resnet/versions/50:predict
{
    "predictions": [
        {
            "labels": ["military uniform", "pickelhaube", "suit", "Windsor tie", "bearskin"],
            "probabilities": [0.453408211, 0.209194973, 0.193582058, 0.0409308933, 0.0137334978]
        }
    ]
}

Now ask for a prediction to the same model but this time using labels instead of versions. Notice that the URL changes slightly:

$ curl -d @payloads/request-body.json -X POST http://localhost:8501/v1/models/resnet/labels/canary:predict
{
    "predictions": [
        {
            "probabilities": [0.940013, 0.0485330448, 0.00640664576, 0.0020130109, 0.000604341098],
            "labels": ["military uniform", "suit", "Windsor tie", "pickelhaube", "bow tie"]
        }
    ]
}

6. Advanced model server configuration

6.1. SavedModel Warmup

As soon as you have just deployed the model without a warm up, you can see the response time look like that:

$ curl -d @payloads/request-body.json -X POST http://localhost:8501/v1/models/resnet/versions/50:predict -o /dev/null -s -w 'Total: %{time_total}s\n'
Total: 1,423064s

To enable model warmup, you will use user-provided PredictionLogs in assets.extra/ directory. Fisrtly, generating the PredictionLogs from module utils/warmup_serving.py and save into a file:

$ python3 utils/warmup_serving.py
...
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

$ tree /home/dangho/Downloads/resnet_serving
/home/dangho/Downloads/resnet_serving
├── 101
│   ├── assets
│   ├── assets.extra
│   │   └── tf_serving_warmup_requests
│   ├── saved_model.pb
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── 50
    ├── assets
    ├── assets.extra
    │   └── tf_serving_warmup_requests
    ├── saved_model.pb
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index
8 directories, 8 files

Now we will test the model serving with a warm up

# Re-create the container
$ docker-compose up -d
[+] Running 2/2
 ⠿ Network tensorflow-serving-configuration_default  Created            0.1s
 ⠿ Container tf_serving                              Started            0.8s

# Trace logs
$ docker logs -f tf_serving
...
2023-10-02 06:53:31.495273: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:122] Finished reading warmup data for model at /models/resnet/50/assets.extra/tf_serving_warmup_requests. Number of warmup records read: 200. Elapsed time (microseconds): 45988765.

# Exec curl
curl -d @payloads/request-body.json -X POST http://localhost:8501/v1/models/resnet/versions/50:predict -o /dev/null -s -w 'Total: %{time_total}s\n'
Total: 0,129297s

It is easy to see that a Warmup helps the model to start up faster.

6.2. Monitoring Configuration

TensorFlow Serving gives you the capability to connect to Prometheus to monitor metrics from your model server, by using the --monitoring_config_file flag to specify a monitor.config file containing a MonitoringConfig protocol buffer. It would look like that:

prometheus_config {
  enable: true,
  path: "/monitoring/prometheus/metrics"
}

Next, configure Prometheus Server to listen to your model server by providing the following to its deployment manifest:

version: '3.2'
services:
  tf-serving:
    container_name: tf_serving
    image: tensorflow/serving:2.5.1
    ports:
      - "8501:8501"
      - "8500:8500"
    volumes:
      - "/home/dangho/Downloads/resnet_serving:/models/resnet"
      - "./models.config:/models/models.config"
      - "./monitor.config:/models/monitor.config"
    command:
      - '--model_config_file=/models/models.config'
      - '--model_config_file_poll_wait_seconds=60'
      - '--allow_version_labels_for_unavailable_models'
      - '--monitoring_config_file=/models/monitor.config'

Deploying your model server:

$ docker-compose up -d

You will also be able to check if the metrics are correctly exported at http://localhost:8501/monitoring/prometheus/metrics. It’s look like that:

# TYPE :tensorflow:api:op:using_fake_quantization gauge
# TYPE :tensorflow:cc:saved_model:load_attempt_count counter
:tensorflow:cc:saved_model:load_attempt_count{model_path="/models/resnet/101",status="success"} 1
:tensorflow:cc:saved_model:load_attempt_count{model_path="/models/resnet/50",status="success"} 1
# TYPE :tensorflow:cc:saved_model:load_latency counter
:tensorflow:cc:saved_model:load_latency{model_path="/models/resnet/101"} 3020490
:tensorflow:cc:saved_model:load_latency{model_path="/models/resnet/50"} 1455731
# TYPE :tensorflow:cc:saved_model:load_latency_by_stage histogram
:tensorflow:serving:runtime_latency_bucket{model_name="resnet",API="Predict",runtime="TF1",le="10"} 0

6.3. Batching

TensorFlow Serving allows you to perform request batching by setting the --enable_batching and --batching_parameters_file flags, where the batching.config file may look like:

max_batch_size { value: 128 }
batch_timeout_micros { value: 0 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 4 }

Next, configure batching online requests together for model server by providing the following to its deployment manifest:

version: '3.2'
services:
  tf-serving:
    container_name: tf_serving
    image: tensorflow/serving:2.5.1
    ports:
      - "8501:8501"
      - "8500:8500"
    volumes:
      - "/home/dangho/Downloads/resnet_serving:/models/resnet"
      - "./models.config:/models/models.config"
      - "./monitor.config:/models/monitor.config"
      - "./batching.config:/models/batching.config"
    command:
      - '--model_config_file=/models/models.config'
      - '--model_config_file_poll_wait_seconds=60'
      - '--allow_version_labels_for_unavailable_models'
      - '--monitoring_config_file=/models/monitor.config'
      - '--batching_parameters_file=/models/batching.config'
      - '--enable_batching'

(Optional) gRPC vs. REST

In this section, we will benchmark the inference time between gRPC and REST by run the python script:

$ python3 utils/benchmark.py
========== Test on 250 records ==========
Predict gRPC in ...s
Predict REST api in ...s

You can get full source code here!