MLOps Chapter 8 : Model Server with Nvidia Triton - Local - Part 1.b

MLOps Nov 1, 2021

In our previous post we covered - How to Setup the basic pre-requisites for Triton and How to Setup a Model Inference Server.

By any chance if you have missed out on our previous article, I recommend that you follow the link below, before continuing forward.

MLOps Chapter 8 : Model Server with Nvidia Triton - Local - Part 1.a
In our previous post we explored, What is Nvidia Triton and How is it changing the world of MLOps. In case if you haven’t read our previous article, I would highly recommend to go through it to understand the pre-text to what we are trying to achieve in this experiment.

So what is today's article all about?

Now that you know how to spawn up an Inference Server with a configured Model Registry, wouldn't you want to know how to use the Model Server to run inference.

The catch with these Inference Servers is that one has to write to their own scripts in their application to leverage the end points offered by the Model Inference server to run inference.

Although Nvidia provides some documentation around using Triton, to be honest it is very scanty and not enough to implement a full scale system and that's the sole reason for this article.

To explore the official documentation, follow the link below -

Documentation - Latest Release :: NVIDIA Deep Learning Triton Inference Server Documentation
This Triton Inference Server documentation focuses on the Triton inference server and its benefits. The inference server is included within the inference server container. This guide provides step-by-step instructions for pulling and running the Triton inference server container, along with the deta…

Now with that out, let's get started -

Step 0: Install the necessary packages

pip3 install tritonclient[all]

Triton client is a package that we would be leveraging in running inference from our Model Server.

Step 1: Import the necessary Packages

import argparse
import sys
import numpy as np
from PIL import Image

import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException

As we are dealing with Images and might have some level of pre-processing involved before sending the image for inferencing, hence the use of Numpy and PIL.

Step 2: Take User Arguments

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
        help="Enable verbose output",
        help="Inference server URL. Default is localhost:8000.",

    FLAGS = parser.parse_args()

Now a thing to highlight, If you recollect in our previous post, the Docker Run command had exposed 3 ports to the user.

Before we move to the next step, let's try and understand those better -

  1. 8000: A port at which we can make HTTP calls to run inference
  2. 8001: A port at which we can make gRPC calls to run inference
  3. 8002: A port at which Prometheus metrics of the deployed models, such as throughput, latency etc is available.

As we are interested only in running inference for now, by default we will use the 8000 port.

Step 3: Instantiate Triton Client

    triton_client = httpclient.InferenceServerClient(
        url=FLAGS.url, verbose=FLAGS.verbose
except Exception as e:
    print("context creation failed: " + str(e))

As you can see we have just instantiated the Triton Client with the URL and Verbose received based on User Inputs.

Step 4: Define your model

In order to run inference, we need to provide the name of the model from which we want to run inference from.

In addition to that, by Default the Triton server picks up the latest version in the model registry for inference unless mentioned otherwise.
model_name = "intel_image_class"
image_path = "<path_to_your_image>"

We also provide with a sample image path on which we would test our inference server.

Step 5: Pre-process your Input

Now all we need to do is load the image and process it to make it ready for inference. Basically as expected by the model or as defined in our config.pbtxt

image = np.asarray(, 100)))
image = np.expand_dims(image, axis=0)
image = np.divide(image, 255.0).astype("float32")

As you can see, we haven't done any heavy pre-processing, just mere resizing of image and converting it into Float 32 type as that's what we had mentioned in our config file.

Step 6: Run your Inference

inputs = []
    httpclient.InferInput(name="input_1", shape=image.shape, datatype="FP32")
inputs[0].set_data_from_numpy(image, binary_data=False)

outputs = []

result = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

Let's dive a little deeper into this code snippet to understand better as to what we are trying to achieve.

  1. We need to define 2 Lists as mentioned in our Config file as well. (Inputs and Outputs)
  2. Input as defined in our config is expected as a Numpy array
  3. The name is the same as the Layer Name
  4. Shape and Data type is expected to match with the values in the config file or else the system would throw error.

Now all that is left is for you to run the script and enjoy your Inference.

The sample output would be a Numpy array of Probabilities provided by the model.

If needed post processing can be done to attain the results in your expected format.

Step 7: Bonus Step

In our previous command, the Triton server loads the model registry only at the time of starting.

But what if I decided to train my model again to improve its performance or may be I want to add more models to my model registry.

It would be highly inconvenient for me to restart my server every time there is a change in the Model Registry.

To avoid such issues and to make out lives easier, there is a concept called as Model Control Mode as offered by Triton.

In such a situation, one can make changes to the model registry and the Triton Server will pick the changes in real time without having to restart your server.

All you need to do is send additional parameters in your Run Command as follows -

docker run -d -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/model/registry:/models tritonserver --model-repository=/models --model-control-mode=poll --repository-poll-secs=5
Now every 5 secs your Inference Server will check for updates in your Model Registry and update itself without the hassle of restarting.


Congratulations! If you were able to follow our steps without any hiccups, you should have a working Model Inference Server with a sample script to run inference from it.

I hope this article was helpful in clearing out a lot of concepts and queries regarding end to end implementation of Nvidia Triton.

You can find the official Github Repository of Triton on the below mentioned link with a lot more features detailed out -

GitHub - triton-inference-server/server: The Triton Inference Server provides an optimized cloud and edge inferencing solution.
The Triton Inference Server provides an optimized cloud and edge inferencing solution. - GitHub - triton-inference-server/server: The Triton Inference Server provides an optimized cloud and edge i...

As for the Code written in the script above, you can find it in the official Github Repository of Chronicles of AI

AI-kosh/mlops/chp_8 at main · Chronicles-of-AI/AI-kosh
Archives of blogs on Chronicles of AI. Contribute to Chronicles-of-AI/AI-kosh development by creating an account on GitHub.

In our future posts we will be covering more exciting features of Triton and its integration with Cloud Platforms.



Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.