MLOps Chapter 8 : Model Server with Nvidia Triton - Local - Part 1.b
In our previous post we covered - How to Setup the basic pre-requisites for Triton and How to Setup a Model Inference Server.
By any chance if you have missed out on our previous article, I recommend that you follow the link below, before continuing forward.

So what is today's article all about?
Now that you know how to spawn up an Inference Server with a configured Model Registry, wouldn't you want to know how to use the Model Server to run inference.
The catch with these Inference Servers is that one has to write to their own scripts in their application to leverage the end points offered by the Model Inference server to run inference.
Although Nvidia provides some documentation around using Triton, to be honest it is very scanty and not enough to implement a full scale system and that's the sole reason for this article.
To explore the official documentation, follow the link below -
Now with that out, let's get started -
Step 0: Install the necessary packages
pip3 install tritonclient[all]
Triton client is a package that we would be leveraging in running inference from our Model Server.
Step 1: Import the necessary Packages
import argparse
import sys
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException
As we are dealing with Images and might have some level of pre-processing involved before sending the image for inferencing, hence the use of Numpy and PIL.

Step 2: Take User Arguments
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-v",
"--verbose",
action="store_true",
required=False,
default=False,
help="Enable verbose output",
)
parser.add_argument(
"-u",
"--url",
type=str,
required=False,
default="localhost:8000",
help="Inference server URL. Default is localhost:8000.",
)
FLAGS = parser.parse_args()
Now a thing to highlight, If you recollect in our previous post, the Docker Run command had exposed 3 ports to the user.
Before we move to the next step, let's try and understand those better -
- 8000: A port at which we can make HTTP calls to run inference
- 8001: A port at which we can make gRPC calls to run inference
- 8002: A port at which Prometheus metrics of the deployed models, such as throughput, latency etc is available.
As we are interested only in running inference for now, by default we will use the 8000 port.
Step 3: Instantiate Triton Client
try:
triton_client = httpclient.InferenceServerClient(
url=FLAGS.url, verbose=FLAGS.verbose
)
except Exception as e:
print("context creation failed: " + str(e))
sys.exit()
As you can see we have just instantiated the Triton Client with the URL and Verbose received based on User Inputs.
Step 4: Define your model
In order to run inference, we need to provide the name of the model from which we want to run inference from.
In addition to that, by Default the Triton server picks up the latest version in the model registry for inference unless mentioned otherwise.
model_name = "intel_image_class"
image_path = "<path_to_your_image>"
We also provide with a sample image path on which we would test our inference server.

Step 5: Pre-process your Input
Now all we need to do is load the image and process it to make it ready for inference. Basically as expected by the model or as defined in our config.pbtxt
image = np.asarray(Image.open(image_path).resize((100, 100)))
image = np.expand_dims(image, axis=0)
image = np.divide(image, 255.0).astype("float32")
As you can see, we haven't done any heavy pre-processing, just mere resizing of image and converting it into Float 32 type as that's what we had mentioned in our config file.
Step 6: Run your Inference
inputs = []
inputs.append(
httpclient.InferInput(name="input_1", shape=image.shape, datatype="FP32")
)
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput(name="dense"))
result = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
print(result.as_numpy("dense"))
Let's dive a little deeper into this code snippet to understand better as to what we are trying to achieve.
- We need to define 2 Lists as mentioned in our Config file as well. (Inputs and Outputs)
- Input as defined in our config is expected as a Numpy array
- The name is the same as the Layer Name
- Shape and Data type is expected to match with the values in the config file or else the system would throw error.
Now all that is left is for you to run the script and enjoy your Inference.
The sample output would be a Numpy array of Probabilities provided by the model.
If needed post processing can be done to attain the results in your expected format.
Step 7: Bonus Step
In our previous command, the Triton server loads the model registry only at the time of starting.
But what if I decided to train my model again to improve its performance or may be I want to add more models to my model registry.
It would be highly inconvenient for me to restart my server every time there is a change in the Model Registry.
To avoid such issues and to make out lives easier, there is a concept called as Model Control Mode as offered by Triton.
In such a situation, one can make changes to the model registry and the Triton Server will pick the changes in real time without having to restart your server.

All you need to do is send additional parameters in your Run Command as follows -
docker run -d -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/model/registry:/models nvcr.io/nvidia/tritonserver:20.06-py3 tritonserver --model-repository=/models --model-control-mode=poll --repository-poll-secs=5
Now every 5 secs your Inference Server will check for updates in your Model Registry and update itself without the hassle of restarting.
Conclusion
Congratulations! If you were able to follow our steps without any hiccups, you should have a working Model Inference Server with a sample script to run inference from it.
I hope this article was helpful in clearing out a lot of concepts and queries regarding end to end implementation of Nvidia Triton.
You can find the official Github Repository of Triton on the below mentioned link with a lot more features detailed out -
As for the Code written in the script above, you can find it in the official Github Repository of Chronicles of AI
In our future posts we will be covering more exciting features of Triton and its integration with Cloud Platforms.
STAY TUNED 😁