Looking for the Perfect Dashboard: InfluxDB, Telegraf and Grafana - Part XXXV (GPU Monitoring)

Greetings friends, I have been showing since 2016 all the goodness of Telegraf, InfluxDB, and Grafana. And I do not get tired since every day or week, I have new technologies, or hardware, that I want to monitor to have more detailed control of all my environment.

Just a few days ago I acquired an NVIDIA RTX 3090, one of the best graphics cards we can find today, and of course, the first thing I thought was, I need to monitor this hardware that has cost more than any of my servers, which I already monitor without problems.

Dashboard for NVIDIA GPUs (extendable to AMD with a little work)

When we finish the entry we will have something similar to that Dashboard that will allow you to visualize:

Dashboard – Summary

This dashboard, very useful and powerful shows us the following details:

Filter by Host and by GPU type – In case we have several PC/VDI, it will show us the model as well.
Temperatures – The GPU temperature, and very important (only in 3070, 3080 and 3090) the VRAM temperature, better known as GPU Memory Junction.
- In addition to having them see the most recent, there is a graph with the history of it.
Fans – The percentage at which our fans are spinning, in case we have more fans, you will have to modify the query.
Speeds – The dashboards show the current clock speeds of the GPU, memory, etc.
Consumption – The consumption in the percentage of our GPU and its memory.
Power Consumption – Power consumption in Watts, which is useful to calculate how much you are spending.
- Besides having it to see the most recent, there is a graph with the history of the same.
Mining – It is something simple, if the memory is at 100% constant, the mining is active. No other application puts the memory at 100% for such a long period of time.

Topology with all logical components

This entry is different from the previous ones, it includes more components, some of them for a fee such as HWInfo64 Pro, I leave the complete image here:
We will need telegraf for Windows, I guess you have already followed the steps to send information to InfluxDB, in addition, we will need to have installed the NVIDIA drivers, which include a small application called nvidia_smi, and finally, we also need HWInfo64 Pro, which allows us to use shared memory and Remote Sensor Monitor that presents the data in a local web in JSON format.

Small system requirements before continuing

We will need to verify that we have NVIDIA smi installed, for it from a CMD, we go to the following directory C:\Program Files Corporation Corporation smi> and we launch the executable, to see the result:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Sun May 23 12:30:07 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.27       Driver Version: 466.27       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:02:00.0  On |                  N/A |
| 70%   45C    P2   261W / 280W |   8576MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

All right, let’s go to the next step, we install the latest version of HWInfo64, we need to buy a license and put it in the folder where we have installed HWInfo64. If we run the application, in sensor mode only, we will see the following, very accurate and interesting data about our GPU:

Finally, we will have to enable that we can connect to the shared memory, as simple as clicking on the icon and activating it: Minimize now the application, and download the Addon called Remote Sensor Monitor from its official website:

From a CMD, we launch the Remote Sensor Monitor with the following parameters:

"Remote Sensor Monitor.exe" --gpuz=0 --aida64=0 --ohm=0

This will show us the next message, all good and ready:

Starting Remote Sensor Monitor on port 55555...

HWiNFO process found! Enabling HWiNFO...

Please wait for the web server to start.......Web server running

Press [ENTER] to quit

Which is fantastic news, already from our browser, we go to http://localhost:55555/ and we will see the following:

Configuration of the telegraf.conf file

We have almost everything ready, we have one last step, activate these two plugins inside the telegraf.conf configuration file, as simple as adding at the end of the file the following:

# Pulls statistics from nvidia GPUs attached to the host
# Pulls statistics from nvidia GPUs attached to the host
[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
   bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"

  ## Optional: timeout for GPU polling
  # timeout = "5s"
  
[[inputs.http]]
#URL for ESXi for ARM data in JSON format
urls = ["http://localhost:55555/"]
method = "GET"
#Overwrite measurement name from default `http` to `hwinfo_stats`
name_override = "hwinfo_stats"

#Exclude url and host items from tags
tagexclude = ["url"]

#Data from HTTP in JSON format
data_format = "json"

#Set hwinfo metadata as tags
tag_keys = ["SensorName", "SensorUnit"]

#JSON values to set as string fields
json_string_fields = ["SensorValue"]

insecure_skip_verify = true

This is all, we restart, or start the telegraf service and go to grafana.

Grafana Dashboards

I created a Dashboard from scratch by selecting the best requests to the database, finalizing the colors, thinking about the graphics and how to display them, and everything is automated to fit our environment without any problems and without having to edit anything manually. The Dashboard can be found here, once imported, you can use the top drop-down menus to select between different pcs, cards, etc:

https://grafana.com/grafana/dashboards/14478

Importing the Grafana Dashboard the easy way

So you don’t have to waste hours configuring a new Dashboard, and ingesting and debugging what you want, I have already created a wonderful Dashboard with everything you need to monitor our environment in a very simple way, it will look like the image I showed you above. Select the name you want and enter the ID: 14478, which is the unique ID of the Dashboard, or the URL:

https://grafana.com/grafana/dashboards/14478

With the drop-down we could choose between GPU, PCs, etc.:Please leave your comments here, or on GitHub, thanks a lot for reading!

I hope you like it, and I would like to leave you the complete series here, so you can start playing with the plugins that I have been telling you about all these years:

Dashboard for NVIDIA GPUs (extendable to AMD with a little work)

Topology with all logical components

Small system requirements before continuing

Configuration of the telegraf.conf file

Grafana Dashboards

Importing the Grafana Dashboard the easy way

Reader Interactions

Comments

Trackbacks

Leave a Reply