VMware: Unlocking ESXi Overprovisioning Insights, A Quick, Cost-Effective Solution with Grafana, InfluxDB, and Telegraf on VMware

Greetings friends, one of the common questions I get is how we can get how many times an ESXi is overprovisioned. A few months ago I already covered how to get this data, in the case you are already using Veeam ONE:

Veeam: Safeguarding vSphere Clusters with Veeam ONE – Overcommitment Report for Optimal Health

But I am aware that not everyone is using Veeam ONE today, and perhaps you are using the free Influxdata telegraf, Grafana, and the dashboards I’ve built and showed you how to configure a few times:

So, if you are already using the integration, I have good news for you. Thanks to Kris Vermeulen, who reminded about this recently, I am plelased to share with all of you an update to your current Dashboards. If you download the new version it will llook more or less like this:

How can I get the last version? Is it free, dude?

You can download the latest revision from here, and of course it is free. I have never seen a cent out of this work. I am just pleased it reached 22K datacenters.

https://grafana.com/grafana/dashboards/8159-vmware-vsphere-overview/

Getting into the Matrix

Quick Intro

For some of you this is not needed, so if you only use the product as it is and do not want to explore further, now is your time to close this tab <> Maybe you can leave a comment, or feedback about the project before 🙂

To calculate CPU Overcommit, or overprovisioned we need to measurements from InfluxDB:

vsphere_host_cpu
vsphere_vm_cpu

These are very simple tables, where the data looks like this, once grouped by either Host, or VM:

Following the white rabbit

From here we can see something interesting, and that is that we have a lot of data, metrics/values, per CPU (vCPU or Core for Hosts), that is great, and a normal query will be good to see consumption and usage, now, when we want to calculate howe many of that column are we can put something like this:

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "vsphere_host_cpu")
  |> filter(fn: (r) => r["esxhostname"] == 'MYESXI.MYDOMAIN')
  |> filter(fn: (r) => r["cpu"] != "instance-total")
  |> keep(columns: ["cpu"])
  |> distinct(column: "cpu")
  |> group()
  |> count()

That it is going to give us the total number of CPU we have on that ESXi. We are going in the right direction!

Let’s move to VMs, as we do not only want to get how many CPU has every VM, but also we want to know the assigned Host, so the query is a bit more complex (just adding the ESXi to the mix really):

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "vsphere_vm_cpu")
  |> filter(fn: (r) => r["esxhostname"] == 'MYESXI.MYDOMAIN')
  |> keep(columns: ["cpu","vmname"])
  |> filter(fn: (r) => r["cpu"] != "instance-total")
  |> group(columns: ["vmname"])
  |> distinct(column: "cpu")
  |> group()
  |> count()

And with this, we will have the total number provisioned CPU across all VMs on that ESXi. As we are using a range of time, the query should calculate only VMs that have data, meaning that if you delete a VM 7 days ago, it should not count if you run it with a period of 24 hours ago.

Now, this is all good and seems simple. How do we run a mathematical operation to see how many times is the total VM CPUs vs the Host CPU?

That is a very good question, I couldn’t find anything directly on InfluxDB, so I am happy to be using Grafana for this purpose. Let’s move to the next section then.

Grafana, zeros and ones

So, here we are, inside Grafana queries. We will start adding the queries one by one. First the Host CPU oriented one, and second the VM CPU oriented one. Mind that I am using clever regex and Grafana variables that I have already defined on the dashboard settings, so we can use them on cases like this:

The final query, C, it is a simple Expression where I do a simple division of the Total VM CPU divided by the Total Host CPU. Extremely simple, once we know what we are doing, right? Once we have all of this, we can play with the overrides, colors and all, to have something that looks good enough for our use-case:

That’s all folks, if you are interested into more posts about Grafana, please help yourself:

Comments

Sino says

6th December 2023 at 9:17 am

Dashboard download is not working. Thank you 4 your great work
jorgeuk says

6th December 2023 at 5:01 pm

Hello, might be a bug on Grafana.com, try the revisions page, download latest: https://grafana.com/grafana/dashboards/8159-vmware-vsphere-overview/?tab=revisions

Cheers!
Sucr4m says

14th April 2024 at 9:39 am

Hello Jorge, thanks for dashboard is working perfectly.

In the vsphere overview row, when there is more than one vcenter it separates the information, how is it possible for information from each vcenter to appear when filtered?
jorgeuk says

26th April 2024 at 2:22 pm

Hello,
Can you please share some screenshot so I can take a look?

Thank you!
Roozbeh says

22nd August 2024 at 8:36 pm

Thank you for all Dashboard, really helpful, all of my VMware dashbaord stop working after 24 Hrs, when I do ” systemctl status telegraf.service” it gives me below:
[inputs.vsphere] Error in plugin: Post “https://vcenter.exmaple.com/sdk”: dial tcp: lookup vcenter.exmaple.com: no such host

I can ping vcenter.exmaple.com from my ubuntu 22.04 with no issue, however I replaced it with vCenter IP and restarted telegraph service, now I am getting:

[inputs.vsphere] Unable to find metric name for id 429. Skipping!
jorgeuk says

24th August 2024 at 11:47 am

Hello,
What vCenter version are you running? And the unable to find metric is a warning or error?

Also, what telegraf version?
Roozbeh says

26th August 2024 at 4:47 pm

Hi,
Telegraf 1.21.4+ds1-0ubuntu2
vCenter: 8.0.2 23929136
it was metric error.
I am in day 5th with IP on vsphere-stats.conf and no issue, for some reason FQDN stop working after 24 hours, not a big deal just wanted to let you know.