Looking for the Perfect Dashboard: InfluxDB, Telegraf and Grafana – Part XII (Native Telegraf Plugin for vSphere)

Greetings friends, today I bring you another one of those hidden gems that you like so much. In addition to being free and being able to display it in a few minutes, it has a potential that many commercial tools would like.

Today we are about to create four fresh Grafana Dashboards within minutes, at the end of the blog, we can have some Dashboards (in plural friends) similar to these:

vSphere Overview Dashboard

vSphere Hosts Overview Dashboard

vSphere Datastore Overview

vSphere VM Overview

Telegraf Plugin for VMware vSphere

My friend Craig told me that an official Telegraf plugin for vSphere had been released a few days ago, so the first thing I did was to go to his GitHub and check it out:

The plugin is pure joy, not only because it speaks directly with the vCenter SDK, but also because we can monitor all the following parameters:

  • Cluster Stats
    • Cluster services: CPU, memory, failover
    • CPU: total, usage
    • Memory: consumed, total, vmmemctl
    • VM operations: # changes, clone, create, deploy, destroy, power, reboot, reconfigure, register, reset, shutdown, standby, vmotion
  • Host Stats:
    • CPU: total, usage, cost, mhz
    • Datastore: iops, latency, read/write bytes, # reads/writes
    • Disk: commands, latency, kernel reads/writes, # reads/writes, queues
    • Memory: total, usage, active, latency, swap, shared, vmmemctl
    • Network: broadcast, bytes, dropped, errors, multicast, packets, usage
    • Power: energy, usage, capacity
    • Res CPU: active, max, running
    • Storage Adapter: commands, latency, # reads/writes
    • Storage Path: commands, latency, # reads/writes
    • System Resources: cpu active, cpu max, cpu running, cpu usage, mem allocated, mem consumed, mem shared, swap
    • System: uptime
    • Flash Module: active VMDKs
  • VM Stats:
    • CPU: demand, usage, readiness, cost, mhz
    • Datastore: latency, # reads/writes
    • Disk: commands, latency, # reads/writes, provisioned, usage
    • Memory: granted, usage, active, swap, vmmemctl
    • Network: broadcast, bytes, dropped, multicast, packets, usage
    • Power: energy, usage
    • Res CPU: active, max, running
    • System: operating system uptime, uptime
    • Virtual Disk: seeks, # reads/writes, latency, load
  • Datastore stats:
    • Disk: Capacity, provisioned, used

Impressive! right?, if you do not have yet Telegraf, InfluxDB and Grafana follow these steps (these for Grafana), but for some of you, who already have followed the whole series in Spanish, we only have to update our system to receive the vSphere plugin for Telegraf:

We will be able to see the telegraf package with an update, so we will say yes when it asks us to update:

Once we have the package installed, we only need to configure the telegraf.conf, found in /etc/telegraf/telegraf.conf, we will have to remove the # from the vSphere plugin:

Of course, we will also have to decomment all the parameters of the plugin:

Once done, if we are not using a valid SSL CA, or if the CA it is not installed on the Grafana, InfluxDB, Telegraf server, please uncomment this as well:

Another option is to download the SSL from our vCenter to our Telegraf, to trust it:

Let’s finally restart the telegraf service:

Verifying that we are ingesting information with Chronograf

The normal thing to these heights, if we have made well all the steps, is that already we are sending information compiled by Telegraf towards InfluxDB, if we realize a search using the wonderful Chronograf, we will be able to verify that we have information:

All the variables of this new vSphere plugin for Telegraf are stored in vsphere_* so it’s really easy to find them.

Grafana Dashboards

It is here where I have worked really hard, since I have created the Dashboards from scratch selecting the best requests to the database, finishing colors, thinking which graphic and how to show it, and in addition everything is automated so that it fits with your environment without any problem and without having to edit you anything manually. You can find the Dashboards here, once imported the four, you can move between them with the top menu on the right, now it’s time to download them, or know the ID at least of them:

How to easily import the Grafana Dashboards

So that you don’t have to waste hours configuring a new Dashboard, and ingesting and debugging queries, I’ve already created four wonderful Dashboards with everything you need to monitor our environment in a very simple way, it will look like the image I showed you above.

From our Grafana, we will make Create – Import

Select the name you want and enter one by one the IDs: 8159, 8162, 8165, 8168, which are the unique IDs of the Dashboard, or the URLs:

  • https://grafana.com/dashboards/8159
  • https://grafana.com/dashboards/8162
  • https://grafana.com/dashboards/8165
  • https://grafana.com/dashboards/8168

With the menu at the top right, you can switch between the Dashboards of Hosts, Datastores, VMs and of course the main one of Overview:

Some of the improvements that this Dashboard includes are the variable selections at the top left, depending on what you select, you will be able to see only the Cluster, ESXi, or VM you are interested in. Please leave your feedback in the comments.

If you want to see them working without installing anything, here is the link to my environment:

That’s all folks, if you want to follow the full Blog series about Grafana, InfluxDB, Telegraf, please click on the next links:

Advertisements

42 Thoughts

  1. Hi Jorge,

    I am looking to setup InfluxDB, Telegraf and Grafana – Part XII (Native Telegraf Plugin for vSphere) in our environment.
    Could you please provide the full installation and configuration document on windows platform.

  2. Very cool, I set this up this morning on a large instance and your dashboards are beautiful!
    I can’t seem to get datastore ‘used’ metrics though, perhaps our vSphere version 5.5 is too old ?

  3. Hello Tom,
    On which dashboard exactly? I have updated a new version, it is on the grafana.com site, please download the new version. Let me know exactly, or share some screenshots please 🙂

    Thank you for the feedback!

  4. Hi Guys,

    i get the error in the telegraf logs,

    [input.vsphere]: Error in discovery for 10.1.101.180:7444: ServerFaultCode: Request version ‘urn:vim25/6.7’ and namespace ‘urn:vim25’ are not supported

    Im unable to connect to my vCenter any ideas ?

    thanks in advance..

  5. Can you please try to do an apt-get upgrade or yum upgrade? It does look you might have some old openssl on the Telegraf side, also, would you mind to please let me know your vSphere version?

  6. This is great work i got it install with no issues trying to update the dashboards to allow another search field data center i am having no luck to find that key value any ideals ?

  7. Hello James, which Dashboard, and which panel trying to update? Is that DC inside the same VC?

  8. Hi Jorge,

    thanks for the blog article.
    You mean “insecure_skip_verify = true” instead of “insecure_skip_verify = false”, right ?

  9. All 4 dashboards. And yes the DC is in the same Vcenter. we have multiple vcenter with multiple DC by having this searching and filtering would be a great added value.

  10. Definitely, let me dig into it and I will let you know when the grafana.com it is updated.

    Thank you!

  11. Hi Wesley, as mentioned by you on Slack, uncomment the datastore section, like this:
    datastore_metric_include = []

    Best regards

  12. Hey Jorge,

    First, thank you for your awesome hard work!

    I am getting errors in telegraf from the vsphere plugin.

    [input.vsphere]: Error in discovery for : Post https:///10.1.0.43/sdk: http: no Host in request URL

    Would you happen to know what the error means? I have not found anything.

  13. HI Again,

    Please can can i get some advise, I have managed to get all working (Very Awesome) but now im only getting certain datastore back,

    Its only pulling -7 through but i have 16 DS,

    this is what is in my config
    ## Datastores
    datastore_metric_include = [] ## if omitted or empty, all metrics are collected
    # datastore_metric_exclude = [] ## Nothing excluded by default
    # datastore_instances = false ## false by default for Datastores only

    any advice would be appreciated ..

    Thanks
    David

  14. Hello David,
    Are the ones missing NFS? Can you please try to increase the timeout, also the max_query_objects and max_query_metrics, and on Grafana try to show a wider range, like the last 3 hours or so. Let us know

  15. Hi Jorge,

    Thanks for the reply, I have done as you have asked, i have also removed some metrics and its actually getting worse less metrics are getting pulled in and yes it was NFS datastore not being pulled in… this is what i have changed in my Config..

    with these change below i have all the datastore showing now but just no metrics

    ## Default data collection interval for all inputs
    interval = “60s” –changed from 10

    ## This controls the size of writes that Telegraf sends to output plugins.
    metric_batch_size = 10000 — changed from 1000

    # ## number of go routines to use for collection and discovery of objects and metrics
    collect_concurrency = 5
    discover_concurrency = 3

    # ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
    max_query_objects = 1000 changed from 256

    # ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
    max_query_metrics = 1000 changed from 256

    any help would be much appreciated

    thanks
    David

  16. Hi Jorge,

    Great work on this! Thank you! I was able to get it up and running quickly thanks to your documentation.

    The only issue that I have is that NONE of my Datastore are showing. They are all iSCSI and here’s my current settings per your documentation:

    ## Datastores
    datastore_metric_include = [] ## if omitted or empty, all metrics are collected
    # datastore_metric_exclude = [] ## Nothing excluded by default
    # datastore_instances = true ## false by default for Datastores only

    If you can give me some assistance I would appreciate it.

    Thanks,
    Edward

  17. Hi Edward, can you please change the timeout to something higher, and maybe the:
    [agent]
    ## Default data collection interval for all inputs
    interval = "60s"

    Will make the trick too

  18. Jorge,

    I’ve changed the timeout to “100s” and have updated the interval to “60s”, restarted the necessary services to reflect the changes and still NO info for all of my Datastores.

    Any other recommendation that you think I should change or look into?

    Just wondering, did your Dashboard work right off the bat or did you have to tweak it and made some changes to get your Datastore readings? If so, please let me know what other settings you might have updated to get the Datastore to show.

    Thanks,
    Edward

  19. Hi Edward,
    It does work out of the box with me, here are my config, just datastore and the tweaks:
    # Configuration for telegraf agent
    [agent]
    ## Default data collection interval for all inputs
    interval = "60s"
    ## Rounds collection interval to 'interval'
    ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
    round_interval = true

    [[inputs.vsphere]]
    ## Datastores
    datastore_metric_include = [] ## if omitted or empty, all metrics are collected
    # datastore_metric_exclude = [] ## Nothing excluded by default
    # datastore_instances = false ## false by default for Datastores only

    # ## timeout applies to any of the api request made to vcenter
    timeout = "180s"

    Then on the top of Grafana I select like 1 hour, 3 or 6 , it all does work, can you check on your chronograf if you are indeed sending any data at all? And review on tail -f /var/log/telegraf/telegraf.log that not errors appear?

    thank you!

  20. Hi Jorge,

    I’ve made all the changes you’ve recommended and unfortunately Datastore is still not showing.

    Only errors that I see is this:

    Oct 05 09:14:31 vm-stats telegraf[1110]: 2018-10-05T13:14:31Z W! [outputs.influxdb] when writing to [http://localhost:8086]: database “telegraf” creation failed: Post http://localhost:8086/query: dial tcp 127.0.0.1:8086: connect: connect

    The rest of the Dashboard is working perfectly other than the Datastore status/section.

    If you can think of anything else for me to look into that would be much appreciated.

    Thanks,
    Edward

  21. Great work!

    Question: how can we add more than one Vcenter?

    Can you explain what the syntax is please, I cannot find that anywhere, I have 2 vcenters.

    Something like:
    vcenters = [ “https://vcenter1.local/sdk” ] [ “https://vcenter2.local/sdk” ]
    Or maybe like this?
    vcenters = [ “https://vcenter1.local/sdk” “https://vcenter2.local/sdk” ]

    How is it done???

    Thanks in advance!

  22. Hello,
    I have not 2 vcneters to try, but it should be as it always is on Telegraf:
    vcenters = [ "https://vcenter1.local/sdk", "https://vcenter2.local/sdk" ]

    Cna you please try it?

  23. Hello Edward, what is that vm-stats, it is maybe another plugin you had? I will recommend taking a copy of the telegraf.conf to telegraf.conf.old then cp the telegraf.conf.dpkg-dist to telegraf.conf, edit the basics of InfluxDB if needed, and then under telegraf.d create a new vsphere.conf, where you put just your new config directly from this blog, to see if that works.

  24. Jorge,

    “vm-stats” is the hostname.

    I’ll copy the telegraf.conf and give that a shot. I’ll let you know how it goes.

  25. Hello,

    Thanks for the awesome guide!

    Is there anyway to get used percentage of the Virtual Machine’s CPU?

  26. “Hello,
    I have not 2 vcneters to try, but it should be as it always is on Telegraf:
    vcenters = [ “https://vcenter1.local/sdk”, “https://vcenter2.local/sdk” ]

    Cna you please try it?”

    Hey Jorge,

    Thanks, I have configured two Vcenters and this works just fine, thank you.

    ” # # Read metrics from VMware vCenter
    [[inputs.vsphere]]
    # ## List of vCenter URLs to be monitored. These three lines must be uncommented
    # ## and edited for the plugin to work.
    vcenters = [ “https://192.168.1.1/sdk”, “https://192.168.1.2/sdk” ]
    username = “User@Domain”
    password = “P@$$w0rd”
    #
    # ## VMs
    # ## Typical VM metrics (if omitted or empty, all metrics are collected)”

  27. Hi,
    how is it possible to exclude datastore metrics?
    i want to exclude all local datastores which named all “hypervisorname-local”.
    i tried datastore_metric_exclude = [“*-local”] but i still collect metrics for these datastores.

  28. Hello Florian,
    On Grafana, on the Datastore variables I am already not including the Veeam ones, look at them at the moment it is a regex which says /^(?!VeeamBackup_)/ add your own, so at least Grafana doesn’t show them.

    I will investigate how to not ingest the data from Telegraf.

  29. Thanks a lot Jorge for your excellent work! I have couple of queries:

    1. Cluster variable Filter is not working for me. Doesn’t matter which cluster I choose, it shows all the hypervisors.
    2. It is taking ages to load the graphs for Hosts view as I have 100s of hosts.

    Any help with the same will be appreciated man 🙂

  30. yeah i am testing right now i have a vcenter with 300 host and grafana keep crashing because of java. i was looking at trying to convert it to elastic search because your able to cluster for free.

  31. Hi Jorge, please can i ask how to connect to two vCenter on different username and password..

    thanks in advance..

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.