Finally, a chance to sit down and talk about some Graphite! And not the awesome structure pictured above even though it does look like a sweet thing to talk about.
Let's start with an update..
- Grafana is up and running with data from both our graphite instances easily accessible.
- Collectd is my new choice of tool for getting that much needed snmp data into Graphite.. ooh the purty graphs.. the netscaler stats.. a quick, but great win.
- Graphite/Carbon/Whisper were all updated from 0.9.12 to 0.9.13 almost seamlessly.
I guess it's explanation time..
Grafana (imagine this said spookily!)
We had a Grafana instance already that kind of fell to the wayside. There was some issues with our team in Budapest having timeouts accessing out default dashboards which relied on the graphite png rendering (Which is a pain). That's where I thought i'd throw a few hours into seeing what's needed to get Grafana up and running.
Turns out, getting Grafana running was easy - it was already running! Upgrading to the most recent version allowed for multiple Graphite backends and with a little tweaking of the configs, all was well in the world and Grafana became my default graphing tool.
Collectd.. the systems statistics collection daemon
Well, most people will tell you that this is a great tool and it's super easy to configure. Most people are also over thinking what they need and aim to get everything they want in hopes that one day they will have the data right in front of them. I'm not most people. All I wanted was SNMP info for interfaces on our switches and Netscaler stats.
Luckily, this was as easy as using Collectd with the SNMP and RRDtool plugins on the Graphite boxes themselves.
By using the graphite servers as the pollers I'm increasing the memory usage by ~11mb and CPU usage is barely noticeable. The usage wasn't even worth a consideration when I knew that Collectd would do all I wanted it to do, while also allowing me to do it via puppet.
A few hours of tweaking and bingo! I had my network devices making their way into Graphite. Oh no, what's this? Graphite/Carbon/what-the-nonsense is starting to have wonderful crashing issues where it flips a shit and everything goes to hell. Well, let's turn off Collectd until this is sorted out.
Enter the surprising Graphite/Carbon/Whisper 0.9.13 update!
I had read about many users experiencing issues with Carbon going somewhat off the rails. I was hoping it wasn't the same issue that was troubling me. However, 0.9.13 had some fixes that were intriguing me. I figured, now was a good a time as any to make the jump.
After updating the version and the url paths for the packages in puppet, I let puppet do its thing and a seamless upgrade occurred. I couldn't be happier.
Some days are longer than others.. some days are awesomely short!
Some time went by (read this as.. it was the weekend) and it was time to get Collectd back up and running.. start up the service.. wait..
oh look.. Graphite is holding its own.. not bad.. alright.. stats are a bit peaky but it's still chugging along without issues. Great!
Graphs!! (It's why we're here right?)
Here's what you're looking at for a 7 day time spread.
You can clearly see the sadness of Carbon flipping out and going mental.
The upgrade to 0.9.13 resulted in more update operations and lower cpu usage. This is a win for me. I'd rather see more updates than not with SSDs.
Collectd was enabled and you can see how awesomely the committed points increased with the new metrics. Great!
That's fine for Carbon.. what was the cost??
I bet you're asking yourself, what about the host? How did the host change? How? Well, not that bad actually.
We definitely introduced some CPU wait, but that's not too bad of a place to be.
So far.. everything's chugging along quite nicely.
Well, it seems there's still one big part of this whole system that needs some work. Apache.
Since we're hitting Apache so much with our dashboards, i'm still trying to figure out a way to minimize the effort of improving performance without getting into tuning Apache.
If you've ever tried to tune Apache you'll know why i'm trying to avoid doing it.. it can be a lot of fun but what a pain.
My boss (Jim) suggested maybe introducing our CDN into the mix as the dashboards are all loading the same data. This is definitely a route I'm going to look into.. nothing quite like brushing up CDN skills every now and then, but then again, Fastly is quite great and it'll probably be a walk in the park compared to L3.
Was it worth it?
What a silly question.. of course it was! Look at how great these graphs look! (Normally they'd look the same but we're testing some stuff in one DC)...