Challenge 

WWT's Reliability Engineering team, within the internal IT organization, is responsible for supporting various infrastructure and development teams—also within internal IT—that align with business units and organizations across WWT.  

One of these infrastructure teams builds and maintains the environments that enable WWT's internal application developers to build, host and manage the apps WWT employees need to do their jobs every day. These apps support a variety of job functions, including managing complex quoting and ordering needs and tracking time spent on projects. The infrastructure team needed a solution for stateful workloads that could quickly and simply scale without requiring more IT resources to maintain the solution.  

The Reliability Engineering team first launched a bare-metal Kubernetes cluster in 2018 to support stateful workloads—which are more complicated to run and scale than stateless applications. To accelerate and simplify lifecycle management, the IT team implemented a shared, multi-tenant Kubernetes cluster, meaning all users were operating from the same large cluster. 

This initial rollout was so successful that users outgrew the limits of the implementation within about a year. The users became very comfortable with the Kubernetes environment and asked to become power users. They wanted to run custom operators, enable feature gates, tune systems for their own performance needs and delegate access to their own users. All these things require cluster-level privileges, which meant they needed their own clusters.  

As users were requesting additional levels of access and more capabilities, lifecycle management of the single, large cluster became increasing challenging as the open-source tools kept breaking compatibility with previous versions. The solution was to add more clusters, but they needed to be easier to own, maintain and upgrade than the current single cluster. 

Solution 

The Reliability Engineering team implemented a multi-cluster Kubernetes solution, with clusters built and managed by Rancher.  

  • Kubernetes runs on-premises in VMware on Cisco UCS hardware
  • Storage leverages WWT's existing NetApp ONTAP storage arrays via NetApp Trident
  • Logs are shipped to Splunk
  • Networking is advanced configuration using Calico with BGP pairing to WWT's core Cisco switches

This solution was chosen based on cost effectiveness, minimal engineering time requirements and support availability.  

The deployment was split between data centers that are geographically near each other and connected by high-speed networking. This allows IT to either run clusters independently or spanned across data centers, with each data center designated as a Kubernetes Zone. With Rancher's management plane, enabling both options is straightforward, allowing internal users to choose the appropriate topology for their use case. 

Results 

The Reliability Engineering team owns Rancher, including setup, maintenance and lifecycle management. Rancher allows the team to provide several different engagement models for other infrastructure teams with different levels of maturity.  

Those at the beginning of their journey with containerization, or with minimal needs, generally just want the simplest way possible to run a container or two. For them, Reliability Engineering supports a common cluster, so that teams can get the benefit of containerization for simple use cases without learning about operating their own Kubernetes cluster. Rancher's project model makes it simple to offer this as a service, while Rancher's tooling minimizes the lifecycle management overhead for the Reliability Engineering team. 

Other teams have outgrown the common cluster and are ready to explore more Kubernetes capabilities or cluster-level add-ons, or they are looking to delegate access to their own internal customers. For them, Rancher allows the Reliability Engineering team to easily spin up multiple independent clusters and hand off first-level support. Teams can own the operation of their own clusters without getting bogged down in minutiae, like Kubernetes upgrades and certificate rotation. 

In additional to enabling increased agility and flexibility, Rancher offers a straightforward, cost-effective way to manage multiple Kubernetes clusters that also reduces total cost of ownership by saving engineering time. 

Technologies