Lets take a break from developing Virtual Queues in Scala and talk about something else. A while back a friend of mine asked me how an organization should go about handling their IT operations if they decide to move on to the cloud. This got me thinking. Developing a Cloud Operations plan is not an easy task. The challenge is that no two organizations have the exact same operations and so one cannot just follow a standard plan. What this means is that to arrive at a good operations plan on Cloud for a particular organization, one needs a prescriptive process that is generic enough that it could be followed for any organization and helps formulate that particular organization’s cloud operations plan.
I looked for such a process but could not find one. So over a long weekend I took it upon myself to come up with a simple yet comprehensive process that could be followed by any organization/client and in the end will produce a detailed yet tailored Cloud Operations plan taking the Unique IT operations of that organization as well as the capabilities of the cloud provider they have selected into consideration.
In this blog post I am sharing my process to develop IT operations for the cloud environment.
Introduction
With more and more organizations embracing cloud environment as a viable option for their compute, storage and networking needs, it becomes important to address how this new setup will impact the important task of IT operations. Traditionally IT organizations had complete control over all of their IT assets, starting from facilities and bare metal hardware all the way to application software supporting business processes. Cloud environment shifts this paradigm where an IT organization is no longer in complete control of their environment. Regardless of the deployment model of choice (other than a Private Cloud) be it Community Cloud, Hybrid Cloud or a Public Cloud, an organization has to rely on the Cloud Providers offerings to design their IT operations strategy.
A fundamental challenge faced by an organization is the lack of a well defined process to develop such a strategy and plan for its IT operations on the cloud. What makes the task more daunting is the fact that no two organizations have the same IT operations, processes and tasks. So there is no generic strategy that could be blindly adopted.
These challenges make it very important to have a consistently repeatable process that could be followed by any organization that wants to develop a strategy for its IT operations on the cloud. This paper explains one such process. The process is laid out as a series of steps that need to be executed in sequence to arrive at a viable IT operations strategy.
Rest of this paper explains these steps in detail.
STEP 1. Assess current IT operations
The first step towards developing an IT operations strategy for cloud is for an organization to assess its current state of IT operations. Start by identifying all processes performed as part of the current IT operations. A process is simply an activity consisting of related tasks that have a collective aim. In that sense a process can be seen as a collection of tasks. An inventory of all the tasks that forms a process should be documented. This is a very important document and forms the baseline of what needs to be considered when moving the operations to the cloud.
This step can be performed by interviewing the current staff members; it might also help to study what is done in the industry. A book such as Building Operational Excellence, authored by Bruce Allen & Dale Kutnick lists hundreds of processes and their best practices. This book will serve as a good starting point. In interviewing the staff, it will help to share the processes from this book with them and assist them in identifying a process from the industry that closely matches an in-house process. Listing the processes using industry standard terminology will make the discussions with a cloud provider easier. It is imperative to have as comprehensive a list of processes that are moving to the cloud as possible.
For example this table lists a set of processes that a typical organization might have as part of their IT operations. This is by no means a comprehensive list of processes but serves as an example of a typical organization.
• Application optimization • Asset management • Business relationship management • Capacity management • Change management • Configuration management • Contract management • Database administration (Physical) • Disk storage management • Facilities management • Hardware support • Infrastructure planning • Inventory management • Job scheduling • Middleware management • Negotiation management • Network monitoring • Output management • Performance management • Physical database management • Problem management • Production acceptance • Production control • Quality assurance • Security management • Service-level management • Service-level agreement management • Service request management • Software distribution • Software management • Systems monitoring • Tape management • Workload monitoring |
Each of these processes is made up of several tasks that are carried by individuals supporting the process. For example, Disk storage management is a process that consists of following tasks:
1. File placement,
2. Archiving,
3. Backup and recovery,
4. Numerous other tasks that often are performed independently of each other.
Another example is Database Administration that might consist of the following tasks
1. Evaluate the Database Server Hardware
2. Install the Database Software
3. Plan the Database
4. Create and Open the Database
5. Back Up the Database
6. Enroll System Users
7. Implement the Database Design
8. Back Up the Fully Functional Database
9. Tune Database Performance
10. Download and Install Patches
For each of the task in the process interview the staff member performing the task and collect all relevant information such as
· How automated is the task currently?
· What are the job skills required by the person doing it?
· Is the task/process monitored using metrics?
· If so, which ones?
· How mature is the process?
Once the task list is sorted and grouped by process, it should be formally documented. This baseline document will be very important in the subsequent work.
STEP 2. Identify which tasks are still relevant in the cloud environment
The assessment in the previous step needs to take into consideration the Cloud Provisioning Model that is being used by the organization. For example, a SaaS provisioning model only needs to concern with the application level operations, e.g. performance monitoring, on the other hand this assessment will be much more elaborate in case of an IaaS provisioning model. Even in case of IaaS some of the tasks are irrelevant in the cloud. For example, facilities management will not be of concern in the cloud.
As part of this step, filter out such tasks and arrive at a list of tasks and processes that are still relevant.
STEP 3. Rank the maturity of the current process
Process maturity is an assessment of how well the process is performed at the firm. Specifically, how disciplined it is and how carefully its results are monitored and improved. Normally, process maturity is graded on a scale of five steps: ad hoc and random, repeatable, defined, managed, and optimized.
Ad hoc and random processes and tasks are performed irregularly based on the needs of the system. There are no documented steps for performing such processes. Need for such processes might arise due to unexpected events e.g. power failure at the data center.
Repeatable processes and tasks usually have a run-book, a manual or hand book of sorts, which clearly states all the steps that are needed to perform those tasks. Any operations person with basic skills should be able to perform such tasks following the run-book. However, such tasks are still random in nature and not scheduled or planned for.
Defined process and tasks are well known tasks in advance, they usually have a set repeatable schedule when they need to be performed and they have all the properties of a repeatable task.
Managed tasks are controlled and monitored tasks that either requires some kind of downtime or approvals from the management before they can be carried out. For example, most of the maintenance tasks such as increasing the disk space might fall in this category. Another example is deploying a new version of an application.
Optimized tasks are the ones that have been well studied and have been improved over a period of time. Such tasks are usually the oldest tasks that an organization has been performing and/or are tasks that are well defined in the industry. For example, database administration is a task that is optimized in most IT organizations. These are the tasks that can also be replicated as-is in the cloud environment.
In order to determine if a process needs to be improved as part of the move to the cloud, classifying existing operations based on their maturity is helpful.
STEP 4. for each process identify if any improvements are required
A move to the cloud environment presents an opportunity where any process or a task that needs to be improved can be improved. In order to identify if a process needs to be improved, one needs to gather information on industry best practices associated with that task. Such best practices are documented in books like Building Operational Excellence, authored by Bruce Allen & Dale Kutnick. To determine what needs to be improved and how, compare each process as performed at the organization with its corresponding best practices and determine the gaps that need filling.
Usually the best practices for a process will capture the following metrics:
1. Automation Balance – On a scale of 1 to 10 how automated a task is? Processes that are highly automated tend to be mature and should be modified with care. They generally are automated because they work well for the IT organization. Modifications to these processes often take the form of slight enhancements, installation of appropriate metrics, and the like. By comparison, manual processes should be examined in the context of finding means of automation.
2. Stability Balance – On a scale of 1 to 10 how stable a process is? A stable process means that the process or a task is mature, optimized and highly efficient. Such a process might not need any improvements.
3. Staffing skills – This capture a list of required/desired skills needed by the staff members supporting that process or task. These skills might have to be reconsidered as part of the move to the cloud.
4. Best Practices – captures all the best practices that are part of that process.
5. Process integration – should list integration points to other processes. Knowing the integration points can be significant when grouping processes. Integration points are also important in identifying which processes are affected by changes in other processes.
At this point one should not make the improvements but just identify what needs to be improved. Such improvements will have to take cloud capabilities into consideration and subsequent steps discuss that in detail.
STEP 5. create a list of minimal capabilities DESIRED from A cloud provider
If this process is followed, at this point the following information is known-
1. A list of all the processes under the aegis of IT operations, and all the tasks that make up those processes
2. Maturity of each of these tasks
3. Any improvements (if any) that are needed for these tasks
From this information it is possible to determine a list of minimal capabilities that a cloud provider needs to support. For example, consider the best practices for disk storage management process – For each of the best practice table 2 lists what capabilities a cloud provider needs to support
Disk Storage Management – Best Practices |
Cloud Provider Capability to support the practice |
Validate and test backup and recovery concepts |
Point in time snapshot of volumes and replication across data centers. E.g. Amazon AWS provides an ability to take a just-in-time snapshot of its elastic block store that is automatically replicated across its S3 storage cluster. From this snapshot a new EBS volume can be created which will be a mirror image of the original. |
Make overall storage management approach completely cross-platform |
Attach and detach volumes to any virtual machine. Amazon AWS provides such a capability. |
Prioritize all applications based on business driven recovery requirements |
N/A |
List what should be recovered and how long it should take |
N/A |
Implement an ongoing performance management/optimization process |
Statistics such as I/O stat to track performance of the storage volumes |
As another example, consider some of the best practices around Asset Management
Asset Management – Best Practices |
Cloud Provider Capability to support the practice |
Review new Assets |
Ability to list and present all of the virtual machines being used, all the storage appliances being used, any networking appliances such as load balancers being used |
Review of Asset Utilization |
Ability to monitor key statistics such as I/O stats, CPU stats and memory stats |
Identification of missing assets |
See the first line item |
Review risks and license compliance |
N/A as most of the hardware assets are provided by the cloud provider |
Review asset location |
N/A. This is a benefit of moving to the cloud as the facilities are provided by the cloud provider |
Review duplicate assets |
Ability to see what is running on each of the appliances used from a cloud provider |
Update contracts and licenses |
This step is no different from hosting an in-house data center vs.. hosting on the cloud. |
Yet another example is that of Performance Management. Performance Management can be further divided into performance management of operating system, performance management of applications (e.g. Java) and platforms such as Weblogic and performance management of databases. Table 3 below shows the best practices for O/S performance management (Unix is shown as an example in the table below) and what a cloud provider needs to provide –
O/S Performance Management – Best Practice |
Cloud Provider Capability to support the practice |
Monitor user mode and system mode CPU Utilization |
Tools such as rstat. E.g. Amazon AWS provides cloudwatch service to monitor and manage such controls. |
Average load on CPU, Interrupt rate, context switch rate |
|
Process Monitoring – CPU utilization per process, amount of memory consumed, number of threads, user sessions |
Tools such as ps, top, proc tools (/usr/proc/pstack, /usr/proc/pfiles), truss etc |
Memory Monitoring – percent used, MB free, paging rate etc |
Tools such as vmstat |
I/O Monitoring |
Tools such as iostat |
Load balancing |
Ability to replicate a machine from a spec (e.g. AMI in AWS) or creating a storage volume from a snapshot. E.g. Amazon AWS extends its cloud watch to automated application scaling by letting customers specify rules. |
It is best to create a catalog of all the best practices required for all IT operations processes and what capabilities are available from the cloud provider in support of these process.
STEP 6. create an inventory of capabilities Actually supported by the cloud provider
Once the exercise of step 5 is completed, an organization moving to the cloud should have a comprehensive list of capabilities that the cloud provider needs support. However, a cloud provider will not support everything that is on this list. The next step in this process identifies the capabilities that are actually supported by the cloud provider. One might have to get creative here to see if a capability provided by a cloud provider can be made to work for a particular best practice even it is not the most desired or best fitting for the job. For example, AWS provides a control panel that lists all the EC2 instances that a client has started. Even though this is not the best way to have an inventory for assets, it can serve the purpose.
STEP 7. perform a gap analysis of what cloud provider provides and what is required for each of the process
Once the two lists from step 5 and 6 are available it becomes quite easy to see the gaps between what is desired by an organization to support all the operations in the cloud and what is realistic based on the capabilities provided by the cloud provider. This gap analysis might result in eliminating some of the processes or combining multiple processes into one. The gap analysis will also lead to some work where custom tools might need to be developed to support a particular process.
STEP 8. create a plan to bridge the gap
The next step after the gap analysis will create a plan to bridge the gaps between what a cloud provider supports and what is required to support all the processes. This step might generate some work for the Operations and maintenance team where they might be forced to create some custom tools to support critical processes for which there appears to be no support from the cloud provider. For example, a portal presenting a dashboard showing the performance metrics might have to be designed if the cloud provider lacks this.
This step might also lead to some innovation where an organization might either eliminate some processes or combine some of the processes to accommodate the gaps in cloud provider’s capabilities.
STEP 9. Develop A Set Of Policies And An SLA for The Cloud Provider
The previous steps would have produced a list of IT processes needed on the cloud, how best to utilize a cloud provider’s built in capabilities to support these processes, any custom work that is needed to support some of the processes that are not supported directly by a cloud provider. Step 4, generates a catalog that has all the relevant information about a process, the tasks that make up that process, the best practices around such a process, the key skills that the staff will need to support the process, the frequency at which the process needs to execute and its maturity.
Based on these inputs, a set of policies can be created, staffing needs can also be evaluated and a gap analysis can be performed about the current staff’s skill levels and what is needed for the cloud environment. A plan can be created to train the existing staff or hire new members.
The information collected so far can also be used in developing a clear SLA for the cloud provider stating specific parameters and minimum levels for each element of the service provided. The SLAs must be enforceable and state specific remedies that apply when they are not met. Aspects of cloud computing services where SLAs may be pertinent include:
· Uptime
· Performance and response time
· Error correction time
· Infrastructure/security
How To Use This Process
Developing an IT operations plan for the cloud is not a simple task. The exact nature of IT operations will vary from organization to organization and will be dependent on which cloud provider is used. The plan will also depend on the kind of provisioning model used in the cloud (different provisioning models being, IaaS, PaaS and SaaS). A SaaS might already have a fully baked SLA as well as all the operational tasks already defined by the cloud provider might not be able to modify that except make minor tweaks. An IaaS model on the other hand will require a lot of work to be done by the organization.
This section provides some help on how to use this process to develop a comprehensive IT operations plan.
As mentioned earlier, Step 1 and Step 2 will require input from the current operations staff. Please refer to the step 1 text to see how this step can be accomplished.
For step 3, besides the current operations staff input from the internal customers of the operations team and management might be needed as well. For step 4 senior IT personnel and development staff might have to be consulted. Step 5, 6, 7 and 8 are critical steps and will require discussions with the cloud provider. These steps need to be carefully documented. Someone who is knowledgeable about the cloud environment along with the senior staff from the operations team will have to be involved in developing these steps. Finally for developing step 9, some help from the legal department might be needed.
Conclusion
This paper presents a process for coming up with a plan and strategy for IT operations for an organization that is moving its operations to a cloud provider. The process consists of nine prescriptive steps and starts by identifying an organizations current IT operations. All the processes and the tasks in each of the process are clearly documented. The next step whittles the list down to only those operations that are relevant in the cloud environment. Moving along the next step investigates and documents the maturity of each of the process. Based on the outcome of this step, the next step will determine if any improvements to the existing process is needed. A list of desired feature from a cloud provider is then compiled and compared against the actual capabilities supported by the cloud provider. Based on the importance of the task, this gap analysis will lead to creation of some custom tools. Finally, a policy and SLA can be created for the cloud provider after reaching a mutual agreement with the cloud provider.