Assurance is dead. Long live orchestrated assurance.
Legacy service assurance procedures are not living up to today’s requirements on customer experience and service quality. Thus, we need to rethink service assurance, abandon practices with a strict reactive focus on events and alarms, and instead start to think about what really matters. Simply renaming traps and syslog streams to “telemetry” will not help. Neither will dropping event and alarm information into a big data lake.
Let us start by analyzing the underlying problems.
Disconnect between Service Fulfillment and Service Assurance
First of all the service fulfillment/delivery and service assurance processes are disconnected. The delivery team provisions a service without knowing if it really works. There is very little automatic hand-over to the operations team on how to monitor the service. In many cases, the assurance team has to start from scratch, perhaps not even realizing the service exists until a customer calls and complains. Furthermore, to “help” them understand the service they have incomplete service discovery tools and inventories.
Sub-optimal activation testing
Frequently, services are not tested at delivery, which, as mentioned above, is a considerable problem. In many cases, there is simply no activation testing carried out. Customers detect if the service is working as expected. In other cases, a simple ping is done at service delivery. But that has very little to do with the customer experience. Furthermore, legacy testing techniques, when at all performed, often require manual and expensive field efforts. This, of course, is too slow and inefficient.
Very often neither customer care nor the operations team has real insights into how the service actually is working.
A poor understanding of the end-user experience
Today, service assurance practices focus on resources, servers, applications and network devices. Accordingly, assurance data consists of log files and counters relating to these resources. This, however, has little to do with how services are working. You can have a fault on a device that is not affecting a customer and, furthermore, many customer issues have nothing to do with a fault. Most under-performing services are due to a less than optimal configuration, and alarm or performance systems will not detect these problems.
The industry has begun to realize that service assurance is not living up to requirements. But rather than identifying the root cause and doing something about it, it seems all too often we are looking for a free lunch instead – the Big Data promise. You can’t just throw incomplete and low-level data into a Big Data repository, and expect to draw conclusions about service health.
Fear the mapping-machine
Unfortunately, Big Data alone does not bring us closer to the goal. Calculating service health from low-level resource data is not obvious. The mapping function is simply not available in Big Data frameworks, while with machine learning, the training sets for the service data are lacking.
At Data Ductus, we work with technology partners to provide solutions which we believe bring us closer to a resolution. Our two product partners Cisco and Netrounds have defined a concept and implemented a design pattern called Orchestrated Assurance to address the underlying problems and move to service focused assurance, see: http://orchestratedassurance.net
The principles are the following:
- Measure service metrics directly. Do not try to infer them from resource data.
- Use automated tests in every service delivery.
- Tie the orchestrator and assurance systems together so that the orchestrator automatically performs the testing and enables monitoring.
We are always eager to learn from others and to share ideas. If you have comments or would like to assess potential joint initiatives, do not hesitate to get in touch.
Is Big Data the Big Elephant for service assurance?
As telecom providers onboard new services and customers, network and service assurance becomes more complex and more difficult to manage. One key reason for this is the poor quality of the network and service assurance data, which often suffers from:
- A lack of priorities
- A lack of service context and customer context
- And too much irrelevant data
The industry has made various and regular attempts to address this issue, yet without any real breakthrough. As a result, alarm correlation efforts often fail due to the impossible nature of rule maintenance. Similarly, initiatives focusing on inventory system lookup – to help build context – drastically fail due to incorrect and incomplete inventory data.
Don’t pin all your hopes on Big Data
The industry has now turned to Big Data in the hope that it will help solve the assurance data problem. The general belief seems to be that we can throw even more low-quality data into a big data platform and magically get the answers we need. However, big data scientists concur that this simply isn’t possible. Thus, for the data issues mentioned above there is simply no silver bullet and to think big data is the easy answer boils down to a disproportionate belief in the technology or an over zealous product vendor over-selling their capabilities.
And why is this? Successful big data projects have two preconditions:
- Highly relevant, high-quality data must be available – quantity is not quality
- Clear definition of questions to answer – there’s no magic wand for all questions
Service-focused assurance is a way forward
At Data Ductus, we strive to take a more service-focused approach. In the solutions that we deliver with our partner Netrounds, for instance, we provide high-quality data at the service layer. Focusing on data quality at the source in this way, enables you to answer specific questions such as:
- Does the service work at turn-up?
- What is the network loss, latency and jitter?
- What is the Mean Opinion Score (MOS) score for Structured Insulated Panels (SIP) calls.
For more information about this approach, see our joint white paper on small data versus big data at: https://www.netrounds.com/service-assurance-need-big-data-small-data-white-paper/.
If you are interested in discussing these topics with us, get in touch.
The Catch 22 of Service Orchestration
Automating service deployment can cause a dilemma if not done properly. Deploying services manually has the benefit of unlimited flexibility. Smart engineers can principally configure services to meet any customer requirement. However, that way of working is infeasible. Service deployment projects take too long time and introduce too many errors. Furthermore, the operational cost is too high since you need a larger staff to cope with the demand.
Accordingly, many service providers and enterprises automate service delivery. Ironically, this type of automation is often counterproductive. Service providers tell us this leads to a culture of “it works, don’t touch”. We refer to this as the Catch 22 of Service Orchestration.
In a typical example, a part of the catalog is automated with a “hard-coded” solution which required a long and costly software project. The solution then fulfills the goals of fast and error-free configuration, but it does not offer the flexibility to adopt the portfolio to new customer and market needs. It takes yet another long and costly software project to achieve this. Therefore, there is a risk the organization falls back to manual configurations.
How can we use Service Orchestration to avoid this situation?
There are several things to consider to remain flexible and still automate:
- Inhouse DevOps teams: Do not outsource everything to an external partner. You need the skills internally to implement changes.
- DevOps culture: Your product owners, and operations and development teams need to work together in an efficient manner. See a presentation on the topic here: https://www.slideshare.net/stefanvallin/devops-for-network-engineers
- An automation/orchestration platform that supports both design-time and run-time features. At design-time, the design team must be able to design, implement and test new or changed services within days or weeks. Operations can then easily automate the deployment. It is a litmus test when you select the platform, how quickly can you implement a simple service yourself. Is it hard-coded? Aim for model-driven platforms that render themselves from data-models like YANG or TOSCA. Evaluate seriously with your own development teams.
A fast turn-around in the design phase delivers a quick turn-around of new services to the market. A fast and error-free run-time engine gives fast service delivery to customers.
It is also important to have a partner that guides the organization towards this way of working. You should focus on training, technical expertise to help in implementing the first services and work towards a model where you maintain and further the services internally.
At Data Ductus, we have helped clients around the globe to establish efficient in-house DevOps procedures. We’d be happy to share our experiences with you.