DevOps are not enough

Induction

I’ve read somewhere that the major criterion of DevOps work is running code. If DevOps is engaged in the creation of products for consumers, just running code is not a criterion – it is not enough. There are, at least, two requirements – the code must deliver the value it has been written for and the code must provide certain additional values that allow using this code in the product.

Three parts of this article discuss these additional values that are presented by Security as Code, Risk as Code and Quality as Code. The word “Code” in the mentioned names outlines that all these values should be a part of the DevOps process, not a concern of other teams. Written and even running code without security, risk addressing and quality cannot be categorised as “done”.

Part 1 – Security as Code

Talking about security is a delicate matter for two reasons: a) everyone has an opinion about it, and b) everyone hates it because it restricts/policies development and does not provide quick profit. When security is breached, it is usually too late to recognise its value.

DevSecOps try heavily to infiltrate the DevOps process over the last few years with security policies and testing. This resulted in Security as Code but still requires more effort to integrate with DevOps. It is still unwelcome because:

  1. Security adds work in development and testing, which requires additional time and tools while capable of breaking Dev-Ops link and enforce re-development
  2. DevOps are still not-responsible and not accountable for the security gaps in created and deployed code, but security specialists are, nevertheless, undesirable in the teams.

The Security as Code (SoC) methodology, has been defined as the following: “Security as Code is about building security into DevOps tools and practices, making it an essential part of the toolchains and workflows. You do this by mapping out how changes to code and infrastructure are made and finding places to add security checks and tests and gates without introducing unnecessary costs or delays” by Jim Bird. The intention is good, however, the recommendation is … almost ambiguous. There is a magic word “mapping” with quite unclear meaning – mapping to what and how this relates to security?  Also, a diplomatic reservation of “finding places to add security checks and tests and gates without introducing unnecessary costs or delays“. Such finding depends on the one who is looking up, i.e. it is a subjective matter. Also, who judges whether the checks, tests and security gates cause necessary or unnecessary cost? It is unclear, what is meant by “delays” because there is no time fixed and insecure code is not acceptable for deployment.

In summary, quoted SoC’s meaning is doomed by its own definition.

Thus, the only way a security/SoC can be squeezed into the DevOps development process for sure is via de-coupling Dev and Ops and inserting Security means between them: Dev – Sec – Ops, i.e. merging DevOps and DevSecOps teams. This model must be enforced by Development and Architecture Governance and explicitly controlled. It also should be embedded into the definition of “Done” for Dev/Ops.

As a result, the DevOps development planning, SDLC, practice and methodologies should be reviewed to incorporate security as a first-class citizen.

Part 2 – Risk as Code

Risk is even more subjective matter than security. According to NIST (NISTIR 8011 Vol. 1), “Information System-Related Security Risk. Definition(s): A measure of the extent to which an entity is threatened by a potential circumstance or event, and typically a function of: (i) the adverse impacts that would arise if the circumstance or event occurs; and (ii) the likelihood of occurrence.” Companies identify many different types of risk, e.g., operation, financial, market, natural disaster, etc., but all of them relate to potential circumstances or events scaled against the contextual company’s perception.

Risk does not exist on its own, it is always linked to the “company’s risk appetite” expressed as a policy. In other words, one company considers a possible event as a significant potential damage, while another company is happy to accept the same risk. Automation of the risk identification and dealing with it faces the challenge of “risk dealing feasibility”. For example, in business operational processes or workflows points of risks may be identified by people or by applying AI/ML analysis; the feasibility questions are how costly and how much time creation of the AI/ML solution might take (versus applying trained person) and whether particular AI/ML model will be effective for other processes or workflows.

In technology, risks are always associated with the interaction between software components, applications, services, fragments of code, etc., however, policies at the code level are not easily visible or even written. This means that DevOps need to deal with both Policy as Code, which can include Compliance as Code, and Risk as Code. This article focuses on Risk as Code only.

The two major risk-handling strategies in code are:

1)   Risk Remediation

2)   Risk Processing.

Risk Remediation (do not mistake it for risk mitigation) is about selling the risk for a reasonable premium to a 3rd Party, which might compensate for the consequences of the risk materialisation later. P&C Insurance is based on Risk Remediation. For IT development, Risk Remediation may be effectively applicable to the SW products supporting the Mass-media Industry with its insignificant losses and eventual consistency of information. In contrast, in financial, telco, chemical, auto and alike industries, the Risk Remediation strategy may be too late, harsh and even dangerous.

Risk Processing is a strategy where the company deals with the risk immediately. This area of knowledge and practice is usually called “Risk and Control Management”. Controls are code-based solutions (in our case) that can try to prevent the risk event from occurring and/or try to compensate (mitigate) the damage resulting from the risk-event (risk materialisation).

The best practice (though being relatively expensive) recommends placing both preventive and reactive controls at each point in the code where risk-events might take place. Examples of preventive controls in code may be: a) data quality control before the data is passed for processing or b) verification of accessibility of the target code before communicating with it. A special warning should be made for the solutions where one code fires an event (event notification) and another code is supposed to listen for it. Particularly, since such communication is asynchronous and fully decouple, the state of the listening code is generally unknown; as a result, the listener may not listen when the event is fired and the information gets lost. Therefore, a special data accumulating pattern (additional code) should be applied as a preventive control.

The example of reactive code control is well-known to many programmers – it is an exception-handling clause in programming languages like Java, C#, C++ and others that are used for data processing. In a lot of business cases, catching unexpected exceptions and logging them for later (off-line) analysis and correction may be unacceptable in some cases because a) the execution flow may not be interrupted according to business needs/requirements, b) latency caused by correction exceeds SLA/customer expectations, c) cost of the “lost opportunity” may be too high.

Many developers consider that just logging is enough (for them) while developers working on interactions with human users believe that throwing an error message to the end-user is a “New Norma”, which is not. Since SW products become more and more service-oriented, the code should follow the “golden service rule” – the consumer does not want to care about internal problems of service”. That is, if a service includes a code that has a risk of failure, the service must minimize or eliminate the impact on the consumer if the risk materialises.

The belated fixes of code’s failure are less acceptable in Cloud than it was on-premises. This is because Cloud-technology and general digitalisation offer such a wide market that almost everything in it is resilient and delays caused by code fixing are not excusable – consumers simply turn their heads away from the poor quality. I am afraid that DevOps would not like this comment, at all.

The recommendation for Risk as Code handling is the creation and authorising of the Governing policies that enumerate several possible code cases where preventive and reactive controls are must-haves in code. Code reviews and code tests should verify these controls before the code may be considered “done” and permitted for release. Will this increase the delivery time? – Yes. Will this improve the quality of development outcomes? – Yes. Will this increase consumer satisfaction? – Yes. Does this decrease the total cost of ownership for the company? – Yes. I’d like to remind DevOps of the genius expression articulated by Albert Einstein, “Everything should be made as simple as possible, but not simpler”, which in our cases may sound like ‘Everything should be made as simple and fast as possible, but not simpler and faster than necessary’.

Part 3 – Quality as Code

Quality of code is a topic known almost since the early days of programming (around 1970). First, it was about bug-free code, then – documented code, then code compliance with code standards and best practices. Nowadays, requirements to code quality are increased and ask not only for clearance, simplicity, accuracy, security and elegance, but also for

·      performance and de-coupling,

·      reliability and robustness,

·      flexibility and adaptability,

·      testability,

·      maintainability and manageability.

The enumerated qualities are usually realised or have to be realised by an additional code, which we call Quality as Code. Thus, code quality and Quality as Code are different matters. If we strip the operating code from additional code (Quality as Code), the original code can lose certain quality.

The mentioned above list is not full and should be extended, but the article explains only these qualities.

Performance and de-coupling – it is not a secret that the same task can be coded is a few different ways. Since modern programming languages are running on top of low-level languages and constructs created at different times, the performance of code depends on all – lower-level code performance, higher-level code performance and code combination design. So, the code performance is a balance between simplicity and efficiency. Distributed computing adds complexity as a cost for other qualities unavailable in the monolith design. Distribution not only distancing/de-coupling code fragments from each other, but also demands additional code (Quality as Code) that enables interactions between the code fragments. Therefore, the performance of Quality as Code should also be counted in overall code performance. For example, at the Application Level, the highest performance belongs to a direct synchronous interaction between code fragments. At the same time, if this interaction is conducted over the network (like REST), it has minimal reliability and robustness and an alternative asynchronous messaging with lower performance may be more preferable (both REST and messaging constitute Quality as Code).

Reliability and robustness – code reliability indicates the ability of the code to run without failure, i.e. it may be an average time between failures or probability of code failures or over a specific period of time. Code robustness is the ability of code to continue execution (even partially) in unexpected failure conditions. Thus, reliability requires recording all code failures, e.g. logging and/or reporting (Quality as Code), and robustness needs an additional code (Quality as Code) that would be able to continue running if the core code fails to perform its task. In some industries, a “code/task, failure” is not an option despite any methods of development, deployment and execution and the Quality as Code is a must-have.

Flexibility and Adaptability – in the current context, code flexibility is about code changes in response to external factors; code adaptability is about an adaption of a code fragment to the changed execution context, which does not necessarily result in the code change. For instance, AMLD v.5 requires that each financial transaction contained information pointing to the beneficiary of the transaction. So, the code composing the transaction’s message, e.g. for the Open Banking API, should be changed quickly and easily to include the beneficiary’s identity, plus, an additional code should provide the identity. This is a new requirement to be addressed via flexibility. In contrast, a service that calculates the price of a mutual fund should be adaptable to the execution context. As an example, the calculating code should engage an additional code (Quality as Code), which has to recognise where the service is used – whether it works in the UK or in the USA – because these countries use different local financial regulations and count the price using different formulas.

Testability – not every code can be easily testable. SW testing includes special input and outcome data as well as pre-conditions and post-conditions. The former pair requires additional code that controls/reports the input and outcome data while the latter pair is more complex in implementation and can ask for additional manipulations with configurations of the run-time environment. Certain types of testing are mandatory for particular code and, while testing is already a part of DevOps practices, its automation created additional challenges and complexity, and requires extra time. This leads to the temptation of reducing testing as much as possible, especially for the regression and integration tests. All testing activities obviously demand extra code (Quality as Code) and time. It is more likely than not that regression tests are skipped in full and integration tests are substituted by connectivity tests. Since the quality of outcomes is not the major attribute of DevOps work (or less important than the time of delivery), DevOps have created a ‘fantasy’ that all consumers will collaborate with them and help them to refine the code. This idea has been taken in isolation from the context – while internal consumers might provide feedback to the DevOps Team, the external consumers will likely turn away from the ‘raw’ product toward a competitor’s one.

Maintainability and Manageability – maintainability is a degree of easiness in the software maintenance. It relates to the size, consistency, structure/modularity, and complexity of the codebase. Any deviation from a monolith structure toward modularity requires additional code (Quality as Code) for connectivity between modules either at the higher or lower levels of code. Code management is the process of handling changes to code and, at least, a process of tracking modifications and managing changes to code. While code manageability comprises many operational activities outside of the code itself, effective manageability assumes behavior code monitoring embedded into the original code. Such monitoring also constitutes the Quality as Code.

Observed Security as Code, Risk as Code and Quality as Code practices jointly represent minimal requirements for DevOps work in the modern development process. These practices are equally or, in many cases, more important than just ‘bold’ time to market because competitive advantages, gained by an organisation due to the delivery pace, can be lost thanks to insecurity, unacceptable risks and low quality of code. Even worse, not only advantage may vanish, but a company can lose some customers who become unhappy with such products/code.

Automation of Security as Code, Risk as Code and Quality as Code is possible, if not today then in the observable future, based on AI/ML. Until this becomes available in the market, companies have to either build this automation in-house or perform Security, Risk and Quality controls manually / routinely while DevOps ought to accept interruptions and slowdowns in their delivery process since uncontrolled code is “not done” yet.

Leave a comment