Cloud deployments often use tagging to describe the context of a compute or resource such as a who owns or what application a virtual machine or object storage bucket belongs to. However, the common resource tagging models in use don’t describe the context required for people or tools to assess security or manage risk easily.

This post will extend those tagging models with context useful for assessing security; the next post will contextualize risk. These attributes can also be used to characterize fault tolerance and estimate impacts from service outages.

Describe the Security context by recording the intended Data Classification (Confidentiality), Integrity, and Availability of information handled by that resource in tags.

Risk Assessment Process

Recall that the main inputs to the risk assessment process are:

  • a risk model
  • an assessment approach
  • an analysis approach
  • domain knowledge
  • current configuration

The ‘domain knowledge’ describes the resources and environment being assessed in sufficient detail for us to analyze the information’s security and risks. This domain knowledge is the missing descriptive information that enables assessors and tools to determine relevance and create useful security and risk assessments.

Security Context

Start by tagging resources in a Cloud deployment with stakeholders’ expectations for the confidentiality, integrity and availability of information processed or stored by that resource.

This should align your cloud deployment with mainstream information security and risk management models.

Somewhat Organized Drawers, Jesse Orrico

Let’s define the expected contents of those tags and permitted values. I’m maintaining these tags and values in a spreadsheet in case you’d like to use them.

Data Classification (or Confidentiality)

The Data Classification (or Confidentiality) tag specifies stakeholders’ intended level of confidentiality and uses of the data processed or stored by this resource, both inside and outside the organization. The suggested values for this tag come from the SANS Institute’s Data Classifications:

  • Public – Non-sensitive information available for external release
  • Internal – Information that is generally available to employees and approved non-employees
  • Confidential – Information that is sensitive within the company and is intended for use only by specified groups of employees
  • Restricted – Information that is extremely sensitive and is intended for use only by named individuals within the company

Many organizations already have a data classification scheme, so it is wise to look for and review that scheme to see if you can adopt it before defining a new standard.

Describing the intended confidentiality of data processed by a resource is the most complicated of the three because confidentiality also implies or demands some description of who is authorized to see the data.

Assessors will likely be interested in additional context when evaluating confidentiality such as Application, Function, or Owner.

However, some of the most basic and important questions such as, “should this bucket be publicly accessible?” are answerable directly from a DataClassification tag. Buckets hosting public websites should be tagged with DataClassification=Public. Buckets containing PHI or PCI data should be DataClassification=Confidential (and have a bucket policy limiting access to the relevant applications and support personnel).

Integrity

The Integrity tag specifies stakeholders’ intended level of integrity for this data, which means the extent to which the data must be guarded from improper modification or destruction; includes ensuring information non-repudiation and authenticity.

Some people might wonder when you wouldn’t want ‘100%’ data integrity. There are a number of cases where we can’t achieve, don’t need, or won’t pay for 100% integrity. Here are a couple of examples:

Application Logs

Application log data is often shipped, transformed, enhanced with additional metadata, and stored on a best effort basis or using an explicit sampling scheme because we don’t usually don’t need all of the log data to debug problems. Often a representative sample within a given time period is all that’s needed to detect or debug a problem.

Detect Rather than Prevent Modification

In some situations, it’s sufficient to use checksums and cryptographic signatures to detect that data was modified or corrupted so that you can request a valid copy or investigate the underlying problem. Many network protocols such as TCP incorporate checksums to detect corruption during transmission. AWS Cloudtrail is a good example of an audit log delivery system that provides a digest file containing SHA-256 hashes of the files it delivered to you. This digest is signed by CloudTrails encryption key, so you can verify that this metadata and the files delivered to you are authentic.

You can measure integrity as the maximum portion of records that may lose integrity in a month, before stakeholders are impacted:

  • One in a hundred (1%): 0.01
  • One in a thousand (0.1%): 0.001
  • One in ten thousand (0.01%): 0.0001
  • One in one hundred thousand (0.001%): 0.00001
  • One in a million (0.001%): 0.000001
  • Less than 1 in a million

You could also measure the inverse of that, the minimum portion of processed records that must maintain integrity in a month, which aligns to how we usually think about Availability.

Availability

The Availability tag specifies stakeholders’ desired portion of time or service requests that the resource should provide reliable and timely access, measured monthly in NINES of availability or allowed downtime per month:

  • 0.99 availability (or 7.3 hours dt)
  • 0.999 availability (or 10.08 minutes dt)
  • 0.9995 (5.04 minutes dt)
  • 0.9999 (1.01 minutes dt)
  • 0.99999 (6.05 seconds dt)

Here are some examples of applying this definition of Availability:

A stateless load balancer or compute cluster must be deployed across three availability zones to achieve 0.9995 or better availability and should be tagged with Availability=0.9995

A web application using an RDBMS that wants to achieve 99.95% availability must handle a database failover as a complete system within 5 minutes. This means the database cluster services probably need close to four nines availability. If the DB cluster is tagged with Availability=0.9999, then operators and tools know they should expect to see:

  • have a warm replica running in another availability zone, because launching a new instance takes ~10 minutes
  • configurations and an implementation that can safely and automatically promote a replica to leader within 3-4 minutes of a failure event

This leaves 1-2 minutes for the application database drivers to detect that failover and switch to the new leader unless the system provides a network endpoint that handles this transparently.

Some challenges and questions

“We already use X.”

You may use an ontology that is (better|worse) than what is described here. Honestly, I’d love to discuss it with you, either way. Whether or not you should continue or extend usage that classification system into the cloud should hinge on how well it will support risk management decisions.

“Should we describe the component or the system?”

The app+RDBMS availability example exposes a problem. Should you tag the resource with the intended resource or system-level availability? I think this is an excellent and probably open question. I’d love to hear the perspective of those using Chaos Engineering to help systems build resilience. I suggest specifying CIA requirements at the resource-level. My thinking is that this provides more flexibility for you model a dependency graph of relationships between the components of a system or multiple systems that share a dependency.

Next Steps

Once you start applying these Security tags to resources consistently, you should be able to assess the security of those resources quickly and without talking to the resource’s Owner repeatedly.

This information security context provides critical domain knowledge about how the organization intends to manage the information processed by that resource. This context can be used by people and tools to analyze and assess risks to the information security, availability, cost of downtime, and more.

Next, we’ll investigate how to use this Security context plus information model the loss of information security and then compute a quantitative risk estimate for that model. This will enable you to discuss risks in terms of money, instead of ‘High’ or ¯\_(ツ)_/¯.

Stephen

#NoDrama