RT: 5 minutes

Modeling Risk in Cloud Deployments described how to estimate and record threat impact and likelihood information in tags applied to Cloud resources such as databases and object stores. You can compute the risk of those threats by plugging that impact and likelihood into the general risk calculation:

risk = (likelihood_confidentiality_loss * impact_confidentiality_loss) 
       + (likelihood_integrity_loss * impact_integrity_loss)
       + (likelihood_availability_loss * impact_availability_loss)

But that’s not something you “just do.” We know that the actual impact and likelihood are generally unknowable and so we’ll need to estimate an expected loss probabilistically.

In this post, we will compute a realistic annual loss estimate in dollars for an ecommerce application using a tool that models the distribution of possible impacts and probabilities appropriately.

The threats that were modeled for the example ecommerce application were:

Lost Availability due to ‘bad’ changes and load.

  • Impact: ranging between $250 and $19,000 per incident
  • Likelihood: 3 times per year

Lost Confidentiality due to an internal threat or attack:

  • Impact: at least $1,000 for an internal leak and at most $100k if the data is exfiltrated by an attacker
  • Likelihood: we didn’t define this previously, but let’s say the probability of an internal leak is 0.5 events per year and an external leak is 0.2 events per year (once every 5 years)

What are the estimated losses for these threats?

Netflix just released the riskquant tool to help you answer precisely these questions. From the announcement:

riskquant takes a list of loss scenarios, each with estimates of frequency, low loss magnitude, and high loss magnitude, and calculates and ranks the annualized loss for all scenarios. The annualized loss is the mean magnitude averaged over the expected interval between events, which is roughly the inverse of the frequency (e.g. a frequency of 0.1 implies an event about every 10 years).

Let’s put our threat scenarios in the table form that riskquant understands (csv):

IdentifierNameProbabilityLow loss ($)High loss ($)
Lose Prod User DB Confidentiality Internally0.5100010,000
Lose Prod User DB Confidentiality to Attacker0.210,000100,000
Lose Availability0.9925019,000
Lose Availability0.0082225019,000

The Identifier and Name columns identify a threat to simulate.

The Low_loss and High_loss columns specify the lower and upper bounds of the impact.

The Probability column contains plain, unit-less probabilities and riskquant doesn’t care what time periods you simulate, technically. This is useful when an event occurs multiple times per year, because we can’t express a probability as 300%. So to model the availability threat, we need to make a couple adjustments. Either model that the threat:

  • occurs with 100% probability annually to get the expected impact of one event and then multiply by 3
  • occurs with (3/365) 0.00822% probability and then multiply by 365

This approach to modeling event frequency makes some assumptions about independence and uniformity that I’ll skip for now. My hope is that this approach appears more accurate and definitely more precise than characterizing the event as having, e.g. a ‘Low’ frequency. Better information now is useful for managing risks we already have.

Ok, on to modeling the range of possible threat impacts.

Log-normal Probability Distributions (Wikipedia)
Log-normal Probability Distributions (Wikipedia)

riskquant models impact value with the Log-normal distribution and reports the distribution mean as the expected loss for each threat.

Log-normal distributions always produce positive and sometimes extreme values, and the peak can be configured to resemble the most frequently observed values. These properties help it fit some phenomena better than other distributions such as a normal or uniform distribution. Log-normal distributions are often used to model losses by a cyberattack, fatigue-stress failure lifetimes, and project costs.

Let’s produce those loss estimates now. If you’d like to follow along, the files in this example are available on GitHub at qualimente/riskquant-example.

Riskquant requires tensorflow and other data analysis libraries that were easier for me to get working in Linux via Docker than OSX. You can check out the Dockerfile used to build the Docker image I used in this pull request. The image is available on Docker Hub at qualimente/riskquant.

Run riskquant on the threat model described in the data directory:

docker container run --rm -it \
  -v "$(PWD)/data":/data/ \
  qualimente/riskquant --file /data/webapp.threat-model.csv

The riskquant program runs successfully and reports the results were written to a file:

Writing prioritized threats to:

Let’s inspect the loss estimates with cat data/webapp.threat-model_prioritized.csv, formatted below for readability:

IdentifierNameExpected loss ($/event)
WebLossConfPublicLose Prod User DB Confidentiality to Attacker$8,080
WebLossAvailAnnualLose Availability$5,130
WebLossConfInternalLose Prod User DB Confidentiality Internally$2,020
Lose Availability$43

riskquant outputs the expected losses in order of greatest to least. The expected annual losses are:

  • Lose Prod User DB Confidentiality to Attacker: $8,080 / year
  • Lose Prod User DB Confidentiality Internally: $2,020 / year
  • Lose Availability: $15,695 / year (365*$43) or $15,390 / year (3*$5,130)

This example was explored through riskquant‘s command-line interface and the results of the SimpleLoss model were presented here. You can perform more sophisticated analyses when using riskquant as a library and configuring shape distributions directly. In particular, the library offers a pertloss function that allows much more control over the shape of the probability distribution that produces threat events.

Let’s stop here, because we’ve improved our decision making capability significantly.

Use the Information

These estimates are great information to have when deciding whether it makes sense to invest time and money in addressing the factors that caused the availability incidents or in protecting confidentiality of the ecommerce system’s user database.

Consider that if this team had $5,000 to invest in risk reduction, this information would suggest looking for ways to:

  1. significantly reduce the risk of losing confidentiality to an attacker
  2. reducing availability incidents from from 3 to 1 or pulling the repair time in significantly, both of these are key Aspects of Software Delivery Performance

Improvements in these areas are likely to have positive ROI within a one year time horizon with demonstrable results to the organization’s leadership.

Also, keep in mind that the risk management solution doesn’t always need to be technical or a single investment.

For example, the team might people available to improve availability through by implementing a more robust delivery process that detects failures and helps operators rollback quickly. The team might decide to invest $4k of the risk management budget in that area. This would leave $1k to increase cybersecurity insurance coverage that might limit the organization’s public data breach loss exposure to $50k.

The effectiveness of risk management processes depend heavily on the quality of the information available to decision makers. Quantifying those risks using a robust, consistent contextual model is a way to improve the accuracy and precision of the information used within the risk management process and help you repeat that analysis in a scalable way over time.

I’m building k9 Security to help engineers using the Cloud understand and and improve their risks continuously by improving the security policies that protect their data — hit reply if you’d like to learn more.