The Maximum Entropy principle in modelling and estimating probabilities of default for banks

I am now finally proceeding with my PhD dissertation in systems analysis and operations research. What I found originally interesting, was estimating probabilities of default for a group of banks using logistic regression, see my presentation at University of Cambridge Judge Business School, Lindgren [2016].

When we consider a statistical model for probability of default (PD) of a business entity or of a bank, we need to argue why we assume a specific statistical model for the data generating process. After we have identified a statistical model, estimation and inference is usually rather straightforward, although it might be computationally burdensome. In this article, I explain why logistic regression specification is a very natural one in terms of maximum entropy based statistical inference. The additional benefit is that we can use the machinery of statistical mechanics, as we will interpret the model through the Gibbs measure. This framework allows us to find expressions for various potentially useful concepts like enthalpy and free energy, usually based on the information codified in the partition function Z. Logistic regression is a very simple model for neural networks and this could be ultimately very useful paradigm in finance as well. Markets could be seen as a huge, adapting non-linear neural processing totality.


The principle of maximum entropy

I will follow in the steps of Jaynes [1957], who argued that the a priori distribution should be the one that maximizes entropy given some constraints. Entropy is a concept that originated from 19th century thermal physics  and statistical mechanics as a measure of disorder, but in a larger perspective it can be considered as an expectation related to surprisal, in terms of information theory. We usually consider information be related to the logarithm of probability because of its certain algebraic properties. For a thorough discussion, see for example the famous work by Claude Shannon.

In a discrete probability space we define entropy as

S(p_i)=-\sum_{i=1}^{n}p_i \log{p_i}

Where we define ‘surprisal’ to be \log{\frac{1}{p_i}}. Note that if an event is certain, surprisal is zero, and if probability is close to zero, surprisal grows very fast towards infinity. The intuition for entropy is therefore the average surprisal, when sampling. The idea now is to find a priori distribution, if we now nothing about it except some expectation based on the distribution. If we are prudent, we should assume the distribution is the one that maximizes entropy.

Consider now an expectation, call it energy

\langle E \rangle = \sum_{i=1}^{n}E_ip_i

If we now maximize entropy given a fixed constraint of average energy, we have the following Lagrangian

L(p_i)=-\sum_{i=1}^{n}p_i \log{p_i}-\beta \left(\sum_{i=1}^{n}E_ip_i-\langle E \rangle \right)-a\left(\sum_{i=1}^{n}p_i-1\right)

The last constraint is there to ensure that the probability measure is normalized to unity.

The maximization problem is straightforward and the entropy maximizing distribution is the Boltzmann distribution, or the Gibbs distribution

p_i=\frac{e^{-\beta E_i}}{Z(\beta)}

where Z(\beta) is the partition function that ensures the distribution is normalised to 1.

Logistic model

We now consider a binary choice model for the problem of default. At any instant, the entity is in default or not. We assign these probabilites to be p_i and 1-p_i respectively. Now let us assign energies for such two states of the world. We have energies E_1 and E_2. The partition function is therefore

Z(\beta)=e^{-\beta E_1}+e^{-\beta E_2}

If we now substitute this in to the Gibbs distribution, we will have

p_1=\frac{e^{-\beta E_1}}{e^{-\beta E_1}+e^{-\beta E_2}}

This can be simplified to be


This is the logistic curve, whose argument is the difference in energy. There is the Lagrangian coefficient beta, which in physics is the inverse temperature, here it can be used to balance the units to give a unitless probability.

Let us now identify the energies. Given that the probability of default is dependent on the difference, we should somehow relate these two concepts to risk and capital. So we could for example choose that E_2 represents total risk and E_1 represents capital-like variable. When risk is large compared to capital, probability of default is close to unity.

So in other words we might choose

E_2=\vec{w}\cdot \vec{x} and E_1=\theta

Where risk is a weighted sum of incoming sources of risk and theta is a measure of capital. Given these specifications, we have the model

p_1=\frac{1}{1+e^{-\beta(\vec{w}\cdot \vec{x}-\theta)}}

We can use the logit transform to form a linear regression model

\log{\frac{p_1}{1-p_1}}=\beta(\vec{w}\cdot \vec{x}-\theta) +\epsilon

We can assume that the additive noise term is IID, normal, standardised, if we assume a multiplicative IID lognormal noise in the original specification. This is feasible. The multicollinarity issues of the risk vector can be ignored, because I mainly care about forecasting systemic risk. In the era of machine learning, black box modelling is OK!

What next?

This is the framework I feel intuitively is the logical foundation for my empirical studies of systemic risk. I need to consider if I could somehow make use of the statistical mechanics framework further.


Lindgren (2016)

Jaynes, E.T (1957), Information Theory and Statistical mechanics, Phys. Rev. 106, 620



Leave a Reply