# Exploring Finite Factored Sets with some toy examples

Seeing as there is little secondary literature for the Finite Factored Set formalism, I thought I’d write up my experience of exploring it through some toy examples that are classic examples in the Pearlian paradigm. My goal was to see how these models that I understood very well in the Pearlian paradigm would work in Finite Factored Sets.

As a warning, this doesn’t make any use of the more exciting properties of Finite Factored Sets. It’s just an exploration of how this formalism handles the mundane stuff. This also means that I’m using the factored set *directly*, without the abstractions of the orthogonality database. Which I think is fine here, because these are tiny toy examples whose structure is fully known. (However, it’s possible that I’ve missed the entire point of Finite Factored Sets.)

The first example is the 3-variable collider that is very central to Pearl’s formalism. It is given by the following Causal Diagram:

$A$, $B$, and $C$ are all binary variables (0=false, 1=true).

The intended meaning of a Causal Diagram (or rather, *function causal model*) is that the value of a node $x_i$ is given by a deterministic function that takes as input the parents, $pa(x_i)$, (indicated by the arrows) and an “error”, or “noise”, variable $u_i$ that is governed by a probability distribution that is independent from the other error/noise variables: $x_i = f_i(pa(x_i), u_i)$. Thus, the value of $C$ is given by $c=f_c(a, b, u_C)$ where $u_C$ is noise or uncertainty that is not explicitly modeled, which we can visualize like this:

We could also split up $A$ and $B$ into a deterministic and a random part, but as they are root nodes, there is little point. It would just be $a=f_a(u_A)$.

The Pearlian formalism runs on *graphs*, but Finite Factored Sets run on the *set* $S$ of all possible outcomes – the sample space. So, the goal is now to construct a sample space that is consistent to the above graph. After that, we’ll find a *factorization* of that sample space.

I think it should be clear that to cover the whole sample space $S$, it is sufficient to consider all possible combinations of the outcomes of $A$, $B$, and $U_C$ (but not $C$), because if we know the value of these three, then we also know the value of $C$, via $f_c$.

So, we simply define $S$ as the Cartesian product of the sets of possible values of $A$ and $B$: $\mathcal{A}=\{0,1\}$, $\mathcal{B}=\{0,1\}$, and the possible values of $U_C$: $\mathcal{U}_C$, which I’ll leave undefined for the moment (except to note that it must be a finite set):

$S=\mathcal{A}\times\mathcal{B}\times\mathcal{U}_C,\quad (a, b, u_C) \in S$

(We use the lowercase letters to represent elements of sets that make up the Cartesian product: $a\in \mathcal{A}$, $b\in\mathcal{B}$, and $u_C\in\mathcal{U}_C$.)

Then – as is custom in the formalism of Finite Factored Sets – the variables $A$, $B$, $U_C$ are defined as *partitions* on $S$. For example, $A$ is a partition consisting of two parts: 1) the set of elements of $S$ where $a=0$ and 2) those where $a=1$:

$A = \{\{(a, b, u_C)\in S|a=0\}, \{(a, b, u_C)\in S|a=1\}\}$

This captures exactly what the variable $A$ is supposed to represent: the question of whether the first element of the Cartesian product is 0 or 1. $B$ is defined analogously:

$B = \{\{(a, b, u_C)\in S|b=0\}, \{(a, b, u_C)\in S|b=1\}\}$

And $U_C$ as well:

$U_C = \{\{(a, b, u_C)\in S|u_C=0\}, \dots\}$

with an as of yet undefined number of parts in the partition.

Now, given the way we constructed $S$, the set of partitions $G=\{A, B, U_C\}$ is a factorization of $S$: $F=(S, G)$, because any element of $S$ can be uniquely identified by knowing in which part of $A$, $B$, and $U_C$ it is, because $S$ is just the Cartesian product of $\mathcal{A}$, $\mathcal{B}$, and $\mathcal{U}_C$.

Now that we have the set, let’s look at the probability distribution over $S$. Let $P(A=a')$ be shorthand for $P(\{(a, b, u_C)\in S|a=a'\})$, i.e., the probability that the outcome lands in the subset $\{(a, b, u_C)\in S|a=a' \}$ of the sample space $S$, where the first element of the Cartesian product is equal to $a'$. We define $P(B=b)$ and $P(U_C=u_c)$ analogously. Finally, let $P(a, b, u_C)$ refer to the probability of the individual outcome $(a, b, u_C)\in S$.

The fact that we want our finite factored set model to be consistent with the Causal Diagram above, implies some things about $P$. In particular, the diagram implies that $A$, $B$, and $U_C$ should be independent, which means that the joint probability should factorize like this:

$P(a, b, u_C)=P(A=a)P(B=b)P(U_C=u_C)$

But this is exactly how our (finite) set $S$ factorizes! Thus, $P$ factorizes the same way that $S$ does.

This concludes the translation of the graphical model into a semantically-equivalent finite factored set. To recap what we have accomplished: we turned the original model with the variables $A$, $B$, and $C$ into one with three independent variables: $A$, $B$, and $U_C$. And defined $C$ as a deterministic function of these. We then constructed a finite factored set with the factors $A$, $B$, and $U_C$.

Now, let’s define $C$ on $S$ as well. We again define it as a partition of $S$:

$C=\{\{(a, b, u_C)\in S|f_C(a, b, u_C)=0\},\\ \{(a, b, u_C)\in S|f_C(a, b, u_C)=1\}\}$

We simply partition $S$ depending on the output of the function $f_C$. This also allows us to define $P(C=c)$ as $P(\{(a, b, u_C)\in S|f_C(a, b, u_C)=c\})$.

In the Pearlian formalism, we can read off the fact that $A$ and $B$ are independent from the graph structure. With Finite Factored sets, we have to look at *histories*. The history of a partition (aka a variable) is roughly speaking the minimal set of factors in the factorization $G$ that is sufficient to fully specify the partition (aka variable). $A$ and $B$ are factors in their own right, so their history consists just of themselves: $h(A)=\{A\}$ and $h(B)=\{B\}$, because surely knowing $A$ is enough to know $A$. As $h(A) \cap h(B)=\emptyset$, we can conclude that $A$ and $B$ are orthogonal, but this is just because we *defined* the factorization that way, so this is no new information. Still, it’s a good consistency check.

The history of $C$ is more interesting, but still pretty trivial to determine. As long as $f_C$ makes use of all its arguments and is generally non-pathological, all the factors are needed to determine $C$. So: $h(C)=\{A, B, U_C\}$. This implies $h(A)\subset h(C)$ and $h(B)\subset h(C)$ which implies $A$ and $B$ are both *strictly before* $C$ in the time defined by our factorization. This is a non-trivial result which matches what we would have expected from the graphical model that I drew in the beginning. So, that’s good.

But what if we condition on $C$? Colliders are special because they have $A\perp B$ but $A\not\perp B|C$. Can we recover this results from our finite factored set model?

To compute conditional orthogonality, we have to compute *conditional histories*. In order to condition on $C$, we need the histories conditioned on the subsets of $S$ where $C=0$ and where $C=1$. We’ll call these subsets $C_0$ and $C_1$:

$C_0 := \{(a, b, u_C)\in S|f_C(a, b, u_C)=0\}\\ C_1 := \{(a, b, u_C)\in S|f_C(a, b, u_C)=1\}$

Let’s start with the latter: let’s determine $h(A|C_1)$ – the conditional history of $A$ given $C_1$. However, the definition of conditional history is not as straightforward as the definition of history. I’m reproducing it here:

Let $F=(S, G)$ be a finite factored set, let $X$, $Y$, $Z\in \text{Part}(S)$, and let $E\subseteq S$. We use $s\sim_X t$ to say that two elements $s$ and $t$ are in the same part in $X$. The conditional history of $X$ given $E$, written $h^F(X|E)$, is the smallest set of factors $H\subseteq G$ satisfying the following two conditions:

- For all $s, t\in E$, if $s\sim_b t$ for all $b\in H$, then $s\sim_X t$.
- For all $s, t\in E$ and $r\in S$, if $r\sim_{b_0} s$ for all $b_0\in H$ and $r\sim_{b_1} t$ for all $b_1\in G\backslash H$, then $r\in E$.

The first condition is easy. It just says that the factors in $h(A|C_1)$ should be sufficient to pin down $A$ in $C_1$. We can safely assume that $A$ itself will be enough. So, is $H:=h(A|C_1)$ just $\{A\}$ again? Well there is that second condition. To show its effect, let’s assume that indeed $H=\{A\}$. The condition then specifies something that has to be true for all $s,t\in C_1$ and $r\in S$. However, given the assumption we just made, I can construct a counterexample to the stated condition. (Thus showing by contradiction that $h(A|C_1)$ is not just $\{A\}$.)

To make things concrete, I’ll give a concrete definition of how $C$ is computed. First, let’s say that $U_C$ is the result of a 6-sided die; so: $\mathcal{U}_C=\{1, 2, 3, 4, 5, 6\}$. We’ll then say that $C$ is XOR of $A$ and $B$, except when $U_C$ rolls a 6, in which case $C$ is just 0:

$f_C(a, b, u_C)=\begin{cases} a\oplus b &\text{if } u_C \neq 6\\ 0 &\text{else.} \end{cases}$

The counterexample is then:

$s=(a_s, b_s, u_{Cs})=(1, 0, 1)\\ t=(a_t, b_t, u_{Ct})=(0, 1, 2)\\ r=(a_r, b_r, u_{Cr})=(1, 1, 2)$

Let’s first confirm that $s$ and $t$ are indeed in $C_1$. To be in $C_1$, we need $f_C(a, b, u_C)=1$. This is true for both, because $1\oplus 0=1$ and $0\oplus 1=1$ and in neither case is $u_C=6$.

Now, in the if-statement in the second condition, we first ask whether $r$ and $s$ can be distinguished by the factors in $H$. We’re assuming for now that $H$ consists only of $A$, so the question becomes whether $r$ and $s$ can be distinguished by looking at $A$. The answer is *no*, because in both cases, the first entry is 1 (so, $a=1$). Thus, as far as the factors in $H$ are concerned $r$ and $s$ are the same.

Next we must investigate $r$ and $t$ in light of the the factors that are *not* in $H$. In our case, that’s $B$ and $U_C$. As we can see, $r$ and $t$ are indeed indistinguishable under $B$ and $U_C$ because $r$ and $t$ only differ in the *first* entry of the tuple (which corresponds to $A$). The if-statement is thus true, and so according to the condition, we should have $r\in C_1$. However, this is not the case, because $1\oplus 1 = 0$, and so $r$ is in $C_0$. In other words, we have a contradiction and $h(A|C_1)$ can’t consist of only $A$.

So, what does the second condition in the definition of *conditional history* look out for? It seems to want to prevent that both $H$ and its complement are *incomplete* when it comes to being able to answer the question of whether an element is in $C_1$ or not. That is, if both the factors within $H$ and the factors without seem to indicate that an element $r$ is in $C_1$, then – the definition says – it should really be in $C_1$. The factors should *not* be split in such a way that neither $H$ nor its complement ($G\backslash H$ where $G$ is the set of all factors) can reconstruct $C$; otherwise, the border itself leaks information about the likely values of factors.

The problem can be easily fixed by adding $B$ and $U_C$ to $h(A|C_1)$ as well, so that $H$ has all the factors needed to fully pin down the border of $C_1$: $h(A|C_1)=\{A, B, U_C\}$. Via symmetry, we also get $h(A|C_0)=\{A, B, U_C\}$ and also the same for $h(B|C_{0,1})$. We thus definitely don’t have $h(A|C_{0/1})\cap h(B|C_{0/1})=\emptyset$ anymore. And so, $A$ and $B$ are *not* orthogonal given $C$.

This means we have recovered the two most important independence statements about the collider: $A\perp B$ and $A\not\perp B|C$, as well as $C\not\perp A$ and $C\not\perp B$ (just from the fact that $A$ and $B$ are strictly before $C$ in time). What remains to be confirmed are $C\not\perp A|B$ and $C\not\perp B|A$. I’ll leave that as an exercise for the reader.

As our next example, we’ll have a look at the 3-variable chain:

Separating out the noise/uncertainty:

I won’t repeat all the steps now and just skip to constructing the sample space $S$ as the Cartesian product of the possible values of $A$, $U_B$, and $U_C$. The elements of $S$ are then tuples like this one:

$(a, u_B, u_C) \in S$

We define the partitions $A$, $U_B$, and $U_C$ on this in the obvious way, and so they are our factorization of $S$. The variables $B$ and $C$ are defined as partitions of $S$ according to some deterministic function of $(a, u_B)$ and $(a, u_B, u_C)$ respectively. For the histories, it follows then that $h(A)=\{A\}$, $h(B)=\{A, U_B\}$, and $h(C)=\{A, U_B, U_C\}$, which implies $A<B<C$, as one would expect.

(It might seem remarkable here that we can reconstruct the exact order of $A$, $B$, and $C$, but that is only because we *defined* $S$ that way. Nothing interesting has happened yet. This is just a self-consistency check.)

I tried visualizing these histories but had limited success:

The interesting independence statement in this model is $A\perp C|B$. So, to investigate this, let’s look at $h(C|B_1)$ – the conditional history of $C$ on the subset of $S$ where $B=1$. The first step is to clarify that $C$ is computed via $B$:

$f_C(a, u_B, u_C)=g_C(f_B(a, u_B), u_C)$

That is, $a$ and $u_B$ are only used via $f_B$. But if we already know the output of $f_B$ (because we’re in the subset $B_1\subset S$ where $f_B=1$), then we only need $u_C$ to compute the value of $C$. Thus, it would be consistent with *condition 1* of the conditional histories definition if we had $h(C|B_1)=\{U_C\}$. But is it also consistent with *condition 2*?

In condition 2, we have first this part: “if $r\sim_{U_C} s$ …” However, $U_C$ has no bearing on whether $r$ is in $B_1$ or not, because $f_B$ doesn’t take $U_C$ as an input. So, we can ignore that part. Without that part we are left with: if $r\sim_{X}t$ for all $X\in \{A, U_B\}$, then $r\in B_1$. Is this true for all $t\in B_1$ and $r\in S$? Yes, because what it is saying is, if $r$ seems to be in $B_1$ according to $A$ and $U_B$, then it really is in $B_1$. And that is true, because $A$ and $U_B$ already fully determine $B$, so if they say you are in $B_1$, then you are.

So, we avoided the trap we had above, where neither $H$ nor its complement could reconstruct the boundary of the set we were conditioning on. Here, $A$ and $U_B$ (which are both not in $H$) are able to reconstruct the boundary of $B_1$ perfectly, so condition 2 is fulfilled. Thus, $h(C|B_1)=\{U_C\}$ (and analogously $h(C|B_0)=\{U_C\}$), which implies that $h(A|B_{0/1})\cap h(C|B_{0/1})=\emptyset$. (I didn’t check that $h(A|B_{0/1})$ doesn’t contain $U_C$, but a quick argument shows that it won’t: $U_C$ is neither needed to pin down $A$ nor to pin down the border of $B_{0/1}$.) So, $A\perp C|B$ as expected.

There is one interesting 3-variable model left – the common cause:

The reader may want to try this out for themself before reading on.

Splitting off the randomness, we get:

As before, we can construct a sample space that is factorized by $G=\{U_B, A, U_C\}$, giving us the finite factored set $F=(S, G)$. The histories for $A$, $B$, and $C$ are then again obvious: $h(A) =\{A\}$, $h(B)=\{U_B, A\}$, and $h(C)=\{A, U_C\}$. We can see that $B$ and $C$ are not independent, because their histories both include $A$. But they also don’t have a definite temporal order; we can neither say $B<^FC$ nor $C<^FB$.

From Pearl’s theory, we expect that $B$ and $C$ become independent when conditioned on $A$. So let’s look at those conditional histories. As we know all the tricks by now, it will be a very high-level analysis.

We recall the first rule of conditional histories: the history should contain those factors that are needed to pin down the variable *given that we already know the value of the conditioned variable*. If we know the value of $A$, then it suffices to know $U_B$ in order to know $B$. So, the conditional history, $H$, of $B$ given that we know the value of $A$, contains $U_B$ at the least.

The second rule of conditional histories demands that either $H$ or its complement, $G\backslash H$, (or both) is on its own sufficient to determine the value of the conditioned variable ($A$ in our case). Assuming $H=\{U_B\}$, the complement of $H$, $\{A, U_C\}$, contains $A$ itself, and so is definitely sufficient to determine $A$. Thus, when conditioning on values for $A$, $H=\{U_B\}$ is a permitted history by the second rule.

By symmetry, we get $\{U_C\}$ for the conditional history for $C$. This all then implies $B\perp C|A$ as expected.

## Conclusions #

I hope this was useful to someone. And I hope I didn’t completely mess up the intended use of this formalism.

One thing I appreciate about this formalism is that I find it easy to drop to the base level (the sample space with the factorization) to explicitly check my higher-level thoughts when I get confused. It’s nice to have that solid ground level available whenever I need it.

The rule about conditional histories is not exactly *easy*, but it feels closer to a fundamental law than the *d-separation* rules of the Pearlian paradigm, which always felt a bit arbitrary.

Finally, I still kind of think that a DAG is a really nice way to visualize dependencies and independencies of random variables. I wonder if there is a visualization that feels more native for Finite Factored Sets while still looking nice.

- Next: What’s so dangerous about AI anyway?
- Previous: Effects are correlated, policy outcomes are not, and multi-factor explanations are hard