Fourier Layer structure

Each Fourier layer is divided into two sub-layers:

\[\tilde{Y}=\mathcal{H}_\theta^l(\tilde{X})=\sigma\left(\mathcal{C}_\theta^l(\tilde{X})+\mathcal{B}_\theta^l(\tilde{X})\right)\]

where

\(\tilde{X}\) corresponds to the input of the current layer and \(\tilde{Y}\) to the output.
\(\sigma\) is an activation function. For \(l=1,\dots,L-1\), we’ll take the activation function ReLU (Rectified Linear Unit) and for \(l=L\) we’ll take the activation function GELU (Gaussian Error Linear Units). .figure image::fourier/activation_functions.png[width=180.0,height=144.0]
\(\mathcal{C}_\theta^l\) is a convolution layer where convolution is performed by FFT (Fast Fourier Transform). For more details, see Section 1.
\(\mathcal{B}_\theta^l\) is the "bias-layer". For more details, see [_bias_sub-layer].

1. Convolution sublayer

Each \(\mathcal{C}_\theta^l\) convolution layer contains a trainable kernel \(\hat{W}\) and performs the transformation

\[\mathcal{C}_\theta^l(X)=\mathcal{F}^{-1}(\mathcal{F}(X)\cdot\hat{W})\]

where \(\mathcal{F}\) corresponds to the 2D Discrete Fourier Transform (DFT) on a \(ni\times nj\) resolution grid and

\[(Y\cdot\hat{W})_{ijk}=\sum_{k'}Y_{ijk'}\hat{W}_{ijk'}\]

In other words, this transformation is applied channel by channel.

An image is fundamentally a signal. Just as 1D signals show changes in amplitude (sound) over time, 2D signals show variations in intensity (light) over space. The Fourier transform allows us to move from the spatial or temporal domain into the frequency domain. In a sound signal (1D signal), low frequencies represent low-pitched sounds and high frequencies represent high-pitched sounds. In the case of an image (2D signal), low frequencies represent large homogeneous surfaces and blurred parts, while high frequencies represent contours, more generally abrupt changes in intensity and, finally, noise.

The 2D DFT is defined by :

\[\mathcal{F}(X)_{ijk}=\frac{1}{ni}\frac{1}{nj}\sum_{i'=0}^{ni-1}\sum_{j'=0}^{nj-1}X_{i'j'k}e^{-2\sqrt{-1}\pi\left(\frac{ii'}{ni}+\frac{jj'}{nj}\right)}\]

The inverse of the 2D DFT is defined by :

\[\mathcal{F}^{-1}(X)_{ijk}=\sum_{i'=0}^{ni-1}\sum_{j'=0}^{nj-1}X_{i'j'k}e^{2\sqrt{-1}\pi\left(\frac{ii'}{ni}+\frac{jj'}{nj}\right)}\]

We can easily show that \(\mathcal{F}\) is the reciprocal function of \(\mathcal{F}^{-1}\). We have

\[\begin{aligned} \mathcal{F}^{-1}(\mathcal{F}(X))_{ijk}&=\sum_{i'=0}^{ni-1}\sum_{j'=0}^{nj-1}\mathcal{F}(X)_{i'j'k}e^{2\sqrt{-1}\pi\left(\frac{ii'}{ni}+\frac{jj'}{nj}\right)} \\ &=\frac{1}{ni}\frac{1}{nj}\sum_{i'j'}\sum_{i''j''}X_{i''j''k}e^{-2\sqrt{-1}\pi\left(\frac{i'i''}{ni}+\frac{j'j''}{nj}\right)}e^{2\sqrt{-1}\pi\left(\frac{ii'}{ni}+\frac{jj'}{nj}\right)} \\ &=\frac{1}{ni}\frac{1}{nj}\sum_{i''j''}X_{i''j''k}\sum_{i'j'}e^{2\sqrt{-1}\pi\frac{i'}{ni}(i-i'')}e^{2\sqrt{-1}\pi\frac{j'}{nj}(j-j'')} \end{aligned}\]

Let

\[S=\sum_{i'j'}e^{2\sqrt{-1}\pi\frac{i'}{ni}(i-i'')}e^{2\sqrt{-1}\pi\frac{j'}{nj}(j-j'')}\]

Thus

If \((i,j)=(i'',j'')\) : \(S=\sum_{i',j'}1=ni\times nj\)
If \((i,j)\ne(i'',j'')\) :

\begin{aligned} S&=\sum_{i'}\left(e^{\frac{2\sqrt{-1}\pi}{ni}(i-i'')}\right)^{i'}\sum_{j'}\left(e^{\frac{2\sqrt{-1}\pi}{nj}(j-j'')}\right)^{j'} \\ &=\frac{1-\left(e^{\frac{2\sqrt{-1}\pi}{ni}(i-i'')}\right)^{ni}}{1-e^{\frac{2\sqrt{-1}\pi}{ni}(i-i'')}}\times \frac{1-\left(e^{\frac{2\sqrt{-1}\pi}{nj}(j-j'')}\right)^{nj}}{1-e^{\frac{2\sqrt{-1}\pi}{nj}(j-j'')}} \\ &=\frac{1-e^{2\sqrt{-1}\pi(i-i'')}}{1-e^{\frac{2\sqrt{-1}\pi}{ni}(i-i'')}}\times \frac{1-e^{2\sqrt{-1}\pi(j-j'')}}{1-e^{\frac{2\sqrt{-1}\pi}{ni}(j-j'')}}=0 \end{aligned}

as the sum of a geometric sequence.

We deduce that

\[\mathcal{F}^{-1}(\mathcal{F}(X))_{ijk} = \frac{1}{ni}\frac{1}{nj} \times ni\times nj\times X_{ijk} = X_{ijk}\]

And finally \(\mathcal{F}\) is the reciprocal function of \(\mathcal{F}^{-1}\).

For more details about the Convolution sub-layer, see Section "Some details on the convolution sub-layer".

2. Bias sub-layer

The bias layer is a 2D convolution with a kernel size of 1. This means that it only performs matrix multiplication on the channels, but pixel by pixel. In other words, it mixes channels via a kernel, but does not allow interaction between pixels.

Precisely,

\[\mathcal{B}_\theta^l(X)_{ijk}=\sum_{k'}X_{ijk}W_{k'k}+B_k\]