Convolutional LSTM network(ConvLSTM)

본 글에서는 Long Short Term Memory(LSTM)에 convolution structure를 추가함으로써 temporal correlation(시간 의존성)뿐만 아니라 공간 데이터(spatial information)도 다루는 Convolutional LSTM network에 대해 설명하도록 하겠습니다. 즉, ConvLSTM은 시공간 데이터(spatial-temporal data)를 기반으로 미래 시점을 예측하고자 제안되었으며 논문 "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting(2015)"을 바탕으로 작성하였습니다.

▶ Background

ConvLSTM에 대해 본격적으로 설명하기에 앞서 Long Short Term Memory(LSTM)의 구조와 Convolution Neural Network(CNN)의 convolution operation 및 padding에 대한 배경지식이 필요합니다. LSTM과 CNN에 대한 자세한 설명은 아래 블로그를 참조하면 좋을 것 같습니다.

[Long Short Term Memory(LSTM) 참고자료]

[Convolution Neural Network(CNN) 참고자료]

Minjeong님의 [딥러닝/머신러닝] CNN(Convolutional Neural Networks) 쉽게 이해하기

0. Problem Setting

1. Long Short Term Memory for Sequence Modeling

ConvLSTM은 LSTM의 변형 형태인 Fully connected LSTM(FC-LSTM)에 convolution structure를 추가함으로써 확장하였습니다. 본 [Section 1]에서는 FC-LSTM의 구조에 대해 간략하게 설명하고자 합니다.

※ FC-LSTM 관련 논문: Generating Sequences With Recurrent Neural Networks(2013)

입력 게이트(input gate): $i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + \color{blue}{W_{ci} \circ C_{t-1}} + b_i) \in (0,1)$
망각 게이트(output gate): $f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + \color{blue}{W_{cf} \circ C_{t-1}} + b_f) \in (0,1)$
Update Cell: $C_t = f_t \circ C_{t-1} + i_t \circ \text{tanh}(W_{xc}x_t + W_{hc} h_{t-1} + b_c)$
- $\sigma$: sigmoid function(시그모이드 함수)
- $\circ$: Hadamard product(아다마르 곱)
출력 게이트(output gate): $o_t = \sigma(W_{xo} x_t + W_{ho}h_{t-1}+\color{blue}{W_{co} \circ C_t} + b_o) \in (0,1)$
Update hidden states $h_t = o_t \circ \text{tanh}(C_t)$

기존 LSTM에서는 정보의 흐름을 제어하는 gate 계산 시 cell state 정보를 활용하지 않았지만 FC-LSTM에서는 cell state 정보를 활용합니다. Cell state 정보를 추가함으로써 기울기 소실 문제를 더 보완할 수 있다고 합니다.

Gate vector 계산시 사용되는 cell의 weight matrix $\color{blue}{W_{ci}, W_{cf}, W_{co}}$는 $m$차원의 대각 행렬(hidden layer의 차원이 $m$인 경우)입니다.

▷ Gate의 역할

Input gate $i_t$가 활성화되면 cell state $C_t$에 새로운 정보 $ \text{tanh}(W_{xc}x_t + W_{hc} h_{t-1} + b_c) $가 누적됩니다.
Forget gate $f_t$가 활성화되면 과거 정보 $C_{t-1}$의 일부를 망각하게 됩니다.
Output gate $o_t$가 활성화되면 cell state $C_t$ 정보의 일부가 hidden state $h_t$에 반영됩니다.
- $o_t=1$인 경우, 현재 cell state $C_t$가 의미 있는 정보를 담고 있다고 판단하여 그대로 최종 출력합니다.
- $o_t=0$인 경우, 현재 cell state $C_t$에 대한 어떠한 정보도 출력하지 않으며 모두 0 값을 갖습니다.

▷ [참고] Hadamard Product(아다마르 곱)

아다마르 곱은 같은 크기의 두 행렬의 각 성분을 곱하는 연산을 의미합니다.

$$M = \begin{pmatrix} M_{11} & M_{12} & \cdots & M_{1n} \\
M_{21} & M_{22} & \cdots & M_{2n}\\
\vdots & \vdots & \ddots & \vdots \\
M_{m1} & M_{m2} & \cdots & M_{mn}
\end{pmatrix}, \, N = \begin{pmatrix} N_{11} & N_{12} & \cdots & N_{1n} \\
N_{21} & N_{22} & \cdots & N_{2n}\\
\vdots & \vdots & \ddots & \vdots \\
N_{m1} & N_{m2} & \cdots & N_{mn}
\end{pmatrix}$$

$$M \circ N = \begin{pmatrix} M_{11}N_{11} & M_{12}N_{12} & \cdots & M_{1n}N_{1n} \\
M_{21}N_{21} & M_{22}N_{22} & \cdots & M_{2n}N_{2n}\\
\vdots & \vdots & \ddots & \vdots \\
M_{m1}N_{m1} & M_{m2}N_{m2} & \cdots & M_{mn}N_{mn}
\end{pmatrix}$$

▶ FC-LSTM의 단점

FC-LSTM의 경우 시간적 상관관계(temporal correlation)을 다루는데는 강력하고 유용한 모델이지만 공간 정보(spatial information)에 대해서는 다루지 않습니다. FC-LSTM의 경우, input-to-state와 state-to-state transition 시 full connection을 사용합니다. 즉, input gate, forget gate, output gate 및 cell state 입력값으로 각 input data와 hidden state의 선형 변환($W_{xi} x_t, W_{hi} h_t, W_{xf} x_t, W_{hf} h_t, W_{xo} x_t, W_{ho} h_o, W_{xc}x_t , W_{hc} h_{t-1} $)된 값이 들어가기 때문에 공간 정보에 대해서는 인코딩(encoding)을 하지 않습니다.

2. Convolutional LSTM

▶ ConvLSTM 도입배경

시공간 데이터 기반 시계열 예측 모델을 형성하고자 FC-LSTM에 convolution structure를 추가하였습니다. 공간 정보를 다루기 위해 각 입력 데이터(input data) $\chi_1, \chi_2, \cdots, \chi_t$, cell outputs $C_1, \cdots, C_t$, hidden states $H_1, \cdots, H_t$, gates $i_t, f_t, o_t$는 모두 vector가 아닌 matrix 형태로 $P \times M \times N$ 차원인 3D tensor로 마지막 두 개의 차원($M \times N$)이 공간 차원(spatial dimension)입니다(Problem Setting에서 보충설명).

▶ Problem Setting

각 time sequence $\chi_1, \chi_2, \cdots, \chi_t$는 vector가 아닌 spatial region(공간 정보)를 내포한 $m \times n$ 그리드(행렬) 형태입니다($\chi_t \in \mathbb{R}^{m \times n}, \, t \in \{1,2, \cdots, n\}$).

[Figure 2] Transforming 2D image into 3D Tensor

시점별 이미지의 픽셀(pixel)은 $P$개의 measurement로 이루어져 있습니다. 다르게 생각하면 input time sequence는 넓이(width)가 $m$, 높이(height)가 $n$, 채널(channel) 수가 $P$인 3D tensor로 생각할 수 있습니다( $\chi_t \in \mathbb{R}^{P \times m \times n}$).

※ cell outputs $C_1, \cdots, C_t$, hidden states $H_1, \cdots, H_t$, gates $i_t, f_t, o_t$도 입력 데이터 $\chi_1, \chi_2, \cdots, \chi_t$와 같은 형태입니다.

▶ [Structure of ConvLSTM]

ConvLSTM의 구조에 대해 설명하고자 합니다.

$\color{red}{*}$ : convolution operator, $\color{blue}{\circ}$: Hadamard product

$W_{xi}, W_{hi}, W_{xf}, W_{hf}, W_{xo}, W_{ho}$: convolutional filter(kernel filter)

입력 게이트(input gate): $i_t = \sigma(\color{red}{W_{xi} * x_t} + \color{red}{W_{hi}*h_{t-1}} + \color{blue}{W_{ci} \circ C_{t-1}} + b_i) \in (0,1)$
망각 게이트(output gate): $f_t = \sigma(\color{red}{W_{xf} * x_t} + \color{red}{W_{hf} * h_{t-1}} + \color{blue}{W_{cf} \circ C_{t-1}} + b_f) \in (0,1)$
- Input: 과거 정보 $h_{t-1}, C_{t-1}$, 새로운 정보 $\chi_t$
- Ouput:
  - $i_t \in (0,1)$: 새로운 정보 $\tilde{C}_t = \text{tanh}(W_{xc} * x_t + W_{hc} *h_{t-1} + b_c) $ 반영 비중
  - $f_t \in (0,1)$: 이전(과거) 정보 $C_{t-1}$의 망각 비중
Update Cell state: $C_t = f_t \circ c_{t-1} + i_t \circ \text{tanh}(\color{red}{W_{xc} * x_t} + \color{red}{W_{hc} *h_{t-1}} + b_c)$
출력 게이트(output gate): $o_t = \sigma(\color{red}{W_{xo} * x_t} + \color{red}{W_{ho} * h_{t-1}}+\color{blue}{W_{co} \circ C_t} + b_o) \in (0,1)$
- Input: 과거 정보 $h_{t-1}, C_{t-1}$, 새로운 정보 $\chi_t$
- Output: $o_t \in (0,1)$: 과거 정보와 현재 정보의 비중이 적절하게 조정된 $C_t$의 정보를 그대로 출력하지 않고 어떤 정보를 반영할지에 대한 비중
Update hidden state: $h_t = o_t \circ \text{tanh}(c_t)$

입력값 $h_{t-1}, C_{t-1}$의 차원과 $\chi_t$의 차원을 같게 하기 위해서는 $h_{t-1}, C_{t-1}$ 의 경우 convolution operation 수행 전 padding이 필요합니다. 초기 은닉 상태를 $0$으로 두는 것처럼 ConvLSTM에서도 초기 padding은 모두 0 값으로 설정합니다(zero-padding).

3. Encoding-Forecasting Structure

▶ [Problem] Spatial-temporal sequence forecasting problem

이전 관측된 공간 데이터 $(\chi_{t-J+1}, \chi_{t-J+2}, \cdots, \chi_{t})$를 기반으로 ConvLSTM을 통해서 $t$ 시점 이후 $t+K$시점까지의 공간 데이터 $(\chi_{t+1}, \chi_{t+2}, \cdots, \chi_{t+K})$를 예측하는 것이 목적입니다.

논문의 저자는 크게 다음 2가지 network로 구성합니다.

Encoding Network: compress the whole input sequences into a hidden state tensor
Forecasting Network: unfolds this hidden state to give the final prediction

[Figure 4]. Encoding-Forecasting Structure

Encoding Network와 Forecasting Network는 모두 ConvLSTM layer를 여러개 쌓음으로써 구축하였습니다. Forecasting Network의 initial hidden state $h^{F}_0$와 initial cell state $C^{F}_0$ 는 Encoding network의 last hidden state $h^{E}_L$와 last cell state $C^{E}_L$입니다.

[Example: Figure 4]

Encoding network $\text{ConvLSTM}_1$의 마지막 hidden state $h^{E1}_L$와 cell state $C^{E1}_L$는 Forecasting network $\text{ConvLSTM}_3$의 초기 hidden state $h^{F1}_0$와 cell state $C^{F1}_0$입니다.
Encoding network $\text{ConvLSTM}_2$의 마지막 hidden state $h^{E2}_L$와 cell state $C^{E2}_L$는 Forecasting network $\text{ConvLSTM}_4$의 초기 hidden state $h^{F2}_0$와 cell state $C^{F2}_0$입니다.

Forecasting Network의 모든 states를 concatenate한 값을 입력받아 $1 \times 1$ convolution layer(prediction layer)

을 통해 입력 $\chi_t$와 같은 차원을 갖는 최종 산출값(예측값) $\hat{\chi}_t$을 출력합니다.

구조상 LSTM based Encoder Decoder와 비슷한 형태이지만 ConvLSTM의 경우 입력값이 3D tensor라는 점에서 차이가 존재합니다.

※ [Link] LSTM Encoder Decoder 설명

ConvLSTM의 encoding network는 LSTM based Encoder Decoder의 encoder network와 유사합니다.
ConvLSTM의 forecasting network는 LSTM based Encoder Decoder의 decoder network와 유사합니다.

▶ 수식

ConvLSTM 학습과정을 수식으로 살펴보면 다음과 같습니다.

$$\begin{align*}
\hat{\chi}_{t+1}, \cdots, \hat{\chi}^{t+k} &= \text{argmax}_{\chi_{t+1}, \cdots, \chi_{t+k}} p (\chi_{t+1}, \chi_{t+2}, \cdots, \chi_{t+k} | \hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t}) \tag{1-1}
\\
&\approx \text{argmax}_{\chi_{t+1}, \cdots, \chi_{t+k}} p (\chi_{t+1}, \chi_{t+2}, \cdots, \chi_{t+k} | f_{\text{encoding}}(\hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t})) \tag{1-2}
\\
&= g_{forecasting}(f_{\text{encoding}}(\hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t})) \tag{1-3}
\end{align*}$$

$f$: Encoding network
$g$: Forecasting network

(식 $1-1$) 이전 시점들의 공간 데이터 $\hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t}$를 기반으로 앞으로 $K$ 시점 후까지의 공간 데이터 $\hat{\chi}_{t+1}, \cdots, \hat{\chi}_{t+k}$를 예측하고자 합니다.

(식 $1-2$) 이전 시점들의 공간 데이터 $\hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t}$를 Encoding network $f$를 통해 학습함으로써 $\hat{\chi}_{t-J+1}, \cdots, \hat{\chi}_{t}$에 대한 hidden state와 cell state의 값을 산출합니다.

(식 $1-3$) 마지막 hidden state와 cell state를 토대로 forecasting network $g$를 통해 앞으로 $K$ 시점 후까지의 공간 데이터 $\hat{\chi}_{t+1}, \cdots, \hat{\chi}_{t+k}$를 예측합니다.

시공간 데이터(spatial-temporal information)를 다루기 위해 제안된 모델인 ConvLSTM의 구조 및 ConvLSTM을 활용한 네트워크의 학습 과정을 살펴봄으로써 FC-LSTM에 convolution structure를 추가하여 temporal information뿐만 아니라 spatial information도 잘 활용한 것을 알 수 있습니다.

'Time Series Analaysis > Time Series Analysis' 카테고리의 다른 글

Linear Gaussian State Space Model (0)	2024.06.10
Gated Recurrent Unit(GRU) (2)	2024.01.23
Long Short Term Memory(LSTM) (0)	2024.01.22
Recurrent Neural Network(RNN) (0)	2024.01.22

In Young

Convolutional LSTM network(ConvLSTM)

0. Problem Setting

1. Long Short Term Memory for Sequence Modeling

2. Convolutional LSTM

3. Encoding-Forecasting Structure

'Time Series Analaysis > Time Series Analysis' 카테고리의 다른 글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Convolutional LSTM network(ConvLSTM)

0. Problem Setting

1. Long Short Term Memory for Sequence Modeling

2. Convolutional LSTM

3. Encoding-Forecasting Structure

'Time Series Analaysis > Time Series Analysis' 카테고리의 다른 글

'Time Series Analaysis/Time Series Analysis' 관련글

티스토리툴바