Mixture of Experts

Mixture of Experts, 줄여서 MoE는 Transformer 전체를 완전히 바꾸는 구조가 아니다. 보통 attention은 그대로 두고, MLP/FFN 부분을 여러 expert로 바꾼다.

Dense Transformer에서는 모든 token이 같은 MLP를 통과한다.

token -> shared MLP -> output

MoE Transformer에서는 여러 expert MLP가 있고, 각 token은 router를 거쳐 일부 expert만 사용한다.

token
  -> router
  -> selected expert MLPs
  -> combined output

왜 MoE를 쓰나

Dense MLP는 모든 token이 모든 parameter를 사용한다. 모델을 크게 만들수록 계산량도 함께 커진다.

MoE는 전체 expert parameter 수를 크게 늘리면서도, token 하나가 사용하는 expert 수는 작게 유지한다.

전체 parameter는 크다.
하지만 token당 활성 parameter는 일부다.

이런 구조를 sparse activation이라고 부른다.

Router는 각 token을 어떤 expert에게 보낼지 정한다.

token representation
  -> router score
  -> top-k experts 선택

예를 들어 expert가 8개 있고 top-2 routing을 쓴다면, token 하나는 8개 expert 중 2개 expert만 통과한다.

MoE를 큰 그림에 넣으면 이렇게 볼 수 있다.

기본 Transformer:
embedding + attention + dense MLP

MoE Transformer:
embedding + attention + routed expert MLP

핵심은 MoE가 attention의 대체물이 아니라, 주로 MLP/FFN을 sparse expert 구조로 바꾸는 방법이라는 점이다.