Paper Reading/Transformer based Embedding Model

Sequence based approach LayoutLM, LAMBERT와 같이 2D positional embedding을 input token에 추가한 BERT기반 모델 접근법 단점) 데이터셋 의존도가 높고, 컴퓨팅 파워도 요구함 Graph based approach BERTGrid, Chargrid, 각 문서를 Graph화 하고, 텍스트나 텍스트라인을 노드화, 관련 RoI visual 정보나 positional 정보까지 노드화 하여 그래프를 구성할 수 있고, GCN이나 attention network을 통해 각 이웃 노드간의 관계를 학습할 수 있음 단점) sequence based approach보다 성능이 다소 떨어짐 BERT와 같은 강력한 token embedding이 불가능함 (BERTGri..
LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. Contribution LayoutLMv3 : first multimodal model that doesn’t rely on a pre-trained CNN or Faster R-CNN backbone → save param..
PAPER DocFormer: End-to-End Transformer for Document Understanding We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In additio arxiv.org GitHub GitHub - shabie/docformer: Implementation of DocFormer: E..
VLBERT PAPER VL-BERT: Pre-training of Generic Visual-Linguistic Representations We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguis arxiv.org UNITER PAPER UNITER: UNiversal Image-TExt Represent..
PAPER ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and arxiv.org 논문을 깊게 읽고 만든 자료가 아니므로, 참고만 해주..
Js.Y
'Paper Reading/Transformer based Embedding Model' 카테고리의 글 목록