大模型发展史
by WZhang
published 2026-02-07
views 158
大模型发展史
Abstract: 阅读大模型/多模态模型发展过程中一些经典论文,进行翻译和解读,加深印象。
后续要翻译和解读的论文
| Year | Method | Paper | Link |
|---|---|---|---|
| 2017 | Transformer | Attention is All You Need | https://arxiv.org/pdf/1706.03762 |
| 2018 | GPT-1 | Improving Language Understanding by Generative Pre-Tranining | https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf |
| 2018 | BERT | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | https://arxiv.org/pdf/1810.04805 |
| 2019 | GPT-2 | Language Models are Unsupervised Multitask Learners | https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf |
| 2020 | GPT-3 | Language Models are Few-Shot Learners | https://arxiv.org/pdf/2005.14165 |
| 2020 | ViT | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | https://arxiv.org/pdf/2010.11929 |
| 2020 | simCLR | A Simple Framework for Contrastive Learning of Visual Representations | https://arxiv.org/pdf/2002.05709 |
| 2020 | DETR | End-to-End Object Detection with Transformers | https://arxiv.org/pdf/2005.12872 |
| 2021 | CLIP | Learning Transferable Visual Models From Natural Language Supervision | https://arxiv.org/pdf/2103.00020 |
| 2021 | ViLT | ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision | https://arxiv.org/pdf/2102.03334 |
| 2021 | GLIP | Grounded Language-Image Pre-training | https://arxiv.org/pdf/2112.03857 |
| 2021 | Swin Transformer | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | https://arxiv.org/pdf/2103.14030 |
| 2021 | MAE | Masked Autoencoders Are Scalable Vision Learners | https://arxiv.org/pdf/2111.06377 |
| 2021 | MoCo | Momentum Contrast for Unsupervised Visual Representation Learning | https://arxiv.org/pdf/1911.05722 |
| 2021 | MoCo 2 | Improved Baselines with Momentum Contrastive Learning | https://arxiv.org/pdf/2003.04297 |
| 2021 | MoCo 3 | An Empirical Study of Training Self-Supervised Vision Transformers | https://arxiv.org/pdf/2104.02057 |
| 2022 | BLIP | BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | https://arxiv.org/pdf/2201.12086 |
| 2023 | BLIP 2 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | https://arxiv.org/pdf/2301.12597 |
| 2023 | Llama | LLaMA: Open and Efficient Foundation Language Models | https://arxiv.org/pdf/2302.13971 |
| 2023 | Llama 2 | Llama 2: Open Foundation and Fine-Tuned Chat Models | https://arxiv.org/pdf/2307.09288 |
| 2024 | Llama 3 | The Llama 3 Herd of Models | https://arxiv.org/pdf/2407.21783 |
| 2024 | Sora | Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | https://arxiv.org/pdf/2402.17177 |
| 2024 | GPT-4 | GPT-4 Technical Report | https://arxiv.org/pdf/2303.08774 |
| 2025 | Qwen 3 | Qwen3 Technical Report | https://arxiv.org/pdf/2505.09388 |
| 2025 | DeepSeek-R1 | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | https://arxiv.org/pdf/2501.12948 |
0comment(s)