LSTM, Recurrent Neural Network, Diffusion Model, Transoformerなどおなじみのモデルが羅列されてる中で、空間と時間の有用な表現を抽出するのための「世界モデル」というのがあるのが面白い。
Unsupervised Learning of Video Representations using LSTMs
https://arxiv.org/pdf/1502.04681.pdf
RECURRENT ENVIRONMENT SIMULATORS
https://arxiv.org/pdf/1704.02254.pdf
World Models
https://arxiv.org/pdf/1803.10122.pdf
Generating Videos with Scene Dynamics
https://arxiv.org/abs/1609.02612
Mocogan: Decomposing motion and content for video generation
https://arxiv.org/abs/1707.04993
Adversarial video generation on complex datasets.
https://arxiv.org/abs/1907.06571
Generating long videos of dynamic scenes.
https://arxiv.org/abs/2206.03429
Nüwa: Visual synthesis pre-training for neural visual world creation.
https://arxiv.org/abs/2111.12417
Imagen video: High definition video generation with diffusion models.
https://arxiv.org/abs/2210.02303
Align your latents: High-resolution video synthesis with latent diffusion models.
https://arxiv.org/abs/2304.08818
Photorealistic video generation with diffusion models.
https://arxiv.org/abs/2312.06662
Attention is all you need.
https://arxiv.org/abs/1706.03762
Language models are few-shot learners.
https://arxiv.org/abs/2005.14165
An image is worth 16×16 words: Transformers for image recognition at scale.
https://arxiv.org/abs/2010.11929
Vivit: A video vision transformer.
https://arxiv.org/abs/2103.15691
Masked autoencoders are scalable vision learners.
https://arxiv.org/abs/2111.06377
Patch n’Pack: NaViT
https://arxiv.org/abs/2307.06304
High-resolution image synthesis with latent diffusion models.
https://arxiv.org/abs/2112.10752
Auto-encoding variational bayes.
https://arxiv.org/abs/1312.6114
Deep unsupervised learning using nonequilibrium thermodynamics.
https://arxiv.org/abs/1503.03585
Denoising diffusion probabilistic models.
https://arxiv.org/abs/2006.11239
Improved denoising diffusion probabilistic models.
https://arxiv.org/abs/2102.09672
Diffusion Models Beat GANs on Image Synthesis.
https://arxiv.org/abs/2105.05233
Elucidating the design space of diffusion-based generative models.
https://arxiv.org/abs/2206.00364
Scalable diffusion models with transformers.
https://arxiv.org/abs/2212.09748
Generative pretraining from pixels.
https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf
Zero-shot text-to-image generation.
https://arxiv.org/abs/2102.12092
Scaling autoregressive models for content-rich text-to-image generation.
https://arxiv.org/abs/2206.10789
Improving image generation with better captions.
https://arxiv.org/abs/2006.11807
Hierarchical text-conditional image generation with clip latents.
https://arxiv.org/abs/2204.06125
Sdedit: Guided image synthesis and editing with stochastic differential equations.
https://arxiv.org/abs/2108.01073