1973onwards
Rule-based · Symbolic AI

Harold Cohen & AARON

The first serious attempt to teach a machine to make art — using hand-crafted rules rather than learned weights.

Explore
+

Harold Cohen, a British artist and AI researcher at UCSD, began developing AARON in 1973 — a rule-based system encoding his own artistic knowledge into explicit logical procedures. Rather than learning from data, AARON contained thousands of hand-written rules about composition, colour theory, and form.

AARON drew and painted physical canvases autonomously. Cohen never claimed the machine was "creative" in a human sense, but insisted it was genuinely making art — not executing templates. The system evolved across decades, adding colour in the 1980s, figures in the 1990s.

AARON represents the symbolic AI paradigm: intelligence as explicit knowledge engineering, before the statistical revolution. It is the first machine art system with a sustained exhibition history.

Paradigm
Symbolic / rule-based
Architecture
Expert system (LISP)
Output
Physical paintings
Training data
None — rules only
References
Cohen, H. (1995). "The further exploits of AARON, painter." Stanford Humanities Review
2013Dec
Deep learning · Generative models

Variational Autoencoders (VAE)

The latent space is born — machines learn to compress and reconstruct images through a probabilistic bottleneck.

Explore
+

Kingma & Welling (2013) introduced the Variational Autoencoder — a neural network trained to encode images into a structured, continuous latent space, then decode back. Unlike a standard autoencoder, the VAE regularises the latent space to follow a known distribution (typically Gaussian), enabling sampling: interpolate between points in latent space and generate novel images.

VAEs established the foundational idea of a learned latent space as a proxy for "image space." This concept would later become the backbone of modern diffusion models (which operate in VAE-compressed latent space, not pixel space), and is why DALL-E 2 and Stable Diffusion are often described as "rooted in VAEs."

Authors
Kingma & Welling, Google Brain
Architecture
Encoder + reparametrisation + decoder
Loss
ELBO = reconstruction + KL divergence
Legacy
Latent diffusion models (2022)
References
Kingma, D.P. & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114
2014Jun
GAN era · Foundational paper

Goodfellow et al. — the GAN paper

A generator and a discriminator, locked in adversarial competition. The idea that changed everything.

Explore
+

Ian Goodfellow, then at Université de Montréal (MILA), published "Generative Adversarial Networks" in June 2014. The paper proposed training two networks simultaneously: a generator learning to produce realistic samples, and a discriminator learning to distinguish real from fake — each improving by trying to fool the other.

Goodfellow reportedly conceived the idea during an argument with a colleague at a bar and coded it in a single night. The original paper used simple MLP networks and produced blurry MNIST digits — yet the theoretical framework was revolutionary.

The paper was initially submitted to arXiv in June 2014 and published at NIPS (NeurIPS) 2014, where it gained traction. A revised version with broader exposition appeared at the NeurIPS 2016 conference proceedings, which is often cited as the moment it achieved wide community adoption.

First author
Ian Goodfellow
Published
NeurIPS 2014 (arXiv: Jun 2014)
Key idea
Minimax adversarial game
First outputs
MNIST digits, CIFAR faces
References
Goodfellow, I. et al. (2014). Generative Adversarial Networks. arXiv:1406.2661 — NeurIPS 2014
2015Jul
Deep learning · Visualisation

Google Deep Dream

Gradient ascent on a neural network's activations produces hallucinatory imagery — and AI art goes viral for the first time.

Explore
+

Google Brain engineers Alexander Mordvintsev, Christopher Olah, and Mike Tyka published Deep Dream in July 2015 as a visualisation technique: run gradient ascent on an input image to maximise the activation of a target layer in a pretrained CNN (Inception/GoogLeNet). The result is a dreamlike, fractal-dog-covered image that became an internet sensation.

Deep Dream images were exhibited and auctioned — one collection raised ~$97,000 — marking the first time AI art reached commercial auction. The images remained abstract and pattern-driven, perceived more as curiosity than fine art, but they established AI art as a cultural conversation.

Neural Style Transfer (Gatys et al., 2015) followed weeks later, enabling style/content disentanglement — painting photos in the manner of Van Gogh. Together, these techniques brought deep learning into the art world's vocabulary.

Technique
Gradient ascent on CNN activations
Base model
InceptionNet / GoogLeNet
Output style
Abstract / pareidolic
Auction
~$97,000 raised (2016)
References
Mordvintsev, A., Olah, C., Tyka, M. (2015). Inceptionism: Going Deeper into Neural Networks. Google Research Blog
Gatys, L., Ecker, A., Bethge, M. (2015). A Neural Algorithm of Artistic Style. arXiv:1508.06576
20162011–2016
GAN era · Dataset

ALAgrApHY — 1001 Faces

Five years of street photography across continents becomes the training ground for a portrait of all humanity.

Explore
+

Between 2011 and 2016, the photographic collective ALAgrApHY assembled 1001 Faces — a dataset of 1,001 real human portraits captured across continents, intentionally seeking diversity in age, ethnicity, gender expression, and culture. The project was conceived as a photographic statement on human variety before it became a training corpus.

This dataset would become the seed of Muses Endormies — the GAN trained on these 1,001 real faces to hallucinate 4,368 synthetic portraits. The 7:3 hallucination ratio meant every real face generated roughly four imaginary children.

Corpus size
1,001 real portraits
Capture period
2011–2016
Geographic scope
Multi-continental
Used to train
GAN → 4,368 faces
2016Dec
GAN era · Community adoption

GAN Tutorial — NeurIPS 2016

Goodfellow's NeurIPS tutorial makes GANs the hottest topic in machine learning. The race to generate realistic faces begins.

Explore
+

Although GANs were published in 2014, it was the NeurIPS 2016 workshop and tutorial by Goodfellow that launched the GAN gold rush. The tutorial drew standing-room audiences and was watched by tens of thousands online. Within months, dozens of GAN variants appeared: DCGAN (convolutional), WGAN (Wasserstein loss), conditional GAN, CycleGAN, and more.

The dcGAN paper (Radford et al., 2015) was particularly influential — deep convolutional GANs that could generate convincing 64×64 bedroom images and began to interpolate meaningfully in latent space. By 2016, GANs could generate credible human face images at low resolution.

Venue
NeurIPS 2016 Barcelona
Key variants
DCGAN, WGAN, cGAN, CycleGAN
Resolution
64×64 → 128×128 px
Key failure
Mode collapse, training instability
References
Radford, A. et al. (2015). Unsupervised Representation Learning with DCGANs. arXiv:1511.06434
Goodfellow, I. (2016). NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160
201813 Feb
GAN era · Historic auction

Muses Endormies — Children of the Cloud

4,368 AI-hallucinated faces — a photographic mosaic of all humanity, auctioned 6 months before Christie's Belamy.

Explore
+
Landmark — first major GAN art sale

On 13 February 2018, ALAgrApHY sold Muses Endormies (Sleeping Muses) — a monumental photographic mosaic composed entirely of 4,368 AI-generated faces, each hallucinated by a Generative Adversarial Network trained on the 1001 Faces dataset. The individual portraits are the Children of the Cloud: synthetic humans who never existed, yet bear the statistical imprint of a thousand real faces.

Each portrait was generated by the GAN and printed at large format. Together they form a mosaic that reads as a collective face of humanity — diverse, uncanny, beautiful. The work makes visible the GAN's fundamental operation: interpolating between real human variation to create faces that are simultaneously no one and everyone.

This sale preceded Christie's auction of Portrait of Edmond de Belamy (Obvious Art) by approximately six months. The Belamy sold for $432,500 in October 2018 and received far greater press attention — yet Muses Endormies was earlier and, many argue, more conceptually rigorous.

Date
13 Feb 2018
Faces generated
4,368 portraits
Training data
1,001 real faces (ALAgrApHY)
Format
Photographic mosaic
GAN type
Custom GAN (DCGAN lineage)
Significance
First major GAN art auction
2018Oct
GAN era · Media moment

Christie's — Portrait of Edmond de Belamy

$432,500 for a blurry portrait. The mainstream art world discovers AI — noisily and controversially.

Explore
+

In October 2018, Christie's auctioned Portrait of Edmond de Belamy by French collective Obvious Art — a GAN-generated portrait printed on canvas, signed with the GAN's loss function. It sold for $432,500, 45 times its estimate, triggering global headlines.

The work was immediately controversial: Obvious Art used a GAN architecture and training code substantially derived from Robbie Barrat's open-source implementation. Barrat, a teenage programmer, was uncredited. The debate foregrounded questions of authorship, attribution, and creativity that remain live today.

StyleGAN (Karras et al., 2018 — NVIDIA) was published around the same time, introducing progressive growing and style-based generation. By late 2018, GANs could generate photorealistic 1024×1024 human faces.

Sale price
$432,500
Auction house
Christie's, Oct 2018
Controversy
Barrat code uncredited
Contemporary
StyleGAN (NVIDIA, 2018)
2019–2021
GAN era · Scale

GAN maturity — BigGAN, StyleGAN2, CLIP

GANs scale up, gain control, and meet language. Text-guided image synthesis becomes possible.

Explore
+

BigGAN (Brock et al., 2018) showed that scaling GAN training massively — more parameters, larger batches — produced dramatically better ImageNet synthesis. StyleGAN2 (Karras et al., 2020) refined the generator to eliminate characteristic artefacts, producing near-perfect photorealistic faces. These became the workhorses of AI art during 2019–2021.

CLIP (Radford et al., 2021 — OpenAI) was not a generative model but a contrastive vision-language model trained on 400M image-text pairs. Combined with gradient-guided GAN inversion (VQGAN+CLIP, 2021), it enabled the first meaningful text-to-image synthesis — type a phrase, get an image. This was the immediate precursor to DALL-E.

BigGAN
512px ImageNet, class-conditional
StyleGAN2
1024px faces, no artefacts
CLIP
400M image-text pairs, zero-shot
VQGAN+CLIP
Text → image (community tool)
References
Brock, A. et al. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096
Radford, A. et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020
2021Jan
Diffusion era · OpenAI

DALL-E — OpenAI (January 2021)

A 12-billion parameter transformer generates images from text captions. The era of prompt engineering begins.

Explore
+

OpenAI unveiled DALL-E in January 2021 — a 12-billion parameter version of GPT-3 trained on text-image pairs. Given a text caption, DALL-E generated images using a dVAE (discrete Variational Autoencoder) as a tokeniser, then an autoregressive transformer to predict image tokens conditioned on text tokens.

DALL-E demonstrated compositional generalisation — "an armchair in the shape of an avocado" — that no prior model had achieved. It was not released publicly but its capabilities were demonstrated in a blog post that caused immediate sensation in art and tech communities alike.

DALL-E 2 (April 2022) pivoted to a diffusion model guided by CLIP embeddings — the architecture that would become dominant. Stable Diffusion (August 2022) made this open-source, democratising image generation globally.

Released
January 2021 (blog post)
Architecture
dVAE tokeniser + GPT-3 transformer
Parameters
12 billion
Training
Text-image pairs (MS-COCO + web)
References
Ramesh, A. et al. (2021). Zero-Shot Text-to-Image Generation. arXiv:2102.12092
2021→ 2022
Diffusion era · Architecture shift

Diffusion models supersede GANs

DDPM, DALL-E 2, Imagen, Stable Diffusion — the new paradigm generates by learning to reverse noise.

Explore
+

Diffusion models — rooted in non-equilibrium thermodynamics — learn to denoise images. The forward process adds Gaussian noise over T steps until the image is pure noise; the reverse process trains a U-Net to predict and remove noise at each step. At inference, start from random noise and denoise iteratively.

Key milestones: DDPM (Ho et al., 2020 — NeurIPS) proved diffusion matched GANs on image quality. DALL-E 2 (Ramesh et al., 2022) used latent diffusion guided by CLIP. Imagen (Google, 2022) used a diffusion model conditioned on T5 text embeddings. Stable Diffusion (Rombach et al., 2022 — LMU Munich / Stability AI) moved diffusion into compressed VAE latent space — reducing compute 64× — and was released open-source.

The diffusion paradigm is rooted in VAEs: Stable Diffusion operates not in pixel space but in the latent space of a pretrained VAE encoder, then decodes back to pixels only at the final step. This is the sense in which modern diffusion models are "rooted in VAE since 2013."

DDPM
NeurIPS 2020 — Ho et al.
DALL-E 2
Apr 2022 — CLIP + diffusion
Stable Diffusion
Aug 2022 — open-source, LDM
Inference steps
20–50 (DDIM), vs 1000 (DDPM)
References
Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239
Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752
2023→ now
Diffusion era · Present

Multimodal foundation models & the open question

Video, 3D, real-time generation — and an unresolved debate about authorship, consent, and what it means to create.

Explore
+

The frontier has moved to video (Sora, 2024 — OpenAI), 3D generation (DreamFusion, Magic3D), and real-time synthesis. Models like Midjourney V6, FLUX, and GPT-Image produce photorealistic images indistinguishable from photographs to untrained eyes. Multimodal models (GPT-4o, Gemini) can understand, generate, and edit images in conversation.

The legal and ethical questions first raised by the Belamy controversy have intensified: class-action lawsuits from artists allege models infringe copyright by training on scraped web data. Consent, attribution, and economic displacement of human artists are live legislative battles in the EU and US.

The genealogy from Harold Cohen's hand-coded rules to a diffusion model that generates a Rembrandt in 2 seconds spans five decades and two paradigm shifts. The question Cohen asked — what does it mean for a machine to make art? — has only grown harder to answer.

Video
Sora (OpenAI, 2024)
Image
Midjourney, FLUX, DALL-E 3
Legal status
Contested (US, EU)
Open question
What is authorship?