1973onwards

Rule-based · Symbolic AI

Harold Cohen & AARON

The first serious attempt to teach a machine to make art — using hand-crafted rules rather than learned weights.

Explore

+

Harold Cohen, a British artist and AI researcher at UCSD, began developing AARON in 1973 — a rule-based system encoding his own artistic knowledge into explicit logical procedures. Rather than learning from data, AARON contained thousands of hand-written rules about composition, colour theory, and form.

AARON drew and painted physical canvases autonomously. Cohen never claimed the machine was "creative" in a human sense, but insisted it was genuinely making art — not executing templates. The system evolved across decades, adding colour in the 1980s, figures in the 1990s.

AARON represents the symbolic AI paradigm: intelligence as explicit knowledge engineering, before the statistical revolution. It is the first machine art system with a sustained exhibition history.

Paradigm

Symbolic / rule-based

Architecture

Expert system (LISP)

Output

Physical paintings

Training data

None — rules only

References

Computer History Museum — Harold Cohen and AARON

Cohen, H. (1995). "The further exploits of AARON, painter." Stanford Humanities Review

2013Dec

Deep learning · Generative models

Variational Autoencoders (VAE)

The latent space is born — machines learn to compress and reconstruct images through a probabilistic bottleneck.

Explore

+

Kingma & Welling (2013) introduced the Variational Autoencoder — a neural network trained to encode images into a structured, continuous latent space, then decode back. Unlike a standard autoencoder, the VAE regularises the latent space to follow a known distribution (typically Gaussian), enabling sampling: interpolate between points in latent space and generate novel images.

VAEs established the foundational idea of a learned latent space as a proxy for "image space." This concept would later become the backbone of modern diffusion models (which operate in VAE-compressed latent space, not pixel space), and is why DALL-E 2 and Stable Diffusion are often described as "rooted in VAEs."

Authors

Kingma & Welling, Google Brain

Architecture

Encoder + reparametrisation + decoder

Loss

ELBO = reconstruction + KL divergence

Legacy

Latent diffusion models (2022)

References

Kingma, D.P. & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114

2014Jun

GAN era · Foundational paper

Goodfellow et al. — the GAN paper

A generator and a discriminator, locked in adversarial competition. The idea that changed everything.

Explore

+

Ian Goodfellow, then at Université de Montréal (MILA), published "Generative Adversarial Networks" in June 2014. The paper proposed training two networks simultaneously: a generator learning to produce realistic samples, and a discriminator learning to distinguish real from fake — each improving by trying to fool the other.

Goodfellow reportedly conceived the idea during an argument with a colleague at a bar and coded it in a single night. The original paper used simple MLP networks and produced blurry MNIST digits — yet the theoretical framework was revolutionary.

The paper was initially submitted to arXiv in June 2014 and published at NIPS (NeurIPS) 2014, where it gained traction. A revised version with broader exposition appeared at the NeurIPS 2016 conference proceedings, which is often cited as the moment it achieved wide community adoption.

First author

Ian Goodfellow

Published

NeurIPS 2014 (arXiv: Jun 2014)

Key idea

Minimax adversarial game

First outputs

MNIST digits, CIFAR faces

References

Goodfellow, I. et al. (2014). Generative Adversarial Networks. arXiv:1406.2661 — NeurIPS 2014

2015Jul

Deep learning · Visualisation

Google Deep Dream

Gradient ascent on a neural network's activations produces hallucinatory imagery — and AI art goes viral for the first time.

Explore

+

Google Brain engineers Alexander Mordvintsev, Christopher Olah, and Mike Tyka published Deep Dream in July 2015 as a visualisation technique: run gradient ascent on an input image to maximise the activation of a target layer in a pretrained CNN (Inception/GoogLeNet). The result is a dreamlike, fractal-dog-covered image that became an internet sensation.

Deep Dream images were exhibited and auctioned — one collection raised ~$97,000 — marking the first time AI art reached commercial auction. The images remained abstract and pattern-driven, perceived more as curiosity than fine art, but they established AI art as a cultural conversation.

Neural Style Transfer (Gatys et al., 2015) followed weeks later, enabling style/content disentanglement — painting photos in the manner of Van Gogh. Together, these techniques brought deep learning into the art world's vocabulary.

Technique

Gradient ascent on CNN activations

Base model

InceptionNet / GoogLeNet

Output style

Abstract / pareidolic

Auction

~$97,000 raised (2016)

References

Mordvintsev, A., Olah, C., Tyka, M. (2015). Inceptionism: Going Deeper into Neural Networks. Google Research Blog

Gatys, L., Ecker, A., Bethge, M. (2015). A Neural Algorithm of Artistic Style. arXiv:1508.06576

20162011–2016

GAN era · Dataset

ALAgrApHY — 1001 Faces

Five years of street photography across continents becomes the training ground for a portrait of all humanity.

Explore

+

Between 2011 and 2016, the photographic collective ALAgrApHY assembled 1001 Faces — a dataset of 1,001 real human portraits captured across continents, intentionally seeking diversity in age, ethnicity, gender expression, and culture. The project was conceived as a photographic statement on human variety before it became a training corpus.

This dataset would become the seed of Muses Endormies — the GAN trained on these 1,001 real faces to hallucinate 4,368 synthetic portraits. The 7:3 hallucination ratio meant every real face generated roughly four imaginary children.

Corpus size

1,001 real portraits

Capture period

2011–2016

Geographic scope

Multi-continental

Used to train

GAN → 4,368 faces

References

ALAgrApHY — 1001 Faces gallery

2016Dec

GAN era · Community adoption

GAN Tutorial — NeurIPS 2016

Goodfellow's NeurIPS tutorial makes GANs the hottest topic in machine learning. The race to generate realistic faces begins.

Explore

+

Although GANs were published in 2014, it was the NeurIPS 2016 workshop and tutorial by Goodfellow that launched the GAN gold rush. The tutorial drew standing-room audiences and was watched by tens of thousands online. Within months, dozens of GAN variants appeared: DCGAN (convolutional), WGAN (Wasserstein loss), conditional GAN, CycleGAN, and more.

The dcGAN paper (Radford et al., 2015) was particularly influential — deep convolutional GANs that could generate convincing 64×64 bedroom images and began to interpolate meaningfully in latent space. By 2016, GANs could generate credible human face images at low resolution.

Venue

NeurIPS 2016 Barcelona

Key variants

DCGAN, WGAN, cGAN, CycleGAN

Resolution

64×64 → 128×128 px

Key failure

Mode collapse, training instability

References

Radford, A. et al. (2015). Unsupervised Representation Learning with DCGANs. arXiv:1511.06434

Goodfellow, I. (2016). NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160

201813 Feb

GAN era · Historic auction

Muses Endormies — Children of the Cloud

4,368 AI-hallucinated faces — a photographic mosaic of all humanity, auctioned 6 months before Christie's Belamy.

Explore

+

Landmark — first major GAN art sale

On 13 February 2018, ALAgrApHY sold Muses Endormies (Sleeping Muses) — a monumental photographic mosaic composed entirely of 4,368 AI-generated faces, each hallucinated by a Generative Adversarial Network trained on the 1001 Faces dataset. The individual portraits are the Children of the Cloud: synthetic humans who never existed, yet bear the statistical imprint of a thousand real faces.

Each portrait was generated by the GAN and printed at large format. Together they form a mosaic that reads as a collective face of humanity — diverse, uncanny, beautiful. The work makes visible the GAN's fundamental operation: interpolating between real human variation to create faces that are simultaneously no one and everyone.

This sale preceded Christie's auction of Portrait of Edmond de Belamy (Obvious Art) by approximately six months. The Belamy sold for $432,500 in October 2018 and received far greater press attention — yet Muses Endormies was earlier and, many argue, more conceptually rigorous.

Date

13 Feb 2018

Faces generated

4,368 portraits

Training data

1,001 real faces (ALAgrApHY)

Format

Photographic mosaic

GAN type

Custom GAN (DCGAN lineage)

Significance

First major GAN art auction

2018Oct

GAN era · Media moment

Christie's — Portrait of Edmond de Belamy

$432,500 for a blurry portrait. The mainstream art world discovers AI — noisily and controversially.

Explore

+

In October 2018, Christie's auctioned Portrait of Edmond de Belamy by French collective Obvious Art — a GAN-generated portrait printed on canvas, signed with the GAN's loss function. It sold for $432,500, 45 times its estimate, triggering global headlines.

The work was immediately controversial: Obvious Art used a GAN architecture and training code substantially derived from Robbie Barrat's open-source implementation. Barrat, a teenage programmer, was uncredited. The debate foregrounded questions of authorship, attribution, and creativity that remain live today.

StyleGAN (Karras et al., 2018 — NVIDIA) was published around the same time, introducing progressive growing and style-based generation. By late 2018, GANs could generate photorealistic 1024×1024 human faces.

Sale price

$432,500

Auction house

Christie's, Oct 2018

Controversy

Barrat code uncredited

Contemporary

StyleGAN (NVIDIA, 2018)

2019–2021

GAN era · Scale

GAN maturity — BigGAN, StyleGAN2, CLIP

GANs scale up, gain control, and meet language. Text-guided image synthesis becomes possible.

Explore

+

BigGAN (Brock et al., 2018) showed that scaling GAN training massively — more parameters, larger batches — produced dramatically better ImageNet synthesis. StyleGAN2 (Karras et al., 2020) refined the generator to eliminate characteristic artefacts, producing near-perfect photorealistic faces. These became the workhorses of AI art during 2019–2021.

CLIP (Radford et al., 2021 — OpenAI) was not a generative model but a contrastive vision-language model trained on 400M image-text pairs. Combined with gradient-guided GAN inversion (VQGAN+CLIP, 2021), it enabled the first meaningful text-to-image synthesis — type a phrase, get an image. This was the immediate precursor to DALL-E.

BigGAN

512px ImageNet, class-conditional

StyleGAN2

1024px faces, no artefacts

CLIP

400M image-text pairs, zero-shot

VQGAN+CLIP

Text → image (community tool)

References

Brock, A. et al. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096

Radford, A. et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020

2021Jan

Diffusion era · OpenAI

DALL-E — OpenAI (January 2021)

A 12-billion parameter transformer generates images from text captions. The era of prompt engineering begins.

Explore

+

OpenAI unveiled DALL-E in January 2021 — a 12-billion parameter version of GPT-3 trained on text-image pairs. Given a text caption, DALL-E generated images using a dVAE (discrete Variational Autoencoder) as a tokeniser, then an autoregressive transformer to predict image tokens conditioned on text tokens.

DALL-E demonstrated compositional generalisation — "an armchair in the shape of an avocado" — that no prior model had achieved. It was not released publicly but its capabilities were demonstrated in a blog post that caused immediate sensation in art and tech communities alike.

DALL-E 2 (April 2022) pivoted to a diffusion model guided by CLIP embeddings — the architecture that would become dominant. Stable Diffusion (August 2022) made this open-source, democratising image generation globally.

Released

January 2021 (blog post)

Architecture

dVAE tokeniser + GPT-3 transformer

Parameters

12 billion

Training

Text-image pairs (MS-COCO + web)

References

Ramesh, A. et al. (2021). Zero-Shot Text-to-Image Generation. arXiv:2102.12092

2021→ 2022

Diffusion era · Architecture shift

Diffusion models supersede GANs

DDPM, DALL-E 2, Imagen, Stable Diffusion — the new paradigm generates by learning to reverse noise.

Explore

+

Diffusion models — rooted in non-equilibrium thermodynamics — learn to denoise images. The forward process adds Gaussian noise over T steps until the image is pure noise; the reverse process trains a U-Net to predict and remove noise at each step. At inference, start from random noise and denoise iteratively.

Key milestones: DDPM (Ho et al., 2020 — NeurIPS) proved diffusion matched GANs on image quality. DALL-E 2 (Ramesh et al., 2022) used latent diffusion guided by CLIP. Imagen (Google, 2022) used a diffusion model conditioned on T5 text embeddings. Stable Diffusion (Rombach et al., 2022 — LMU Munich / Stability AI) moved diffusion into compressed VAE latent space — reducing compute 64× — and was released open-source.

The diffusion paradigm is rooted in VAEs: Stable Diffusion operates not in pixel space but in the latent space of a pretrained VAE encoder, then decodes back to pixels only at the final step. This is the sense in which modern diffusion models are "rooted in VAE since 2013."

DDPM

NeurIPS 2020 — Ho et al.

DALL-E 2

Apr 2022 — CLIP + diffusion

Stable Diffusion

Aug 2022 — open-source, LDM

Inference steps

20–50 (DDIM), vs 1000 (DDPM)

References

Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models. arXiv:2006.11239

Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752

2023→ now

Diffusion era · Present

Multimodal foundation models & the open question

Video, 3D, real-time generation — and an unresolved debate about authorship, consent, and what it means to create.

Explore

+

The frontier has moved to video (Sora, 2024 — OpenAI), 3D generation (DreamFusion, Magic3D), and real-time synthesis. Models like Midjourney V6, FLUX, and GPT-Image produce photorealistic images indistinguishable from photographs to untrained eyes. Multimodal models (GPT-4o, Gemini) can understand, generate, and edit images in conversation.

The legal and ethical questions first raised by the Belamy controversy have intensified: class-action lawsuits from artists allege models infringe copyright by training on scraped web data. Consent, attribution, and economic displacement of human artists are live legislative battles in the EU and US.

The genealogy from Harold Cohen's hand-coded rules to a diffusion model that generates a Rembrandt in 2 seconds spans five decades and two paradigm shifts. The question Cohen asked — what does it mean for a machine to make art? — has only grown harder to answer.

Video

Sora (OpenAI, 2024)

Image

Midjourney, FLUX, DALL-E 3

Legal status

Contested (US, EU)

Open question

What is authorship?