AI Research

Develop New Products with AI — A Variational Autoencoder for Categorical Product Characteristics

Variational Autoencoders (VAEs) have become a powerful tool for generating synthetic data, learning compact representations, and enabling downstream tasks such as anomaly detection, recommendation, or product design. While VAEs operate naturally on continuous data — such as images or numerical features — many real-world datasets are purely categorical, especially in domains like retail or e-commerce.

Consider a product catalog where each product is described by attributes such as:

  • color: red, blue, green
  • material: cotton, leather, polyester
  • category: shoes, shirt, bag
  • brand: dozens or hundreds of discrete options

The challenge: How can a VAE model this data when VAEs require continuous-valued inputs and outputs?

The answer: learn an embedding for every categorical feature, converting discrete variables into trainable continuous vectors. This article describes how this works and how a product-characteristics VAE can be built.

1. Why Categorical Values Are a Problem for VAEs

A standard VAE requires:

  • continuous input vectors (usually real-valued)
  • a continuous latent space (typically multivariate Gaussian)
  • reconstruction via continuous outputs (again, often Gaussian or Bernoulli)

Categorical variables, by contrast:

  • have no inherent numeric geometry
  • use arbitrary integer IDs that are meaningless
  • cannot be passed directly into neural nets
  • cannot be reconstructed via Gaussian decoders

This mismatch requires a transformation before training.

2. Using Embeddings to Represent Categorical Values

To make categorical variables compatible with VAEs, each category is mapped to a trainable vector — similar to embeddings in NLP.

How it works

For a categorical variable with N categories:

  • Create an embedding matrix E∈RN×d, where d is embedding dimension
  • Each category ID looks up a vector ei​∈Rd
  • The embedding vectors become the effective continuous input to the VAE encoder

Why embeddings are effective

  • They capture semantic similarity between categorical choices
  • They learn latent structure within each categorical field
  • They allow the encoder to operate in a continuous space
  • They allow the decoder to reconstruct categories using a softmax over category logits

For example, suppose the "material" attribute has categories:

  • cotton
  • leather
  • polyester

An embedding might naturally cluster "cotton" and "polyester" as synthetic/textile materials, while "leather" may form a distinct region.

3. Architecture of a Categorical VAE for Product Characteristics

The architecture contains four main components:

3.1 Encode

  • Each categorical feature is embedded separately
  • All embeddings are concatenated into a single continuous vector
  • A neural network compresses this into mean mu and log-variance log sigma² of the latent distribution
  • Latent variable z is sampled using the reparameterization trick

3.2 Latent Space

  • Standard multivariate Gaussian
  • Captures product-wide relationships
  • Allows interpolation, clustering, and sampling of new products

3.3 Decoder

  • Maps latent vector z into a set of logits for each categorical variable
  • For each feature:
    • the decoder outputs a softmax distribution
    • a category is selected via argmax or sampling

3.4 Training Loss

VAEs normally optimize:

L=Eq(z∣x)​[logp(x∣z)]−βKL(q(z∣x)∥p(z))

When the output is categorical:

  • Reconstruction term becomes the sum of cross-entropy losses for each categorical feature
  • KL divergence remains unchanged

4. Why This Works for Generating New Product Attributes

After training:

  • Sampling from the latent space produces new, plausible product combinations
  • The decoder ensures outputs are valid existing category values
  • The embedding structure ensures similarity relationships are respected

For example, if the model sees that:

  • black shoes
  • white shoes
  • black boots

are common, it may generate:

  • white boots

even if such products were rarely or never seen.

The embedding + VAE combination gives the model the ability to interpolate between product types and discover novel combinations.

5. Practical Considerations

Choosing embedding dimensions

Typical choices:

  • Small vocabularies (N < 20): 4–10 dimensions
  • Medium vocabularies (20–200): 8–32 dimensions
  • Large vocabularies (200+): 16–64 dimensions

Ordinal vs. non-ordinal categories

Embeddings are especially useful when categories are unordered (e.g., colors).

Handling rare categories

A minimum frequency threshold helps avoid embeddings that never learn.

Sampling vs argmax during generation

  • Sampling increases diversity
  • Argmax yields more stable product designs

6. Example Use Cases

Product Design

Generate new product configurations — colors, materials, styles — that are consistent with historical patterns.

Recommendation Systems

Sample similar products in the latent neighborhood of an existing item.

Data Augmentation

Generate additional synthetic product records to balance sparse categories.

Concept Discovery

Latent space may reveal new product "topics" or styles.

Conclusion

Modeling categorical product features with a VAE requires introducing learned embeddings that map discrete values into a continuous geometric space. This allows the encoder and decoder to operate naturally within the VAE framework. The resulting model provides a powerful tool for generating synthetic product characteristics, exploring new combinations, and uncovering latent structure in product catalogs.