AI Research

Develop New Products with AI — A Variational Autoencoder for Categorical Product Characteristics

Variational Autoencoders (VAEs) have become a powerful tool for generating synthetic data, learning compact representations, and enabling downstream tasks such as anomaly detection, recommendation, or product design. While VAEs operate naturally on continuous data — such as images or numerical features — many real-world datasets are purely categorical, especially in domains like retail or e-commerce.

Consider a product catalog where each product is described by attributes such as:

color: red, blue, green
material: cotton, leather, polyester
category: shoes, shirt, bag
brand: dozens or hundreds of discrete options

The challenge: How can a VAE model this data when VAEs require continuous-valued inputs and outputs?

The answer: learn an embedding for every categorical feature, converting discrete variables into trainable continuous vectors. This article describes how this works and how a product-characteristics VAE can be built.

1. Why Categorical Values Are a Problem for VAEs

A standard VAE requires:

continuous input vectors (usually real-valued)
a continuous latent space (typically multivariate Gaussian)
reconstruction via continuous outputs (again, often Gaussian or Bernoulli)

Categorical variables, by contrast:

have no inherent numeric geometry
use arbitrary integer IDs that are meaningless
cannot be passed directly into neural nets
cannot be reconstructed via Gaussian decoders

This mismatch requires a transformation before training.

2. Using Embeddings to Represent Categorical Values

To make categorical variables compatible with VAEs, each category is mapped to a trainable vector — similar to embeddings in NLP.

How it works

For a categorical variable with N categories:

Create an embedding matrix E∈RN×d, where d is embedding dimension
Each category ID looks up a vector ei∈Rd
The embedding vectors become the effective continuous input to the VAE encoder

Why embeddings are effective

They capture semantic similarity between categorical choices
They learn latent structure within each categorical field
They allow the encoder to operate in a continuous space
They allow the decoder to reconstruct categories using a softmax over category logits

For example, suppose the "material" attribute has categories:

cotton
leather
polyester

An embedding might naturally cluster "cotton" and "polyester" as synthetic/textile materials, while "leather" may form a distinct region.

3. Architecture of a Categorical VAE for Product Characteristics

The architecture contains four main components:

3.1 Encode

Each categorical feature is embedded separately
All embeddings are concatenated into a single continuous vector
A neural network compresses this into mean mu and log-variance log sigma² of the latent distribution
Latent variable z is sampled using the reparameterization trick

3.2 Latent Space

Standard multivariate Gaussian
Captures product-wide relationships
Allows interpolation, clustering, and sampling of new products

3.3 Decoder

Maps latent vector z into a set of logits for each categorical variable
For each feature:
- the decoder outputs a softmax distribution
- a category is selected via argmax or sampling

3.4 Training Loss

VAEs normally optimize:

L=Eq(z∣x)[logp(x∣z)]−βKL(q(z∣x)∥p(z))

When the output is categorical:

Reconstruction term becomes the sum of cross-entropy losses for each categorical feature
KL divergence remains unchanged

4. Why This Works for Generating New Product Attributes

After training:

Sampling from the latent space produces new, plausible product combinations
The decoder ensures outputs are valid existing category values
The embedding structure ensures similarity relationships are respected

For example, if the model sees that:

black shoes
white shoes
black boots

are common, it may generate:

white boots

even if such products were rarely or never seen.

The embedding + VAE combination gives the model the ability to interpolate between product types and discover novel combinations.

5. Practical Considerations

Choosing embedding dimensions

Typical choices:

Small vocabularies (N < 20): 4–10 dimensions
Medium vocabularies (20–200): 8–32 dimensions
Large vocabularies (200+): 16–64 dimensions

Ordinal vs. non-ordinal categories

Embeddings are especially useful when categories are unordered (e.g., colors).

Handling rare categories

A minimum frequency threshold helps avoid embeddings that never learn.

Sampling vs argmax during generation

Sampling increases diversity
Argmax yields more stable product designs

6. Example Use Cases

Product Design

Generate new product configurations — colors, materials, styles — that are consistent with historical patterns.

Recommendation Systems

Sample similar products in the latent neighborhood of an existing item.

Data Augmentation

Generate additional synthetic product records to balance sparse categories.

Concept Discovery

Latent space may reveal new product "topics" or styles.

Conclusion

Modeling categorical product features with a VAE requires introducing learned embeddings that map discrete values into a continuous geometric space. This allows the encoder and decoder to operate naturally within the VAE framework. The resulting model provides a powerful tool for generating synthetic product characteristics, exploring new combinations, and uncovering latent structure in product catalogs.