Text classification, topic modelling and embedding calculation
Text calssifiation, topic modelling and embedding calculation
Project Overview
This project focuses on automating call classification and topic modeling for member interactions. Traditionally, our team manually classifies calls after reviewing their content. By automating this process, we save time and improve consistency. Additionally, we use topic modeling to identify new or overlooked categories, ensuring our classification system stays relevant. Finally, we calculate text embeddings and store them in a vector database for future use.
Data
The model uses call transcripts as input. These transcripts must be available when running predictions.
Modeling Approach
The application consists of three main components:
1. Predefined Classification
This model classifies transcripts into pre-established categories and subcategories. Below is the schema used for classification:
schema = { "type": "function", "function": { "name": "classify_call_transcript", "description": ( "" ), "parameters": { "type": "object", "properties": { "summary": { "type": "string", "description": "A concise overview of the call transcript." }, "classification": { "type": "object", "description": "Classify the transcript into one or more of the following *closely relevant* top-level categories along with their respective sub-categories.", "properties": { "Category 1": { "type": "array", # "description": "Issues related to Category 1.", "items": { "type": "string", "enum": [ "Sub Class 1", "Sub Class 2" ] } }, } }, "additionalProperties": False } }, "required": ["classification", "summary"] } }}openai_client = AzureOpenAI( api_key='', api_version="", azure_endpoint='')prompt = f"""{sample}"""response = openai_client.chat.completions.create( model='gpt-4o', tools=[schema], messages=[ { "role": "system", "content": "You work classifying text and providing summaries" }, { "role": "user", "content": prompt, } ], max_tokens=1000, temperature=0, tool_choice='required', seed=0)2. Topic Clustering
This model detects emerging topics that are not part of the predefined classification. We use hyperparameter optimization to find the best clustering configuration.
def objective(trial): ''' This function optimizes the silhouette score between classes and embeddings using the hyperparameters below. ''' n_neighbors = trial.suggest_int('n_neighbors', 5, 10) n_components = trial.suggest_int('n_components', 5, 10) min_dist = trial.suggest_float('min_dist', 0.0, 1.0) min_cluster_size = trial.suggest_int('min_cluster_size', 10, 100) metric_umap = trial.suggest_categorical("metric_umap", ['cosine','euclidean','manhattan']) metric_hdbscan = trial.suggest_categorical("metric_hdbscan", ['euclidean','manhattan']) cluster_selection_method = trial.suggest_categorical("cluster_selection_method", ['eom']) max_cluster_size = trial.suggest_int('max_cluster_size', 100, 500) umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=min_dist, metric=metric_umap) hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, max_cluster_size=max_cluster_size, metric=metric_hdbscan, cluster_selection_method=cluster_selection_method) embedding_dim_reduction = umap_model.fit_transform(np.stack(df_texts['normalized_embedding'])) topics = hdbscan_model.fit_predict(embedding_dim_reduction) valid_indices = [i for i, topic in enumerate(topics) if topic != -1] valid_embeddings = [embedding_dim_reduction[i] for i in valid_indices] valid_topics = [topics[i] for i in valid_indices] n_labels = len(set(valid_topics)) if len(valid_embeddings) == 0 or n_labels < 2: sil_score = 0 else: sil_score = silhouette_score(valid_embeddings, valid_topics, metric='cosine') return sil_scorestudy = optuna.create_study(direction='maximize')study.optimize(objective, n_trials=100)# Apply best parametersn_neighbors = study.best_trial.params['n_neighbors']n_components = study.best_trial.params['n_components']min_dist = study.best_trial.params['min_dist']min_cluster_size = study.best_trial.params['min_cluster_size']metric_umap = study.best_trial.params["metric_umap"]metric_hdbscan = study.best_trial.params["metric_hdbscan"]cluster_selection_method = study.best_trial.params["cluster_selection_method"]max_cluster_size = study.best_trial.params['max_cluster_size']umap_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, min_dist=min_dist, metric=metric_umap)hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, max_cluster_size=max_cluster_size, metric=metric_hdbscan, cluster_selection_method=cluster_selection_method)embedding_dim_reduction = umap_model.fit_transform(np.stack(df_texts['normalized_embedding']))topics = hdbscan_model.fit_predict(embedding_dim_reduction)df_texts['topics'] = topicsdf_texts[df_texts['text_plain'].isna()]['topics'] = -13. Embedding Calculation
We generate embeddings for transcripts using OpenAI's embedding API and store them for future retrieval.
endpoint = ''model_name = "text-embedding-3-large"deployment = "text-embedding-3-large"api_version = "2024-02-01"client = AzureOpenAI( api_version="2024-12-01-preview", azure_endpoint=endpoint, api_key='')response = client.embeddings.create( input=["this was a successful test","return embeddings","look at this L"], model=deployment)for item in response.data: length = len(item.embedding) print( f"data[{item.index}]: length={length}, " f"[{item.embedding[0]}, {item.embedding[1]}, " f"..., {item.embedding[length-2]}, {item.embedding[length-1]}]" )def get_embeddings_openai(texts): ''' Get embeddings from OpenAI API Args: texts (pd.Series(str)): text/transcripts of interest. Returns: embedding ([array]): vector from open ai embedding method. ''' embeddings = pd.DataFrame(columns=['embedding']) response = client.embeddings.create( input=texts, model=deployment ) for item in response.data: embeddings = pd.concat([embeddings, pd.DataFrame({'embedding': [np.array(item.embedding)]})], ignore_index=True) return embeddingsPredictions
Once all three models are ready, we run them on texts, generate outputs, and store the results.
