Text clustering, or how to extract real business value from millions of text messages

The vastness of text data is the bane of many an organization. Hundreds of comments, opinions and articles flood company databases and despite the suspicion that valuable business information may flow from it, text data is often ignored in decision-making processes. However, it often turns out that the knowledge hidden in this seemingly inaccessible source of data is priceless! This begs the question.... are we able to infer from text in a quick and painless way?

––––


Unlike, for example, financial indicators where numerical values are directly interpretable, the unstructured format of text data does not allow for simple and quick analysis. However, if we collect useful metrics along with such data, such as the number of stars next to a product review, we can try to automate the process of finding positive and negative features of our product. We could do this by analyzing sentiments towards entities detected in the text - such as the phone screen or speaker interface.


Most often, however, we have to deal with a text without any labels - customer conversations are an example of such data. In such a situation, we can analyze them manually or create a set of heuristic algorithms based on regular expressions that detect specific, arbitrary categories. Unfortunately, this is associated with a significant increase in analysis time and hard, manual work for the analyst. Fortunately, we have a set of methods available to make this process much easier!


Text representation in multidimensional space


In order to enable advanced automation of text analysis, we need to transform the text into a format supported by machine learning algorithms - a numerical version of the text. Such a format could consist of vectors that define the meaning of words in space. 


One of the current best solutions for creating such vectors is to use pre-trained algorithms based on Transformer architecture (https://arxiv.org/abs/1706.03762). They allow the creation of context-based representations of words (and also whole sentences) in multidimensional space. In the simplest terms, similar texts are represented in space close to each other thanks to such models.



Using a model specialized to create representations of whole sentences, we can transform the following quotes (from a certain well-known mockumentary series) into vectors and then compute the cosine similarity between them to verify whether the representations of similar texts are indeed close to each other. 



The first two messages are quite similar in content, while the third one concerns something completely different. Calculation of the cosine similarity for the given representations (in pairs 1-2 and 1-3) shows the expected result - the similarity measure of the first pair is much higher than that of the second pair.


Algorithms based on the Transformer architecture are distinguished by their generalisation capabilities learned from huge textual datasets. Our role in using them can be reduced to simply downloading a given model and computing predictions on our data. Of course, we could fine-tune it on our data, but this solution is all about simplicity and speed.



Unsupervised learning


Having the text data in numerical form, we can proceed with the analysis using machine learning models. The lack of labels means that we have to resort to alternative methods such as unsupervised learning. This means that during training (adapting the model to the data) we do not directly indicate the classes of observations and we want our algorithm to extract groups of observations based on their characteristics. In the case of text data, these features are actually the direction of a vector that is a representation of a word (text) in a multidimensional space. So our goal is to identify clusters of texts with similar meaning - similar direction. 


Identification of clusters in the text


We can drop the processed text data into an unsupervised model that identifies the clusters. At little cost, we can adapt it to the data by tuning the hyperparameters.


One such model that performs extremely well with text data, if only because of the elimination of noise, is DBSCAN (Density-based spatial clustering of applications with noise).


This algorithm, in a nutshell, identifies groups of observations in dense regions and omits the rest. Thanks to that, outliers or noise will not be assigned to clusters "by force", as it is the case with e.g. the k-means algorithm, where each observation must be assigned to a predefined number of clusters.


As it is presented in the following figure - in case of k-means algorithm each point belongs to some cluster, and in case of DBSCAN the regions with the highest density are identified leaving the noise (grey dots) untouched.


Source: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html


Using DBSCAN, we will obtain clusters of high density - containing texts of very similar meaning. Noise, making the analysis much more difficult, will be omitted - it may be spam messages or opinions completely unrelated to the product. It is possible that such messages or opinions are worth additional analysis, but if we want to focus on main groups/categories, skipping the noise will definitely positively influence the analysis.


We can browse the resulting clusters manually and identify the category of text by itself, e.g., opinions on phone defects or similar questions. We can also visualize the clusters after reducing the dimensions of the vector representation of the text, using such techniques as PCA or UMAP. Clusters on a collection of all dialogues from a certain well-known mockumentary series, such as:

It can be visualised in three dimensions after reducing the dimensions of the vector representations using the PCA technique with three components (as in the following figure, where each color denotes a different cluster and 8 clusters are visible). The similar lines of the transcript are clearly related to each other.



A more sophisticated approach, however, is to attempt automated topic identification across clusters. The standard approach is to extract the most relevant words via a TF-IDF matrix. However, a simple modification of this approach to create TF-IDF matrices within a cluster (class-based TF-IDF,) rather than over the entire text corpus will significantly facilitate cluster interpretations.


In this way, we can extract clusters of similar topics, and at little cost, automatically name these clusters by the most relevant terms found in the text. One such cluster, containing sentences similar to those in the example of calculating similarity by the cosine measure, was defined by the words raise, ask, salary, schedule, compensation, Michael - we can easily interpret the topic covered in this cluster. We can then unify the resulting topics into several categories or use them unaltered.


Tidio livechat is used by thousands of website users every day, conducting millions of conversations with customer service departments. The collection of data gathered by our widget is constantly growing and currently exceeds 2TB. Using the techniques described above, we have significantly accelerated the process of analyzing conversations and extracting business value from text that was not originally tagged. Working on a text data set requires dedication and passion for data structuring. However, the business value for the entire company and for product development that comes from analyzing texts far outweighs the effort required. 


Share this article: