Text Clustering 1.1 Documentation

Do you have any questions? Write us an email or ask us through the feedback section.

Request

Endpoint:

POST

https://api.meaningcloud.com/clustering-1.1

If you are working with an on-premises installation, you will need to substitute api.meaningcloud.com by your own server address.

Content-Type:

multipart/form-data

Parameters:

Name	Description	Values	Notes
`key`	Authorization key for using MeaningCloud services. Create an account for free to create your key.		Required
`of`	Output format.	`json` `xml`	Optional. Default:`json`
`lang`	It specifies the language in which the text is.	en: English es: Spanish it: Italian fr: French pt: Portuguese ca: Catalan da: Danish sv: Swedish no: Norwegian fi: Finnish zh: Chinese ru: Russian ar: Arabic	Required
`txt`	This parameter will contain one or more texts, one per line. All the texts sent in this parameter will be assigned automatically the ID used to identify them at the output. The IDs will be numerical, and will start from 1. For `mode`=dg, more than one text needs to be sent.	UTF-8 encoded text (plain text, HTML or XML).	Required
`id`	This parameter will contain the IDs associated to the input texts. Each ID will have to be included in a different line, and the number of IDs included has to be the same as the number of texts included in `txt`.	UTF-8 encoded text (plain text, HTML or XML).	Optional. Default: `id=""`
`mode`	This parameter will define the approach used to carry out the clustering process. To read more about the possibilities check the Clustering modes section.	tm: Topic Modeling (default) dg: Document Grouping	Optional. Default: `mode="tm"`
`sw`	Stopwords to be ignored by the algorithm, both in the clustering process, and as labels for the clusters. The valid format is a stopword per line (separated by linefeed "\n"). These stopwords are added to the ones used by default for the selected `lang`.	UTF-8 encoded.	Optional. Default: `sw=""`

Clustering modes

The current clustering modes available are the following:

Topic modeling: this method groups the documents passed in the txt parameter by the n-gram that's most representative of its meaning. It's a change in the pipeline found in classical clustering algorithms, as it selects the representing labels before grouping the texts. This approach helps to discover hidden themes in document collections providing more descriptive labels than classical clustering algorithms. Cluster assignation is not exclusive (a text can belong to more than one cluster), and there will always exist a default cluster called Other Topics with the texts that do not belong to any other cluster.
Document grouping: this method implements the classic bisecting k-means algorithm. One of its most significant differences with topic modeling is the fact that cluster assignation is exclusive, that is, a text can only be assigned to a single cluster. In this case, labels are not as descriptive; they are composed by a collection of terms that describe the documents assigned to the cluster. For large collections, the label will be a single term.

So, which one to choose? It will depend on your use case, but the main factors to take into account are thattopic modeling gives more descriptive labels and more weight to outliers in the collection, while document grouping is the only one that provides exclusive clustering.

Text Clustering API version 1.1

Request

Endpoint:

Content-Type:

Parameters:

Clustering modes