What is a text classification model?

A classification model comprises the list of categories as well as the resources required to classify documents into the classes defined. For instance, a model may allow classifying movie synopses with respect to their genre. The model would include categories such as thriller, terror or romantic. Formally, each category is identified by a code and a label, a short description of the purpose of the category.

The classification process is based on a hybrid algorithm that combines statistical methods with linguistic rules to achieve the maximum classification accuracy and control over the results. Thus, each category includes additional training documents and/or rules to classify documents.

Each category contain additional fields to provide training text and define four sets of manual rules, like relevant or irrelevant terms, that determine the behaviour of the classification model.

MeaningCloud provides some models for common use cases:

  • IPTC Subject Codes - International Press Telecommunication Council is an international consortium of the world's major news agencies, news publishers and news industry vendors. It maintains several widely adopted taxonomies for news categorization, including this one on subjects. Top level categories include subjects like "sport", "politics" or "education". IPTC Subject codes provides roughly 1400 categories organized in a three-level tree. If you require more detailed information, please consult the IPTC standard: [ definition ] [ navigation tool ].
  • EuroVoc - EuroVoc is a multilingual, multidisplinary thesaurus covering the activities of the EU and the European Parliament. Here it is provided as a classification model for indexing institutional documentation into subject topics.
  • Business Reputation - focuses on the different areas that may affect the online reputation of a company.
  • Social Media - this model provides a simple classification to categorize in a comprehensive way all the social media posts you want to analyze.

In the supported models section there are more details about these models, the categories defined for each one of them and the languages they are available in.

These are the exact values associated to each available model you have to enter as the model parameter when using the API.

  • IPTC_en: English IPTC model.
  • IPTC_es: Spanish IPTC model.
  • IPTC_fr: French IPTC model.
  • IPTC_it: Italian IPTC model.
  • IPTC_pt: Portuguese IPTC model.
  • IPTC_ca: Catalan IPTC model.
  • EUROVOC_es_ca: EuroVoc EU's multilingual thesaurus (Spanish/Catalan).
  • BusinessRep_es: Business Reputation (Spanish).
  • SocialMedia_en: English Social Media model.
  • SocialMedia_es: Spanish Social Media model.

In addition, you can define your own classification models. They provide a similar output to our own IPTC categorization model. The relevance of a document with respect to a subject is judged and the output is attached to the response for further processing. This method allows you to decide the type of classification (binary, multiclass, multilabel, single label) that better suits your application.