A corpus is a collection of data that makes up the foundation of any AI.
What is a Corpus?
Corpus refers to the collection of data on which AIs are trained. For a corpus to be effective, it must have a good quality collection of data in terms of content as well as distribution of topics and concepts. The corpus or collection of data must be relevant to the intended function of the AI. For example, a medical AI that is intended to track patient movements in case of falls may need both general data and medical data because understanding the movement of humans can be better understood through natural scenarios. However, a medical AI intended to find bacteria in blood samples might only need medical data or, more specifically, a collection of blood samples.
Quality of Corpus Data
The quality of corpus data is dependent on many factors including the size, balance, date, annotation, classification, representation, and accessibility of its data collection. Curating a corpus that takes all those factors into account limits, as much as possible, the errors that could occur when an AI is trained on it – this is besides the programming and guardrails that are implemented.
A good example of the importance of quality in corpus data is Amazon’s flawed automated resume hiring tool which they had trained on the resumes of job candidates submitted to the company over a ten-year period. However, in 2015, Amazon came to the startling realization that their tool was discriminating against female candidates. This was discovered to be due to the lack of balance and representation in their corpus of resume samples as they were mostly from male candidates.
Types of Corpus Data
Types of corpus data refer to the many ways corpora (plural of corpus) can be classified. Most commonly, corpora are distinguished by the modes and modalities their collection of data contains. Modes refer to the presentations of data such as text, image, audio, and video while modalities refer to the characteristics such as language, genre (e.g. fiction, jazz, or horror), style (e.g. sarcastic or baroque), production techniques (e.g. animation), and acoustic features (e.g. high-pitched).
Corpora are always made up of more than one type of corpus data. Some common types are the “monolingual corpus”, which contains data in only one language; the “multilingual corpus”, which contains data in more than one language; the “monomodal corpus”, which contains data in only one modality; and the “multimodal corpus”, which contains data in more than one modality.
Importance of Corpus
Just as an AI’s corpus is dictated by its intended function, the proficiency and capabilities of an AI is dictated by its corpus. Proficiency, in this case referring to accuracy and reliability, is dependent on the quality of the corpus while capability, in this case referring to an AI’s capacity to perform various functions, is dependent on the modality or modalities of the corpus. Put simply, a good quality corpus with a collection of balanced data will result in a proficient AI while a monomodal corpus with, for example, only text-based data will result in an AI with only text-based capabilities.
A good case study for monomodal corpus is GPT-3 which is being trained on a collection of text data. AIs using GPT-3 or the models in its series as their underlying technology (i.e. ChatGPT) are capable of content creation, summarization, conversations, information retrieval, and other text-based capabilities. However, they aren’t capable of generating audio, images, or videos without plugins – for example, the MixerBox ImageGen plugin enables ChatGPT to generate AI images.