TEPT data and disambiguation – The use of personal identifiers | Distant Reading and Data-Driven Research in the History of Philosophy

The TEPT project is committed to the development of an infrastructure for the reconstruction of the relations of academic descent among philosophers. Such relations constitute a socio-institutional network of arcs connecting nodes, i.e. couples of academic parent/offspring. The Tree of Philosophers that is being developed by DR2 is meant to represent the network of relations that are reconstructed.

Evidently, philosophers are the smallest and fundamental units of the project from a structural point of view, as they are the constituents of the very subject of the project, i.e. the descent relations.

TEPT’s criteria for the inclusion of people in the Tree of Philosophers are quite broad: anyone who ever granted or received a high-level academic degree to or from someone who either granted or received a high-level academic degree in philosophy is, in principle, a proper addition to the tree.

People (philosophers) are thus included in the tree regardless of their notability, their career paths, their productivity in the intellectual domain or the reception of their works.

Trivially, thus, the Tree of Philosophers is meant to record many people. A resource such as the Tree of Philosophers will hardly be ever completed, so that the number of philosophers that can be expected to be included is not easy to assess. Nevertheless, the 3000 people constituting the first batch of philosophers TEPT has ordered and analysed, along with our instant acknowledgement of the little historical coverage of such a sample, can provide some insights on the order of magnitude we expect the tree to deal with. In spite of its institutional focus, which makes it blind towards non-academic transmission of knowledge, the tree and the infrastructure it relies upon have to manage and treat as “authors” (people whose works, represented by their dissertations at the bear minimum, contributed to intellectual production) a number of people that is larger than usual in historical reconstructions of the history of philosophy.

Moreover, a significant part of the tree’s domain is populated by what we can naively call non-famous philosophers. We can’t provide an esteemed ratio yet (yet!), but common sense is sufficient to assume that in most historical contexts in which academic philosophical training exists, people graduating in philosophy usually come in greater numbers than people becoming notable because of their philosophical work (regardless of training and background).

The need of working on large sets of mostly unknown people heavily influences most choices and strategies in the development of the TEPT project. In order to see how this aspect affects the development of TEPT’s infrastructures, a brief description of the tasks involved in the reconstruction of lines of academic descent can be of use.

Once that a specific type of descent relation is defined, e.g. the master-pupil relation of a PhD advisor and a PhD candidate, TEPT researchers work to track the academic genealogy of a philosophers who earned or granted a PhD, then proceeding to recursively retrieve either their parents (their PhD supervisors and the supervisors of the supervisors) or their offsprings (the PhD candidates they advised and those that the latter supervised). In order to retrieve information about such connections, researchers need to browse a variety of documents and archives: depending on the philosophers inquired upon, on the time of their academic training and on the institutions where the training took place, we can find pieces of information about philosophers’ academic training in (auto)biographies, in institutional archives and in national registers, but professional resumés as well, along with public commemorative speeches and even obituaries are valuable sources.

Most pieces of information needed for the reconstruction of biographies are difficult to find, even at expensive costs in terms of time. In some cases, non-famous people have left fewer traces (e.g. if they did not publish philosophical works because they pursued different careers), but what is noticeable in most cases is that those traces that have been left by non-famous people are harder to find and to put together. By relying on resources such as Proquest, for example, we can find names of philosophers graduated in the USA the second half of 20th century along with the titles of their theses. Nonetheless, such names and titles are seldom sufficient for the attribution of descent relations: apart from cases concerning well-known philosophers, it is very difficult to assess if different pieces of information linked to a name do refer to the same person.

Providing a couple of examples, we retrieved the identities of seven different philosophers whose last name is “Davidson”, all of them being trained in the US and active in 20th century; considering the forty philosophers whose last name is “Johnson” we have been able to distinguish four different “David Johnson” with different and sometimes punctuated middle names, all trained in the US between 1949 and 1978. The attribution to the same people of data retrieved from different sources is often a difficult task simply because we are not sure that they actually refer to the same people. Such attribution thus requires multiple validation steps that are hard to formalise in a set of instructions. If we are provided with the title of a dissertation we can evaluate the disciplinary proximity of the thesis with the academic production or the professional path followed by someone that has the same name of the thesis’ author (at least until we prove that they are the same person). The same evidently holds for geographical and chronological information. We often find published works, but the retrieval of sets of very heterogenous works with the same author’s name is often the first cue that we are probably dealing with some case of homonymy.

For all these reasons, two major issues that demand our attention in the development of the Tree of Philosophers are the duplication and the overlapping of the personal identities of philosophers included in the Tree.

We find a partial solution to both problems by relying on virtual identifiers of authority data, which are resources used in archival disciplines and in library institutions. Virtual identifiers are simple numbers or strings of text that are used as labels or indexes, directing to a specific person identified as the author of a number of works. In order to mitigate the problems of duplication and overlapping of identities we operate slight modifications to TEPT’s database, enlarging the Philosophers Table by adding four additional fields, one for each virtual identifier we want to rely upon: these are Wikidata, ORCID, ISNI and VIAF.

VIAF (Virtual International Authority File) is an international authority data identifier assigned by OCLC by aggregating authority data from national library systems: this means that a name that is assigned a VIAF by OCLC is a name that is recorded as the author of at-least-one work in at-least-one national library catalogue. ISNI is the ISO effort of standardisation of personal identification of contributors to the intellectual production and it works in a similar manner to VIAF, by aggregating authority data from national catalogues along with academic production of article-like works. Both OCLC’s VIAF and ISNI aggregates authority data from national systems using samples of published titles and authors’ birth (and sometimes death) years. ORCID identifiers are obviously available only for recent time, but their assignment is directly requested by researchers or institutions and they can help in disambiguating nodes in recent branches of the Tree of Philosophers, at the cost of an insignificant increase in the sparsity of the Philosophers Table in TEPT’s database. Technically, ORCID is a part of ISNI, because ORCID identifiers are included as a region of ISNI identifiers. Nonetheless, we prefer to keep ORCID an ISNI ids as separate fields. First, because a recent productive philosopher can easily have different ISNI and ORCID ids; secondly, because while ISNI ids are assigned by aggregating data, the assignment of ORCID ids is directly requested by researchers or their institutions of affiliation, so that reliability of ORCID is greater than that of ISNI. Finally, Wikidata identifiers are assigned by an automatic system supervised by users of the Wikidata community. Noticeably, Wikidata ids are assigned to somehwhat notable people so that we cannot expect Wikidata ids to make a huge difference in the disambiguation of non-famous philosophers.

The search for such different identifiers and, whenever available, their attribution to personal records in TEPT’s database, provide means to improve the way in which TEPT works: the inclusion of four types of identifiers makes it possible for us to assess the population of the database at different stages, thus allowing for the evaluation of different strategies of data collection; we can evaluate the coverage of the database in terms of intra- or inter-disciplinary renown of philosophers (e.g. by comparing ISNI, ORCID and Wikidata coverage); furthermore, we are able to approximate a measure of the “Great Unread” that is included in the Tree of Philosophers, by measuring the philosophers that are not recorded in any of the mentioned repositories.

From a technical point of view, the assignment of virtual identifiers to personal records in TEPT’s database also improves TEPT data in terms of findability, accessibility, interoperability and reusability (also known as FAIR principles ): if disambiguation allowed by the introduction of virtual identifiers trivially increases findability and accessibility of data, the indexation of philosophers’ data with external identifiers dramatically improves interoperability, and consequently reusability. Indeed, reliance on external, widely used identifiers allows for the development of semi-automatic procedures for the inclusion of large quantities of personal records. As stated above, a variety of archival sources provide data about philosophers, and some of these already come in the form of structured data (mostly spreadsheet of archival records). In the best cases, structured data are clean enough to be added almost directly (after some filtering) in TEPT’s database. Without reliance on external identifiers, we could have done such an addition only if we were sure that no data that we would have added would have overlapped with personal data already present in the database, thus duplicating philosophers. This entails that when two sources of structured data concerning the same context are available, without relying on external identifiers we would have been forced to choose one of the two sources and discard the other. By contrast, reliance on external identifiers and on procedures for their attribution allows for the assignment of identifiers to the structured data to be added, and then for the application of filters in the fields of the identifiers.

Along with our collaborators in the department of Computer Science, we devised a process to link as many philosophers as possible to their respective virtual identifiers. More specifically, dedicated annotation and disambiguation tools have been developed for the assignment of the identifiers to personal records in TEPT’s database. The application works by searching for matching identifiers in four different pages, one for each repository of virtual identifiers. In each page, the application suggests potential matches for the name of the selected philosopher in the repository (e.g. VIAF) by performing automatic queries (e.g. through the VIAF search engine). If there is one matching result of a recorded author, the author’s identifier is assigned to the name, and annotators must verify the assignment, then confirming or disconfirming it. If there are multiple matches, annotators select the identifier that they consider correct, if they find any, by exploring authors’ contextual data in the repositories of identifiers.

This disambiguation process is necessary because virtual identification leaves room for error and ambiguity. Data-aggregation procedures of VIAF, ISNI and Wikidata can lack precision and collapse different people with similar names in single entities, or they can assign two identifiers to the same author. Typically, this happens when, given two authored books 1 and 2 and two foreign countries A and B, book 1 is translated and published in country A but not in country B, while book 2 is translated and published in country B but not in country A.

Facing these ambiguities, we decided a set of explicit rules to guide us in evaluating multiple matching identifiers for each repository, e.g. in VIAF assignments, if multiple matching VIAF ids do not contain mistakes in contextual data we select the one that has the lowest serial number. Noticeably, we devised criteria that ideally leave no room for interpretations, in order to ensure consistency of assignment of virtual identifiers in the case that two collaborators need to assign ids to two datasets of philosophers that are to be merged.

Recent Posts

Recent comments

Categories