The Educational Times Database: Building a Database of ET/MQ Problems and Solutions, 2010s

Author(s): 
Robert M. Manzo (University of North Carolina at Chapel Hill)

In the 2010s Despeaux, a professor of mathematics at Western Carolina University (WCU), North Carolina, took up the task of migrating Tattersall’s tables of data, stored across multiple computer files, into an integrated online digital database. Her main point of reference in designing the ET database was the popular database Auteurs des Nouvelles Annales de Mathématiques, developed by a team of researchers from two French institutions: the Archives Henri Poincaré (at Université de Lorraine) and the Groupe d’Histoire et Diffusion des Sciences d’Orsay (at Université Paris-Sud, now part of Université Paris-Saclay). In their database, the French team indexed 5,128 mathematical questions and answers contributed by 1,852 authors, printed in the journal Nouvelles Annales de Mathématiques between 1842 and 1927 [“Accueil” 2018]. Users can search the Auteurs database by author name, year published, author profession, and fourteen other categories. Similarly, the ET database is searchable by author name and other attributes.

Despeaux served as project manager, but realization of the ET database required several additional skill sets to come together. Most obviously, in order to build a database, the expertise of a computer scientist was needed. Dr. Mark Holliday, a colleague of Despeaux’s in the Department of Mathematics and Computer Science at WCU, outlined the steps required to turn Tattersall’s data files into a functioning database. First, the data had to be “cleaned up,” which meant transferring the data from Tattersall’s numerous computer files to a single computer file and imposing a standard format on all data points belonging to the same category (e.g., names, dates, volume numbers). Second, computer code had to be written that would refer to the single data file and make that file’s data into a searchable digital database. The first step of data clean-up was handled by Despeaux and Robert Manzo, a graduate student assistant with experience in librarianship, data management, and historical research. In the second step, Holliday and four computer science students wrote the computer code in two separate phases. In the first phase, completed in 2016–17, students Grant Brown and Logan Schmidt worked with Holliday to create the first iteration of the database, comprised of a limited set of about 1,500 mathematical questions. A research grant awarded to Despeaux by the WCU College of Arts and Sciences funded this first phase of coding. In the second phase, in 2019–20, students Kevin Filanowski and Caleb Tupone expanded the size and functions of the database. To fund this phase of coding, as well as the bulk of data clean-up, Despeaux applied for and won the annual Hunter Scholar Award, given by WCU’s Hunter Library to support data-focused projects.

To start the process of data clean-up, Holliday and Despeaux had to decide how to arrange Tattersall’s many computer files into a single file. Over the years, Tattersall had compiled his data into twenty-three separate files, four containing biographical information on authors and nineteen containing a list of about 18,000 proposed questions and about 1.5 times more solutions (since some problems elicited multiple solutions). The actual problems and solutions were not copied out, only information on who asked the question, who solved it, what type it was, and in which volume of the ET/MQ it appeared. In other words, Tattersall was dealing with metadata. Metadata indicates who communicated with whom, about what, and when. Per Holliday’s recommendation, the twenty-three separate files were integrated into a single unified file that could be referenced by the database’s code.

The unified file was created using Microsoft Excel. The file consisted of five large tables: authors, countries, question types, proposers, and solvers. The authors table was a master list of all contributors to the ET/MQ, with additional columns giving their gender, country of origin or primary residence, a biography, and the source for the biography. The countries table listed forty-four different countries and assigned each a numerical value, 1 to 44. (Each country number was used to fill in values in the country-of-origin column in the authors table.) The question types table listed and assigned a numerical value to about fifty different categories of mathematics questions (see Table 1). These values are used in the proposers table to indicate the type of each proposed question. The data entered into the countries and question types tables were straightforwardly copied from lists created by Tattersall. Originally, the numerical codes he created had acted as shorthand for Tattersall, so that he did not have to write out the full names of countries in every author’s biography, nor fully write out question types next to every problem listed. The codes were retained for the Excel file for the same reasons that Tattersall used them: as labor- and space-saving devices. Using the codes allowed the authors, proposers, and solvers tables to remain relatively compact and human-manageable. Lastly, the proposers and solvers tables are what their titles indicate: numbered lists of every question proposed and every solution offered in the pages of the ET/MQ, along with their respective authors, date published, volume number, and question type.

Excerpt from Excel spreadsheet showing Educational Times problem solvers.

Figure 12. A screenshot of the "solvers" table in the unified Excel workbook file. Note that female contributors are highlighted in red. The columns, from left to right, show the Excel row number, the ET problem number, the solver's first name, solver's unique ID, solver's display name, ET volume number and date, and problem type.

Unfortunately, in moving data from Tattersall’s files to the single Excel file, data could rarely be copied and pasted directly due to formatting issues. Author names had to be copied with Tattersall’s original formatting intact, since he had color-coded some names to indicate gender, language, or a pseudonym. But there was no easy way to copy data in bulk while also ensuring that color codes and other formatting features were retained. Another issue arose in later stages of computer coding, when Tupone and Filanowski dealt with the formatting of dates. The SQL code underlying the database required all publication dates to include a day, not just a month and a year. However, the ET was published monthly, without a day of the month stamped on the cover. The solution to this problem remains a work in progress. Currently, the database displays “01” for the day in most publication dates, until a satisfactory workaround can be implemented.

The most serious formatting issue involved names. Individual names needed to share a consistent spelling across all tables in which they appear, including solvers, proposers, and authors. The database team wanted to ensure that different spellings and/or forms of the same name were associated with the singular person who was known by those multiple forms. In the original data set, for example, one particular author was alternately referred to as “J.W. Sharpe” and “James William Sharpe.” Several authors had titles such as “General” or “Prince” printed before their names in some issues of ET but not in others. A few authors were credited as “Dr.” in one issue but “Prof.” in another. To avoid causing the database to dissociate data that should be linked, Despeaux directed that an additional column be added to the proposers, solvers, and authors tables. The additional column gave every author a unique identifier, typically the author’s last name followed by an underscore and a numeral. By this method, every author received both a “display name” and a “unique ID.” For example, one author who was variously credited as “A. Morel” or “Prof. Morel” was given the unique identifier “Morel.” The database would show the “display name” to the end user, while referencing the “unique ID” on the database’s back-end. The inclusion of a unique identifier also resolved confusion over different authors with the same last name. For example, author J.L. Jenkins has the identifier “Jenkins,” J.S. Jenkins has “Jenkins_2,” Morgan Jenkins has “Jenkins_4,” and so on. In the end, most formatting issues were resolved satisfactorily, and the team achieved the goal of a smoothly functioning ET database.