Setting up the environment
knitr::opts_chunk$set(echo = TRUE, error=TRUE)
devtools::install_github("mplex/cedhar", subdir="pkg/sdam")
#install.packages("rjson")
#install.packages("tidyverse")
#install.packages("getPass")
#install.packages("formatR")
library(tidyverse)
library(tidytext)
library(dplyr)
library(stringr)
library(sdam)
library(rjson)
library(tidyverse)
library(getPass)
library(formatR)
Before we attempt the cleaning itself, we need to build the cleaning blocks. Once the cleaning blocks are ready we can put them together based on the desired outcome.
I have created three categores of building blocks, closely linked with the methodological approach and the purpose of the cleaning process.
Structure of a cleaning block:
Each of the cleaning blocks have the same structure. Regular expressions will be used to find and replace the searched term or pattern.
regexpatternname <- c("regexpattern", "substitutionpattern")
The aim of this model is to produce a clean text that is as close to the original text of an inscription as possible.
The cleaned output of the conservative model will be as close to the original text as possible. In most cases it should resemble a diplomatic edition of epigraphic text with spaces between words, lowercase letters, eliminated brackets and non-utf compliant symbols.
Aim: All expanded abbreaviations that are in the parenthesis () will be eliminated from the clean text (substituted with "").
Αὐρ(ήλιος) ΟὐαλέριοςΑὐρ Οὐαλέριοςexpanded_abbreviations_conservative <- c("\\([^(]*\\)", "")
Aim: All supressions that are in the curly braces {} followed by one or more superscript digits will be eliminated from the clean text (substituted with "").
!!! It is crutial that block 3. Supression of a text does not precede block 2. Supression of a text with superscripts, otherwise the Regex pattern would not clean the text properly. This particular pattern is common for the PHI dataset and may or may not appear in other datasets.
ἱερεὺς ληφθὶς ὑπὰ {²⁶ὑπὸ}²⁶ τῶν βαρβάρωνἱερεὺς ληφθὶς ὑπὰ τῶν βαρβάρωνsuppresion_superscripts_conservative <- c("{[^}]*}[⁰¹²³⁴⁵⁶⁷⁸⁹]+", "")
Aim: All curly braces {} will be eliminated from the clean text (substituted with ""), while the contents of the braces will remain in the text.
!!! It is crutial that block 3. Supression of a text does not precede block 2. Supression of a text with superscripts, otherwise the Regex pattern would not clean the text properly.
Σεβαστοῦ υἱοῦ {θ̣εοῦ Σεβαστοῦ} τύχηςΣεβαστοῦ υἱοῦ θ̣εοῦ Σεβαστοῦ τύχηςsuppresion_conservative <- c("[\\{*\\}]", "")
Aim: All restoration that are in the square brackets [] will be eliminated from the clean text (substituted with "").
!!! Beware that by eliminating the contents of the brackets you may loose some context - use at your own discretion.
[Ν]ανα Ἕλληνο̣[ς] θυγάτηρ καὶ ἡ ἑτέρα [γυνὴ]ανα Ἕλληνο θυγάτηρ καὶ ἡ ἑτέραrestoration_conservative <- c("\\[[^[]*\\]", "")
Aim: All substitutions that are in the angular brackets <> will be eliminated from the clean text (substituted with "").
!!! Beware that by eliminating the contents of the brackets you may loose some context - use at your own discretion.
κωρο<ν Ἀ>ντιόχ<ου> ἡ πατρὶς τειμῆ<ς>κωρο ντιόχ ἡ πατρὶς τειμῆςsubstitution_conservative <- c("\\<[^<]*\\>", "")
Aim: All sustitutions following the pattern “A=B” will be cleaned thw following way: B remain in the text and the equal sign and A will be eliminated from the clean text.
!!! Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion. The substitution_edh_interpretive should be run before substitution_interpretive script, otherwise the Regex pattern would not clean the text properly. The substitution_interpretive script will clean the angular brackets in the next step.
pos<u=I>erunt bene merentipos<I>erunt bene merentisubstitution_edh_conservative <- c("([α-ωΑ-Ωa-zA-Z])=([α-ωΑ-Ωa-zA-Z])", "\\2")
The aim of this model is to produce a clean text that is enriched with editorial interpretations of the original text.
The output of the interpretive model will produce an epigraphic text with as many editorial suggestions, restorations, corrections, and improvements as possible to provide as much possible contents of the inscription as possible. The brackets and non-utf compliant symbols will be eliminated.
Aim: All parenthesis () will be eliminated from the clean text (substituted with ""), while the contents of the parenthesis will remain in the text.
Αὐρ(ήλιος) ΟὐαλέριοςΑὐρήλιος Οὐαλέριοςexpanded_abbreviations_interpretive <- c("[\\(*\\)]", "")
Aim: Contents found within curly braces {} followed by one or more superscript digits will substitute the word immediately preceding the curly braces, see example.
!!! It is crutial that block 3. Supression of a text does not precede block 2. Supression of a text with superscripts, otherwise the Regex pattern would not clean the text properly. This particular pattern is common for the PHI dataset and may or may not appear in other datasets.
ἱερεὺς ληφθὶς ὑπὰ {²⁶ὑπὸ}²⁶ τῶν βαρβάρωνἱερεὺς ληφθὶς ὑπὸ τῶν βαρβάρωνsuppresion_superscripts_interpretive <- c(" [^ ]+ \\{([⁰¹²³⁴⁵⁶⁷⁸⁹]+)([^}]+)\\}\\1", " \\2")
Note: the script will not work if there is no text preceeding the curly braces. To eliminate the curly braces with superscripts and the contents of the curly braces, use the suppresion_superscripts_conservative script. However, it is recommended to run the suppresion_superscripts_conservative script after suppresion_superscripts_interpretive script, otherwise the Regex pattern would not clean the text properly.
Aim: All curly braces {} will be eliminated from the clean text (substituted with ""), while the contents of the braces will remain in the text.
!!! It is crutial that block 3. Supression of a text does not precede block 2. Supression of a text with superscripts, otherwise the Regex pattern would not clean the text properly. Due to ambiguous use of {} by editors of epigraphic corpora, the exact usage depends on the specific dataset and the way the curly braces were used. If you wish to keep the text within the brackets, use suppresion_keep_interpretive script and if you wish to remove the text in the brackets, use suppresion_remove_interpretive script.
θ̣εοῦ Σεβαστοῦ υἱοῦ {θ̣εοῦ Σεβαστοῦ} τύχηςθ̣εοῦ Σεβαστοῦ υἱοῦ θ̣εοῦ Σεβαστοῦ τύχηςθ̣εοῦ Σεβαστοῦ υἱοῦ τύχηςsuppresion_keep_interpretive <- c("[\\{*\\}]", "")
OR
suppresion_remove_interpretive <- c("{[^}]*}", "")
Aim: All square brackets [] will be eliminated from the clean text (substituted with ""), while the contents of the brackets will remain in the text.
!!! Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion.
[Ν]ανα Ἕλληνο̣[ς] θυγάτηρ καὶ ἡ ἑτέρα [γυνὴ]Νανα Ἕλληνο̣ς θυγάτηρ καὶ ἡ ἑτέρα γυνὴrestoration_interpretive <- c("[\\[*\\]]", "")
Aim: All angular brackets <> will be eliminated from the clean text (substituted with ""), while the contents of the brackets will remain in the text.
!!! Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion.
κωρο<ν Ἀ>ντιόχ<ου> ἡ πατρὶς τειμῆ<ς>κωρον Ἀντιόχου ἡ πατρὶς τειμῆςsubstitution_interpretive <- c("[\\<*\\>]", "")
Aim: All sustitutions following the pattern “A=B” will be cleaned thw following way: A remain in the text and the equal sign and B will be eliminated from the clean text.
!!! Beware that by eliminating the brackets you may loose some information about the preservation of the text - use at your own discretion. The substitution_edh_interpretive should be run before substitution_interpretive script, otherwise the Regex pattern would not clean the text properly. The substitution_interpretive script will clean the angular brackets in the next step.
pos<u=I>erunt bene merentipos<u>erunt bene merentisubstitution_edh_interpretive <- c("([α-ωΑ-Ωa-zA-Z])=([α-ωΑ-Ωa-zA-Z])", "\\1")
The aim of the generic cleaning is to strip the epigraphic text any non-utf compliant symbols and characters that do not adhere to the principles of a ‘tidy text’ analysis.
The final output of the cleaning depends on which of the individual cleaning blocks will be in the cleaning script. Each individual block represents one step of the cleaning process, and user can modify all the steps to recah the intended outcome. All the cleaning steps are dependent on the characteristics of the original dataset, therefore familiarity with the original dataset prior the cleaning process is recommended. Each dataset can have a different set of symbols and characters to be cleaned, thus, the cleaning blocks should be adjusted accordingly.
Aim: All square brackets [] containing one or more “—” will be eliminated from the clean text (substituted with "").
!!! The scipt lacuna1 should be run before restoration_conservative and restoration_interpretive scripts, otherwise the Regex pattern would not clean the text properly.
[— — —]ης θεῷ Φοίβῳης θεῷ Φοίβῳlacuna1 <- c("\\[[— ]+\\]", "")
Note: If there is a text within the square bracket, e.g. προύχον[τος — — —], script restoration_interpretive will eliminate the square brackets, the script interpunction_symbols will clean the “—” and the script multi_whitespace will eliminate the extra whitespaces. Therefore the scripts restoration_interpretive(1), interpunction_symbols(2) and multi_whitespace(3) should be used in combination and in the indicated sequence.
Aim: All square brackets [] containing one or more “.” will be eliminated from the clean text (substituted with "").
!!! The scipt lacuna1 should be run before restoration_conservative and restoration_interpretive scripts, otherwise the Regex pattern would not clean the text properly.
[․․]ω Διὶ καὶ Ἥρᾳω Διὶ καὶ Ἥρᾳlacuna2 <- c("\\[[․]+\\]", "")
Note: If there is a text within the square bracket, e.g. προύχον[τος — — —], script restoration_interpretive will eliminate the square brackets, the script interpunction_symbols will clean the “—” and the script multi_whitespace will eliminate the extra whitespaces. Therefore the scripts restoration_interpretive(1), interpunction_symbols(2) and multi_whitespace(3) should be used in combination and in the indicated sequence.
Aim: All instances of the following strings “vacat, vac, vac., v.” will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by multi_whitespace script in the following steps.
!!! If your datasets contains latin inscriptions, you may want to check whether the vacat script is not eliminitating more words than anticipated, e.g. words containing string “vacat” or “vac”. If so, adjust the cleaning block accordingly, i.e. remove “vac”, or don’t use it.
Ἡρακλείδα vacat χαῖρε.Ἡρακλείδα χαῖρε.vacat <- c("(vacat|vac|vac\\.|v\\.)", " ")
Aim: All instances of the editorial strings in parenthesis such as (vel sim.) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by multi_whitespace script in the following steps.
!!! The editorial_notes script should run before the expanded_abbreviations_conservative and expanded_abbreviations_interpretive scripts, otherwise the Regex pattern would not clean the text properly.
Ἥρωι (vel sim.) Καλλισθένηςeditorial_notes <-c("\\(vel sim.\\)", " ")
Aim: All instances of in-line symbol for new line (|) will be eliminated (substituted with "").
Λάμπρη Τ̣ελεσήνορ|ος γυνή.Λάμπρη Τ̣ελεσήνορος γυνήnew_line <- c("[\\||\\/]", "")
Aim: All instances of words split between two lines with a dash (-) will be eliminated (substituted with "").
ἀρχιερέως καὶ εὐποσιάρ-\nχου μηνὸςἀρχιερέως καὶ εὐποσιάρχου μηνὸςsplit_word_multiline <- c("-\\n", "")
Aim: All instances of erased text (〚—〛) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by multi_whitespace script in the following steps.
Ἀρτέμιδι 〚— — —〛 ἐπηκόοις.Ἀρτέμιδι ἐπηκόοις.erasure_empty <- c("〚[— ]+〛", " ")
Aim: All instances of double brackets for erasures (〚 〛) will be eliminated (substituted with "") and the contents of the double brackets will be preserved as part of the clean text.
Ἀμύντωρ Νουμηνίου 〚χαῖρε〛. καὶ ἡ γυνὴ αὐτοῦἈμύντωρ Νουμηνίου χαῖρε. καὶ ἡ γυνὴ αὐτοῦerasure_new_text <- c("[〚〛]", "")
Aim: All instances of the dubious reading marked by the subscrit dot (unicode 0323) will be eliminated (substituted with "").
!!! The dubious_dot_subscript script should happen as first step of the cleaning, otherwise the letters might shift and the Regex pattern would not clean the text properly.
Ἀ̣πό̣λ̣λ̣ωνοςἈπόλλωνοςdubious_dot_subscript <- c("\u{0323}", "")
Aim: All instances of listed interpunction symbols (,.!-—#%^&*/~:;) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by multi_whitespace script in the following steps.
Φιλήτη # θεᾷ Μαλοφόρῳ or κεῖμαι πρόμοιρος Ἑρμογένης τυμβευθείς· /ἀγὼνΦιλήτη θεᾷ Μαλοφόρῳ or κεῖμαι πρόμοιρος Ἑρμογένης τυμβευθείς ἀγὼνinterpunction_symbols <- c("[,|\\.|․|·|!|\\-|—|–|#|%|\\^|&|\\*|~|:|;|@]", " ")
Aim: All instances of superscripted numbers will be eliminated (substituted with "").
!!! The superscript_numbers should not be run before the suppresion_superscripts_conservative or suppresion_superscripts_interpretive script, otherwise the Regex pattern would not clean the text properly.
Αὐρ(ήλιος) Διονύσιος #⁵⁶ βʹ #⁵⁶Αὐρ(ήλιος) Διονύσιος # βʹ #superscript_numbers <- c("[⁰¹²³⁴⁵⁶⁷⁸⁹]+", "")
Aim: All instances of the listed specialised epigraphic symbols, such as the haedera (❦), will be eliminated (substituted with "").
ἀγαθῆι ❦ τύχηιἀγαθῆι τύχηιepigraphic_symbols <-c ("[❦|∙|𐆖|⏑|⏓|⏕]", "")
Aim: All instances of th elisted symbols marking uncertainty (?) will be replaced by a space (substituted with " "). If there is any extra whitespace, it will be cleaned by multi_whitespace script in the following steps.
χαῖρε?χαῖρεuncertainty_symbols <-c ("[?]", " ")
Aim: All instances of end of line symbol () will be replaced by space (substituted with " ").
καὶ ἄρξαντα\nτοῦ κοινοῦκαὶ ἄρξαντα τοῦ κοινοῦend_line <- c("\\n", " ")
Aim: All instances of extra blank space (“ ”) will be replaced by space (substituted with " ").
ἀγαθῆι τύχηι.ἀγαθῆι τύχηι.extra_blank <- c("[ ]+", " ")
Aim: All instances of more then one whitespace " " next to each other will be eliminated (substituted with "").
!!! The multi_whitespace should run as the second last cleaning block to ensure all redundant white spaces are cleaned from the text.
Ἡρακλείδα χαῖρε.Ἡρακλείδα χαῖρε.multi_whitespace <- c("\\s+", " ")
Aim: All instances of whitespace " " at the beginning and end of the line will be eliminated (substituted with "").
!!! The whitespace_endline should run as the last cleaning block to ensure all redundant white spaces are cleaned from the text.
χαῖρεχαῖρεwhitespace_endline <- c("(^\\s|\\s$)", "")
Aim: All instances of editorial comments in Latin alphabet that are enclosed in curly braces {} with superscript numbers will be eliminated (substituted with "").
!!! If your dataset contains Latin inscriptions, use this script with caution. Verify first, that running the script it does not eliminate any necessary information or text. This block has been specifically designed for the interpretive cleaning of the PHI Greek Inscription dataset and it should run before suppresion_superscripts_interpretive and suppresion_interpretive scripts, otherwise the Regex pattern would not clean the text properly.
ἀγαθῆι τύχηι. {²in parte inferiore altera manu incisa est:}² ὑπὲρ τῆς τοῦἀγαθῆι τύχηι. ὑπὲρ τῆς τοῦeditorial_comments_latin <- c("\\{([⁰¹²³⁴⁵⁶⁷⁸⁹]+)([a-zA-Z0-9][^}]+)\\}\\1", "")
Aim: All instances of arabic numerals (0-9) will be eliminated (substituted with "").
!!! If your dataset contains arabic numerals that you would like to keep, use this script with caution. Verify first, that running the script it does not eliminate any necessary information or text. This block has been specifically designed for the interpretive cleaning of the PHI Greek Inscription dataset and it should run before multi_whitespace and whitespace_endline scripts, otherwise the Regex pattern would not clean the text properly.
ἡ γυνὴ αὐτοῦ ΦιλΙ̣ 4 5 καὶἡ γυνὴ αὐτοῦ ΦιλΙ καὶarabic_numerals <- c("[0-9]+", "")
Aim: All instances of unclosed brackets will be eliminated (substituted with "").
!!! Use the unclosed_brackets script immediately before multi_whitespace and whitespace_endline scripts, otherwise the Regex pattern would not clean the text properly.
ummio isenna Xv [ummio isenna Xvunclosed_brackets <- c("[\\[|\\{|\\(|\\)|\\}|\\]]", "")
When we have established the individual buidling blocks, we can put them together in the right sequence and build a cleaning function in R for conservative and interpretive models.
Source: https://epigraphy.packhum.org/
First, we need to load the provided test dataset PHI_IGBulg-I.csv located in the test_data folder and create an object dirtytext contain the text to be cleaned. Use getwd() function to make sure you are in the right working directory, so the read_csv code works for you. If not, adjust the path.
getwd()
## [1] "/home/petra/Github/epigraphic_cleaning/scripts/R"
text <- read_csv("../../test_data/PHI_IGBulg-I.csv")
dirtytext <- as.data.frame(select(text, hdr2, data))
Aim: to have a clean text that is as close to the original inscription as preserved on the medium.
cleaning_conservative <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=vacat[1], replacement=vacat[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=editorial_notes[1], replacement=editorial_notes[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_conservative[1], replacement=expanded_abbreviations_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_superscripts_conservative[1], replacement=suppresion_superscripts_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_conservative[1], replacement=suppresion_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_conservative[1], replacement=restoration_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_conservative[1], replacement=substitution_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
Original text of an inscription IGBulg I² 15(3) before cleaning:
[— — — — — — — — — — — — — — —]
[— — —δόντα καὶ διανομ]ὰ̣ς̣ τ̣ῇ̣ τ̣ε̣ κ̣ρ̣α̣-
[τί]σ̣τ̣ῃ βουλῇ καὶ ἀγορανόμοις καὶ
[ταῖ]ς ἑπτὰ φυλαῖς καὶ τοῖς ὑμνοῦσι
τοὺς Σεβαστοὺς καὶ ἀγοραίοις, ἰ-
α̣τροῖς, παιδευταῖς καὶ τοῖς παρε-
{[πα]ρ̣ε̣}π̣ιδη̣μήσα̣σιν {²⁶παρεπιδημήσασιν}²⁶ τῆ̣ς̣ Π̣ε̣ντ[α]-
[πόλεως βουλευταῖς — — — — —]
[— — — — — — — — — — — — —]
Output of the cleaning_conservative function:
example_conservative <- as.data.frame(cleaning_conservative(dirtytext$data))
example_conservative[30,]
## [1] "ὰς τῇ τε κραστῃ βουλῇ καὶ ἀγορανόμοις καὶ ς ἑπτὰ φυλαῖς καὶ τοῖς ὑμνοῦσι τοὺς Σεβαστοὺς καὶ ἀγοραίοις ἰατροῖς παιδευταῖς καὶ τοῖς παρερεπιδημήσασιν τῆς Πεντ"
Aim: to have a clean text enriched by editorial interpretations and reconstructions of the text (to have as rich text of an inscription as possible).
cleaning_interpretive <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=vacat[1], replacement=vacat[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=editorial_notes[1], replacement=editorial_notes[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=editorial_comments_latin[1], replacement=editorial_comments_latin[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_interpretive[1], replacement=expanded_abbreviations_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_superscripts_interpretive[1], replacement=suppresion_superscripts_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_keep_interpretive[1], replacement=suppresion_keep_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_interpretive[1], replacement=restoration_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_interpretive[1], replacement=substitution_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
Original text of an inscription IGBulg I² 15(3) before cleaning:
[— — — — — — — — — — — — — — —]
[— — —δόντα καὶ διανομ]ὰ̣ς̣ τ̣ῇ̣ τ̣ε̣ κ̣ρ̣α̣-
[τί]σ̣τ̣ῃ βουλῇ καὶ ἀγορανόμοις καὶ
[ταῖ]ς ἑπτὰ φυλαῖς καὶ τοῖς ὑμνοῦσι
τοὺς Σεβαστοὺς καὶ ἀγοραίοις, ἰ-
α̣τροῖς, παιδευταῖς καὶ τοῖς παρε-
{[πα]ρ̣ε̣}π̣ιδη̣μήσα̣σιν {²⁶παρεπιδημήσασιν}²⁶ τῆ̣ς̣ Π̣ε̣ντ[α]-
[πόλεως βουλευταῖς — — — — —]
[— — — — — — — — — — — — —]
Output of the cleanining_interpretive function:
example_interpretive <- as.data.frame(cleaning_interpretive(dirtytext$data))
example_interpretive[30,]
## [1] "δόντα καὶ διανομὰς τῇ τε κρατίστῃ βουλῇ καὶ ἀγορανόμοις καὶ ταῖς ἑπτὰ φυλαῖς καὶ τοῖς ὑμνοῦσι τοὺς Σεβαστοὺς καὶ ἀγοραίοις ἰατροῖς παιδευταῖς καὶ τοῖς παρεπιδημήσασιν τῆς Πενταπόλεως βουλευταῖς"
Save the output of cleaning_conservative and cleaning_interpretive function together with the original contents of the dataset. Create a new directory outputs in the root folder if it does not exist.
clean_text <- text %>%
mutate(clean_text_conservative = cleaning_conservative(text$data)) %>%
mutate(clean_text_interpretive = cleaning_interpretive(text$data))
# dir.create("../../outputs")
write_csv(clean_text, path = "../../outputs/PHI_IGBulg-I_clean_text.csv")
Source: https://edh-www.adw.uni-heidelberg.de/
First, we need to install several more packages and load the libraries in order to connect to Sciencedata.dk and access the dataset.
## your sciencedata username:
## Please enter password in TK window (Alt+Tab)
list_json <- fromJSON(resp)
## Error in fromJSON(resp): unexpected character '<'
EDH_tibble = as_tibble(list_json)
## Error in as_tibble(list_json): object 'list_json' not found
head(EDH_tibble)
## Error in head(EDH_tibble): object 'EDH_tibble' not found
Aim: to have a clean text that is as close to the original inscription as preserved on the medium - in case of the EDH dataset column diplomatic_text should be similar to the output of the conservative_cleaning model.
Since the dataset is mostly in Latin, I did not use the following cleaning scripts: vacat, editorial_notes, editorial_comments_latin since they would eliminate some parts of the text that should not be eliminated. I am not using the suppresion_superscripts_conservative script beacuse the structure of the EDH dataset does not contain curly braces followed by superscript numbers. The script unclosed_brackets has been added since EDH dataset contains a lot of unclosed brackets of all kinds. Script substitution_edh_conservative was added to clean additional substitution features of the EDH dataset.
cleaning_conservative_edh <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_conservative[1], replacement=expanded_abbreviations_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_conservative[1], replacement=suppresion_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_conservative[1], replacement=restoration_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_edh_conservative[1], replacement=substitution_edh_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_conservative[1], replacement=substitution_conservative[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=unclosed_brackets[1], replacement=unclosed_brackets[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
Transcription column of the first five inscriptions before cleaning:
print(EDH_tibble$transcription[1:5])
## Error in print(EDH_tibble$transcription[1:5]): object 'EDH_tibble' not found
Diplomatic_text column of the first five inscriptions (for comparison with the cleaning output):
print(EDH_tibble$diplomatic_text[1:5])
## Error in print(EDH_tibble$diplomatic_text[1:5]): object 'EDH_tibble' not found
Output of the cleaning_conservative_edh function:
example_edh <- as.data.frame(cleaning_conservative_edh(EDH_tibble$transcription))
## Error in gsub(pattern = dubious_dot_subscript[1], replacement = dubious_dot_subscript[2], : object 'EDH_tibble' not found
example_edh[1:5,]
## Error in eval(expr, envir, enclos): object 'example_edh' not found
Aim: to have a clean text enriched by editorial interpretations and reconstructions of the text (to have as rich text of an inscription as possible).
Since the dataset is mostly in Latin, I did not use the following cleaning scripts: vacat, editorial_notes, editorial_comments_latin since they would eliminate some parts of the text that should not be eliminated. I am not using the suppresion_superscripts_interpretive script beacuse the structure of the EDH dataset does not contain curly braces followed by superscript numbers. The script unclosed_brackets has been added since EDH dataset contains a lot of unclosed brackets of all kinds. Script substitution_edh_interpretive was added to clean additional substitution features of the EDH dataset.
EDH has provided their own version of clean text in the column text_cleaned but did not provide any cleaning script or steps leading to the current state of text_cleaned. As a second step I will compare the output of the interpretive_cleaning model with the text_cleaned version to see who has produced better text for text mining.
cleaning_interpretive_edh <- function(epigraphic_dataset){
clean_text <- gsub(pattern=dubious_dot_subscript[1], replacement=dubious_dot_subscript[2], x=epigraphic_dataset, perl=TRUE)
clean_text <- gsub(pattern=lacuna1[1], replacement=lacuna1[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=lacuna2[1], replacement=lacuna2[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=expanded_abbreviations_interpretive[1], replacement=expanded_abbreviations_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=suppresion_keep_interpretive[1], replacement=suppresion_keep_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=restoration_interpretive[1], replacement=restoration_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_edh_interpretive[1], replacement=substitution_edh_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=substitution_interpretive[1], replacement=substitution_interpretive[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=new_line[1], replacement=new_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=split_word_multiline[1], replacement=split_word_multiline[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_empty[1], replacement=erasure_empty[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=erasure_new_text[1], replacement=erasure_new_text[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=interpunction_symbols[1], replacement=interpunction_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=superscript_numbers[1], replacement=superscript_numbers[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=epigraphic_symbols[1], replacement=epigraphic_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=uncertainty_symbols[1], replacement=uncertainty_symbols[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=end_line[1], replacement=end_line[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=extra_blank[1], replacement=extra_blank[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=arabic_numerals[1], replacement=arabic_numerals[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=multi_whitespace[1], replacement=multi_whitespace[2], x=clean_text, perl=TRUE)
clean_text <- gsub(pattern=whitespace_endline[1], replacement=whitespace_endline[2], x=clean_text, perl=TRUE)
return(clean_text)
}
Transcription column of the first five inscriptions before cleaning:
print(EDH_tibble$transcription[1:5])
## Error in print(EDH_tibble$transcription[1:5]): object 'EDH_tibble' not found
Text_cleaned column, provided by EDH as a clean version of the text, for comparison with the output of the cleaning_intepretive_edh function:
print(EDH_tibble$text_cleaned[1:5])
## Error in print(EDH_tibble$text_cleaned[1:5]): object 'EDH_tibble' not found
Output of the cleaning_interpretive_edh function:
example_edh2 <- as.data.frame(cleaning_interpretive_edh(EDH_tibble$transcription))
## Error in gsub(pattern = dubious_dot_subscript[1], replacement = dubious_dot_subscript[2], : object 'EDH_tibble' not found
example_edh2[1:5,]
## Error in eval(expr, envir, enclos): object 'example_edh2' not found
EDH_df_clean <- as.data.frame(EDH_tibble) %>%
mutate(clean_text_conservative = cleaning_conservative_edh(EDH_tibble$transcription)) %>%
mutate(clean_text_interpretive = cleaning_interpretive_edh(EDH_tibble$transcription))
## Error in as.data.frame(EDH_tibble): object 'EDH_tibble' not found
Thracia <- EDH_df_clean%>%
filter(province_label == "Thracia"| province_label == "Thracia?")
## Error in eval(lhs, parent, parent): object 'EDH_df_clean' not found
number <- 24
print(Thracia$clean_text_interpretive[number]) # output of cleaning_interpretive function
## Error in print(Thracia$clean_text_interpretive[number]): object 'Thracia' not found
print(Thracia$transcription[number]) # original text to be cleaned
## Error in print(Thracia$transcription[number]): object 'Thracia' not found
print(Thracia$text_cleaned[number]) # text_cleaned provided by EDH
## Error in print(Thracia$text_cleaned[number]): object 'Thracia' not found
EDH_clean_json <- toJSON(EDH_df_clean)
## Error in toJSON(EDH_df_clean): object 'EDH_df_clean' not found
Thracia_json <- toJSON(Thracia)
## Error in toJSON(Thracia): object 'Thracia' not found
write(Thracia_json, "../../outputs/EDH_Thracia.json")
## Error in cat(x, file = file, sep = c(rep.int(sep, ncolumns - 1), "\n"), : object 'Thracia_json' not found