Distinctive lexical patternsin russian patient information leaflets: a corpus-driven study
Analysis of patterns of language use in Russian instructions for the use of drugs. The establishment and exercise of keywords and repetitive phrases that contribute to the formularity of a given type of text. Description of their discursive functions.
Рубрика | Иностранные языки и языкознание |
Вид | статья |
Язык | английский |
Дата добавления | 17.03.2021 |
Размер файла | 52,5 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
Distinctive lexical patternsin russian patient information leaflets: a corpus-driven study
Lukasz Grabowski
University of Opole
University of Ostrava
Abstract
This methodologically-oriented corpus-driven study focuses on distinctive patterns of language use in a specialized text type, namely Russian patient information leaflets. The study's main goal is to identify keywords and recurrent sequences of words that account for the leaflets' formulaicity, and -- as a secondary goal -- to describe their discoursal functions. The keywords were identified using three methods (G2, Hedges' g and Neozeta) and the overlap between the three metrics was explored. The overlapping keywords were qualitatively analyzed in terms of discoursal functions. As for the distinctive multi-word patterns, we focused on recurrent n-grams with the largest coverage in the corpus: these were identified using the Formu- lex method (Forsyth, 2015b), which provides complementary data with respect to more conservative n-gram and lexical bundles approaches. The results revealed that the most distinctive keywords were identified using Hedges' g metric, that the largest overlap occurred between G2 and Neozeta metrics, and that the frequent use and discoursal functions of the identified lexical patterns correspond with situational contexts and communicative purposes of patient information leaflets. It is hoped that this study will provide an opportunity for a methodological reflection and inspire further corpus-driven research on distinctive recurrent lexical patterns (e.g., keywords, n-grams, lexical bundles) or -- more generally -- on formulaic language in texts originally written in Russian.
Keywords: keywords, n-grams, formulaic language, phraseology, patient information leaflets, Russian language
Аннотация
language russian instruction drug
Типизированные лексические паттерны в русских инструкциях по применению лекарственных препаратов: корпусное исследование
Лукаш Грабовский
Опольский университет
Остравский университет
Данное методологически ориентированное исследование, проведенное с использованием корпусного метода, посвящено анализу наиболее отчетливо выраженных паттернов использования языка в русских инструкциях по применению лекарственных препаратов. Цель исследования двуплановая и заключается, во-первых, в установлении и эксцерпции ключевых слов и повторяющихся словосочетаний, которые вносят вклад в формулярность данного типа текста, и, во-вторых, в последовательном описании их дискурсивных функций. Для эксцерпции ключевых слов использовались три метода: логарифмическая функция правдоподобия (ЛФП), g-Хеджеса и Неодзета. Для дальнейшего качественного анализа были выбраны только ключевые слова, совпадающие во всех трех процедурах. Рекуррентные N-граммы с самым большим лексическим охватом в корпусе извлекались с использованием метода Формулекс(Forsyth 2015b), который предоставляет взаимодополняющие данные относительно более консервативных N-грамм и лексических связок. Результаты показали, что: 1) наиболее ярко выраженные ключевые слова были выявлены cиспользованием формулы g-Хеджеса; 2) самое большое совпадение ключевых слов было выявлено для формул ЛФП и Неодзета; 3) частотность и дискурсивные функции отобранных слов и словосочетаний обусловлены как ситуативным контекстом, так и коммуникативными функциями инструкций по применению лекарственных препаратов. Исход проведенного анализа позволяет надеяться, что полученные результаты станут толчком для методологических размышлений, а также дальнейших корпусных исследований типичных, часто употребляемых лексических паттернов (например, ключевых слов, N-грамм, лексических связок) и в целом -- формулярности русскоязычных текстов.
Ключевые слова: ключевые слова, N-граммы, шаблонный язык, фразеология, инструкции по применению лекарственных препаратов, русский язык
Introduction
Formulaicity has been a rather nebulous term encountered in a broad variety of disciplines of arts and humanities, including painting, sculpture or visual arts, among others, where an object of enquiry (a work of art, industrial design, text, etc.) has the properties that may be readily described as patterned-like, template-like, stencilled, trite, cliched, to name but a few epithets. When studying a natural language, with various purposes in mind, linguists of various schools often refer to it as formulaic. In recent years, the linguistically-oriented research on formulaicity has been flourishing (e.g., Wray, 2002, 2008, 2009; Schmitt & Carter, 2004; Wood, 2010a, 2010b, 2015; Kecskes, 2016; Myles &Cordier, 2017; Nelson, 2018; Pqzik, 2018).
In view of the fact that every sub-discipline of linguistics approaches formulaicity from a different perspective, it is difficult to precisely determine what is meant by formulaic language. Norbert Schmitt and Ronald Carter (2004: 3) argue that this phenomenon should be treated in an inclusive way so that multiple types of linguistic units fall under an umbrella term formulaicity. In a similar vein, PiotrPqzik (2018: 241), who uses the term formulaicity interchangeably with phraseological prefabrication, argues that it “is a ubiquitous, complex and multifaceted linguistic phenomenon”. An attempt at a conceptual clarification as to how linguists of various schools approach formulaicity was made by BlazejGalkowski (2006: 163--164), who singles out three major approaches, that is, a purely linguistic one, where formulaicity is explored using various lexical and grammatical categories identified primarily on the basis of formal or lexical criteria; a psycholinguistic approach, which focuses on storage and processing of linguistic data in the mental lexicon of language users; finally, a socio- linguistic approach, which explores situational and cultural underpinnings of formu- laicity. To these, one may also add a corpus linguistic approach, which explores formulaicity by focusing on the frequency and distribution of various types of recurrent sequences of words in texts1. According to Rosamund Moon (2007: 1046), corpus linguists, in particular those conducting research on phraseology, are primarily interested in frequent and statistically significant multi-word patterns in which particular words occur. This approach contrasts with more traditional analytical methods of phraseological research, e.g., the ones proposed and developed by Viktor Vinogradov (1947/1977), Natalia Amosova (1963), which focused on the analysis of systemic invariant forms of phraseologisms (as they are or rather should be recorded in a dictionary) abstracted from situational contexts of their use. However, corpus approach is closer in spirit to more synthetic approaches to phraseology (e.g., Ivanov 1957; Boguslawski 1976, 1989; Anic'kov 1992; Mel'cuk 1995, 1998; Chlebda 1991, 2009) focusing on the identification of links between situational contexts of language use and recurrence of linguistic forms. Consequently, the frequency-driven approach is particularly attractive for the analyses of routinized or cliched texts since they rely more on limited stocks of prefabricated chunks of text or boilerplate formulas, notably when compared with more creative literary texts.
In terms of theoretical underpinnings, this descriptive and methodologically- oriented study also draws inspiration from research conducted by WojciechChlebda (1991, 2009, 2010), who looks at phraseology from a perspective of a producer of an utterance in a specific social context. More precisely, Chlebda (1991) proposes that a `phraseme' (frazem), in later works (Chlebda 2009, 2010) referred to as a `reproduct' (reprodukt), be treated as a central unit of analysis; it is defined as “a linguistic unit (a component of the language system of a given ethnic language) isolated from texts based on the verification of its regular and repeated occurrence, functioning as a ver- balizer of specific content, e.g., a notion, proposition, intention, emotion” (Chlebda 2010: 15--16 and 140) MoredetaileddiscussionondifferentapproachestostudyformulaiclanguagecanbefoundinForsyth&Grabowski (2015) andGrabowski (2015c, 2018). Chlebda (2010: 16) arguesthat a reproductcanbe a singlewordor a multi-wordunitwith a non-compositionalorcompositionalmeaning, andthattheemphasisshouldbeputontheanalysis
ofthelatter. AccordingtoGrabowski (2015c), thisobservationcomplieswithresultsofcertaincorpusstudieswhichshowedthatmulti-wordunitswithcompositionalmeaningareconsiderablymorefre-quentintextsthanidiomsorfixedexpressionswithnon-compositionalmeaning (Moon 1998; Biberetal. 1999)..
Although recurrent use is a defining feature of reproducts, and at the same time one of the key characteristics of formulaicity, Chlebda (2009, 2010) does not operationalize any frequency or distribution threshold -- unlike in the aforementioned lexical bundles approach (Biber et al. 1999) -- allowing one to decide whether a given single word or a sequence of words is a reproduct. On the contrary, the search for reproducts is largely manual, based on close reading of textual material and intuition-based analysis of a number of textual features, e.g., quotation marks, structures with reporting verbs, temporal signals, location signals, community signals, authorial signals, generic and quasi-generic operators (Chlebda 2010: 19). Such signals are also referred to as phraseology markers (Pqzik 2018: 213), which stand for a stable yet extensible set of lexical devices that signal phraseological prefabrication in texts. Thus, the role of corpus linguistic methods in this approach is, strictly speaking, limited to consulting language corpora to perform a frequency check when intuition, notably a subjective assessment of the degree of perceptual salience, is insufficient to decide whether a given text chunk shall be treated as a reproduct (Chlebda 2009, 2010) or -- from a different perspective -- as proverbial, conventional, idiomatic or prefabricated (Pqzik 2018: 213). An approach like this is often referred to as a corpus-informed one.
In this corpus-driven study, however, the search for reproducts, which can be treated as markers of formulaic language in texts, is conducted in the opposite direction. More precisely, the recurrent single words and multi-word units are first identified based on their frequent occurrence in texts and then they are aligned with specific discourse functions in order to ensure that they constitute context-sensitive form-meaning mappings. This should enable one to single out those recurrent textual patterns, i.e., keywords and recurrent n-grams, that contribute to the formulaic nature of a text type under scrutiny, namely patient information leaflets written originally in Russian. A study like this one -- inspired by theoretical insights from Russian, Polish and English phraseology (traditional and distributional one) -- is hoped to fill in the gap in data- driven research on recurrent patterns of language use found in Russian texts.
1. Research material and methodology
In this corpus-driven study I wouldliketothankRichardForsythandCostasGarbielatosforsomeofthecommentswithrespecttotheproblemsaddressedinthisstudy. Also, I amgratefultoananonymousReviewerfor a numberofimportantremarksthathelpedmeimprovethepaper., we aim to identify and describe recurrent lexical items (single- and multi-word units) that contribute to the formulaic nature of patient information leaflets originally produced in Russian Inthatrespect, thestudymaybetreatedasanextensionoftheauthor'searliercorpuslinguisticresearchonkeywordsand/orlexicalbundlesinPolishandEnglishpatientinformationleaflets (Grabowski 2014, 2015a).. As mentioned earlier, the emphasis will be put on keywords and recurrent n-grams with the largest coverage in the corpus, which will be described in greater detail later in the paper. Given the various constraints pertaining to the communicative function, situational contexts of use, target audience of the text type under scrutiny as well as a highly standardized macro-structure of patient information leaflets, one may expect to find there only a limited number of recurrent lexical items. Furthermore, to the knowledge of the author, corpus linguistic studies, notably corpus-driven ones, of recurrent multi-word units in Russian texts have been scarce. For example, Maria Kunilovska, Natalia Morgoun and Alexey Pariy (2018) compared learner and professional translations from English into Russian, on the one hand, with native Russian texts, on the other, by focusing on the number of indicators, such as sentence length, lexical variety (TTR and proportion of high frequency words), lexical density and word frequencies; however, recurrent sequences of words (e.g., n-grams or otherwise) have not been explored in their study. A rare exception is the study by Daehyeon Nam and Sungmin Lee (2016), who explored the use and discourse functions of lexical bundles (Biber et al. 1999) in spoken and written Russian attested in a 1-million-word sample of the Russian National Corpus; the authors revealed that referential bundles predominate in written texts while stance bundles are more frequent in spoken texts (Nam & Lee 2016). However, in the Russian National Corpus one may find a variety of text types and genres (both contemporary and older ones, e.g., produced in the 19th century), which means that the results briefly summarized by Nam and Lee (2016) constitute generalizations that may not be applicable to any specific text type or genre.
That is why in this study we focus on a quasi-specialist text type used in the health care sector, that is patient information leaflets (short PILs). We aim to identify those single- and multi-word units that justify referring to patient information leaflets as a formulaic text type. Importantly, the goal of this research is not to measure the amount of formulaic language (cf. Forsyth & Grabowski 2015; Nelson 2018), but to describe the formulaic profile of the sample of Russian PILs by identifying those linguistic items that account for its highly patterned and cliched style.
Generally speaking, PILs are found in sales packages of medicines and they are written in the language of the country where the medicines are sold, which in this study is the Russian language. In short, PILs are produced -- in accordance with relevant legal regulations in a particular country -- by pharmaceutical companies for patients, pharmacists, nurses, general practitioners, etc., who are typically target readers of this text type. However, PILs also have intermediate users, such as regulatory authorities. According to VicentMontaltResurreccio and Maria Gonzalez Davies (2007: 68--69), the main communicative purpose of PILs is to provide specific information on proper and safe use and administration of medicines (doses, side effects, etc.).
In fact, there have been many corpus linguistic studies exploring lexis and phraseology in PILs originally written in English and other languages, conducted by Silvia Cacchiani (2006, 2016), Rosemary Clerehan, Di Hirsh and Rachelle Buchbinder (2009) or Lukasz Grabowski (2014, 2015a, 2015b), among others. However, to the knowledge of the author of this paper, there has been no study focusing on PILs written in Russian.
The research material includes a tailor-made corpus of 100 PILs (i.e., full-texts) produced by fourteen pharmaceutical companies operating (as of the year 2016) on the
Russian market: Astellas Russia (10), AstraZeneca (10), Bayer (10), BerlinChemie (10), Biochimik Saransk (1), BoehringerIngelheim (10), Farmstandard (10), GraminexFarma Russia (1), Kirkland Rindoxil Russia (1), Lundbeck Russia (5), Novartis (10), Sanofi Russia (10), Servier Russia (10), Takeda Russia (2). All in all, the corpus size is 229,346 word tokens, and the mean block TTR (a type-token ratio in per cent calculated using text chunks of 100 words, which can be treated as a provisional measure of lexical richness) is 78.48%, which is an average of block TTR scores of 100 documents in the study corpus. Also, the linguistic data have not been subjected to annotation or lemmatization. The research questions addressed in this primarily methodologically- oriented study are as follows:
1) What are the keywords typical of Russian patient information leaflets? To what extent do the keywords differ when identified using different keyword metrics? What are the discourse functions of overlapping keywords?
2) What are the distinctive recurrent sequences of words (n-grams) in Russian patient information leaflets? What are their discourse functions?
Units of analysis: keywords and recurrent n-grams with the largest textual coverage
In the first stage of the study, we focus on the identification of keywords, that is those words that for some reason (frequency of use, symbolic value, social or cultural significance, etc.) are more important than other words in texts (Stubbs 2011: 21). It is common knowledge that a corpus linguistic approach to the identification of keywords relies primarily on statistics. According to Michael Scott (2008: 176), keywords are those words “whose frequency is unusually high in comparison with some norm”, which is found in the reference corpus constituting a benchmark for comparison. More precisely, the keywords are identified through their `keyness', an indicator whose value is contingent primarily on word frequencies and corpus size, which -- in turn -- depends on subjectively specified thresholds of frequency, effect-size and statistical significance, on the choice of the unit of analysis (word forms, lemmas, constructions, senses, etc.), and on the very characteristics (representativeness, balance, size, etc.) of the corpora under comparison (Gabrielatos 2018: 252). The core component of the definition of keyness and the essence of keyword analysis is therefore the comparison of frequencies of individual linguistic items (Gabrielatos 2018). However, researchers have also recently experimented with other approaches, e.g., based on comparisons of means of frequencies of individual items (Forsyth 2014b), grounded in topic modeling (Murakami et al. 2017), etc., which go beyond the original idea of keyword analysis.
Since calculating keyness is far from straightforward For a moredetailedoverview, seeGabrielatos (2018)., there are methods galore that help one identify whether a word is a key one in a corpus. One of the most popular approaches involves, first, comparing a frequency of a word in a study corpus with a frequency of the same word in a reference corpus and, second, by cross-tabulating the results taking into consideration the size (i.e., total number of tokens) of both corpora and by applying a test of statistical significance, e.g., Ted Dunning's (1993) log-likelihood test (also known as G2 test) or Pearson's chi-square test (Scott 2008: 122). This approach is implemented (up to version 6.0) in WordSmith Tools WordSmithToolsver. 1.0 wasreleasedin 1996; themostrecentversionofthesoftware (7.0) wasreleasedin 2017. Theprogramisdownloadablefromthefollowingwebsite: http://lexically.net. (Scott 1996--2017).
Another approach is proposed by Costas Gabrielatos and Anna Marchi (2011), who argue for measuring the effect size (i.e., the extent or magnitude of the frequency difference) rather than statistical significance of the frequency difference, the latter being highly sensitive to corpus size. Consequently, Gabrielatos and Marchi (2011) propose the effect size metric %DIFF, which is independent from sample size (Rosenfeld &Penrod 2011: 84) and calculated as follows: %DIFF = (NormFreq in SC - NormFreq in RC) x 100 / NormFreq in RC `NormFreq' standsfor a normalizedfrequency, `SC' standsfor a studycorpusand `RC' standsfor a referencecorpus.. It is implemented, following the procedure proposed by Hardie (2014), in the version 7.0 of WordSmith Tools (Scott 2017), where it is called Log-ratio. Importantly, the log-likelihood test and %DIFF result in different rankings for keywords, i.e., a high log-likelihood score does not correlate with a high %DIFF (Gabrielatos&Marchi 2011), unlike the rankings produced by two size effect metrics (e.g., %DIFF and Ratio ThemetricwasproposedbyAdamKilgrariff (2001) (citedinGabrielatos 2018: 231--235).), which turned out to be identical for all keywords (Gabrielatos 2018: 232). In short, tests of statistical significance and effect size metrics “measure different aspects of a frequency difference” and hence they “are not alternative measures of keyness” (Gabrielatos 2018: 231). In practice, this means that two rankings of keywords, e.g., based on a test of statistical significance and effect size metric respectively, are hardly comparable with each other.
According to Paul Ellis (2010: 9), another useful method that can be employed to measure effect size is Hedges' g (Hedges 1981), which -- as explained by Richard Forsyth (2014b: 10) -- expresses -- in standard deviation units -- the difference in mean frequency rates between a study corpus and a reference corpus, hence producing a z-score (i.e., a standardized difference). To sum up, in contrast to non-parametric tests of statistical significance (e.g., Pearson's chi-square test, Dunning's log-likelihood test), metrics of effect size (e.g., %DIFF, Hedges' g or Cohen's d) help one avoid the problem of small yet at the same time statistically significant differences between frequencies of words in two corpora, the problem typical of comparing corpora of large size (Gabrielatos&Marchi 2011; Gabrielatos 2018). It often happens that statistical significance is not paramount to practical significance between the observed differences. At this point, however, it is worthwhile emphasizing that different size effect metrics are based on different assumptions, e.g., %DIFF compares frequencies of individual items in a corpus while Hedges' g and Cohen's d compare means of frequencies of individual items in the texts in a given corpus I wouldliketothankCostasGabrielatosforthisremark..
There are also other methods, e.g., Simple Math metric (Kilgarriff 2009) that includes a variable that allows one to focus on either lower or higher frequency words TheSimpleMathmethod (Kilgarriff 2009) isimplementedinSketchEngine (Kilgarriffetal. 2014)..
In the Keysoft software package, a collection of scripts written in Python 3.4, Forsyth (2014a) implements twelve methods to identify keywords, some of them commonly used in the field of authorship attribution (e.g., Zeta or Neozeta). Originally developed by John Burrows (2007), and later modified by Hugh Craig and Arthur Kinney (2009), Neozeta enables one to identify keywords by segmenting corpora into text chunks of equal length and counting occurrences of words in those text chunks. According to Maciej Eder (2016: 35), such a method whereby the frequencies of text chunks rather than individual words are used for corpus comparison enables one to filter out the words that appear in texts with high frequencies (e.g., function words) and, consequently, to focus on content words that convey themes or topics discussed in texts, i.e., the so- called aboutness (Phillips 1989).
Hence, in order to capitalize on the whole variety of approaches, we will use the Keysoft package (Forsyth 2014a) to identify and compare keywords in Russian PILs using three fundamentally different metrics, that is G2 (Dunning 1993), Hedges' g (Hedges 1981) and Neozeta (Craig & Kinney 2009), which measure different aspects of a frequency difference11. Since we do not focus on comparisons of statistical significance metrics only (e.g., G2), we do not additionally apply -- as recently recommended by Gabrielatos (2018) -- the BIC score, i.e., a metric calculated by deducting the combined size (logarithmized) of the compared corpora from the G2 value of the frequency difference.
Although the three metrics provide different flavours to keyword rankings, it is believed that there is some practical value in undertaking such a comparison. Firstly, it will enable one to further verify the correlation (if any) between rankings produced by a test of statistical significance (G2) and effect size metric (Hedges' g), the latter one focusing on means of frequencies of individual words. Second, some researchers, especially those using keywords in critical discourse analysis or sociolinguistic research, still use tests of statistical significance for keyword analysis (e.g., Baker et al. 2019), which means that it may be useful for them to compare rankings of keywords obtained using different methods, irrespective of the fact that different methods may be based on different statistical assumptions, e.g., comparisons of means of word frequencies rather than frequencies of individual items in two corpora.
As a reference corpus, we will use the Russian component of the Leeds Pentaglossal Corpus ThemathematicalformulaeusedtocalculatekeynessusingthethreemethodsaredescribedandexplainedingreaterdetailbyForsyth (2014b: 9--12). LeedsPentaglossalCorpusisdownloadablefromthefollowingwebsite: http://corpus.leeds.ac.uk/tools/5gcorpus.zip. (Forsyth &Sharoff 2014), which includes 113 documents or fragments of documents representing 13 text types (e.g., Bible, corporate statements, fiction, news articles, ted.com transcripts, United Nations documents). Hence, the composition of the corpus is more heterogeneous as compared with the Russian PILs. The size of the Russian Pentaglossal Corpus (henceforth RPC) is 251,204 word tokens and the mean block TTR (type-token ratio) is 79.09%. Since the STTR of Russian PILs is 78.48%, the RPC can be intuitively described as similar in terms of its lexical variation.
Next, we will compare the keywords obtained using three different methods to identify any overlapping ones, which will be subject to further qualitative analysis. The rationale behind focusing on the overlap (i.e., similarity) between the keywords is the claim made by Gabrielatos (2018: 225) who argues that “the vast majority of keyness studies focus on difference, at the expense of similarity”. Also, due to their unusually high frequency the keywords may be the center of units of meaning in texts thereby performing specific discourse functions and contributing to the texts' formulai- city. According to Stanislaw Gozdz-Roszkowski (2011: 35), keywords can “reveal not only a great deal about the subject matter, the `aboutness' of a particular genre, but they can also specify the salient features which are functionally related to the genre”. This observation has two important implications. Firstly, since keywords are typically studied through their typical co-occurrence patterns, it should be possible to align them with specific discourse functions. In practice, this means developing a set of provisional categories in the form of tentative labels reflecting typical characteristics of the keywords -- the type of information they convey, their role in the organization of discourse, their semantic prosody and evaluative charge etc. (Gozdz-Roszkowski 2011: 65; Grabowski 2015 c). Secondly, the exploration of co-occurrence patterns or wider contexts of use of keywords should also enable one to identify distinctive or salient sequences of words that perform specific discourse functions in texts. The resulting sequences may include specialist terms or text chunks contributing to the formulaic style of a given text type or genre.
According to Bestgen (2018: 206), “one of the frequently used approaches to studying formulaic language is based on the automatic identification of recurrent continuous sequences of words”. With this goal in mind, in the second stage of the study we will use a recently proposed method called Formulex (Forsyth 2015b), which identifies properly fragmented n-grams based on the concept of `coverage'. According to Forsyth (Forsyth 2015b: 17), the method whereby “the sequences are mutually exclusive” and that “longer prefabricated phrases [are prevented] from being swamped by the elements of which they are composed” enables one to specify more precise boundaries of recurrent strings of words. As demonstrated by Grabowski &Juknevi- ciene (2016) Using a corpusofLithuanianandPolishstudents' EFL writing, Grabowski&Jukneviciene (2016) filteredouttheoriginallistsoflexicalbundles, identifiedusingthreetraditionalcriteria (Biberetal. 1999, 2003, 2004; Biber 2006), againstthelistsofformulasgeneratedusingtheFormulexmethod (Forsyth 2015b)., Formulex method may come in useful when dealing with overlapping sequences of n-words or with those sequences of words that constitute fragments of longer n-grams (e.g., внедоступномдлядетей, внедоступномдлядетейместе, хранитьвнедоступномдлядетейместе`store in a place not accessible for children'). A problem like this one is often faced by researchers using the lexical bundles methodology (Biber et al. 1999) where the recurrent sequences of words are extracted from texts using the criteria such as orthographic length, minimum frequency and distribution range. In that approach, overlapping or structurally-incomplete items are often identified when analyzing highly-patterned, formulaic text types or genres; this is precisely the scenario that we aim to avoid when using Formulex method (Forsyth 2015b) in an attempt to extract right-sized n-grams from the study corpus.
2.Results
In the first stage of the study, we focused on exploration of the most salient words in Russian PILs. To that end, we used the Keysoft package (Forsyth 2014a) and identified keywords using three different metrics described earlier in this paper, that is G2 (Dunning 1993), Hedges' g (Hedges 1981) and Neozeta (Craig & Kinney 2009), which resulted in three lists with 52, 55 and 51 positive keywords respectively. The top-50 keywords are presented in Table 1. For the sake of clarity, all the numbers were deleted from the lists of keywords -- this way three numbers were deleted from the keywords identified using G2 test, four numbers from the list identifies using Neozeta, and none from the list of keywords identified using Hedges' g metric. In brackets right next to each keyword, there is information concerning the overlap among the top-50 keywords, e.g., (3) indicates that a given keyword was identified using each method; (2) indicates an overlap between G2 and Neozeta; (2h) indicates an overlap between G2 and Hedge's g; (2n) indicates an overlap between Hedge's g and Neozeta; (1) indicates that a keyword was identified using a single method only. As mentioned earlier, this procedure should provide a preliminary insight into the similarity between the output of the three keyword metrics.
The results revealed that 22 keywords (44%) out of top-50 identified using three different metrics overlap with each other. Also, it was revealed that 20 keywords overlap in the case of using G2 test and Neozeta; 2 keywords (реакции`reactions', беременности`pregnancy') overlap in the case of using G2 and Hedges' g; 2 keywords overlap in the case of using Hedges' g and Neozeta(особые` particular', взаимодействие`interaction'). The most distinctive keywords were identified using Hedges' g statistic: 23 keywords (46%) do not overlap with the ones identified using either G2 or Neozeta. The corresponding figure for G2 and Neozeta is 7 for both metrics. Hence, the findings confirm that the three metrics -- based on different statistical assumptions -- prioritize different keywords.
The provisional results were further verified using Spearman Rank Correlation (Rs) ThesameapproachwasusedbyBaker (2010: 92).applied to ranks of all positive keywords identified using each metric. In cases when a word does not occur on the list of positive keywords produced by a given metric, it was decided to assign to it a rank of N+1. For example, in comparisons of keywords identified using G2 (52 words) with the ones obtained using Hedges' g (55 words), all the words that occurred in G2 list (e.g., после`after') were assigned 56th rank (55 + 1), as if they appeared on the Hedges' g list. The results confirmed our earlier observations: the highest R score (0.778) Ranksin G2 test: Mean: 26.5, Standarddeviation: 15.15; RanksinNeozeta: Mean: 26.5, Standarddeviation: 15.1; Covariance = 90.83 / 51 = 178.1; R = 178.1 / (15.15 * 15/1) = 0.778. was reported in the case of G2 vsNeozeta, which indicates rather strong positive association. The Rs scores for G2 vs Hedges' g (0.293) and Hedges' g vsNeozeta (0.189) indicate weak association between the metrics.
Table 1. Keywords in Russian PILs (top-50): comparing G2, Hedges' g, Neozeta
Rank |
G2 |
Hedges' g |
Neozeta |
|
1 |
мг(3) |
препарата(3) |
препарата(3) |
|
2 |
препарата(3) |
дозы (3) |
при(3) |
|
3 |
пациентов(2) |
при(3) |
пациентов(2) |
|
4 |
при(3) |
следует(3) |
тг(3) |
|
5 |
дозы(3) |
противопоказания |
следует(3) |
|
6 |
следует(3) |
применению(3) |
дозы (3) |
|
7 |
крови(3) |
особые(2n) |
У (2) |
|
8 |
лечения(3) |
взаимодействие(2n) |
применения(3) |
|
9 |
приема(3) |
лекарственная(1) |
или (1) |
|
10 |
терапии(3) |
форма(1) |
крови(3) |
|
11 |
применения(3) |
лекарственными(3) |
лечения(3) |
|
12 |
с (3) |
годности(1) |
приема(3) |
|
13 |
мл(2) |
инструкцией(1) |
после(2) |
|
14 |
применении(3) |
препарат(3) |
применение(3) |
|
15 |
применение(3) |
показания(1) |
терапии(3) |
|
16 |
препарат(3) |
побочное(1) |
применении(3) |
|
17 |
нарушения(2) |
тг(3) |
препарат(3) |
|
18 |
печени(2) |
торговое(1) |
с (3) |
|
19 |
У (2) |
выпуска(1) |
рекомендуется(3) |
|
20 |
сутки(2) |
применение(3) |
развития(2) |
|
21 |
стороны(1) |
фармакокинетика(1) |
до (1) |
|
22 |
концентрации(2) |
хранения(1) |
течение(2) |
|
23 |
таблетки(2) |
фармакотерапевтическая(1) |
применению(3) |
|
24 |
редко(1) |
отпуска(1) |
может(1) |
|
25 |
лечение(3) |
действие(3) |
концентрации(2) |
|
26 |
рекомендуется(3) |
фармакологические(1) |
лечение(3) |
|
27 |
риск(2) |
симптомы(1) |
печени(2) |
|
28 |
доза(3) |
вспомогательные(1) |
риск(2) |
|
29 |
применению(3) |
рекомендуется(3) |
мл (2) |
|
30 |
снижение(2) |
свойства(1) |
период(2) |
|
31 |
течение(2) |
реакции(2h) |
действие(3) |
|
32 |
прием(2) |
регистрационный(1) |
снижение(2) |
|
33 |
часто(1) |
передозировка(1) |
нарушения(2) |
|
34 |
дозе(2) |
лечение(3) |
прием(2) |
|
35 |
почек(2) |
приема(3) |
необходимо(1) |
|
36 |
после (2) |
средствами(3) |
другими(1) |
|
37 |
беременности(21|) |
беременности(2h) |
дозе(2) |
|
38 |
препаратов(2) |
применении(3) |
препаратов(2) |
|
39 |
составляет(3) |
состав(1) |
средствами(3) |
|
40 |
развития(2) |
срока(1) |
функции(1) |
|
41 |
пациенты(1) |
доза(3) |
таблетки(2) |
|
42 |
период(2) |
крови(3) |
доза (3) |
|
43 |
плазме(1) |
применения(3) |
риска(2) |
|
44 |
со (1) |
с (3) |
составляет(3) |
|
45 |
лекарственными(3) |
осторожностью(1) |
сутки(2) |
|
46 |
риска(2) |
терапии(3) |
почек(2) |
|
47 |
действие(3) |
лечения(3) |
лекарственными(3) |
|
48 |
мин(1) |
условия(1) |
особые(2п) |
|
49 |
средствами(3) |
составляет(3) |
системы(1) |
|
50 |
реакции(21|) |
повышенная(1) |
взаимодействие(2п) |
Although the application of each method results in three largely different sets of keywords, there are 22 overlapping words on three lists. These keywords may be provisionally divided into a number of functional categories, such as high-frequency function words (при`at', c `with') or content words (составляет`(it) constitutes'), measurement keywords (мг`mg'), keywords referring to administration of medicines to patients (доза`dose', дозы`doses', действие`activity', приема`(drug) taking', применение, применения, применении, применению`administration/application'), recommendation or advisory keywords (рекомендуется`it is recommended', следует`(one) should'), keywords referring to human body (крови`blood' gen.), procedural keywords (лечение, лечения`treatment', терапии`therapy'), as well as aboutness keywords that convey a general idea about the topics raised in the Russian PILs (препарат, препарата`drug', лекарственными`medicinal', средствами`products').
The second stage of the study is aimed to identify distinctive or salient sequences of words in the sample of Russian PILs. The rationale behind our approach is that the keywords are not salient by themselves or by virtue of the communicative function of the text variety under scrutiny only. On the contrary, it is believed that the salience of keywords measured by their outstanding frequency results from the frequent use of certain text chunks and/or grammatical constructions. For example, an outstanding frequency of articles in specialist texts may result from frequent use of noun phrases or nominalizations. To that end, in the final stage of the study, we used the Formulib software (Forsyth 2015a), a collection of scripts written in Python 3.4, and attempted to identify n-grams, built of four words or longer, with the highest coverage of texts in the study corpus. More specifically, the coverage threshold was arbitrarily set at 0.02%. Since coverage is calculated in terms of the number of characters, the corresponding threshold taking into consideration the size of the study corpus is 345 or more characters. As regards the procedure of n-gram extraction, Formulib script treats coverage as a binary category, which means that the number of n-grams that match a particular text sequence is irrelevant; what the program verifies is whether the text sequence is covered or not (Forsyth 2015b: 13--14). For example, if n-grams such as связь с белками плазмыand сбелкамиплазмыкровиcover a certain part of the text sequence связь с белками плазмы крови`interaction with blood plasma proteins', each of those five words in the last sequence is marked as covered once. Based on that, the proportion of covered to uncovered characters for each text is calculated and, subsequently, the character coverage for a text category, in this study -- Russian PILs, is aggregated (Forsyth 2015b: 13--14) Forsyth (2015b: 25) notesthathismethodissimilartooneofthealgorithms (“SerialCascadingAlgorithm”) proposedbyO'Donnell (2011: 149--153)..
Apart from providing insights into recurrent chunks of text, the Formulex method (Forsyth 2015b) also allows one to identify boundaries between recurrent n-grams, in particular overlapping or structurally incomplete ones. To illustrate the problem, on 22 occasions in the Russian PILs a contiguous sequence of words, such as с белками плазмы крови,was not a fragment of a longer sequence связь с белками плазмы крови, which is recorded in Russian PILs 15 times; as a matter of fact, the sequence с белками плазмы крови occurs 46 times in total in various patterns in the Russian
PILs corpus, yet it occurs by itself only 22 times. Such a solution, namely that “the sequences [of words] are mutually exclusive” and that “longer prefabricated phrases [are prevented] from being swamped by the elements of which they are composed of” (Forsyth 2015b: 17), allows one to specify more precise boundaries of formulaic sequences of words, which has been one of the challenges in research on n-grams or lexical bundles (Biber et al. 1999; Biber et al. 2004) ThesamemethodwasusedbyGrabowski&Jukneviciene (2016)..
The results in the form of n-grams with the largest coverage in the Russian PILs are presented in Table 2. For the reasons of limited space, only the 50 n-grams with the largest coverage in the study corpus are presented; in practice, this translates into coverage of more than 0.02% of the study corpus. Also, the keywords found in the n-grams are presented in bold. It is believed that the salience of the 22 keywords constituting the core vocabulary of Russian PILs and identified earlier in the study may be also contingent on the frequent occurrence of the text chunks presented in Table 2. The reason for that is that those text chunks constitute textual building blocks of the Russian PILs.
Table 2 presents 50 n-grams, i.e., contiguous sequences of 4 and 5 words, with the largest coverage in the Russian PILs under study. The results reveal that among the ten top-coverage n-grams four are in fact headings describing macro-structure of the genre (взаимодействие с другими лекарственными средствами`interaction with other drugs', инструкция по медицинскому применению препарата`instruction for medical use of the drug', способ применения и дозы methods of administration and doses', взаимодействие с другими лекарственными препаратами`interaction with other medicinal products'), while the remaining ones are found within the PILs' contents.
Since all the n-grams presented in Table 2 are frequently used in the analyzed text type, an attempt has been made to explore their discourse functions. To that end, we capitalized on two functional typologies. The first one is largely based on the functional taxonomy originally proposed by Douglas Biber, Susan Conrad and Viviana Cortes (2004: 384--388) and Biber (2006: 139--145) and applied to lexical bundles, which are divided into three inclusive categories, namely referential, discoursal and expressing stance. The other one is the functional typology originally developed by Kenneth Hyland (2008: 13--14), who divided lexical bundles into three major functional categories, namely research-oriented (in this study called “referential” bundles), text-oriented and participant-oriented bundles (in this study called ''stance/evaluation” bundles).
More specifically, in this study referential n-grams refer to various properties (pharmacological, pharmacokinetic etc.) of medicines or to main themes conveyed in the Russian PILs; text-oriented n-grams help organize and convey information presented in the analyzed text type; finally, stance/evaluation n-grams help express judgments or assessments of information presented in the Russian PILs. Also, more fine-grained functional subcategories are provided to account for specific functional roles of the n-grams under scrutiny. In that respect, the typology used in the present research is similar to the one applied in the study of lexical bundles in Polish patient information leaflets (Grabowski 2014).
Table 2Top-50 n-grams (by coverage) in Russian PILs
No. |
Coverage |
Freq. |
Char. |
Words |
N-gram |
|
1 |
0.1617 |
57 |
50 |
5 |
взаимодействие с другими лекарственными средствами |
|
2 |
0.1202 |
45 |
47 |
5 |
инструкция по медицинскому применению препарата |
|
3 |
0.1099 |
79 |
24 |
4 |
способ применения и дозы |
|
4 |
0.0933 |
43 |
38 |
5 |
со стороны сердечно сосудистой системы |
|
5 |
0.0854 |
59 |
25 |
4 |
см раздел особые указания |
|
6 |
0.0839 |
29 |
51 |
5 |
взаимодействие с другими лекарственными препаратами |
|
7 |
0.0535 |
31 |
30 |
4 |
при одновременном применении с |
|
8 |
0.0529 |
25 |
37 |
5 |
со стороны желудочно кишечного тракта |
|
9 |
0.0514 |
33 |
27 |
4 |
см раздел побочное действие |
|
10 |
0.0501 |
30 |
29 |
4 |
у пациентов пожилого возраста |
|
11 |
0.0499 |
28 |
31 |
5 |
у пациентов с сахарным диабетом |
|
12 |
0.0477 |
22 |
38 |
5 |
со стороны центральной нервной системы |
|
13 |
0.0428 |
22 |
34 |
4 |
со стороны пищеварительной системы |
|
14 |
0.0381 |
18 |
37 |
5 |
нарушения со стороны иммунной системы |
|
15 |
0.0379 |
22 |
30 |
4 |
со стороны дыхательной системы |
|
16 |
0.0378 |
17 |
39 |
5 |
у пациентов с почечной недостаточностью |
|
17 |
0.0369 |
17 |
38 |
5 |
для приготовления раствора для инфузий |
|
18 |
0.0361 |
18 |
35 |
4 |
с другими лекарственными средствами |
|
19 |
0.0333 |
13 |
45 |
5 |
следует соблюдать осторожность при назначении |
|
20 |
0.0331 |
17 |
34 |
4 |
следует соблюдать осторожность при |
|
21 |
0.0312 |
16 |
34 |
4 |
может потребоваться коррекция дозы |
|
22 |
0.0304 |
13 |
41 |
5 |
у пациентов с печеночной недостаточностью |
|
23 |
0.0294 |
12 |
43 |
4 |
повышение активности печеночныхтрансаминаз |
|
24 |
0.0292 |
25 |
20 |
4 |
у детей и подростков |
|
25 |
0.0285 |
19 |
26 |
4 |
со стороны кожных покровов |
|
26 |
0.0285 |
16 |
31 |
5 |
способ применения и дозы внутрь |
|
27 |
0.0282 |
22 |
22 |
4 |
с белками плазмы крови |
|
28 |
0.0275 |
13 |
37 |
4 |
таблетки покрытые пленочной оболочкой |
|
29 |
0.0267 |
10 |
47 |
5 |
по поводу хронической сердечной недостаточности |
|
30 |
0.0258 |
16 |
28 |
4 |
концентрации глюкозы в крови |
|
31 |
0.0256 |
20 |
22 |
4 |
по сравнению с плацебо |
|
32 |
0.0247 |
12 |
36 |
5 |
нарушения со стороны нервной системы |
|
33 |
0.0247 |
12 |
36 |
4 |
с другими лекарственными препаратами |
|
34 |
0.0245 |
10 |
43 |
5 |
претензии потребителей направлять по адресу |
|
35 |
0.0242 |
15 |
28 |
5 |
связь с белками плазмы крови |
|
36 |
0.0234 |
10 |
41 |
5 |
всасывается из желудочно кишечного тракта |
|
37 |
0.0233 |
22 |
18 |
4 |
в течение 24 часов |
|
38 |
0.0231 |
16 |
25 |
5 |
баллов по шкале чайлд пью |
|
39 |
0.0230 |
18 |
22 |
4 |
у пациентов в возрасте |
|
40 |
0.0227 |
17 |
23 |
4 |
не оказывает влияния на |
|
41 |
0.0223 |
10 |
39 |
5 |
у пациентов с артериальной гипертензией |
|
42 |
0.0220 |
22 |
17 |
4 |
в том случае если |
|
43 |
0.0218 |
14 |
27 |
5 |
как и при применении других |
|
44 |
0.0218 |
14 |
27 |
4 |
со стороны иммунной системы |
|
45 |
0.0217 |
26 |
14 |
4 |
в связи с этим |
|
46 |
0.0217 |
10 |
38 |
5 |
беременность и период кормления грудью |
|
47 |
0.0214 |
11 |
34 |
5 |
со стороны костно мышечной системы |
|
48 |
0.0211 |
10 |
37 |
5 |
у пациентов с фибрилляцией предсердий |
|
49 |
0.0211 |
10 |
37 |
4 |
необходимо соблюдать осторожность при |
|
50 |
0.0209 |
15 |
24 |
4 |
на фоне приема препарата |
As for referential n-grams (40 items), they include topic n-grams, referred to by Hyland (2008: 13) as topic-bundles, which identify certain themes conveyed in PILs or key aspects of medicines described therein. This group includes the following n-grams: взаимодействие с другими лекарственными средствами`interaction with other drugs', с другими лекарственными средствами`with other drugs', с другими лекарственными препаратами`with other medicinal products', взаимодействие с другими лекарственными препаратами`interaction with other medicinal products', инструкция по медицинскому применению препарата`instruction for medical use of the drug', таблетки, покрытые пленочной оболочкой(`film coated tablets', i.e., referring to a pharmaceutical form of medicines), по поводу хронической сердечной недостаточности` due to chronic heart failure', беременность и период кормления грудью`pregnancy and lactation period' (i.e., referring to illnesses or physical conditions), баллов по шкале чайлдпью`points on the Child-Pugh scale'. Another group in this category includes location n-grams, which refer to composition, parts or systems of human organism (blood plasma, central nervous system, immune system etc.) affected by illnesses or subjected to the activity of medicines, e.g., со стороны сердечно-сосудистой системы`of the cardiovascular system', со стороны желудочно-кишечного тракта `of the gastrointestinal tract', со стороны центральной нервной системы`of the central nervous system', со стороны пищеварительной системы`of the digestive system', нарушения со стороны иммунной системы`disorders of the immune system' , нарушения со стороны нервной системы`disorders of the nervous system', со стороны дыхательной системы`disorders of the respiratory system' , со стороны кожных покровов`of the skin surfaces', со стороны иммунной системы`of the immunological system', со стороны костно-мышечной системы`of the osseous muscular system', с белками плазмы крови`with blood plasma cells'. Next, procedure-related n-grams relate to various aspects of administration of medicines to patients (preparation, dose etc.), e.g., способ применения и дозы`method of administration and doses', способ применения и дозы внутрь`method of administration and use inside', при одновременном применениис`when used simultaneously with', для приготовления раствора для инфузий`for preparation of solution for infusion', на фоне приема препарата`while taking the drug', по сравнению с плацебо`in comparison with placebo', как и при применении других`as well as in the application of. Process-related n-grams describe chemical processes related to the activity or presence of active substances or excipients in the human body, e.g., повышение активности печеночных трансаминаз `increased activity of hepatic transaminases', концентрации и глюкозы в крови`blood glucose concentration', связь с белками плазмы крови`interaction with blood plasma cells', всасывается из желудочно-кишечного тракта`is being absorbed from the gastrointestinal tract'. Finally, one may find a single temporal n-gram (втечение24 часов`within 24 hours') related to the frequency of administration or duration of the activity of medicines.
Подобные документы
Контрольная по английскому языку, состоит из заданий по переводу текстов и вопросов. Тема – бухгалтерский учет. Например - translate the text "Money and its functions.", translate the following words, phrases and statements from Russian into English.
контрольная работа [18,0 K], добавлен 26.12.2008Loan-words of English origin in Russian Language. Original Russian vocabulary. Borrowings in Russian language, assimilation of new words, stresses in loan-words. Loan words in English language. Periods of Russian words penetration into English language.
курсовая работа [55,4 K], добавлен 16.04.2011- English proverbs and sayings with a component "Pets and other animals" and their Russian equivalents
The functions of proverbs and sayings. English proverbs and sayings that have been translated into the Russian language the same way, when the option is fully consistent with the English to Russian. Most popular proverbs with animals and other animals.
презентация [3,5 M], добавлен 07.05.2015 The lexical problems of literary translation from English on the Russian language. The choice of the word being on the material sense a full synonym to corresponding word of modern national language and distinguished from last only by lexical painting.
курсовая работа [29,0 K], добавлен 24.04.2012The case of the combination of a preposition with a noun in the initial form and description of cases in the English language: nominative, genitive, dative and accusative. Morphological and semantic features of nouns in English and Russian languages.
курсовая работа [80,1 K], добавлен 05.05.2011Phrases as the basic element of syntax, verbs within syntax and morphology. The Structure of verb phrases, their grammatical categories, composition and functions. Discourse analysis of the verb phrases in the novel "Forsyte Saga" by John Galsworthy.
курсовая работа [55,2 K], добавлен 14.05.2009Text and its grammatical characteristics. Analyzing the structure of the text. Internal and external functions, according to the principals of text linguistics. Grammatical analysis of the text (practical part based on the novel "One day" by D. Nicholls).
курсовая работа [23,7 K], добавлен 06.03.2015The history of the English language. Three main types of difference in any language: geographical, social and temporal. Comprehensive analysis of the current state of the lexical system. Etymological layers of English: Latin, Scandinavian and French.
реферат [18,7 K], добавлен 09.02.2014Modern sources of distributing information. Corpus linguistics, taxonomy of texts. Phonetic styles of the speaker. The peculiarities of popular science text which do not occur in other variations. Differences between academic and popular science text.
курсовая работа [24,6 K], добавлен 07.02.2013Moscow is the capital of Russia, is a cultural center. There are the things that symbolize Russia. Russian’s clothes. The Russian character. Russia - huge ethnic and social mixture. The Russian museum in St. Petersburg. The collection of Russian art.
реферат [12,0 K], добавлен 06.10.2008