Page59
Furtherreading
Anderson1976,1978;Shlesinger1989b;Pöchhaker1994,1995;Cronin
2002;Diriker2004;Monacelli2005;Beaton2007a;Boéri2008.
EBRUDIRIKER
Corpora
Acorpus(plural:corpora)isacollectionoftextsthataretheobjectofliterary
orlinguisticstudy.Incontemporarycorpuslinguistics,suchcollectionsareheld
inelectronicform,allowingtheinclusionofvastquantitiesoftexts(commonly
hundredsofmillionsofwords),andfastandflexibleaccesstothemusing
corpusprocessingsoftware.Whilemostdefinitionsstresstheneedforcorpora
tobeassembledaccordingtoexplicitdesigncriteriaandforspecificpurposes
(Atkinsetal.1992),KilgarriffandGrefenstette(2003)allowformore
serendipitouscollectionsoftexts,eventheentireWorldWideweb,tobe
consideredascorpora,aslongastheircontentsarethefocusoflinguistic(or
related)study.Nomatterhowthecorporatheyworkwithcomeintobeing,
however,allcorpuslinguistsinsistontheprimacyofauthenticdata,asattested
intexts,thatis,instancesofspoken,writtenorsignedbehaviourthathave
occurred‘naturally,withouttheinterventionofthelinguist’(Stubbs1996:4).
Corpuslinguiststhustakeanapproachtothestudyoflanguagethatisconsistent
withtheempiricismadvocatedindescriptivetranslationstudiessincethe1970s.
Atthattime,scholarsbecameparticularlycriticaloftheuseofintrospectionin
translationtheory(Holmes1988:101)andofapproachesthatviewed
translationsasidealized,speculativeentities,ratherthanobservablefacts(Toury
1980a:79–81).WhileTouryconcededthatisolatedattemptshadbeenmadeto
describeandexplainactualtranslations,hecalledforawholenew
methodologicalapparatusthatwouldmakeindividualstudiestransparentand
repeatable.ItwasBaker(1993)whosawthepotentialforcorpuslinguisticsto
providesuchanapparatus,andherearlyworkinthearea(Baker1993,1995,
1996a)launchedwhatbecameknownas‘corpusbasedtranslationstudies’,or
CTS.ResearchersinCTSnowpursuearangeofagendas,drawingonavariety
ofcorpustypesandprocessingtechniques,andtheseareaddressedbelow,
followingsomemoregeneralremarksoncorpusdesignandprocessing.
Corpuscreationandbasicprocessing
Bestpracticeincorpuscreationrequiresdesignerstomakeinformeddecisions
onthetypesoflanguagetheywishtoincludeintheircorpora,andinwhich
proportions.Designcriteriacruciallydependontheenvisageduseofthecorpus
buthave,inthepast,centredontheideathatcorporashouldsomehowbe
‘representative’ofaparticulartypeoflanguageproductionand/orreception.
Thestatisticalnotionofrepresentativenessis,however,extremelydifficultto
applytotextualdata,andmanycommentatorsnowprefertoaimfora
‘balanced’sampleofthelanguageinwhichtheyareinterested(Kenny
2001:106–7;KilgarriffandGrefenstette2003).Ageneralpurposemonolingual
corpusmightthushavetoincludeboth(transcribed)spokenandwritten
language,and,withineach,samplesofavarietyoftexttypes,datingfrom
specifictimeperiods.Theremayalsobeatradeoffbetweenincludingfewer
butmoreuseful,fulllengthtextsontheonehand,andmore,buttextually
‘compromised’partialtextsontheother(Atkinsetal.1992;Baker1995:229–
30;Sinclair1991).Onceasuitablebreakdownoftexttypes,authorprofiles,
etc.hasbeendecidedupon,theactualtextschosenforinclusioninacorpuscan
beselectedrandomly,orthroughmoredeliberate‘handpicking’.Thetextsthus
selectedmaythenhavetobeconvertedtoelectronicform(throughkey
boardingorscanning),iftheyarenotalreadyavailableinthisform,and
permissiontoincludetheminthecorpusmayhavetobesoughtfromcopyright
holders.Dependingontheintendeduseofthecorpus,variouslevelsof
structuralorlinguisticannotationaredesirable.Basicmarkupmayinvolve
indicating(usingastandardmarkuplanguagelikeXML)themaindivisionsina
text(headings,paragraphs,sentences,etc.)ortheadditionof‘headers’that
describethecontentoftexts,nametheirauthors,andsoon.Morelinguistically
orientedannotationincludespartofspeechtagging,