Growth in Grammar

Growth in Grammar was a three-year project studying how English children’s written language develops as they progress through their school careers. We hope that this research will provide new understandings of writing development that will inform teaching and curriculum design.

The Growth in Grammar project aimed to further our understanding of how the language of children’s writing develops as they progress through the education system in England, from ages six to sixteen.

We collected 2,898 texts from 983 children in 24 schools and used a number of computer-assisted methods to understand differences in the use of grammar and vocabulary across year groups and text types.

An accessible description of our findings and links to the full downloadable corpus can be found at (registration required - please contact Philip Durrant for access -


The Growth in Grammar Corpus is a collection of texts written by children at schools in England as part of their regular school work. This page describes the process of text collection, transcription and annotation and summarizes the contents of the corpus. 

The full corpus can be accessed at (registration required - contact Phil Durrant for access details

Corpus procedures

Collecting the corpus

Our research team contacted schools from across the country, briefing them on the project and inviting them to participate. All writing was obtained subject to the students’ voluntary informed consent, with additional consent obtained from the head teacher, the relevant subject teachers, and the students’ legal guardians.

Teachers collected texts from participating students and either photocopied these texts and mailed them to us or invited us into their schools to make photocopies ourselves.


All of the texts were received in hand-written form so we employed a small team of transcribers to type them up. Transcribers received two days of training and worked closely with a member of the core project team to deal with issues that arose during the process.

Transcription proceded in two phases. In the first phase, each transcriber was assigned a set of photocopies to type up, in accordance with our transcription conventions. They were also asked to make two types of change to the original texts: 1) replace any proper names which might compromise participants’ or institutions’ anonymity with anonymisation markers; 2) where a word had been mis-spelled, contained erroneous capitalization or an abbreviation, insert a tag recording both the original form and a ‘correction’ with the correct spelling/capitalization/expanded form of the abbreviation.

In the second phase, each transcriber was assigned texts which had originally been transcribed by someone else. They both reviewed the original transcription for accuracy and added annotations related to punctuation and grammar.

Linguistic Annotation

The conventions set out above describe the ‘basic’ version of the corpus. For the purposes of analysis, further versions were created incorporating different types of additional linguistic information.

Part-of-speech-tagged corpus

We used the CLAWStagger to automatically add information about the part-of-speech of each word in the corpus. To achieve more accurate classifications, prior to tagging, misspelled words were corrected and unclear/illegible material removed. Material appearing inside tables was also removed.

Syntactically-tagged corpus

The corpus was tagged with syntactic information in two ways. First, the entire corpus was tagged for part-of-speech and grammatical relations using the Stanford Core NLPsuit of tools (as with the part-of-speech tagging, misspellings were corrected and unclear/illegible material and tables were removed prior to parsing).

Second, a subset of the corpus was manually tagged by a team of trained annotators. This analysis focused specifically on tagging syntactic elements within noun phrases and subordinate clauses. Procedures and conventions used in this process are described in full here. The hand-parsed version of the corpus is available upon request. Please contact Phil Durrant ( for more information.

Corpus contents

The Growth in Grammar corpus comprises nearly 3,000 texts, written by 983 children in 24 different schools. See Summary of corpus contents for quantitative summaries of the corpus contents. See Corpus metadata for metadata describing the full contents of the corpus in detail.

Our primary points of data collection were years 2, 6, 9 and 11. We were also sent some texts from year 4, which are included as a supplement to the main corpus.


Core team

Professor Debra Myhill Debra’s research interests centre upon: composing processes in writing; the role of grammar and metalinguistic understanding in writing; the relationship between talk and writing.
Dr Philip Durrant The majority of Philip's research uses corpus-linguistic methods to study the language of academic writing, both at school and university levels. He also has ongoing interests in language testing and in the learning and use of formulaic language.
Mark Brenchley Mark's main interest lies in the nature of syntactic knowledge, focusing on its acquisition and later development within a wider framework of “communicative competence”, and with a particular emphasis on the use of corpus-based analytical techniques to better understand this knowledge. He is also interested in the general relationship between language and education, together with the possibility of helping develop what might be termed a genuinely "educational" theory of language.


