Skip to main content

Table 4 Training corpora for patent categorization. C73 has more patents per class with longer text and only primary classification.

From: Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

training corpus

number of patents/class

minimum text length

restricted to primary classification

number of classes

total number of patents

C 73

200

8000 characters

yes

73

14600

C 1205

100

2000 characters

no

1205

120500