antalvdb commited on
Commit
e7a5ee7
1 Parent(s): ec49ce0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -7
README.md CHANGED
@@ -7,27 +7,42 @@ model-index:
7
  results: []
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
  # bart-base-spelling-nl-2m
14
 
15
- This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) on an unknown dataset.
 
 
16
  It achieves the following results on the evaluation set:
17
  - Loss: 0.0248
18
  - Cer: 0.0133
19
 
20
  ## Model description
21
 
22
- More information needed
 
 
 
 
 
23
 
24
  ## Intended uses & limitations
25
 
26
- More information needed
 
 
27
 
28
  ## Training and evaluation data
29
 
30
- More information needed
 
 
 
 
 
 
 
 
 
31
 
32
  ## Training procedure
33
 
 
7
  results: []
8
  ---
9
 
 
 
 
10
  # bart-base-spelling-nl-2m
11
 
12
+ This model is a Dutch fine-tuned version of
13
+ [facebook/bart-base](https://huggingface.co/facebook/bart-base).
14
+
15
  It achieves the following results on the evaluation set:
16
  - Loss: 0.0248
17
  - Cer: 0.0133
18
 
19
  ## Model description
20
 
21
+ This is a fine-tuned version of
22
+ [facebook/bart-base](https://huggingface.co/facebook/bart-base)
23
+ trained on spelling correction. It leans on the excellent work by
24
+ Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
25
+ [huggingface](https://huggingface.co/oliverguhr/spelling-correction-english-base)). Training
26
+ was performed on an AWS EC2 instance (g5.xlarge) on a single GPU.
27
 
28
  ## Intended uses & limitations
29
 
30
+ The intended use for this model is to be a component of the
31
+ [Valkuil.net](https://valkuil.net) context-sensitive spelling
32
+ checker.
33
 
34
  ## Training and evaluation data
35
 
36
+ The model was trained on a Dutch dataset composed of 4,964,203 lines
37
+ of text from three public Dutch sources, downloaded from the
38
+ [Opus corpus](https://opus.nlpl.eu/):
39
+
40
+ - nl-europarlv7.1m.txt (2,000,000 lines)
41
+ - nl-opensubtitles2016.1m.txt (2,000,000 lines)
42
+ - nl-wikipedia.txt (964,203 lines)
43
+
44
+ Together these texts comprise 73,818,804 tokens.
45
+
46
 
47
  ## Training procedure
48