PersianBPETokenizer / README.md

mshojaei77

Upload tokenizer

bf7bd35 verified 19 days ago

preview code

raw

history blame contribute delete

No virus

5.15 kB

	---
	datasets:
	- mshojaei77/PersianTelegramChannels
	language:
	- fa
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	tags:
	- 'Tokenizer '
	- persian
	- bpet
	---
	# PersianBPETokenizer Model Card

	## Model Details

	### Model Description
	The `PersianBPETokenizer` is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.

	### Model Type
	- Tokenization Algorithm: Byte-Pair Encoding (BPE)
	- Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
	- Pre-tokenization: Whitespace
	- Post-processing: TemplateProcessing for special tokens

	### Model Version
	- Version: 1.0
	- Date: September 6, 2024

	### License
	- License: MIT

	### Developers
	- Developed by: Mohammad Shojaei
	- Contact: [email protected]

	### Citation
	If you use this tokenizer in your research, please cite it as:
	```
	Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co./mshojaei77/PersianBPETokenizer.
	```

	## Model Use

	### Intended Use
	- Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
	- Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.

	### Out-of-Scope Use
	- Non-Persian Text: This tokenizer is not designed for languages other than Persian.
	- Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.

	## Data

	### Training Data
	- Dataset: `mshojaei77/PersianTelegramChannels`
	- Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
	- Size: 60,730 samples

	### Data Preprocessing
	- Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
	- Pre-tokenization: Used whitespace pre-tokenization.

	## Performance

	### Evaluation Metrics
	- Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
	- Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.

	### Known Limitations
	- Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
	- Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens (`[UNK]`).

	## Training Procedure

	### Training Steps
	1. Environment Setup: Installed necessary libraries (`datasets`, `tokenizers`, `transformers`).
	2. Data Preparation: Loaded the `mshojaei77/PersianTelegramChannels` dataset and created a batch iterator for efficient training.
	3. Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
	4. Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
	5. Post-processing: Set up post-processing to handle special tokens.
	6. Saving: Saved the tokenizer to disk for future use.
	7. Compatibility: Converted the tokenizer to a `PreTrainedTokenizerFast` object for compatibility with Hugging Face Transformers.

	### Hyperparameters
	- Special Tokens: `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
	- Batch Size: 1000 samples per batch
	- Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)

	## How to Use

	### Installation
	To use the `PersianBPETokenizer`, first install the required libraries:
	```bash
	pip install -q --upgrade datasets tokenizers transformers
	```

	### Loading the Tokenizer
	You can load the tokenizer using the Hugging Face Transformers library:
	```python
	from transformers import AutoTokenizer

	persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")
	```

	### Tokenization Example
	```python
	test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
	tokens = persian_tokenizer.tokenize(test_sentence)
	print("Tokens:", tokens)
	encoded = persian_tokenizer(test_sentence)
	print("Input IDs:", encoded["input_ids"])
	print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))
	```


	## Acknowledgments
	- Dataset: `mshojaei77/PersianTelegramChannels`
	- Libraries: Hugging Face `datasets`, `tokenizers`, and `transformers`

	## References
	- [Hugging Face Tokenizers Documentation](https://huggingface.co./docs/tokenizers/index)
	- [Hugging Face Transformers Documentation](https://huggingface.co./docs/transformers/index)