๐ŸŒŸ Easy Fine-Tuning with Hugging Face SQL Console, Notebook Creator, and SFT

Community Article Published September 24, 2024

In this tutorial, we'll take you through an end-to-end process of creating a new dataset, fine-tuning a model with it, and sharing it on Hugging Face. By the end, you'll have a model that can respond in a lovely poetic way! ๐Ÿ’–

What We'll Use:

  • Hugging Face Dataset Viewer SQL Console
  • Dataset Notebook Create
  • Google Colab

For this example, we'll work with a poetry dataset and filter only the poems in the 'Love' category. This will allow us to fine-tune a model to generate answers filled with love and emotion. ๐Ÿ’Œ

1. Getting the data

Let's start by getting our data. We'll use the Georgii/poetry-genre dataset, which contains poems across various topics:

image/png

We only need the 'Love' poems, and we'll filter out any shorter than 150 characters. To do this, we'll use the SQL Console:

Click on SQL Console:

image/png

And now, write the following SQL query:

SELECT text AS poem FROM train WHERE genre = 'Love' AND len(text) > 150

image/png

๐Ÿ’ก Tip: For more advanced techniques and examples on using the SQL Console, check out this guide.

Now, click on Download to save the filtered dataset as a Parquet file. We'll use this file in the next steps.

image/png

2. Uploading the Dataset to Hugging Face

Create a new repository on Hugging Face for your dataset. You can upload the Parquet file manually, or use the following Python snippet to upload it programmatically:

from datasets import load_dataset

# Load the Parquet file into a dataset
dataset = load_dataset('parquet', data_files='query_result.parquet')

# Push the dataset to your Hugging Face repository
dataset.push_to_hub('your_dataset_name')

Or follow these steps to create your dataset.

In my case, I this dataset which now looks this way:

image/png

3. Generating the Training Code

Next, we'll use the Notebook Creator app to generate the training code for our dataset:

  1. Select asoria/love-poems as the dataset name

image/png

  1. Choose the Supervised fine-tuning (SFT) notebook type.

image/png

  1. Click Generate Notebook and open it in Google Colab.

4. Fine-Tuning the Model

Now, it's time to run the scripts in the generated notebook. We'll use the dataset to fine-tune a pre-trained model like facebook/opt-350m to create a new, more love-inspired version.

Follow the instructions in the notebook to train the model. Once training is complete, you'll have a model that responds in a lovelier way! ๐ŸŒนโœจ

Conclusion

With just a few simple steps, we've created a new version of a dataset using the Hugging Face SQL Console, generated the necessary code with the Notebook Creator, and fine-tuned a model to answer with more love and poetry.

Now, your model is ready to spread love in every response! ๐Ÿ’•๐ŸŽ‰