Model seems to be incredibly slow on CPU

#34

by adi751 - opened about 20 hours ago

about 20 hours ago

Using this on google colab, getting the embedding of a sentence with 1400 words took 31 minutes. Is this normal behavior?

When I run it locally, I see that only 1 core is being used, all other cores are dormant. Other embedding models use all my cores. Is this the expected behavior? We want to use this model in production, and having a 30 minute latency for 1 sentence is ludicrously high.

Getting the sentence embedding for around 250 sentences took around 7 hours.

Here's the CPU utilization when model.encode() is running

Am I doing something wrong? Is there a flag to enable multithreading?

adi751

about 20 hours ago

Here's how my CPU utilization looks like when I use a different embedding model(BGE-large-en) on the same input sentence:

jupyterjazz

Jina AI org about 18 hours ago

Hi @adi751 , I'll look into the issue. In the meantime, you can try using sentence-transformers for inference, it should be much faster

adi751

about 18 hours ago

I am using sentence transformers.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
model.encode(text) #text is a sentence of 1400 words

jupyterjazz

Jina AI org about 18 hours ago

Hmm yes, it does take an unusually long time on Colab. Does the same issue occur when you run it locally? I just ran this code snippet on my machine, and it took only 1.5 seconds, while bge-m3 took 1.3 seconds. In general, our model is expected to be slightly slower than bge because we use relative positional embeddings and LoRA adapters.

adi751

about 17 hours ago

•

edited about 17 hours ago

Yes, the issue first showed up locally, and I could see that the resource utilization was whack. I tested on colab just to rule out any weird configuration on my local machine.

Are there any dependent libraries other than einops that are needed? only 1 core being used at a time is extremely weird to me, and torch should handle the parallel tensor operations . I don't see why only 1 core is being used...

edit: Are these the latency figures for cpu or gpu?

jupyterjazz

Jina AI org about 17 hours ago

I ran it on a CPU.

As for additional dependencies, I don't think there are any more than what's listed in the README. I tested this in a freshly initialized venv. Here's what it looks like, in case it's helpful:
certifi==2024.8.30
charset-normalizer==3.3.2
einops==0.8.0
filelock==3.16.1
fsspec==2024.9.0
huggingface-hub==0.25.1
idna==3.10
Jinja2==3.1.4
joblib==1.4.2
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.3
numpy==2.1.1
packaging==24.1
pillow==10.4.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.1.1
sympy==1.13.3
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.4.1
tqdm==4.66.5
transformers==4.44.2
typing_extensions==4.12.2
urllib3==2.2.3

adi751

about 17 hours ago

•

edited about 16 hours ago

I have created fresh environments and installed sentence_transformers, torch, einops on 3 separate machines(local laptop, EC2 server, Colab). I'm facing similar latencies on all 3 systems. The CPU utilization behavior is similar on EC2.

edit: here's the pastebin for my pip freeze: https://p.ip.fi/XxE1

cnmoro

about 12 hours ago

I observed the exact same behavior on my machine. Only one or two cores in use (I have 16 cores available)
It takes forever to process some sentences

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment