shauray commited on
Commit
a679787
1 Parent(s): 3f9a2fc

Upload 5 files

Browse files
README (1).md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ language:
4
+ - en
5
+ tags:
6
+ - 'LLaMA '
7
+ - MultiModal
8
+ ---
9
+ *This is a Hugging Face friendly Model, the original can be found at https://huggingface.co/liuhaotian/llava-llama-2-7b-chat-lightning-lora-preview*
10
+ <br>
11
+ # LLaVA Model Card
12
+
13
+ ## Model details
14
+
15
+ **Model type:**
16
+ LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
17
+ It is an auto-regressive language model, based on the transformer architecture.
18
+
19
+ **Model date:**
20
+ LLaVA-LLaMA-2-7B-Chat-LoRA-Preview was trained in July 2023.
21
+
22
+ **Paper or resources for more information:**
23
+ https://llava-vl.github.io/
24
+
25
+ ## License
26
+ Llama 2 is licensed under the LLAMA 2 Community License,
27
+ Copyright (c) Meta Platforms, Inc. All Rights Reserved.
28
+
29
+ **Where to send questions or comments about the model:**
30
+ https://github.com/haotian-liu/LLaVA/issues
31
+
32
+ ## Intended use
33
+ **Primary intended uses:**
34
+ The primary use of LLaVA is research on large multimodal models and chatbots.
35
+
36
+ **Primary intended users:**
37
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
38
+
39
+ ## Training dataset
40
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
41
+ - 80K GPT-generated multimodal instruction-following data.
42
+
43
+ ## Evaluation dataset
44
+ A preliminary evaluation of the model quality is conducted by creating a set of 90 visual reasoning questions from 30 unique images randomly sampled from COCO val 2014 and each is associated with three types of questions: conversational, detailed description, and complex reasoning. We utilize GPT-4 to judge the model outputs.
45
+ We also evaluate our model on the ScienceQA dataset. Our synergy with GPT-4 sets a new state-of-the-art on the dataset.
46
+ See https://llava-vl.github.io/ for more details.
47
+
48
+ ## Usage
49
+ usage is as follows
50
+
51
+ ```python
52
+ from transformers import LlavaProcessor, LlavaForCausalLM
53
+ from PIL import Image
54
+ import requests
55
+ import torch
56
+
57
+ PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf"
58
+
59
+ model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS,
60
+ device_map="cuda",torch_dtype=torch.float16).to("cuda")
61
+ processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
62
+
63
+ url = "https://llava-vl.github.io/static/images/view.jpg"
64
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
65
+ prompt = "How can you best describe this image?"
66
+
67
+ inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda",
68
+ torch.float16)
69
+ # Generate
70
+ generate_ids = model.generate(**inputs,
71
+ do_sample=True,
72
+ max_length=1024,
73
+ temperature=0.1,
74
+ top_p=0.9,
75
+ )
76
+ out = processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()
77
+
78
+ print(out)
79
+
80
+ """The photograph shows a wooden dock floating on the water, with mountains in the background. It is an idyllic scene that captures both
81
+ nature and human-made structures at their finest moments of beauty or tranquility depending upon one's perspective as they gaze into it"""
82
+ ```
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "_name_or_path": "shauray/Llava-Llama-2-7b-hf",
4
+ "model_type": "llava",
5
+ "architectures": [
6
+ "LlavaForCausalLM"
7
+ ],
8
+ "llama_config": {
9
+ "_name_or_path": "",
10
+ "bos_token_id": 1,
11
+ "eos_token_id": 2,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 4096,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 11008,
16
+ "max_position_embeddings": 4096,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 32,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-06,
23
+ "rope_scaling": null,
24
+ "tie_word_embeddings": false,
25
+ "torch_dtype": "float16",
26
+ "transformers_version": "4.32.0.dev0",
27
+ "use_cache": true,
28
+ "vocab_size": 32000
29
+ },
30
+ "llava_vision_config": {
31
+ "_name_or_path": "",
32
+ "bos_token_id": 1,
33
+ "eos_token_id": 2,
34
+ "freeze_mm_mlp_adapter": true,
35
+ "hidden_act": "silu",
36
+ "hidden_size": 4096,
37
+ "image_aspect_ratio": "square",
38
+ "image_grid_pinpoints": null,
39
+ "initializer_range": 0.02,
40
+ "intermediate_size": 11008,
41
+ "max_position_embeddings": 2048,
42
+ "mm_hidden_size": 1024,
43
+ "mm_resampler_type": null,
44
+ "mm_use_im_patch_token": false,
45
+ "mm_use_im_start_end": false,
46
+ "mm_vision_select_feature": "patch",
47
+ "mm_vision_select_layer": -2,
48
+ "mm_vision_tower": "openai/clip-vit-large-patch14",
49
+ "model_type": "llava",
50
+ "num_attention_heads": 32,
51
+ "num_hidden_layers": 32,
52
+ "num_key_value_heads": 32,
53
+ "pad_token_id": 0,
54
+ "pretraining_tp": 1,
55
+ "rms_norm_eps": 1e-06,
56
+ "rope_scaling": null,
57
+ "tie_word_embeddings": false,
58
+ "torch_dtype": "float16",
59
+ "transformers_version": "4.31.0",
60
+ "tune_mm_mlp_adapter": false,
61
+ "tune_mm_vision_resampler": false,
62
+ "use_cache": false,
63
+ "use_mm_proj": true,
64
+ "vocab_size": 32000
65
+ }
66
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "do_sample": true,
4
+ "eos_token_id": 2,
5
+ "max_length": 4096,
6
+ "pad_token_id": 0,
7
+ "temperature": 0.6,
8
+ "top_p": 0.9,
9
+ "transformers_version": "4.32.0.dev0"
10
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [0.48145466, 0.4578275, 0.40821073],
7
+ "image_processor_type": "CLIPImageProcessor",
8
+ "tokenizer_class": "LlamaTokenizer",
9
+ "image_std": [0.26862954, 0.26130258, 0.27577711],
10
+ "processor_class": "LlavaProcessor",
11
+ "resample": 3,
12
+ "rescale_factor": 0.00392156862745098,
13
+ "size": 224
14
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff