--- license: apache-2.0 datasets: - qizekun/ShapeLLM language: - en --- ## ShapeLLM model This repository contains the ShapeLLM-7B model presented in [ShapeLLM: Universal 3D Object Understanding for Embodied Interaction](https://huggingface.co/papers/2402.17766). ## Install [//]: # (If you are using Windows, do *NOT* proceed, see instructions [here](https://github.com/qizekun/LLaVA/blob/main/docs/Windows.md).) 1. Clone this repository and navigate to ShapeLLM folder ```Shell git clone https://github.com/qizekun/ShapeLLM.git cd ShapeLLM ``` 2. Install Package ```Shell conda create -n shapellm python=3.10 -y conda activate shapellm pip install --upgrade pip # enable PEP 660 support pip install -e . ``` 3. Install additional packages for training cases ```Shell pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` 4. Install PointNet++ ```Shell pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib" ``` ## ShapeLLM ### model weights Please check out our [Model Zoo](https://github.com/qizekun/ShapeLLM/blob/main/docs/MODEL_ZOO.md) for all public ShapeLLM checkpoints. ### Demo #### CLI Inference Chat about point clouds using CLI interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. ```Shell python -m llava.serve.cli \ --model-path qizekun/ShapeLLM_7B_general_v1.0 \ --pts-file assets/instrument.npy ``` ### Training Consistent with LLaVA, we adopt a two-stage training approach. In the first stage, we solely fine-tune the projector for semantic alignment. In the second stage, we conduct full fine-tuning using Instruction Following data. Download data following [DATA](https://github.com/qizekun/ShapeLLM/blob/main/docs/DATA.md), organize the data as follows in `./playground/data/shapellm/`, ``` │playground/data/shapellm/ ├── cap3d_objaverse_785k.json ├── cap3d_objaverse_sft_45k.json ├── gapartnet_sft_27k_openai.json ├── gapartnet_pcs │ ├── Box_100129_0_0.npy │ └── ... └── cap3d_pcs ├── 00000054c36d44a2a483bdbff31d8edf.pt └── ... ``` Furthermore, ShapeLLM utilizes the Large version of [ReCon++](https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/cfgs/pretrain/large/openshape.yaml) as the point encoder. You need to download the [ReCon++ weight](https://huggingface.co/qizekun/ReConV2/blob/main/zeroshot/large/best_lvis.pth) and save it to `./checkpoints/recon/large.pth`. ``` │checkpoints/recon/ └── large.pth ``` **1. Feature Alignment Stage** ``` sh scripts/pretrain.sh ``` **2. Visual Instruction Tuning Stage** ``` sh scripts/finetune.sh ``` The training takes around 14 hours for ShapeLLM-13B on 8x A100 (80G). It takes around 7 hours for ShapeLLM-7B. ### Zero-shot Understanding on 3D MM-Vet Evaluate 3D MLLMs for integrated capabilities and embodied interaction capabilities, run the script: ``` sh scripts/eval/mmvet.sh ``` Using GPT4 to calulate the 3D MM-Vet score: ``` sh scripts/eval/eval_mmvet.sh ``` ### Visual Grounding on GApartNet Evaluate the performance of ShapeLLM on the GApartNet dataset, run the script: ``` sh scripts/eval/gapartnet_ref.sh ``` Calucate the generative 3D visual grounding accuracy: ``` sh scripts/eval/eval_gapartnet.sh ```