File size: 19,554 Bytes
8695c51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5d9aca72-957a-4ee2-862f-e011b9cd3a62",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "## Goal\n",
    "I want [jais-13B](https://huggingface.co/core42/jais-13b-chat) deployed with an API quickly and easily. I'm also scared of mice so ideally I can just use my keyboard. \n",
    "\n",
    "## Approach\n",
    "There are lots of options out there that are \"1-click\" which is really cool! I would like to do even better and make a \"0-click\". This is great for those that are musophobic (scared of mice) or want scripts that can run without human intervention.\n",
    "\n",
    "We will be using [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) as our serving toolkit as it is robust and configurable. For our hardware we will be using [Inference Endpoints](https://huggingface.co/inference-endpoints) as it makes the deployment procedure really easy! We will be using the API to reach our aforementioned \"0-click\" goal."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2086a136-6710-45af-b2b1-7224b5cbbca7",
   "metadata": {},
   "source": [
    "# Pre-requisites\n",
    "Deploying LLMs is a tough process. There are a number of challenges! \n",
    "- These models are huge\n",
    "    - Slow to load \n",
    "    - Won't fit on convenient HW\n",
    "- Generative transformers require iterative decoding\n",
    "- Many of the optimizations are not consolidated\n",
    "\n",
    "TGI solves many of these, and while I don't want to dedicate this blog to TGI there are a few concepts we need to cover to properly understand how to configure our deployment.\n",
    "\n",
    "\n",
    "## Prefilling Phase\n",
    "> In the prefill phase, the LLM processes the input tokens to compute the intermediate states (keys and values), which are used to generate the “first” new token. Each new token depends on all the previous tokens, but because the full extent of the input is known, at a high level this is a matrix-matrix operation that’s highly parallelized. It effectively saturates GPU utilization.\n",
    "\n",
    "~[Nvidia Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)\n",
    "\n",
    "Prefilling is relatively fast.\n",
    "\n",
    "## Decoding Phase\n",
    "> In the decode phase, the LLM generates output tokens autoregressively one at a time, until a stopping criteria is met. Each sequential output token needs to know all the previous iterations’ output states (keys and values). This is like a matrix-vector operation that underutilizes the GPU compute ability compared to the prefill phase. The speed at which the data (weights, keys, values, activations) is transferred to the GPU from memory dominates the latency, not how fast the computation actually happens. In other words, this is a memory-bound operation.\n",
    "\n",
    "~[Nvidia Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)\n",
    "\n",
    "Decoding is relatively slow.\n",
    "\n",
    "## Example\n",
    "Lets take an example of sentiment analysis:\n",
    "\n",
    "Below we have input tokens that the LLM will pre-fill. Note that we know what the next token is during the pre-filling phase. We can use this to our advantage.\n",
    "```text\n",
    "### Instruction: What is the sentiment of the input?\n",
    "### Examples\n",
    "I wish the screen was bigger - Negative\n",
    "I hate the battery - Negative\n",
    "I love the default appliations - Positive\n",
    "### Input\n",
    "I am happy with this purchase - \n",
    "### Response\n",
    "```\n",
    "\n",
    "Below we have output tokens generated during decoding phase. Despite being few in this example we dont know what the next token will be until we have generated it.\n",
    "\n",
    "```text\n",
    "Positive\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2534669-003d-490c-9d7a-32607fa5f404",
   "metadata": {},
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c830114-dd88-45a9-81b9-78b0e3da7384",
   "metadata": {},
   "source": [
    "## Requirements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "35386f72-32cb-49fa-a108-3aa504e20429",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.2\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q \"huggingface-hub>=0.20\" ipywidgets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6f72042-173d-4a72-ade1-9304b43b528d",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "99f60998-0490-46c6-a8e6-04845ddda7be",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/derekthomas/projects/spaces/jais-tgi-benchmark/venv/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from huggingface_hub import login, whoami, create_inference_endpoint\n",
    "from getpass import getpass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eece903-64ce-435d-a2fd-096c0ff650bf",
   "metadata": {},
   "source": [
    "## Config\n",
    "You need to fill this in with your desired repos. Note I used 5 for the `MAX_WORKERS` since `jina-embeddings-v2` are quite memory hungry. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dcd7daed-6aca-4fe7-85ce-534bdcd8bc87",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ENDPOINT_NAME = \"jais13b-demo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0ca1140c-3fcc-4b99-9210-6da1505a27b7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3c7ff285544d4ea9a1cc985cf981993c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "login()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f4ba0a8-0a6c-4705-a73b-7be09b889610",
   "metadata": {},
   "source": [
    "Some users might have payment registered in an organization. This allows you to connect to an organization (that you are a member of) with a payment method.\n",
    "\n",
    "Leave it blank is you want to use your username."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "88cdbd73-5923-4ae9-9940-b6be935f70fa",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "What is your Hugging Face 🤗 username or organization? (with an added payment method) ········\n"
     ]
    }
   ],
   "source": [
    "who = whoami()\n",
    "organization = getpass(prompt=\"What is your Hugging Face 🤗 username or organization? (with an added payment method)\")\n",
    "\n",
    "namespace = organization or who['name']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93096cbc-81c6-4137-a283-6afb0f48fbb9",
   "metadata": {},
   "source": [
    "# Inference Endpoints\n",
    "## Create Inference Endpoint\n",
    "We are going to use the [API](https://huggingface.co/docs/inference-endpoints/api_reference) to create an [Inference Endpoint](https://huggingface.co/inference-endpoints). This should provide a few main benefits:\n",
    "- It's convenient (No clicking)\n",
    "- It's repeatable (We have the code to run it easily)\n",
    "- It's cheaper (No time spent waiting for it to load, and automatically shut it down)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cf8334d-6500-412e-9d6d-58990c42c110",
   "metadata": {},
   "source": [
    "Here is a convenient table of instance details you can use when selecting a GPU. Once you have chosen a GPU in Inference Endpoints, you can use the corresponding `instanceType` and `instanceSize`.\n",
    "| hw_desc             | instanceType   | instanceSize | vRAM  |\n",
    "|---------------------|----------------|--------------|-------|\n",
    "| 1x Nvidia Tesla T4  | g4dn.xlarge    | small        | 16GB  |\n",
    "| 4x Nvidia Tesla T4  | g4dn.12xlarge  | large        | 64GB  |\n",
    "| 1x Nvidia A10G      | g5.2xlarge     | medium       | 24GB  |\n",
    "| 4x Nvidia A10G      | g5.12xlarge    | xxlarge      | 96GB  |\n",
    "| 1x Nvidia A100      | p4de           | xlarge       | 80GB  |\n",
    "| 2x Nvidia A100      | p4de           | 2xlarge      | 160GB |\n",
    "\n",
    "Note: To use a node (multiple GPUs) you will need to use a sharded version of jais. I'm not sure if there is currently a version like this on the hub. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "89c7cc21-3dfe-40e6-80ff-1dcc8558859e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "hw_dict = dict(\n",
    "    accelerator=\"gpu\",\n",
    "    vendor=\"aws\",\n",
    "    region=\"us-east-1\",\n",
    "    type=\"protected\",\n",
    "    instance_type=\"p4de\",\n",
    "    instance_size=\"xlarge\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbc82ce5-d7fa-4167-adc1-b25e567f5559",
   "metadata": {},
   "source": [
    "This is one of the most important parts of this tutorial to understand well. Its important that we choose the deployment settings that best represent our needs and our hardware. I'll just leave some high-level information here and we can go deeper in a future tutorial. It would be interesting to show the difference in how you would optimize your deployment between a chat application and RAG.\n",
    "\n",
    "`MAX_BATCH_PREFILL_TOKENS` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxbatchprefilltokens) |\n",
    "> Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent\n",
    "\n",
    "`MAX_INPUT_LENGTH` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxinputlength) |\n",
    "> This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle\n",
    "\n",
    "I left this quite large as I want to give a lot of freedom to the user more than I want to trade performance. It's important in RAG applications to give more freedom here. But for few turn chat applications you can be more restrictive.\n",
    "\n",
    "`MAX_TOTAL_TOKENS` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtotaltokens) | \n",
    "> This is the most important value to set as it defines the \"memory budget\" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be.\n",
    "\n",
    "`TRUST_REMOTE_CODE` This is set to `true` as jais requires it.\n",
    "\n",
    "`QUANTIZE` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#quantize) |\n",
    "> Whether you want the model to be quantized\n",
    "\n",
    "With jais, you really only have the bitsandbytes option. The tradeoff is that inference is a bit slower, but you can use much smaller GPUs (~3x smaller) without noticably losing performance. It's one of the better reads IMO and I recommend checking out the [paper](https://arxiv.org/abs/2208.07339)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f4267bce-8516-4f3a-b1cc-8ccd6c14a9c7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "tgi_env = {\n",
    "    \"MAX_BATCH_PREFILL_TOKENS\": \"2048\",\n",
    "    \"MAX_INPUT_LENGTH\": \"2000\",\n",
    "    'TRUST_REMOTE_CODE':'true',\n",
    "    \"QUANTIZE\": 'bitsandbytes', \n",
    "    \"MODEL_ID\": \"/repository\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74fd83a0-fef0-4e47-8ff1-f4ba7aed131d",
   "metadata": {},
   "source": [
    "A couple notes on my choices here:\n",
    "- I used `derek-thomas/jais-13b-chat-hf` because that repo has SafeTensors merged which will lead to faster loading of the TGI container\n",
    "- I'm using the latest TGI container as of the time of writing (1.3.4)\n",
    "- `min_replica=0` allows [zero scaling](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-to-0) which is really useful for your wallet though think through if this makes sense for your use-case as there will be loading times\n",
    "- `max_replica` allows you to handle high throughput. Make sure you read through the [docs](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-criteria) to understand how this scales"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9e59de46-26b7-4bb9-bbad-8bba9931bde7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "endpoint = create_inference_endpoint(\n",
    "    ENDPOINT_NAME,\n",
    "    repository=\"derek-thomas/jais-13b-chat-hf\",  \n",
    "    framework=\"pytorch\",\n",
    "    task=\"text-generation\",\n",
    "    **hw_dict,\n",
    "    min_replica=0,\n",
    "    max_replica=1,\n",
    "    namespace=namespace,\n",
    "    custom_image={\n",
    "        \"health_route\": \"/health\",\n",
    "        \"env\": tgi_env,\n",
    "        \"url\": \"ghcr.io/huggingface/text-generation-inference:1.3.4\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96d173b2-8980-4554-9039-c62843d3fc7d",
   "metadata": {},
   "source": [
    "## Wait until its running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "5f3a8bd2-753c-49a8-9452-899578beddc5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 188 ms, sys: 101 ms, total: 289 ms\n",
      "Wall time: 2min 56s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "InferenceEndpoint(name='jais13b-demo', namespace='HF-test-lab', repository='derek-thomas/jais-13b-chat-hf', status='running', url='https://kgcd24dil090jo6n.us-east-1.aws.endpoints.huggingface.cloud')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "endpoint.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "189b26f0-d404-4570-a1b9-e2a9d486c1f7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'POSITIVE'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "endpoint.client.text_generation(\"\"\"\n",
    "### Instruction: What is the sentiment of the input?\n",
    "### Examples\n",
    "I wish the screen was bigger - Negative\n",
    "I hate the battery - Negative\n",
    "I love the default appliations - Positive\n",
    "### Input\n",
    "I am happy with this purchase - \n",
    "### Response\n",
    "\"\"\",\n",
    "                               do_sample=True,\n",
    "                               repetition_penalty=1.2,\n",
    "                               top_p=0.9,\n",
    "                               temperature=0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bab97c7b-7bac-4bf5-9752-b528294dadc7",
   "metadata": {},
   "source": [
    "## Pause Inference Endpoint\n",
    "Now that we have finished, lets pause the endpoint so we don't incur any extra charges, this will also allow us to analyze the cost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "540a0978-7670-4ce3-95c1-3823cc113b85",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint Status: paused\n"
     ]
    }
   ],
   "source": [
    "endpoint = endpoint.pause()\n",
    "\n",
    "print(f\"Endpoint Status: {endpoint.status}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41abea64-379d-49de-8d9a-355c2f4ce1ac",
   "metadata": {},
   "source": [
    "# Analyze Usage\n",
    "1. Go to your `dashboard_url` printed below\n",
    "1. Click on the Usage & Cost tab\n",
    "1. See how much you have spent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "16815445-3079-43da-b14e-b54176a07a62",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://ui.endpoints.huggingface.co/HF-test-lab/endpoints/jais13b-demo/analytics\n"
     ]
    }
   ],
   "source": [
    "dashboard_url = f'https://ui.endpoints.huggingface.co/{namespace}/endpoints/{ENDPOINT_NAME}/analytics'\n",
    "print(dashboard_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b953d5be-2494-4ff8-be42-9daf00c99c41",
   "metadata": {},
   "source": [
    "# Delete Endpoint\n",
    "We should see a `200` if everything went correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c310c0f3-6f12-4d5c-838b-3a4c1f2e54ad",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint deleted successfully\n"
     ]
    }
   ],
   "source": [
    "endpoint = endpoint.delete()\n",
    "\n",
    "if not endpoint:\n",
    "    print('Endpoint deleted successfully')\n",
    "else:\n",
    "    print('Delete Endpoint in manually') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "611e1345-8d8c-46b1-a9f8-cff27eecb426",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}