Update lyraChatGLM/model.py

#21
CHANGES.rst DELETED
@@ -1,10 +0,0 @@
1
- Changelog (lyraChatGLM)
2
-
3
- ## 2.0
4
- - rebuild whole system using modified Fastertransformer
5
- - add dynamic library & models for Volta architecture.
6
- - further acceleration, remove token generation limits.
7
-
8
- ## 1.0
9
-
10
- - add lyraChatGLM model, from original weights
 
 
 
 
 
 
 
 
 
 
 
LISENCE DELETED
@@ -1,420 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) 2023 Tencent Music Entertainment
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
22
-
23
-
24
- Other dependencies and licenses:
25
-
26
- Open Source Software Licensed under The ChatGLM-6B License and the Apache License Version 2.0 :
27
- --------------------------------------------------------------------
28
- 1. chatglm-6b
29
-
30
- File:https://github.com/THUDM/ChatGLM-6B
31
- License:The ChatGLM-6B License and Apache Licnese Version 2.0
32
- For details:https://github.com/THUDM/ChatGLM-6B/blob/main/MODEL_LICENSE
33
- https://github.com/THUDM/ChatGLM-6B/blob/main/LICENSE
34
-
35
- APPENDIX: How to apply the Apache License to your work.
36
-
37
- To apply the Apache License to your work, attach the following
38
- boilerplate notice, with the fields enclosed by brackets "[]"
39
- replaced with your own identifying information. (Don't include
40
- the brackets!) The text should be enclosed in the appropriate
41
- comment syntax for the file format. We also recommend that a
42
- file or class name and description of purpose be included on the
43
- same "printed page" as the copyright notice for easier
44
- identification within third-party archives.
45
-
46
- Copyright Zhengxiao Du
47
-
48
- Licensed under the Apache License, Version 2.0 (the "License");
49
- you may not use this file except in compliance with the License.
50
- You may obtain a copy of the License at
51
-
52
- http://www.apache.org/licenses/LICENSE-2.0
53
-
54
- Unless required by applicable law or agreed to in writing, software
55
- distributed under the License is distributed on an "AS IS" BASIS,
56
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
57
- See the License for the specific language governing permissions and
58
- limitations under the License.
59
-
60
- A copy of the Apache License Version 2.0 is included in this file.
61
-
62
-
63
- Terms of The ChatGLM-6B License:
64
- --------------------------------------------------------------------
65
-
66
- 一、定义
67
-
68
- “许可方”是指分发其软件的 ChatGLM-6B 模型团队。
69
-
70
- “软件”是指根据本许可提供的 ChatGLM-6B 模型参数。
71
-
72
- 2. 许可授予
73
-
74
- 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可,仅用于您的非商业研究目的。
75
-
76
- 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
77
-
78
- 3.限制
79
-
80
- 您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
81
-
82
- 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
83
-
84
- 4.免责声明
85
-
86
- 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
87
-
88
- 5. 责任限制
89
-
90
- 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
91
-
92
- 6.争议解决
93
-
94
- 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
95
-
96
- 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 glm-130b@googlegroups.com 与我们联系。
97
-
98
- 1. Definitions
99
-
100
- “Licensor” means the ChatGLM-6B Model Team that distributes its Software.
101
-
102
- “Software” means the ChatGLM-6B model parameters made available under this license.
103
-
104
- 2. License Grant
105
-
106
- Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
107
-
108
- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
109
-
110
- 3. Restriction
111
-
112
- You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
113
-
114
- You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
115
-
116
- 4. Disclaimer
117
-
118
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
119
-
120
- 5. Limitation of Liability
121
-
122
- EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
123
-
124
- 6. Dispute Resolution
125
-
126
- This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
127
-
128
- Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.
129
-
130
-
131
- Open Source Software Licensed under the Apache License Version 2.0:
132
- --------------------------------------------------------------------
133
- 1. huggingface/transformers
134
- Copyright 2018- The Hugging Face team. All rights reserved.
135
-
136
-
137
- Terms of the Apache License Version 2.0:
138
- --------------------------------------------------------------------
139
- Apache License
140
-
141
- Version 2.0, January 2004
142
-
143
- http://www.apache.org/licenses/
144
-
145
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
146
- 1. Definitions.
147
-
148
- "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
149
-
150
- "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
151
-
152
- "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
153
-
154
- "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
155
-
156
- "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
157
-
158
- "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
159
-
160
- "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
161
-
162
- "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
163
-
164
- "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
165
-
166
- "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
167
-
168
- 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
169
-
170
- 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
171
-
172
- 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
173
-
174
- You must give any other recipients of the Work or Derivative Works a copy of this License; and
175
-
176
- You must cause any modified files to carry prominent notices stating that You changed the files; and
177
-
178
- You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
179
-
180
- If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
181
-
182
- You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
183
-
184
- 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
185
-
186
- 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
187
-
188
- 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
189
-
190
- 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
191
-
192
- 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
193
-
194
- END OF TERMS AND CONDITIONS
195
-
196
-
197
- Open Source Software Licensed under the Modified BSD License:
198
- --------------------------------------------------------------------
199
- 1. pytorch
200
-
201
- From PyTorch:
202
-
203
- Copyright (c) 2016- Facebook, Inc (Adam Paszke)
204
- Copyright (c) 2014- Facebook, Inc (Soumith Chintala)
205
- Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
206
- Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
207
- Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
208
- Copyright (c) 2011-2013 NYU (Clement Farabet)
209
- Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
210
- Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
211
- Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
212
-
213
- From Caffe2:
214
-
215
- Copyright (c) 2016-present, Facebook Inc. All rights reserved.
216
-
217
- All contributions by Facebook:
218
- Copyright (c) 2016 Facebook Inc.
219
-
220
- All contributions by Google:
221
- Copyright (c) 2015 Google Inc.
222
- All rights reserved.
223
-
224
- All contributions by Yangqing Jia:
225
- Copyright (c) 2015 Yangqing Jia
226
- All rights reserved.
227
-
228
- All contributions by Kakao Brain:
229
- Copyright 2019-2020 Kakao Brain
230
-
231
- All contributions by Cruise LLC:
232
- Copyright (c) 2022 Cruise LLC.
233
- All rights reserved.
234
-
235
- All contributions from Caffe:
236
- Copyright(c) 2013, 2014, 2015, the respective contributors
237
- All rights reserved.
238
-
239
- All other contributions:
240
- Copyright(c) 2015, 2016 the respective contributors
241
- All rights reserved.
242
-
243
- Caffe2 uses a copyright model similar to Caffe: each contributor holds
244
- copyright over their contributions to Caffe2. The project versioning records
245
- all such contribution and copyright details. If a contributor wants to further
246
- mark their specific copyright on a particular contribution, they should
247
- indicate their copyright solely in the commit message of the change when it is
248
- committed.
249
-
250
- All rights reserved.
251
-
252
-
253
- Terms of the Modified BSD License:
254
- -------------------------------------------------------------------
255
- This project is licensed under the terms of the Modified BSD License, as follows:
256
-
257
- Redistribution and use in source and binary forms, with or without
258
- modification, are permitted provided that the following conditions are met:
259
-
260
- 1. Redistributions of source code must retain the above copyright
261
- notice, this list of conditions and the following disclaimer.
262
-
263
- 2. Redistributions in binary form must reproduce the above copyright
264
- notice, this list of conditions and the following disclaimer in the
265
- documentation and/or other materials provided with the distribution.
266
-
267
- 3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
268
- and IDIAP Research Institute nor the names of its contributors may be
269
- used to endorse or promote products derived from this software without
270
- specific prior written permission.
271
-
272
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
273
- AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
274
- IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
275
- ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
276
- LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
277
- CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
278
- SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
279
- INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
280
- CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
281
- ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
282
- POSSIBILITY OF SUCH DAMAGE.
283
-
284
-
285
- Open Source Software Licensed under the Python Software Foundation License Version 2:
286
- --------------------------------------------------------------------------
287
- 1. Python/cpython
288
- Copyright © 2001-2023 Python Software Foundation. All rights reserved
289
-
290
-
291
- A. HISTORY OF THE SOFTWARE
292
- ==========================
293
-
294
- Python was created in the early 1990s by Guido van Rossum at Stichting
295
- Mathematisch Centrum (CWI, see https://www.cwi.nl) in the Netherlands
296
- as a successor of a language called ABC. Guido remains Python's
297
- principal author, although it includes many contributions from others.
298
-
299
- In 1995, Guido continued his work on Python at the Corporation for
300
- National Research Initiatives (CNRI, see https://www.cnri.reston.va.us)
301
- in Reston, Virginia where he released several versions of the
302
- software.
303
-
304
- In May 2000, Guido and the Python core development team moved to
305
- BeOpen.com to form the BeOpen PythonLabs team. In October of the same
306
- year, the PythonLabs team moved to Digital Creations, which became
307
- Zope Corporation. In 2001, the Python Software Foundation (PSF, see
308
- https://www.python.org/psf/) was formed, a non-profit organization
309
- created specifically to own Python-related Intellectual Property.
310
- Zope Corporation was a sponsoring member of the PSF.
311
-
312
- All Python releases are Open Source (see https://opensource.org for
313
- the Open Source Definition). Historically, most, but not all, Python
314
- releases have also been GPL-compatible; the table below summarizes
315
- the various releases.
316
-
317
- Release Derived Year Owner GPL-
318
- from compatible? (1)
319
-
320
- 0.9.0 thru 1.2 1991-1995 CWI yes
321
- 1.3 thru 1.5.2 1.2 1995-1999 CNRI yes
322
- 1.6 1.5.2 2000 CNRI no
323
- 2.0 1.6 2000 BeOpen.com no
324
- 1.6.1 1.6 2001 CNRI yes (2)
325
- 2.1 2.0+1.6.1 2001 PSF no
326
- 2.0.1 2.0+1.6.1 2001 PSF yes
327
- 2.1.1 2.1+2.0.1 2001 PSF yes
328
- 2.1.2 2.1.1 2002 PSF yes
329
- 2.1.3 2.1.2 2002 PSF yes
330
- 2.2 and above 2.1.1 2001-now PSF yes
331
-
332
- Footnotes:
333
-
334
- (1) GPL-compatible doesn't mean that we're distributing Python under
335
- the GPL. All Python licenses, unlike the GPL, let you distribute
336
- a modified version without making your changes open source. The
337
- GPL-compatible licenses make it possible to combine Python with
338
- other software that is released under the GPL; the others don't.
339
-
340
- (2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
341
- because its license has a choice of law clause. According to
342
- CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
343
- is "not incompatible" with the GPL.
344
-
345
- Thanks to the many outside volunteers who have worked under Guido's
346
- direction to make these releases possible.
347
-
348
-
349
- B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
350
- ===============================================================
351
-
352
- Python software and documentation are licensed under the
353
- Python Software Foundation License Version 2.
354
-
355
- Starting with Python 3.8.6, examples, recipes, and other code in
356
- the documentation are dual licensed under the PSF License Version 2
357
- and the Zero-Clause BSD license.
358
-
359
- Some software incorporated into Python is under different licenses.
360
- The licenses are listed with code falling under that license.
361
-
362
-
363
- PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
364
- --------------------------------------------
365
-
366
- 1. This LICENSE AGREEMENT is between the Python Software Foundation
367
- ("PSF"), and the Individual or Organization ("Licensee") accessing and
368
- otherwise using this software ("Python") in source or binary form and
369
- its associated documentation.
370
-
371
- 2. Subject to the terms and conditions of this License Agreement, PSF hereby
372
- grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
373
- analyze, test, perform and/or display publicly, prepare derivative works,
374
- distribute, and otherwise use Python alone or in any derivative version,
375
- provided, however, that PSF's License Agreement and PSF's notice of copyright,
376
- i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
377
- 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Python Software Foundation;
378
- All Rights Reserved" are retained in Python alone or in any derivative version
379
- prepared by Licensee.
380
-
381
- 3. In the event Licensee prepares a derivative work that is based on
382
- or incorporates Python or any part thereof, and wants to make
383
- the derivative work available to others as provided herein, then
384
- Licensee hereby agrees to include in any such work a brief summary of
385
- the changes made to Python.
386
-
387
- 4. PSF is making Python available to Licensee on an "AS IS"
388
- basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
389
- IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
390
- DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
391
- FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
392
- INFRINGE ANY THIRD PARTY RIGHTS.
393
-
394
- 5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
395
- FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
396
- A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
397
- OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
398
-
399
- 6. This License Agreement will automatically terminate upon a material
400
- breach of its terms and conditions.
401
-
402
- 7. Nothing in this License Agreement shall be deemed to create any
403
- relationship of agency, partnership, or joint venture between PSF and
404
- Licensee. This License Agreement does not grant permission to use PSF
405
- trademarks or trade name in a trademark sense to endorse or promote
406
- products or services of Licensee, or any third party.
407
-
408
- 8. By copying, installing or otherwise using Python, Licensee
409
- agrees to be bound by the terms and conditions of this License
410
- Agreement.
411
-
412
-
413
- Open Source Software:
414
- --------------------------------------------------------------------
415
- 1. icetk
416
- File:https://github.com/THUDM/icetk
417
-
418
-
419
-
420
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,84 +1,65 @@
1
  ---
2
- license: mit
3
  language: en
4
  tags:
5
- - LLM
6
- - ChatGLM6B
 
7
  ---
 
 
8
  ## Breakings!
9
 
10
- **We know what you want, and here you go!**
11
 
12
- - Newly released lyraChatGLM model, suitable for Ampere (A100/A10) as well as Volta (V100)
13
- - lyraChatGLM has been further optimized, reaching **9000 tokens/s** on A100 and **3900 tokens/s** on V100, about **5.5x** faster than the up-to-date official version (2023/6/1).
14
  - The memory usage was optimized too, now we can set batch_size up to **256** on A100!
15
- - INT8 weight only PTQ is supported
16
 
17
- **Note that the code was fully updated too, you need to use the new API, see `Uses` below**
18
 
19
- If you like our work and consider to join us, feel free to drop a line to benbinwu@tencent.com.
20
 
21
- P.S. Recently we have received a lot of inquiries on accelerating customized models. Actually, we **do not have plan** to release the convertion tool at this moment, nor do we think it would be possible to apply your customized models based on our current release.
22
- ****
23
  ## Model Card for lyraChatGLM
24
 
25
  lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
26
 
27
- The inference speed of lyraChatGLM has achieved **300x** acceleration upon the early original version. We are still working hard to further improve the performance.
 
 
28
 
29
- Among its main features are (updated on 2023-06-20):
30
  - weights: original ChatGLM-6B weights released by THUDM.
31
  - device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...).
32
- - batch_size: compiled with dynamic batch size, maximum depends on device.
33
- - We now support cuda version of both 11.X and 12.X
34
- - lyraChatGLM has been further optimized, with faster model load speed from few minutes to less than 10s for non-int8 mode, and around 1 min for int8 mode!
35
 
36
- ## Speed
37
  - orginal version(fixed batch infer): commit id 1d240ba
38
 
39
  ### test on A100 40G
40
- 1. The maximum batch size and maximum speed table for each version of the model.
41
  |version|max_batch_size|max_speed|
42
  |:-:|:-:|:-:|
43
  |original|1|30 tokens/s|
44
- |original(fxied batch infer)|192|1638.52 tokens/s|
45
- |lyraChatGLM(current)|256|9082.60 tokens/s|
46
- 2. The speed table for the same batch size.
47
- |version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size |
48
- |:-:|:-:|:-:|:-:|:-:|
49
- |original|30 tokens/s| - | - | - |
50
- |original(fxied batch infer)|34.48 tokens/s|356.29 tokens/s|1638.52 tokens/s|1338.45 tokens/s|
51
- |lyraChatGLM(current)|110.05 tokens/s|843.60 tokens/s|4926.92 tokens/s|7235.04 tokens/s|
52
 
53
  ### test on V100
54
- 1. The maximum batch size and maximum speed table for each version of the model.
55
  |version|max_batch_size|max_speed|
56
  |:-:|:-:|:-:|
57
  |original|1|17.83 tokens/s|
58
- |original(fxied batch infer)|128|992.20 tokens/s|
59
- |lyraChatGLM(current)|192|3958.39 tokens/s|
60
- 2. The speed table for the same batch size.
61
- |version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size |
62
- |:-:|:-:|:-:|:-:|:-:|
63
- |original|17.83 tokens/s| - | - | - |
64
- |original(fxied batch infer)|17.83 tokens/s|228.95 tokens/s|889.7 tokens/s|922.20 tokens/s|
65
- |lyraChatGLM(current)|59.33 tokens/s|514.15 tokens/s|2849.88 tokens/s|3958.39 tokens/s|
66
 
67
  ## Model Sources
68
 
69
  - **Repository:** https://huggingface.co/THUDM/chatglm-6b
70
 
71
- ## Docker Environment Recommendation
72
 
73
- - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
74
- - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```
75
 
76
- ```bash
77
- docker pull nvcr.io/nvidia/pytorch:23.02-py3
78
- docker run --rm -it --gpus all -v ./:/lyraChatGLM nvcr.io/nvidia/pytorch:23.02-py3
79
-
80
- pip install -r requirements.txt
81
- python demo.py
82
  ```
83
 
84
  ## Uses
@@ -86,15 +67,14 @@ python demo.py
86
  ```python
87
  from lyraChatGLM import LyraChatGLM6B
88
 
89
- model_path = "./models/1-gpu-fp16.bin"
90
  tokenizer_path = "./models"
91
  data_type = "fp16"
92
- int8_mode = 0 # 1 for INT8 WEIGHT ONLY PTQ
93
  max_output_length = 150
94
  arch = "Ampere" # Ampere or Volta
95
- cuda_version = 12
96
 
97
- model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch, cuda_version)
98
  prompt = "列出3个不同的机器学习算法,并说明它们的适用范围."
99
  test_batch_size = 256
100
 
@@ -120,29 +100,11 @@ print(output_texts)
120
 
121
  3. 支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。
122
 
123
- ## INT8
124
-
125
- **Int8 usage**:
126
-
127
- Our current version supports INT8 weight only PTQ. To enable this mode, simply modify the `int8_mode` to `1` in the demo.py file.
128
-
129
- **In this mode, gpu memory can be further reduced by about half and the speed can be doubled.**
130
-
131
- This solves the issue mentioned in https://github.com/THUDM/ChatGLM-6B/issues/1042.
132
-
133
- However, the speed gain is best achieved with a batch size of no more than 128. If you don't use A100 GPU, you can adjust the
134
- batch size to reduce it and get the benefits. We recommend a batch size of 64.This mode is very suitable for GPUs with
135
- limited VRAM or scenarios where it is difficult to use larger batch sizes in real-time services.
136
-
137
- It should be noted that although we have aligned the accuracy in our test cases, there may be slight differences
138
- in accuracy in some untested scenarios with int8. Please be aware of this.
139
-
140
-
141
  ## Citation
142
  ``` bibtex
143
  @Misc{lyraChatGLM2023,
144
    author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
145
-   title =        {lyraChatGLM: Accelerating ChatGLM to 9000+ tokens/s},
146
    howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
147
    year =         {2023}
148
  }
@@ -150,4 +112,4 @@ in accuracy in some untested scenarios with int8. Please be aware of this.
150
 
151
  ## Report bug
152
  - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
153
- - report bug with a `[bug]` mark in the title.
 
1
  ---
2
+ license: creativeml-openrail-m
3
  language: en
4
  tags:
5
+ - LLM
6
+ - ChatGLM6B
7
+
8
  ---
9
+
10
+
11
  ## Breakings!
12
 
13
+ **We know what you want, and here they are!**
14
 
15
+ - Newly released lyraChatGLM model, suitable for Ampere(A100/A10) as well as Volta(V100)
16
+ - lyraChatGLM has been further optimized, reaches **9000tokens/s** on A100 and **3900 tokens/s** on V100, about **5.5x** faster than original version(2023/6/1).
17
  - The memory usage was optimized too, now we can set batch_size up to **256** on A100!
 
18
 
19
+ **Note that the code was fully updated too, you need to use new API, see `Uses` below**
20
 
 
21
 
 
 
22
  ## Model Card for lyraChatGLM
23
 
24
  lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
25
 
26
+ The inference speed of lyraChatGLM has achieved **300x** acceleration upon the ealry original version. We are still working hard to further improve the performance.
27
+
28
+ Among its main features are:
29
 
 
30
  - weights: original ChatGLM-6B weights released by THUDM.
31
  - device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...).
32
+ - batch_size: compiled with dynamic batch size, maximum depends on device. 
33
+
34
+ ## Speed
35
 
 
36
  - orginal version(fixed batch infer): commit id 1d240ba
37
 
38
  ### test on A100 40G
39
+
40
  |version|max_batch_size|max_speed|
41
  |:-:|:-:|:-:|
42
  |original|1|30 tokens/s|
43
+ |original(fxied batch infer)|192|1638.52 toekns/s|
44
+ |lyraChatGLM(current)|256|9082.60+ tokens/s|
 
 
 
 
 
 
45
 
46
  ### test on V100
 
47
  |version|max_batch_size|max_speed|
48
  |:-:|:-:|:-:|
49
  |original|1|17.83 tokens/s|
50
+ |original(fxied batch infer)|128|992.20 toekns/s|
51
+ |lyraChatGLM(current)|192|3911.45+ tokens/s|
 
 
 
 
 
 
52
 
53
  ## Model Sources
54
 
55
  - **Repository:** https://huggingface.co/THUDM/chatglm-6b
56
 
57
+ ## Docker Environment
58
 
59
+ - **docker image available** at [https://hub.docker.com/repository/docker/bigmoyan/lyrallm/general], pull image by: 
 
60
 
61
+ ```
62
+ docker pull bigmoyan/lyrallm:v0.1
 
 
 
 
63
  ```
64
 
65
  ## Uses
 
67
  ```python
68
  from lyraChatGLM import LyraChatGLM6B
69
 
70
+ model_path = "./models/1-gpu-fp16.h5"
71
  tokenizer_path = "./models"
72
  data_type = "fp16"
73
+ int8_mode = 0
74
  max_output_length = 150
75
  arch = "Ampere" # Ampere or Volta
 
76
 
77
+ model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch)
78
  prompt = "列出3个不同的机器学习算法,并说明它们的适用范围."
79
  test_batch_size = 256
80
 
 
100
 
101
  3. 支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。
102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  ## Citation
104
  ``` bibtex
105
  @Misc{lyraChatGLM2023,
106
    author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
107
+   title =        {lyraChatGLM: Accelerating ChatGLM by 5.5x+},
108
    howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
109
    year =         {2023}
110
  }
 
112
 
113
  ## Report bug
114
  - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
115
+ - report bug with a `[bug]` mark in the title.
demo.py CHANGED
@@ -1,22 +1,20 @@
1
  from lyraChatGLM import LyraChatGLM6B
2
- import numpy as np
3
 
4
- model_path = "./models/1-gpu-fp16.bin"
5
  tokenizer_path = "./models"
6
- inference_data_type = "fp16"
7
  int8_mode = 0
8
  max_output_length = 150
9
- arch = "Volta" # Ampere or Volta
10
- cuda_version = 11 # cuda version, we currently support 11 and 12
11
-
12
- model = LyraChatGLM6B(model_path, tokenizer_path, inference_data_type, int8_mode, arch, cuda_version)
13
 
 
14
  prompt = "今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服裤子鞋子搭配。"
15
- # test_batch_size = 256
16
 
17
  prompts = [prompt, ]
18
 
19
- # # If you want to get different output in same batch, you can set do_sample to True
 
20
  output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)
21
 
22
- print(output_texts)
 
1
  from lyraChatGLM import LyraChatGLM6B
 
2
 
3
+ model_path = "./models/1-gpu-fp16.h5"
4
  tokenizer_path = "./models"
5
+ data_type = "fp16"
6
  int8_mode = 0
7
  max_output_length = 150
8
+ arch = "Ampere" # Ampere or Volta
 
 
 
9
 
10
+ model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch)
11
  prompt = "今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服裤子鞋子搭配。"
12
+ test_batch_size = 256
13
 
14
  prompts = [prompt, ]
15
 
16
+
17
+ # If you want to get different output in same batch, you can set do_sample to True
18
  output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)
19
 
20
+ print(output_texts)
lyraChatGLM/config.py CHANGED
@@ -14,7 +14,7 @@ class ChatGLM6BParam:
14
  tensor_para_size: int = 1
15
  pipeline_para_size: int = 1
16
  remove_padding: bool = True
17
- shared_contexts_ratio: float = 0.0
18
  layernorm_eps: float = 1e-5
19
  weights_data_type: str = "fp16"
20
 
 
14
  tensor_para_size: int = 1
15
  pipeline_para_size: int = 1
16
  remove_padding: bool = True
17
+ shared_contexts_ratio: float = 1.0
18
  layernorm_eps: float = 1e-5
19
  weights_data_type: str = "fp16"
20
 
lyraChatGLM/ftlib/{libth_transformer_sm80_cu11.so → libth_transformer_sm70.so} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:60a06f87ca10c5d556f965a5178aac50cbcbcec0265a7bcf18751e6ef73a807c
3
- size 200894104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74ba35dfae0d02b89594bad9458c15fba2b57fb2d96b698cbd94d78368f3f246
3
+ size 114138600
lyraChatGLM/ftlib/libth_transformer_sm70_cu12.so DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:2d9829541f5edccf8d59e275e1259404168750e3419902fc4c88f789baad3f20
3
- size 114203064
 
 
 
 
lyraChatGLM/ftlib/{libth_transformer_sm80_cu12.so → libth_transformer_sm80.so} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:146841b4ef362048507a576d20cb1e5bb02e0d67f3fcfce351ce25f00989dfbd
3
- size 200980552
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c814d3d493d25d64925261cac48aaf8e1a33722fba4ce3eb8bc7abdcc51f37cf
3
+ size 200886848
lyraChatGLM/lyra_glm.py CHANGED
@@ -10,15 +10,15 @@ import transformers
10
  from .config import CHATGLM_6B_PARAM
11
  from .model import ChatGLM6BModel
12
 
 
13
  class LyraChatGLM6B:
14
- def __init__(self, model_path, tokenizer_path=None, dtype='fp16', int8_mode=0, arch="Ampere", cuda_version="11") -> None:
15
  self.model_path = model_path
16
  self.tokenizer_path = tokenizer_path
17
  self.dtype = dtype
18
  self.arch=arch
19
- # if dtype != 'int8':
20
- # int8_mode = 0
21
- self.cuda_version = cuda_version
22
  self.int8_mode = int8_mode
23
 
24
  self.model, self.tokenizer = self.load_model_and_tokenizer()
@@ -81,9 +81,7 @@ class LyraChatGLM6B:
81
  max_seq_len=0, # for position seq embedding
82
  pipeline_para_size=CHATGLM_6B_PARAM.pipeline_para_size,
83
  shared_contexts_ratio=CHATGLM_6B_PARAM.shared_contexts_ratio,
84
- int8_mode=self.int8_mode,
85
- model_path=self.model_path,
86
- cuda_version=self.cuda_version,
87
  ))
88
 
89
  print('[INFO] Load Our Highly Optimized LyraChatGLM6B model')
@@ -106,6 +104,8 @@ class LyraChatGLM6B:
106
 
107
  print(f'Loading tokenizer from {self.model_path}')
108
  model = ChatGLM6BModel(arch=self.arch,**model_args)
 
 
109
 
110
  return model, tokenizer
111
 
@@ -134,10 +134,7 @@ class LyraChatGLM6B:
134
  ones_int = torch.ones(size=[batch_size], dtype=torch.int32)
135
  ones_float = torch.ones(size=[batch_size], dtype=torch.float32)
136
 
137
- # input_token_ids = self.tokenizer(prompts, return_tensors="pt", padding=True).input_ids.int()
138
- raw_input_token_ids = self.tokenizer(prompts, padding=True)
139
- input_token_ids = torch.tensor (raw_input_token_ids["input_ids"],dtype=torch.int32)
140
-
141
  input_lengths = torch.IntTensor([len(ids) for ids in input_token_ids])
142
  mask_positions = torch.IntTensor([seq.index(130001) for seq in input_token_ids.tolist()])
143
 
 
10
  from .config import CHATGLM_6B_PARAM
11
  from .model import ChatGLM6BModel
12
 
13
+
14
  class LyraChatGLM6B:
15
+ def __init__(self, model_path, tokenizer_path=None, dtype='fp16', int8_mode=0, arch="Ampere") -> None:
16
  self.model_path = model_path
17
  self.tokenizer_path = tokenizer_path
18
  self.dtype = dtype
19
  self.arch=arch
20
+ if dtype != 'int8':
21
+ int8_mode = 0
 
22
  self.int8_mode = int8_mode
23
 
24
  self.model, self.tokenizer = self.load_model_and_tokenizer()
 
81
  max_seq_len=0, # for position seq embedding
82
  pipeline_para_size=CHATGLM_6B_PARAM.pipeline_para_size,
83
  shared_contexts_ratio=CHATGLM_6B_PARAM.shared_contexts_ratio,
84
+ int8_mode=self.int8_mode
 
 
85
  ))
86
 
87
  print('[INFO] Load Our Highly Optimized LyraChatGLM6B model')
 
104
 
105
  print(f'Loading tokenizer from {self.model_path}')
106
  model = ChatGLM6BModel(arch=self.arch,**model_args)
107
+ if not model.load(ckpt_path=self.model_path):
108
+ print('[WARNING] Skip model loading since no checkpoints are found')
109
 
110
  return model, tokenizer
111
 
 
134
  ones_int = torch.ones(size=[batch_size], dtype=torch.int32)
135
  ones_float = torch.ones(size=[batch_size], dtype=torch.float32)
136
 
137
+ input_token_ids = self.tokenizer(prompts, return_tensors="pt", padding=True).input_ids.int()
 
 
 
138
  input_lengths = torch.IntTensor([len(ids) for ids in input_token_ids])
139
  mask_positions = torch.IntTensor([seq.index(130001) for seq in input_token_ids.tolist()])
140
 
lyraChatGLM/model.py CHANGED
@@ -8,6 +8,402 @@ import torch
8
  import torch.distributed as dist
9
  import torch.nn as nn
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  class ChatGLM6BModel(nn.Module):
12
  def __init__(self,
13
  head_num, size_per_head,
@@ -19,8 +415,6 @@ class ChatGLM6BModel(nn.Module):
19
  tensor_para_size: int,
20
  pipeline_para_size: int,
21
  inference_data_type: str,
22
- model_path,
23
- cuda_version,
24
  inter_size: int = 0,
25
  # glm_variant_params
26
  layernorm_eps: float = 1e-5,
@@ -49,7 +443,6 @@ class ChatGLM6BModel(nn.Module):
49
  self.layer_num = layer_num
50
  self.inter_size = inter_size if inter_size != 0 else 4 * self.head_num * self.size_per_head
51
  self.arch = arch
52
- self.model_path = model_path
53
  # gpt_variant_params
54
  self.layernorm_eps = layernorm_eps
55
  self.layernorm_type = layernorm_type
@@ -79,28 +472,62 @@ class ChatGLM6BModel(nn.Module):
79
  assert head_num % tensor_para_size == 0, "head_num must be a multiple of tensor_para_size."
80
  assert layer_num % pipeline_para_size == 0, "layer_num must be a multiple of pipeline_para_size."
81
 
82
- self.device = 0
83
-
84
  # Load the C++ model into Pytorch model.
85
- sm = "sm80"
86
-
87
  if arch == "Ampere":
88
- sm = "sm80"
89
  elif arch == "Volta":
90
- sm = "sm70"
91
- else:
92
- raise Exception(f"unsupported arch: {arch}")
93
 
94
- cu = 'cu11'
95
- if cuda_version == 11:
96
- cu = 'cu11'
97
- elif cuda_version == 12:
98
- cu = 'cu12'
99
- else:
100
- raise Exception(f"unsupported cuda version: {cuda_version}")
 
 
 
 
 
 
101
 
102
- lib_path = pathlib.Path(__file__).parent / "ftlib" / f"libth_transformer_{sm}_{cu}.so"
103
- torch.classes.load_library(os.path.abspath(lib_path))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  self.model = torch.classes.FasterTransformer.GlmOp(
106
  self.head_num, self.size_per_head, self.inter_size,
@@ -122,9 +549,9 @@ class ChatGLM6BModel(nn.Module):
122
  self.has_adapters,
123
  self.adapter_inter_size,
124
  self.use_attention_linear_bias,
125
- self.model_path,
126
- self.weights_data_type,
127
- inference_data_type,
128
  self.shared_contexts_ratio)
129
  self.build_model = True
130
 
@@ -146,7 +573,10 @@ class ChatGLM6BModel(nn.Module):
146
  bad_words_list: typing.Optional[torch.IntTensor] = None,
147
  return_output_length: bool = False,
148
  return_cum_log_probs: int = 0):
149
-
 
 
 
150
  input_len = start_ids.size(1)
151
  assert input_len > 0, "input len must be larger than zero. For an unconditional case, use start_id as the first token."
152
 
 
8
  import torch.distributed as dist
9
  import torch.nn as nn
10
 
11
+ str_type_map = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}
12
+
13
+
14
+ class ChatGLM6BWeights:
15
+ def __init__(
16
+ self, head_num, size_per_head, layer_num, vocab_size, max_seq_len, tensor_para_size, pipeline_para_size,
17
+ weights_data_type: typing.Union[str, np.dtype],
18
+ inference_data_type: str, has_adapters: bool = False, adapter_inter_size: int = 0, gpt_with_moe: bool = False,
19
+ has_positional_encoding: bool = False, has_pre_decoder_layernorm: bool = False,
20
+ has_post_decoder_layernorm: bool = True, int8_mode: int = 0, inter_size: int = 0):
21
+ assert(head_num % tensor_para_size == 0)
22
+ if int8_mode == 1:
23
+ torch_infer_dtype = str_type_map[inference_data_type]
24
+ assert torch_infer_dtype == torch.float16 or torch_infer_dtype == torch.bfloat16, "Weight only quant only supported for infer type fp16 or bf16."
25
+ quant = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix
26
+ self.weight_transpose_calibrate_quantize = lambda x: quant(x, torch.int8)
27
+ else:
28
+ assert int8_mode == 0, "Invalid int8 mode for GPT. Must be 0 or 1"
29
+
30
+ self.head_num = head_num
31
+ self.size_per_head = size_per_head
32
+ self.layer_num = layer_num
33
+ self.vocab_size = vocab_size
34
+ self.max_seq_len = max_seq_len
35
+ self.tensor_para_size = tensor_para_size
36
+ self.pipeline_para_size = pipeline_para_size
37
+ self.layers_per_device = layer_num // pipeline_para_size
38
+
39
+ self.has_adapters = has_adapters
40
+ self.adapter_inter_size = adapter_inter_size
41
+ self.gpt_with_moe = gpt_with_moe
42
+ self.has_positional_encoding = has_positional_encoding
43
+ self.has_pre_decoder_layernorm = has_pre_decoder_layernorm
44
+ self.has_post_decoder_layernorm = has_post_decoder_layernorm
45
+
46
+ local_head_num = head_num // tensor_para_size
47
+ global_head_num = head_num
48
+ local_hidden_units = local_head_num * size_per_head
49
+ global_hidden_units = global_head_num * size_per_head
50
+ local_inter_size = local_hidden_units * 4
51
+ if inter_size != 0:
52
+ assert inter_size % tensor_para_size == 0, f"inter_size({inter_size}) \% tensor_para_size({tensor_para_size}) must be 0"
53
+ local_inter_size = inter_size // tensor_para_size
54
+ local_adapter_inter_size = self.adapter_inter_size // tensor_para_size
55
+
56
+ self.local_head_num = local_head_num
57
+ self.global_head_num = global_head_num
58
+ self.local_hidden_units = local_hidden_units
59
+ self.global_hidden_units = global_hidden_units
60
+ self.local_inter_size = local_inter_size
61
+
62
+ self.int8_mode = int8_mode
63
+ self.share_embed = False
64
+
65
+ if isinstance(weights_data_type, str):
66
+ try:
67
+ weights_data_type = {
68
+ "fp16": np.float16,
69
+ "fp32": np.float32,
70
+ "float16": np.float16,
71
+ "float32": np.float32,
72
+ }[weights_data_type]
73
+ except KeyError:
74
+ raise ValueError(f"Don't know how to interpret weights_data_type: {weights_data_type}")
75
+
76
+ assert weights_data_type in [np.float32, np.float16]
77
+ self.weights_data_type = weights_data_type
78
+ self.inference_data_type = inference_data_type
79
+
80
+ self.w = []
81
+ self.int8_w = []
82
+ self.scale = []
83
+
84
+ # Transformer blocks
85
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
86
+ * layer_num) # self_layernorm_gamma
87
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
88
+ * layer_num) # self_layernorm_beta
89
+ self.w.extend([torch.zeros(global_hidden_units, local_hidden_units * 3,
90
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # self_kernel
91
+ self.w.extend([torch.zeros(local_hidden_units * 3, dtype=str_type_map[self.inference_data_type])]
92
+ * layer_num) # self_bias
93
+ self.w.extend(
94
+ [torch.zeros(local_hidden_units, global_hidden_units, dtype=str_type_map[self.inference_data_type])] *
95
+ layer_num) # self_output_kernel
96
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
97
+ * layer_num) # self_output_bias
98
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
99
+ * layer_num) # ffn_layernorm_gamma
100
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
101
+ * layer_num) # ffn_layernorm_beta
102
+ self.w.extend(
103
+ [torch.zeros(global_hidden_units, local_inter_size, dtype=str_type_map[self.inference_data_type])] *
104
+ layer_num) # ffn_kernel1
105
+ self.w.extend([torch.zeros(local_inter_size, dtype=str_type_map[self.inference_data_type])]
106
+ * layer_num) # ffn_bias1
107
+ self.w.extend(
108
+ [torch.zeros(local_inter_size, global_hidden_units, dtype=str_type_map[self.inference_data_type])] *
109
+ layer_num) # ffn_kernel2
110
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
111
+ * layer_num) # ffn_bias2
112
+
113
+ optional_adapter_offset = 0
114
+
115
+ # After Transformer blocks
116
+ if self.has_pre_decoder_layernorm:
117
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
118
+ self.inference_data_type])) # embedding layernorm gamma
119
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
120
+ self.inference_data_type])) # embedding layernorm beta
121
+ optional_adapter_offset += 2
122
+ if self.has_post_decoder_layernorm:
123
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
124
+ self.inference_data_type])) # final layernorm gamma
125
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
126
+ self.inference_data_type])) # final layernorm beta
127
+ optional_adapter_offset += 2
128
+ if self.has_positional_encoding:
129
+ self.w.append(torch.zeros(max_seq_len, global_hidden_units, dtype=str_type_map[
130
+ self.inference_data_type])) # position_encoding_table
131
+ optional_adapter_offset += 1
132
+
133
+ self.pre_embed_idx = len(self.w)
134
+ self.w.append(torch.zeros(vocab_size, global_hidden_units,
135
+ dtype=str_type_map[self.inference_data_type])) # embedding_table
136
+ self.post_embed_idx = len(self.w)
137
+ self.w.append(torch.zeros(vocab_size, global_hidden_units, dtype=str_type_map[
138
+ self.inference_data_type])) # post embedding_kernel
139
+ self.adapter_offset = 2 + optional_adapter_offset
140
+
141
+ self.w.extend([torch.empty(0, dtype=str_type_map[self.inference_data_type])] * layer_num) # gating_weight
142
+ self.adapter_offset += layer_num
143
+
144
+ # adapters
145
+ if self.has_adapters:
146
+ self.w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
147
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor1_kernel1
148
+ self.w.extend([torch.zeros(local_adapter_inter_size, dtype=str_type_map[
149
+ self.inference_data_type])] * layer_num) # adaptor1_bias1
150
+ self.w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
151
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor1_kernel2
152
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[
153
+ self.inference_data_type])] * layer_num) # adaptor1_bias2
154
+ self.w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
155
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor2_kernel1
156
+ self.w.extend([torch.zeros(local_adapter_inter_size, dtype=str_type_map[
157
+ self.inference_data_type])] * layer_num) # adaptor2_bias1
158
+ self.w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
159
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor2_kernel2
160
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[
161
+ self.inference_data_type])] * layer_num) # adaptor2_bias2
162
+
163
+ # Initialization
164
+ # self._map(lambda w: torch.nn.init.normal_(w, mean=0., std=1.))
165
+
166
+ if (self.int8_mode != 0):
167
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_hidden_units *
168
+ 3, dtype=torch.int8)] * layer_num) # self_int8_kernel
169
+ self.scale.extend([torch.zeros(local_hidden_units * 3, dtype=torch.float)] * layer_num) # self_scale
170
+ self.int8_w.extend([torch.zeros(local_hidden_units, global_hidden_units, dtype=torch.int8)]
171
+ * layer_num) # self_output_int8_kernel
172
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # self_output_scale
173
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_inter_size,
174
+ dtype=torch.int8)] * layer_num) # ffn_int8_kernel1
175
+ self.scale.extend([torch.zeros(local_inter_size, dtype=torch.float)] * layer_num) # ffn_scale1
176
+ self.int8_w.extend([torch.zeros(local_inter_size, global_hidden_units,
177
+ dtype=torch.int8)] * layer_num) # ffn_int8_kernel2
178
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # ffn_scale2
179
+
180
+ if self.has_adapters:
181
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
182
+ dtype=torch.int8)] * layer_num) # adaptor1_int8_kernel1
183
+ self.scale.extend([torch.zeros(local_adapter_inter_size, dtype=torch.float)]
184
+ * layer_num) # adaptor1_scale1
185
+ self.int8_w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
186
+ dtype=torch.int8)] * layer_num) # adaptor1_int8_kernel2
187
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # adaptor1_scale2
188
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
189
+ dtype=torch.int8)] * layer_num) # adaptor2_int8_kernel1
190
+ self.scale.extend([torch.zeros(local_adapter_inter_size, dtype=torch.float)]
191
+ * layer_num) # adaptor2_scale1
192
+ self.int8_w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
193
+ dtype=torch.int8)] * layer_num) # adaptor2_int8_kernel2
194
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # adaptor2_scale2
195
+
196
+ def __getitem__(self, idx):
197
+ return self.w[idx]
198
+
199
+ def __setitem__(self, idx, val):
200
+ self.w[idx] = val
201
+
202
+ def __len__(self):
203
+ return len(self.w)
204
+
205
+ def _map(self, func):
206
+ assert(self.pre_embed_idx < self.post_embed_idx,
207
+ "Pre decoder embedding index should be lower than post decoder embedding index.")
208
+ for i in range(len(self.w)):
209
+ if isinstance(self.w[i], list):
210
+ for j in range(len(self.w[i])):
211
+ self.w[i][j] = func(self.w[i][j])
212
+ else:
213
+ if self.share_embed and i == self.post_embed_idx:
214
+ # If sharing the pre and post embedding, any mapping to
215
+ # the pre decoder weight will give the same output to the
216
+ # post decoder weight, so we just copy here.
217
+ self.w[self.post_embed_idx] = self.w[self.pre_embed_idx]
218
+ else:
219
+ self.w[i] = func(self.w[i])
220
+
221
+ def _map_int8(self, func):
222
+ for i in range(len(self.int8_w)):
223
+ if isinstance(self.int8_w[i], list):
224
+ for j in range(len(self.int8_w[i])):
225
+ self.int8_w[i][j] = func(self.int8_w[i][j])
226
+
227
+ else:
228
+ self.int8_w[i] = func(self.int8_w[i])
229
+ for i in range(len(self.scale)):
230
+ if isinstance(self.scale[i], list):
231
+ for j in range(len(self.scale[i])):
232
+ self.scale[i][j] = func(self.scale[i][j])
233
+
234
+ else:
235
+ self.scale[i] = func(self.scale[i])
236
+
237
+ def _map_int8_scales(self, func):
238
+ for i in range(len(self.scale)):
239
+ if isinstance(self.scale[i], list):
240
+ for j in range(len(self.scale[i])):
241
+ self.scale[i][j] = func(self.scale[i][j])
242
+
243
+ else:
244
+ self.scale[i] = func(self.scale[i])
245
+
246
+ def load(self, ckpt_path, tp_rank, pipeline_para_rank):
247
+ if not os.path.exists(ckpt_path):
248
+ raise FileNotFoundError(f"Failed to find {ckpt_path}")
249
+ w = []
250
+
251
+ type_map = {np.float32: torch.float32, np.float16: torch.float16}
252
+ # Load
253
+
254
+ def is_load(i): return i >= self.layers_per_device * \
255
+ pipeline_para_rank and i < self.layers_per_device * (pipeline_para_rank + 1)
256
+
257
+ h5f = h5py.File(ckpt_path, "r")
258
+
259
+ def load_to_torch(key, is_load: bool):
260
+ if is_load:
261
+ npdata = h5f[key]["weights"][:]
262
+ return torch.from_numpy(npdata).to(str_type_map[self.inference_data_type])
263
+ else:
264
+ return torch.empty(0).to(str_type_map[self.inference_data_type])
265
+ w.extend([load_to_torch(f"model.layers.{i}.input_layernorm.weight", is_load(i))
266
+ for i in range(self.layer_num)])
267
+ w.extend([load_to_torch(f"model.layers.{i}.input_layernorm.bias", is_load(i))
268
+ for i in range(self.layer_num)])
269
+ w.extend(
270
+ [load_to_torch(
271
+ f"model.layers.{i}.attention.query_key_value.weight.{tp_rank}", is_load(i))
272
+ for i in range(self.layer_num)])
273
+ w.extend([
274
+ load_to_torch(
275
+ f"model.layers.{i}.attention.query_key_value.bias.{tp_rank}", is_load(i))
276
+ for i in range(self.layer_num)])
277
+ w.extend([load_to_torch(f"model.layers.{i}.attention.dense.weight.{tp_rank}",
278
+ is_load(i)) for i in range(self.layer_num)])
279
+ w.extend([load_to_torch(f"model.layers.{i}.attention.dense.bias", is_load(i))
280
+ for i in range(self.layer_num)])
281
+ w.extend([load_to_torch(f"model.layers.{i}.post_attention_layernorm.weight",
282
+ is_load(i)) for i in range(self.layer_num)])
283
+ w.extend([load_to_torch(f"model.layers.{i}.post_attention_layernorm.bias",
284
+ is_load(i)) for i in range(self.layer_num)])
285
+ w.extend(
286
+ [load_to_torch(f"model.layers.{i}.mlp.dense_h_to_4h.weight.{tp_rank}", is_load(i))
287
+ for i in range(self.layer_num)])
288
+ w.extend(
289
+ [load_to_torch(f"model.layers.{i}.mlp.dense_h_to_4h.bias.{tp_rank}", is_load(i))
290
+ for i in range(self.layer_num)])
291
+ w.extend(
292
+ [load_to_torch(f"model.layers.{i}.mlp.dense_4h_to_h.weight.{tp_rank}", is_load(i))
293
+ for i in range(self.layer_num)])
294
+ w.extend([load_to_torch(f"model.layers.{i}.mlp.dense_4h_to_h.bias", is_load(i)) for i in range(self.layer_num)])
295
+
296
+ if self.has_pre_decoder_layernorm:
297
+ w.append(load_to_torch(f"model.pre_decoder_layernorm.weight", True))
298
+ w.append(load_to_torch(f"model.pre_decoder_layernorm.bias", True))
299
+
300
+ if self.has_post_decoder_layernorm:
301
+ w.append(load_to_torch(f"model.final_layernorm.weight", True))
302
+ w.append(load_to_torch(f"model.final_layernorm.bias", True))
303
+
304
+ if self.has_positional_encoding:
305
+ wpe = load_to_torch(f"model.wpe", True).reshape(-1, self.global_hidden_units)
306
+ assert self.max_seq_len <= wpe.size(0), (
307
+ f"max_seq_len ({self.max_seq_len} must not exceed "
308
+ f"the value of maximum sequence length during training ({wpe.size(0)})."
309
+ )
310
+ w.append(wpe)
311
+ w.append(load_to_torch(f"model.wte", True))
312
+ self.share_embed = True
313
+ w.append(torch.empty(0).to(str_type_map[self.inference_data_type]))
314
+
315
+ gate_list = []
316
+ for i in range(self.layer_num):
317
+ gate_list.append(load_to_torch(f"model.layers.{i}.mlp.moe.gate.wg.weight", False))
318
+ w.extend(gate_list)
319
+
320
+ if self.has_adapters:
321
+ w.extend(
322
+ [load_to_torch(
323
+ f"model.layers.{i}.after_attention_adapter.dense_h_to_4h.weight.{tp_rank}", is_load(i))
324
+ for i in range(self.layer_num)])
325
+ w.extend([
326
+ load_to_torch(
327
+ f"model.layers.{i}.after_attention_adapter.dense_h_to_4h.bias.{tp_rank}", is_load(i))
328
+ for i in range(self.layer_num)])
329
+ w.extend(
330
+ [load_to_torch(
331
+ f"model.layers.{i}.after_attention_adapter.dense_4h_to_h.weight.{tp_rank}", is_load(i))
332
+ for i in range(self.layer_num)])
333
+ w.extend(
334
+ [load_to_torch(f"model.layers.{i}.after_attention_adapter.dense_4h_to_h.bias", is_load(i))
335
+ for i in range(self.layer_num)])
336
+ w.extend(
337
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_h_to_4h.weight.{tp_rank}", is_load(i))
338
+ for i in range(self.layer_num)])
339
+ w.extend(
340
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_h_to_4h.bias.{tp_rank}", is_load(i))
341
+ for i in range(self.layer_num)])
342
+ w.extend(
343
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_4h_to_h.weight.{tp_rank}", is_load(i))
344
+ for i in range(self.layer_num)])
345
+ w.extend([load_to_torch(
346
+ f"model.layers.{i}.after_ffn_adapter.dense_4h_to_h.bias", is_load(i)) for i in range(self.layer_num)])
347
+
348
+ assert len(self.w) == len(w)
349
+
350
+ # Reshape
351
+ try:
352
+ for i in range(len(w)):
353
+ if w[i].nelement() == self.w[i].nelement():
354
+ self.w[i] = w[i].reshape(self.w[i].shape)
355
+ else:
356
+ self.w[i] = w[i]
357
+
358
+ except RuntimeError:
359
+ raise RuntimeError(
360
+ f"head_num, size_per_head, vocab_size, and max_seq_len must be the same as the ones during training "
361
+ f"(idx: {i} expected shape: {self.w[i].shape} got shape: {w[i].shape})."
362
+ )
363
+
364
+ # transpose calibrate quantize the kernel
365
+ layer_num = self.layer_num
366
+ if self.int8_mode != 0:
367
+ for i in range(layer_num):
368
+ self.int8_w[i + 0 * layer_num], self.scale[i + 0 *
369
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[2 * layer_num + i])
370
+ self.int8_w[i + 1 * layer_num], self.scale[i + 1 *
371
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[4 * layer_num + i])
372
+ self.int8_w[i + 2 * layer_num], self.scale[i + 2 *
373
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[8 * layer_num + i])
374
+ self.int8_w[i + 3 * layer_num], self.scale[i + 3 *
375
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[10 * layer_num + i])
376
+
377
+ # We clear the original weights since they are no longer needed
378
+ if self.int8_mode == 1:
379
+ self.w[2 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
380
+ self.w[4 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
381
+ self.w[8 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
382
+ self.w[10 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
383
+
384
+ if self.has_adapters:
385
+ self.int8_w[i + 4 * layer_num], self.scale[i + 4 * layer_num] = self.weight_transpose_calibrate_quantize(
386
+ self.w[12 * layer_num + i + self.adapter_offset])
387
+ self.int8_w[i + 5 * layer_num], self.scale[i + 5 * layer_num] = self.weight_transpose_calibrate_quantize(
388
+ self.w[14 * layer_num + i + self.adapter_offset])
389
+ self.int8_w[i + 6 * layer_num], self.scale[i + 6 * layer_num] = self.weight_transpose_calibrate_quantize(
390
+ self.w[16 * layer_num + i + self.adapter_offset])
391
+ self.int8_w[i + 7 * layer_num], self.scale[i + 7 * layer_num] = self.weight_transpose_calibrate_quantize(
392
+ self.w[18 * layer_num + i + self.adapter_offset])
393
+
394
+ # Similar to above:
395
+ if self.int8_mode == 1:
396
+ self.w[12 * layer_num + i + self.adapter_offset] = torch.empty(
397
+ 0).to(str_type_map[self.inference_data_type])
398
+ self.w[14 * layer_num + i + self.adapter_offset] = torch.empty(
399
+ 0).to(str_type_map[self.inference_data_type])
400
+ self.w[16 * layer_num + i + self.adapter_offset] = torch.empty(
401
+ 0).to(str_type_map[self.inference_data_type])
402
+ self.w[18 * layer_num + i + self.adapter_offset] = torch.empty(
403
+ 0).to(str_type_map[self.inference_data_type])
404
+ return True
405
+
406
+
407
  class ChatGLM6BModel(nn.Module):
408
  def __init__(self,
409
  head_num, size_per_head,
 
415
  tensor_para_size: int,
416
  pipeline_para_size: int,
417
  inference_data_type: str,
 
 
418
  inter_size: int = 0,
419
  # glm_variant_params
420
  layernorm_eps: float = 1e-5,
 
443
  self.layer_num = layer_num
444
  self.inter_size = inter_size if inter_size != 0 else 4 * self.head_num * self.size_per_head
445
  self.arch = arch
 
446
  # gpt_variant_params
447
  self.layernorm_eps = layernorm_eps
448
  self.layernorm_type = layernorm_type
 
472
  assert head_num % tensor_para_size == 0, "head_num must be a multiple of tensor_para_size."
473
  assert layer_num % pipeline_para_size == 0, "layer_num must be a multiple of pipeline_para_size."
474
 
 
 
475
  # Load the C++ model into Pytorch model.
 
 
476
  if arch == "Ampere":
477
+ lib_path = pathlib.Path(__file__).parent / "ftlib" / "libth_transformer_sm80.so"
478
  elif arch == "Volta":
479
+ lib_path = pathlib.Path(__file__).parent / "ftlib" / "libth_transformer_sm70.so"
480
+ torch.classes.load_library(os.path.abspath(lib_path))
 
481
 
482
+ # Prepare weights
483
+ self.weights = ChatGLM6BWeights(head_num, size_per_head, layer_num, vocab_size,
484
+ max_seq_len, tensor_para_size, pipeline_para_size,
485
+ weights_data_type=weights_data_type,
486
+ inference_data_type=inference_data_type,
487
+ gpt_with_moe=self.gpt_with_moe,
488
+ has_positional_encoding=self.has_positional_encoding,
489
+ has_pre_decoder_layernorm=self.has_pre_decoder_layernorm,
490
+ has_post_decoder_layernorm=self.has_post_decoder_layernorm,
491
+ has_adapters=self.has_adapters,
492
+ adapter_inter_size=self.adapter_inter_size,
493
+ int8_mode=int8_mode,
494
+ inter_size=inter_size)
495
 
496
+ # Prepare for tensor/pipeline parallel
497
+ try:
498
+ dist.init_process_group(backend='mpi')
499
+ except:
500
+ print("[INFO] WARNING: Have initialized the process group")
501
+ self.rank = dist.get_rank()
502
+ self.device_count = torch.cuda.device_count()
503
+ self.device = self.rank % self.device_count
504
+ torch.cuda.set_device(self.device)
505
+
506
+ world_size = dist.get_world_size()
507
+ assert world_size == tensor_para_size * pipeline_para_size, "tensor_para_size * pipeline_para_size must be equal to world_size."
508
+
509
+ self.tensor_para_rank = self.rank % self.tensor_para_size
510
+ self.pipeline_para_rank = self.rank // self.tensor_para_size
511
+
512
+ def load(self, ckpt_path):
513
+ is_load = self.weights.load(ckpt_path, tp_rank=self.tensor_para_rank,
514
+ pipeline_para_rank=self.pipeline_para_rank)
515
+ self.cuda()
516
+ torch.cuda.empty_cache() # clean cache for model weight preprocessing
517
+ return is_load
518
+
519
+ def sparse(self):
520
+ if not self.use_sparse_gemm:
521
+ self.use_sparse_gemm = True
522
+
523
+ def cuda(self):
524
+ self.weights._map(lambda w: w.cuda(self.device))
525
+ if self.int8_mode != 0:
526
+ self.weights._map_int8(lambda w: w.cuda(self.device))
527
+
528
+ if self.build_model:
529
+ del self.model
530
+ self.build_model = False
531
 
532
  self.model = torch.classes.FasterTransformer.GlmOp(
533
  self.head_num, self.size_per_head, self.inter_size,
 
549
  self.has_adapters,
550
  self.adapter_inter_size,
551
  self.use_attention_linear_bias,
552
+ self.weights.w,
553
+ self.weights.int8_w,
554
+ self.weights.scale,
555
  self.shared_contexts_ratio)
556
  self.build_model = True
557
 
 
573
  bad_words_list: typing.Optional[torch.IntTensor] = None,
574
  return_output_length: bool = False,
575
  return_cum_log_probs: int = 0):
576
+ if not self.build_model:
577
+ # for the cases we don't load model
578
+ self.cuda()
579
+ torch.cuda.empty_cache() # clean cache for model weight preprocessing
580
  input_len = start_ids.size(1)
581
  assert input_len > 0, "input len must be larger than zero. For an unconditional case, use start_id as the first token."
582
 
models/1-gpu-fp16.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9bab22c98c57766bc31410c819858fa704490ca76dc04df7331d188c56fba1b1
3
- size 12346572800
 
 
 
 
lyraChatGLM/ftlib/libth_transformer_sm70_cu11.so → models/1-gpu-fp16.h5 RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0826346c748380e8e9fdd7e1f7130bad0f2485a65a8ecd4beb33d19e85c4d79e
3
- size 114280392
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3012c698d6084bf154f78bd9c0734ba8026670a16ac3f3944b41476472f1561a
3
+ size 12347066528
requirements.txt CHANGED
@@ -5,5 +5,4 @@ huggingface_hub
5
  numpy
6
  setuptools
7
  torch
8
- h5py
9
  protobuf==3.20.3
 
5
  numpy
6
  setuptools
7
  torch
 
8
  protobuf==3.20.3