Spaces:

TempoFunk
/

makeavid-sd-jax

Runtime error

App Files Files Community

lopho commited on May 2, 2023

Commit

149cc2d

1 Parent(s): cc65356

this isn't very nice.

Browse files

Files changed (24) hide show

makeavid_sd/LICENSE +661 -0
makeavid_sd/README.md +1 -0
makeavid_sd/makeavid_sd/__init__.py +1 -0
makeavid_sd/makeavid_sd/flax_impl/__init__.py +0 -0
makeavid_sd/makeavid_sd/flax_impl/dataset.py +159 -0
makeavid_sd/makeavid_sd/flax_impl/flax_attention_pseudo3d.py +212 -0
makeavid_sd/makeavid_sd/flax_impl/flax_embeddings.py +62 -0
makeavid_sd/makeavid_sd/flax_impl/flax_resnet_pseudo3d.py +175 -0
makeavid_sd/makeavid_sd/flax_impl/flax_trainer.py +608 -0
makeavid_sd/makeavid_sd/flax_impl/flax_unet_pseudo3d_blocks.py +254 -0
makeavid_sd/makeavid_sd/flax_impl/flax_unet_pseudo3d_condition.py +251 -0
makeavid_sd/makeavid_sd/flax_impl/train.py +143 -0
makeavid_sd/makeavid_sd/flax_impl/train.sh +34 -0
makeavid_sd/makeavid_sd/inference.py +486 -0
makeavid_sd/makeavid_sd/torch_impl/__init__.py +0 -0
makeavid_sd/makeavid_sd/torch_impl/torch_attention_pseudo3d.py +294 -0
makeavid_sd/makeavid_sd/torch_impl/torch_cross_attention.py +171 -0
makeavid_sd/makeavid_sd/torch_impl/torch_embeddings.py +92 -0
makeavid_sd/makeavid_sd/torch_impl/torch_resnet_pseudo3d.py +295 -0
makeavid_sd/makeavid_sd/torch_impl/torch_unet_pseudo3d_blocks.py +493 -0
makeavid_sd/makeavid_sd/torch_impl/torch_unet_pseudo3d_condition.py +235 -0
makeavid_sd/requirements.txt +2 -0
makeavid_sd/setup.py +11 -0
makeavid_sd/trainer_xla.py +104 -0

makeavid_sd/LICENSE ADDED Viewed

	@@ -0,0 +1,661 @@

+                    GNU AFFERO GENERAL PUBLIC LICENSE
+                       Version 3, 19 November 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+                            Preamble
+  The GNU Affero General Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+our General Public Licenses are intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+  Developers that use our General Public Licenses protect your rights
+with two steps: (1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public.
+  The GNU Affero General Public License is designed specifically to
+ensure that, in such cases, the modified source code becomes available
+to the community.  It requires the operator of a network server to
+provide the source code of the modified version running there to the
+users of that server.  Therefore, public use of a modified version, on
+a publicly accessible server, gives the public access to the source
+code of the modified version.
+  An older license, called the Affero General Public License and
+published by Affero, was designed to accomplish similar goals.  This is
+a different license, not a version of the Affero GPL, but Affero has
+released a new version of the Affero GPL which permits relicensing under
+this license.
+  The precise terms and conditions for copying, distribution and
+modification follow.
+                       TERMS AND CONDITIONS
+  0. Definitions.
+  "This License" refers to version 3 of the GNU Affero General Public License.
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+  1. Source Code.
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+  The Corresponding Source for a work in source code form is that
+same work.
+  2. Basic Permissions.
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+  4. Conveying Verbatim Copies.
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+  5. Conveying Modified Source Versions.
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+  6. Conveying Non-Source Forms.
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+  7. Additional Terms.
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+  8. Termination.
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+  9. Acceptance Not Required for Having Copies.
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+  10. Automatic Licensing of Downstream Recipients.
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+  11. Patents.
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+  12. No Surrender of Others' Freedom.
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+  13. Remote Network Interaction; Use with the GNU General Public License.
+  Notwithstanding any other provision of this License, if you modify the
+Program, your modified version must prominently offer all users
+interacting with it remotely through a computer network (if your version
+supports such interaction) an opportunity to receive the Corresponding
+Source of your version by providing access to the Corresponding Source
+from a network server at no charge, through some standard or customary
+means of facilitating copying of software.  This Corresponding Source
+shall include the Corresponding Source for any work covered by version 3
+of the GNU General Public License that is incorporated pursuant to the
+following paragraph.
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the work with which it is combined will remain governed by version
+3 of the GNU General Public License.
+  14. Revised Versions of this License.
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU Affero General Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU Affero General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU Affero General Public License, you may choose any version ever published
+by the Free Software Foundation.
+  If the Program specifies that a proxy can decide which future
+versions of the GNU Affero General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+  15. Disclaimer of Warranty.
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+  16. Limitation of Liability.
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+  17. Interpretation of Sections 15 and 16.
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+                     END OF TERMS AND CONDITIONS
+            How to Apply These Terms to Your New Programs
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as published
+    by the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+Also add information on how to contact you by electronic and paper mail.
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU AGPL, see
+<https://www.gnu.org/licenses/>.

makeavid_sd/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # makeavid-sd-tpu

makeavid_sd/makeavid_sd/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ __version__ = '0.1.0'

makeavid_sd/makeavid_sd/flax_impl/__init__.py ADDED Viewed

File without changes

makeavid_sd/makeavid_sd/flax_impl/dataset.py ADDED Viewed

	@@ -0,0 +1,159 @@

+from typing import List, Dict, Any, Union, Optional
+import torch
+from torch.utils.data import DataLoader, ConcatDataset
+import datasets
+from diffusers import DDPMScheduler
+from functools import partial
+import random
+import numpy as np
+@torch.no_grad()
+def collate_fn(
+        batch: List[Dict[str, Any]],
+        noise_scheduler: DDPMScheduler,
+        num_frames: int,
+        hint_spacing: Optional[int] = None,
+        as_numpy: bool = True
+) -> Dict[str, Union[torch.Tensor, np.ndarray]]:
+    if hint_spacing is None or hint_spacing < 1:
+        hint_spacing = num_frames
+    if as_numpy:
+        dtype = np.float32
+    else:
+        dtype = torch.float32
+    prompts = []
+    videos = []
+    for s in batch:
+        # prompt
+        prompts.append(torch.tensor(s['prompt']).to(dtype = torch.float32))
+        # frames
+        frames = torch.tensor(s['video']).to(dtype = torch.float32)
+        max_frames = len(frames)
+        assert max_frames >= num_frames
+        video_slice = random.randint(0, max_frames - num_frames)
+        frames = frames[video_slice:video_slice + num_frames]
+        frames = frames.permute(1, 0, 2, 3) # f, c, h, w -> c, f, h, w
+        videos.append(frames)
+    encoder_hidden_states = torch.cat(prompts) # b, 77, 768
+    latents = torch.stack(videos) # b, c, f, h, w
+    latents = latents * 0.18215
+    hint_latents = latents[:, :, ::hint_spacing, :, :]
+    hint_latents = hint_latents.repeat_interleave(hint_spacing, 2)
+    #hint_latents = hint_latents[:, :, :num_frames-1, :, :]
+    #input_latents = latents[:, :, 1:, :, :]
+    input_latents = latents
+    noise = torch.randn_like(input_latents)
+    bsz = input_latents.shape[0]
+    timesteps = torch.randint(
+            0,
+            noise_scheduler.config.num_train_timesteps,
+            (bsz,),
+            dtype = torch.int64
+    )
+    noisy_latents = noise_scheduler.add_noise(input_latents, noise, timesteps)
+    mask = torch.zeros([
+            noisy_latents.shape[0],
+            1,
+            noisy_latents.shape[2],
+            noisy_latents.shape[3],
+            noisy_latents.shape[4]
+    ])
+    latent_model_input = torch.cat([noisy_latents, mask, hint_latents], dim = 1)
+    latent_model_input = latent_model_input.to(memory_format = torch.contiguous_format)
+    encoder_hidden_states = encoder_hidden_states.to(memory_format = torch.contiguous_format)
+    timesteps = timesteps.to(memory_format = torch.contiguous_format)
+    noise = noise.to(memory_format = torch.contiguous_format)
+    if as_numpy:
+        latent_model_input = latent_model_input.numpy().astype(dtype)
+        encoder_hidden_states = encoder_hidden_states.numpy().astype(dtype)
+        timesteps = timesteps.numpy().astype(np.int32)
+        noise = noise.numpy().astype(dtype)
+    else:
+        latent_model_input = latent_model_input.to(dtype = dtype)
+        encoder_hidden_states = encoder_hidden_states.to(dtype = dtype)
+        noise = noise.to(dtype = dtype)
+    return {
+            'latent_model_input': latent_model_input,
+            'encoder_hidden_states': encoder_hidden_states,
+            'timesteps': timesteps,
+            'noise': noise
+    }
+def worker_init_fn(worker_id: int):
+    wseed = torch.initial_seed() % 4294967294 # max val for random 2**32 - 1
+    random.seed(wseed)
+    np.random.seed(wseed)
+def load_dataset(
+        dataset_path: str,
+        model_path: str,
+        cache_dir: Optional[str] = None,
+        batch_size: int = 1,
+        num_frames: int = 24,
+        hint_spacing: Optional[int] = None,
+        num_workers: int = 0,
+        shuffle: bool = False,
+        as_numpy: bool = True,
+        pin_memory: bool = False,
+        pin_memory_device: str = ''
+) -> DataLoader:
+    noise_scheduler: DDPMScheduler = DDPMScheduler.from_pretrained(
+            model_path,
+            subfolder = 'scheduler'
+    )
+    dataset = datasets.load_dataset(
+            dataset_path,
+            streaming = False,
+            cache_dir = cache_dir
+    )
+    merged_dataset = ConcatDataset([ dataset[s] for s in dataset ])
+    dataloader = DataLoader(
+        merged_dataset,
+        batch_size = batch_size,
+        num_workers = num_workers,
+        persistent_workers = num_workers > 0,
+        drop_last = True,
+        shuffle = shuffle,
+        worker_init_fn = worker_init_fn,
+        collate_fn = partial(collate_fn,
+                noise_scheduler = noise_scheduler,
+                num_frames = num_frames,
+                hint_spacing = hint_spacing,
+                as_numpy = as_numpy
+        ),
+        pin_memory = pin_memory,
+        pin_memory_device = pin_memory_device
+    )
+    return dataloader
+def validate_dataset(
+        dataset_path: str
+) -> List[int]:
+    import os
+    import json
+    data_path = os.path.join(dataset_path, 'data')
+    meta = set(os.path.splitext(x)[0] for x in os.listdir(os.path.join(data_path, 'metadata')))
+    prompts = set(os.path.splitext(x)[0] for x in os.listdir(os.path.join(data_path, 'prompts')))
+    videos = set(os.path.splitext(x)[0] for x in os.listdir(os.path.join(data_path, 'videos')))
+    ok = meta.intersection(prompts).intersection(videos)
+    all_of_em = meta.union(prompts).union(videos)
+    not_ok = []
+    for a in all_of_em:
+        if a not in ok:
+            not_ok.append(a)
+    ok = list(ok)
+    ok.sort()
+    with open(os.path.join(data_path, 'id_list.json'), 'w') as f:
+        json.dump(ok, f)

makeavid_sd/makeavid_sd/flax_impl/flax_attention_pseudo3d.py ADDED Viewed

	@@ -0,0 +1,212 @@

+from typing import Optional
+import jax
+import jax.numpy as jnp
+import flax.linen as nn
+import einops
+#from flax_memory_efficient_attention import jax_memory_efficient_attention
+#from flax_attention import FlaxAttention
+from diffusers.models.attention_flax import FlaxAttention
+class TransformerPseudo3DModel(nn.Module):
+    in_channels: int
+    num_attention_heads: int
+    attention_head_dim: int
+    num_layers: int = 1
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        inner_dim = self.num_attention_heads * self.attention_head_dim
+        self.norm = nn.GroupNorm(
+                num_groups = 32,
+                epsilon = 1e-5
+        )
+        self.proj_in = nn.Conv(
+                inner_dim,
+                kernel_size = (1, 1),
+                strides = (1, 1),
+                padding = 'VALID',
+                dtype = self.dtype
+        )
+        transformer_blocks = []
+        #CheckpointTransformerBlock = nn.checkpoint(
+        #        BasicTransformerBlockPseudo3D,
+        #        static_argnums = (2,3,4)
+        #        #prevent_cse = False
+        #)
+        CheckpointTransformerBlock = BasicTransformerBlockPseudo3D
+        for _ in range(self.num_layers):
+            transformer_blocks.append(CheckpointTransformerBlock(
+                        dim = inner_dim,
+                        num_attention_heads = self.num_attention_heads,
+                        attention_head_dim = self.attention_head_dim,
+                        use_memory_efficient_attention = self.use_memory_efficient_attention,
+                        dtype = self.dtype
+                ))
+        self.transformer_blocks = transformer_blocks
+        self.proj_out = nn.Conv(
+                inner_dim,
+                kernel_size = (1, 1),
+                strides = (1, 1),
+                padding = 'VALID',
+                dtype = self.dtype
+        )
+    def __call__(self,
+            hidden_states: jax.Array,
+            encoder_hidden_states: Optional[jax.Array] = None
+    ) -> jax.Array:
+        is_video = hidden_states.ndim == 5
+        f: Optional[int] = None
+        if is_video:
+            # jax is channels last
+            # b,c,f,h,w WRONG
+            # b,f,h,w,c CORRECT
+            # b, c, f, h, w = hidden_states.shape
+            #hidden_states = einops.rearrange(hidden_states, 'b c f h w -> (b f) c h w')
+            b, f, h, w, c = hidden_states.shape
+            hidden_states = einops.rearrange(hidden_states, 'b f h w c -> (b f) h w c')
+        batch, height, width, channels = hidden_states.shape
+        residual = hidden_states
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.proj_in(hidden_states)
+        hidden_states = hidden_states.reshape(batch, height * width, channels)
+        for block in self.transformer_blocks:
+            hidden_states = block(
+                    hidden_states,
+                    encoder_hidden_states,
+                    f,
+                    height,
+                    width
+            )
+        hidden_states = hidden_states.reshape(batch, height, width, channels)
+        hidden_states = self.proj_out(hidden_states)
+        hidden_states = hidden_states + residual
+        if is_video:
+            hidden_states = einops.rearrange(hidden_states, '(b f) h w c -> b f h w c', b = b)
+        return hidden_states
+class BasicTransformerBlockPseudo3D(nn.Module):
+    dim: int
+    num_attention_heads: int
+    attention_head_dim: int
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        self.attn1 = FlaxAttention(
+                query_dim = self.dim,
+                heads = self.num_attention_heads,
+                dim_head = self.attention_head_dim,
+                use_memory_efficient_attention = self.use_memory_efficient_attention,
+                dtype = self.dtype
+        )
+        self.ff = FeedForward(dim = self.dim, dtype = self.dtype)
+        self.attn2 = FlaxAttention(
+                query_dim = self.dim,
+                heads = self.num_attention_heads,
+                dim_head = self.attention_head_dim,
+                use_memory_efficient_attention = self.use_memory_efficient_attention,
+                dtype = self.dtype
+        )
+        self.attn_temporal = FlaxAttention(
+                query_dim = self.dim,
+                heads = self.num_attention_heads,
+                dim_head = self.attention_head_dim,
+                use_memory_efficient_attention = self.use_memory_efficient_attention,
+                dtype = self.dtype
+        )
+        self.norm1 = nn.LayerNorm(epsilon = 1e-5, dtype = self.dtype)
+        self.norm2 = nn.LayerNorm(epsilon = 1e-5, dtype = self.dtype)
+        self.norm_temporal = nn.LayerNorm(epsilon = 1e-5, dtype = self.dtype)
+        self.norm3 = nn.LayerNorm(epsilon = 1e-5, dtype = self.dtype)
+    def __call__(self,
+            hidden_states: jax.Array,
+            context: Optional[jax.Array] = None,
+            frames_length: Optional[int] = None,
+            height: Optional[int] = None,
+            width: Optional[int] = None
+        ) -> jax.Array:
+            if context is not None and frames_length is not None:
+                context = context.repeat(frames_length, axis = 0)
+            # self attention
+            norm_hidden_states = self.norm1(hidden_states)
+            hidden_states = self.attn1(norm_hidden_states) + hidden_states
+            # cross attention
+            norm_hidden_states = self.norm2(hidden_states)
+            hidden_states = self.attn2(
+                    norm_hidden_states,
+                    context = context
+            ) + hidden_states
+            # temporal attention
+            if frames_length is not None:
+                #bf, hw, c = hidden_states.shape
+                # (b f) (h w) c -> b f (h w) c
+                #hidden_states = hidden_states.reshape(bf // frames_length, frames_length, hw, c)
+                #b, f, hw, c = hidden_states.shape
+                # b f (h w) c -> b (h w) f c
+                #hidden_states = hidden_states.transpose(0, 2, 1, 3)
+                # b (h w) f c -> (b h w) f c
+                #hidden_states = hidden_states.reshape(b * hw, frames_length, c)
+                hidden_states = einops.rearrange(
+                        hidden_states,
+                        '(b f) (h w) c -> (b h w) f c',
+                        f = frames_length,
+                        h = height,
+                        w = width
+                )
+                norm_hidden_states = self.norm_temporal(hidden_states)
+                hidden_states = self.attn_temporal(norm_hidden_states) + hidden_states
+                # (b h w) f c -> b (h w) f c
+                #hidden_states = hidden_states.reshape(b, hw, f, c)
+                # b (h w) f c -> b f (h w) c
+                #hidden_states = hidden_states.transpose(0, 2, 1, 3)
+                # b f h w c -> (b f) (h w) c
+                #hidden_states = hidden_states.reshape(bf, hw, c)
+                hidden_states = einops.rearrange(
+                        hidden_states,
+                        '(b h w) f c -> (b f) (h w) c',
+                        f = frames_length,
+                        h = height,
+                        w = width
+                )
+            norm_hidden_states = self.norm3(hidden_states)
+            hidden_states = self.ff(norm_hidden_states) + hidden_states
+            return hidden_states
+class FeedForward(nn.Module):
+    dim: int
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        self.net_0 = GEGLU(self.dim, self.dtype)
+        self.net_2 = nn.Dense(self.dim, dtype = self.dtype)
+    def __call__(self, hidden_states: jax.Array) -> jax.Array:
+        hidden_states = self.net_0(hidden_states)
+        hidden_states = self.net_2(hidden_states)
+        return hidden_states
+class GEGLU(nn.Module):
+    dim: int
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        inner_dim = self.dim * 4
+        self.proj = nn.Dense(inner_dim * 2, dtype = self.dtype)
+    def __call__(self, hidden_states: jax.Array) -> jax.Array:
+        hidden_states = self.proj(hidden_states)
+        hidden_linear, hidden_gelu = jnp.split(hidden_states, 2, axis = 2)
+        return hidden_linear * nn.gelu(hidden_gelu)

makeavid_sd/makeavid_sd/flax_impl/flax_embeddings.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import jax
+import jax.numpy as jnp
+import flax.linen as nn
+def get_sinusoidal_embeddings(
+        timesteps: jax.Array,
+        embedding_dim: int,
+        freq_shift: float = 1,
+        min_timescale: float = 1,
+        max_timescale: float = 1.0e4,
+        flip_sin_to_cos: bool = False,
+        scale: float = 1.0,
+        dtype: jnp.dtype = jnp.float32
+) -> jax.Array:
+    assert timesteps.ndim == 1, "Timesteps should be a 1d-array"
+    assert embedding_dim % 2 == 0, f"Embedding dimension {embedding_dim} should be even"
+    num_timescales = float(embedding_dim // 2)
+    log_timescale_increment = jnp.log(max_timescale / min_timescale) / (num_timescales - freq_shift)
+    inv_timescales = min_timescale * jnp.exp(jnp.arange(num_timescales, dtype = dtype) * -log_timescale_increment)
+    emb = jnp.expand_dims(timesteps, 1) * jnp.expand_dims(inv_timescales, 0)
+    # scale embeddings
+    scaled_time = scale * emb
+    if flip_sin_to_cos:
+        signal = jnp.concatenate([jnp.cos(scaled_time), jnp.sin(scaled_time)], axis = 1)
+    else:
+        signal = jnp.concatenate([jnp.sin(scaled_time), jnp.cos(scaled_time)], axis = 1)
+    signal = jnp.reshape(signal, [jnp.shape(timesteps)[0], embedding_dim])
+    return signal
+class TimestepEmbedding(nn.Module):
+    time_embed_dim: int = 32
+    dtype: jnp.dtype = jnp.float32
+    @nn.compact
+    def __call__(self, temb: jax.Array) -> jax.Array:
+        temb = nn.Dense(self.time_embed_dim, dtype = self.dtype, name = "linear_1")(temb)
+        temb = nn.silu(temb)
+        temb = nn.Dense(self.time_embed_dim, dtype = self.dtype, name = "linear_2")(temb)
+        return temb
+class Timesteps(nn.Module):
+    dim: int = 32
+    flip_sin_to_cos: bool = False
+    freq_shift: float = 1
+    dtype: jnp.dtype = jnp.float32
+    @nn.compact
+    def __call__(self, timesteps: jax.Array) -> jax.Array:
+        return get_sinusoidal_embeddings(
+                timesteps = timesteps,
+                embedding_dim = self.dim,
+                flip_sin_to_cos = self.flip_sin_to_cos,
+                freq_shift = self.freq_shift,
+                dtype = self.dtype
+        )

makeavid_sd/makeavid_sd/flax_impl/flax_resnet_pseudo3d.py ADDED Viewed

	@@ -0,0 +1,175 @@

+from typing import Optional, Union, Sequence
+import jax
+import jax.numpy as jnp
+import flax.linen as nn
+import einops
+class ConvPseudo3D(nn.Module):
+    features: int
+    kernel_size: Sequence[int]
+    strides: Union[None, int, Sequence[int]] = 1
+    padding: nn.linear.PaddingLike = 'SAME'
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        self.spatial_conv = nn.Conv(
+                features = self.features,
+                kernel_size = self.kernel_size,
+                strides = self.strides,
+                padding = self.padding,
+                dtype = self.dtype
+        )
+        self.temporal_conv = nn.Conv(
+                features = self.features,
+                kernel_size = (3,),
+                padding = 'SAME',
+                dtype = self.dtype,
+                bias_init = nn.initializers.zeros_init()
+                # TODO dirac delta (identity) initialization impl
+                # kernel_init = torch.nn.init.dirac_ <-> jax/lax
+        )
+    def __call__(self, x: jax.Array, convolve_across_time: bool = True) -> jax.Array:
+        is_video = x.ndim == 5
+        convolve_across_time = convolve_across_time and is_video
+        if is_video:
+            b, f, h, w, c = x.shape
+            x = einops.rearrange(x, 'b f h w c -> (b f) h w c')
+        x = self.spatial_conv(x)
+        if is_video:
+            x = einops.rearrange(x, '(b f) h w c -> b f h w c', b = b)
+            b, f, h, w, c = x.shape
+        if not convolve_across_time:
+            return x
+        if is_video:
+            x = einops.rearrange(x, 'b f h w c -> (b h w) f c')
+            x = self.temporal_conv(x)
+            x = einops.rearrange(x, '(b h w) f c -> b f h w c', h = h, w = w)
+        return x
+class UpsamplePseudo3D(nn.Module):
+    out_channels: int
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        self.conv = ConvPseudo3D(
+                features = self.out_channels,
+                kernel_size = (3, 3),
+                strides = (1, 1),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+    def __call__(self, hidden_states: jax.Array) -> jax.Array:
+        is_video = hidden_states.ndim == 5
+        if is_video:
+            b, *_ = hidden_states.shape
+            hidden_states = einops.rearrange(hidden_states, 'b f h w c -> (b f) h w c')
+        batch, h, w, c = hidden_states.shape
+        hidden_states = jax.image.resize(
+                image = hidden_states,
+                shape = (batch, h * 2, w * 2, c),
+                method = 'nearest'
+        )
+        if is_video:
+            hidden_states = einops.rearrange(hidden_states, '(b f) h w c -> b f h w c', b = b)
+        hidden_states = self.conv(hidden_states)
+        return hidden_states
+class DownsamplePseudo3D(nn.Module):
+    out_channels: int
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        self.conv = ConvPseudo3D(
+                features = self.out_channels,
+                kernel_size = (3, 3),
+                strides = (2, 2),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+    def __call__(self, hidden_states: jax.Array) -> jax.Array:
+        hidden_states = self.conv(hidden_states)
+        return hidden_states
+class ResnetBlockPseudo3D(nn.Module):
+    in_channels: int
+    out_channels: Optional[int] = None
+    use_nin_shortcut: Optional[bool] = None
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        out_channels = self.in_channels if self.out_channels is None else self.out_channels
+        self.norm1 = nn.GroupNorm(
+                num_groups = 32,
+                epsilon = 1e-5
+        )
+        self.conv1 = ConvPseudo3D(
+                features = out_channels,
+                kernel_size = (3, 3),
+                strides = (1, 1),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+        self.time_emb_proj = nn.Dense(
+                out_channels,
+                dtype = self.dtype
+        )
+        self.norm2 = nn.GroupNorm(
+                num_groups = 32,
+                epsilon = 1e-5
+        )
+        self.conv2 = ConvPseudo3D(
+                features = out_channels,
+                kernel_size = (3, 3),
+                strides = (1, 1),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+        use_nin_shortcut = self.in_channels != out_channels if self.use_nin_shortcut is None else self.use_nin_shortcut
+        self.conv_shortcut = None
+        if use_nin_shortcut:
+            self.conv_shortcut = ConvPseudo3D(
+                    features = self.out_channels,
+                    kernel_size = (1, 1),
+                    strides = (1, 1),
+                    padding = 'VALID',
+                    dtype = self.dtype
+            )
+    def __call__(self,
+            hidden_states: jax.Array,
+            temb: jax.Array
+    ) -> jax.Array:
+        is_video = hidden_states.ndim == 5
+        residual = hidden_states
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = nn.silu(hidden_states)
+        hidden_states = self.conv1(hidden_states)
+        temb = nn.silu(temb)
+        temb = self.time_emb_proj(temb)
+        temb = jnp.expand_dims(temb, 1)
+        temb = jnp.expand_dims(temb, 1)
+        if is_video:
+            b, f, *_ = hidden_states.shape
+            hidden_states = einops.rearrange(hidden_states, 'b f h w c -> (b f) h w c')
+            hidden_states = hidden_states + temb.repeat(f, 0)
+            hidden_states = einops.rearrange(hidden_states, '(b f) h w c -> b f h w c', b = b)
+        else:
+            hidden_states = hidden_states + temb
+        hidden_states = self.norm2(hidden_states)
+        hidden_states = nn.silu(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+        if self.conv_shortcut is not None:
+            residual = self.conv_shortcut(residual)
+        hidden_states = hidden_states + residual
+        return hidden_states

makeavid_sd/makeavid_sd/flax_impl/flax_trainer.py ADDED Viewed

	@@ -0,0 +1,608 @@

+from typing import Any, Optional, Union, Tuple, Dict, List
+import os
+import random
+import math
+import time
+import numpy as np
+from tqdm.auto import tqdm, trange
+import torch
+from torch.utils.data import DataLoader
+import jax
+import jax.numpy as jnp
+import optax
+from flax import jax_utils, traverse_util
+from flax.core.frozen_dict import FrozenDict
+from flax.training.train_state import TrainState
+from flax.training.common_utils import shard
+# convert 2D -> 3D
+from diffusers import FlaxUNet2DConditionModel
+# inference test, run on these on cpu
+from diffusers import AutoencoderKL
+from diffusers.schedulers.scheduling_ddim_flax import FlaxDDIMScheduler, DDIMSchedulerState
+from transformers import CLIPTextModel, CLIPTokenizer
+from PIL import Image
+from .flax_unet_pseudo3d_condition import UNetPseudo3DConditionModel
+def seed_all(seed: int) -> jax.random.PRNGKeyArray:
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    rng = jax.random.PRNGKey(seed)
+    return rng
+def count_params(
+        params: Union[Dict[str, Any],
+        FrozenDict[str, Any]],
+        filter_name: Optional[str] = None
+) -> int:
+    p: Dict[Tuple[str], jax.Array] = traverse_util.flatten_dict(params)
+    cc = 0
+    for k in p:
+        if filter_name is not None:
+            if filter_name in ' '.join(k):
+                cc += len(p[k].flatten())
+        else:
+            cc += len(p[k].flatten())
+    return cc
+def map_2d_to_pseudo3d(
+        params2d: Dict[str, Any],
+        params3d: Dict[str, Any],
+        verbose: bool = True
+) -> Dict[str, Any]:
+    params2d = traverse_util.flatten_dict(params2d)
+    params3d = traverse_util.flatten_dict(params3d)
+    new_params = dict()
+    for k in params3d:
+        if 'spatial_conv' in k:
+            k2d = list(k)
+            k2d.remove('spatial_conv')
+            k2d = tuple(k2d)
+            if verbose:
+                tqdm.write(f'Spatial: {k} <- {k2d}')
+            p = params2d[k2d]
+        elif k not in params2d:
+            if verbose:
+                tqdm.write(f'Missing: {k}')
+            p = params3d[k]
+        else:
+            p = params2d[k]
+        assert p.shape == params3d[k].shape, f'shape mismatch: {k}: {p.shape} != {params3d[k].shape}'
+        new_params[k] = p
+    new_params = traverse_util.unflatten_dict(new_params)
+    return new_params
+class FlaxTrainerUNetPseudo3D:
+    def __init__(self,
+            model_path: str,
+            from_pt: bool = True,
+            convert2d: bool = False,
+            sample_size: Tuple[int, int] = (64, 64),
+            seed: int = 0,
+            dtype: str = 'float32',
+            param_dtype: str = 'float32',
+            only_temporal: bool = True,
+            use_memory_efficient_attention = False,
+            verbose: bool = True
+    ) -> None:
+        self.verbose = verbose
+        self.tracker: Optional['wandb.sdk.wandb_run.Run'] = None
+        self._use_wandb: bool = False
+        self._tracker_meta: Dict[str, Union[float, int]] = {
+            't00': 0.0,
+            't0': 0.0,
+            'step0': 0
+        }
+        self.log('Init JAX')
+        self.num_devices = jax.device_count()
+        self.log(f'Device count: {self.num_devices}')
+        self.seed = seed
+        self.rng: jax.random.PRNGKeyArray = seed_all(self.seed)
+        self.sample_size = sample_size
+        if dtype == 'float32':
+            self.dtype = jnp.float32
+        elif dtype == 'bfloat16':
+            self.dtype = jnp.bfloat16
+        elif dtype == 'float16':
+            self.dtype = jnp.float16
+        else:
+            raise ValueError(f'unknown type: {dtype}')
+        self.dtype_str: str = dtype
+        if param_dtype not in ['float32', 'bfloat16', 'float16']:
+            raise ValueError(f'unknown parameter type: {param_dtype}')
+        self.param_dtype = param_dtype
+        self._load_models(
+                model_path = model_path,
+                convert2d = convert2d,
+                from_pt = from_pt,
+                use_memory_efficient_attention = use_memory_efficient_attention
+        )
+        self._mark_parameters(only_temporal = only_temporal)
+        # optionally for validation + sampling
+        self.tokenizer: Optional[CLIPTokenizer] = None
+        self.text_encoder: Optional[CLIPTextModel] = None
+        self.vae: Optional[AutoencoderKL] = None
+        self.ddim: Optional[Tuple[FlaxDDIMScheduler, DDIMSchedulerState]] = None
+    def log(self, message: Any) -> None:
+        if self.verbose and jax.process_index() == 0:
+            tqdm.write(str(message))
+    def log_metrics(self, metrics: dict, step: int, epoch: int) -> None:
+        if jax.process_index() > 0 or (not self.verbose and self.tracker is None):
+            return
+        now = time.monotonic()
+        log_data = {
+                'train/step': step,
+                'train/epoch': epoch,
+                'train/steps_per_sec': (step - self._tracker_meta['step0']) / (now - self._tracker_meta['t0']),
+                **{ f'train/{k}': v for k, v in metrics.items() }
+        }
+        self._tracker_meta['t0'] = now
+        self._tracker_meta['step0'] = step
+        self.log(log_data)
+        if self.tracker is not None:
+            self.tracker.log(log_data, step = step)
+    def enable_wandb(self, enable: bool = True) -> None:
+        self._use_wandb = enable
+    def _setup_wandb(self, config: Dict[str, Any] = dict()) -> None:
+        import wandb
+        import wandb.sdk
+        self.tracker: wandb.sdk.wandb_run.Run = wandb.init(
+                config = config,
+                settings = wandb.sdk.Settings(
+                        username = 'anon',
+                        host = 'anon',
+                        email = 'anon',
+                        root_dir = 'anon',
+                        _executable = 'anon',
+                        _disable_stats = True,
+                        _disable_meta = True,
+                        disable_code = True,
+                        disable_git = True
+                ) # pls don't log sensitive data like system user names. also, fuck you for even trying.
+        )
+    def _init_tracker_meta(self) -> None:
+        now = time.monotonic()
+        self._tracker_meta = {
+            't00': now,
+            't0': now,
+            'step0': 0
+        }
+    def _load_models(self,
+            model_path: str,
+            convert2d: bool,
+            from_pt: bool,
+            use_memory_efficient_attention: bool
+    ) -> None:
+        self.log(f'Load pretrained from {model_path}')
+        if convert2d:
+            self.log('  Convert 2D model to Pseudo3D')
+            self.log('    Initiate Pseudo3D model')
+            config = UNetPseudo3DConditionModel.load_config(model_path, subfolder = 'unet')
+            model = UNetPseudo3DConditionModel.from_config(
+                    config,
+                    sample_size = self.sample_size,
+                    dtype = self.dtype,
+                    param_dtype = self.param_dtype,
+                    use_memory_efficient_attention = use_memory_efficient_attention
+            )
+            params: Dict[str, Any] = model.init_weights(self.rng).unfreeze()
+            self.log('    Load 2D model')
+            model2d, params2d = FlaxUNet2DConditionModel.from_pretrained(
+                    model_path,
+                    subfolder = 'unet',
+                    dtype = self.dtype,
+                    from_pt = from_pt
+            )
+            self.log('    Map 2D -> 3D')
+            params = map_2d_to_pseudo3d(params2d, params, verbose = self.verbose)
+            del params2d
+            del model2d
+            del config
+        else:
+            model, params = UNetPseudo3DConditionModel.from_pretrained(
+                    model_path,
+                    subfolder = 'unet',
+                    from_pt = from_pt,
+                    sample_size = self.sample_size,
+                    dtype = self.dtype,
+                    param_dtype = self.param_dtype,
+                    use_memory_efficient_attention = use_memory_efficient_attention
+            )
+        self.log(f'Cast parameters to {model.param_dtype}')
+        if model.param_dtype == 'float32':
+            params = model.to_fp32(params)
+        elif model.param_dtype == 'float16':
+            params = model.to_fp16(params)
+        elif model.param_dtype == 'bfloat16':
+            params = model.to_bf16(params)
+        self.pretrained_model = model_path
+        self.model: UNetPseudo3DConditionModel = model
+        self.params: FrozenDict[str, Any] = FrozenDict(params)
+    def _mark_parameters(self, only_temporal: bool) -> None:
+        self.log('Mark training parameters')
+        if only_temporal:
+            self.log('Only training temporal layers')
+        if only_temporal:
+            param_partitions = traverse_util.path_aware_map(
+                    lambda path, _: 'trainable' if 'temporal' in ' '.join(path) else 'frozen', self.params
+            )
+        else:
+            param_partitions = traverse_util.path_aware_map(
+                    lambda *_: 'trainable', self.params
+            )
+        self.only_temporal = only_temporal
+        self.param_partitions: FrozenDict[str, Any] = FrozenDict(param_partitions)
+        self.log(f'Total parameters: {count_params(self.params)}')
+        self.log(f'Temporal parameters: {count_params(self.params, "temporal")}')
+    def _load_inference_models(self) -> None:
+        assert jax.process_index() == 0, 'not main process'
+        if self.text_encoder is None:
+            self.log('Load text encoder')
+            self.text_encoder = CLIPTextModel.from_pretrained(
+                    self.pretrained_model,
+                    subfolder = 'text_encoder'
+            )
+        if self.tokenizer is None:
+            self.log('Load tokenizer')
+            self.tokenizer = CLIPTokenizer.from_pretrained(
+                    self.pretrained_model,
+                    subfolder = 'tokenizer'
+            )
+        if self.vae is None:
+            self.log('Load vae')
+            self.vae = AutoencoderKL.from_pretrained(
+                    self.pretrained_model,
+                    subfolder = 'vae'
+            )
+        if self.ddim is None:
+            self.log('Load ddim scheduler')
+            # tuple(scheduler , scheduler state)
+            self.ddim = FlaxDDIMScheduler.from_pretrained(
+                    self.pretrained_model,
+                    subfolder = 'scheduler',
+                    from_pt = True
+            )
+    def _unload_inference_models(self) -> None:
+        self.text_encoder = None
+        self.tokenizer = None
+        self.vae = None
+        self.ddim = None
+    def sample(self,
+            params: Union[Dict[str, Any], FrozenDict[str, Any]],
+            prompt: str,
+            image_path: str,
+            num_frames: int,
+            replicate_params: bool = True,
+            neg_prompt: str = '',
+            steps: int = 50,
+            cfg: float = 9.0,
+            unload_after_usage: bool = False
+    ) -> List[Image.Image]:
+        assert jax.process_index() == 0, 'not main process'
+        self.log('Sample')
+        self._load_inference_models()
+        with torch.no_grad():
+            tokens = self.tokenizer(
+                [ prompt ],
+                truncation = True,
+                return_overflowing_tokens = False,
+                padding = 'max_length',
+                return_tensors = 'pt'
+            ).input_ids
+            neg_tokens = self.tokenizer(
+                [ neg_prompt ],
+                truncation = True,
+                return_overflowing_tokens = False,
+                padding = 'max_length',
+                return_tensors = 'pt'
+            ).input_ids
+            encoded_prompt = self.text_encoder(input_ids = tokens).last_hidden_state
+            encoded_neg_prompt = self.text_encoder(input_ids = neg_tokens).last_hidden_state
+            hint_latent = torch.tensor(np.asarray(Image.open(image_path))).permute(2,0,1).to(torch.float32).div(255).mul(2).sub(1).unsqueeze(0)
+            hint_latent = self.vae.encode(hint_latent).latent_dist.mean * self.vae.config.scaling_factor  #0.18215 # deterministic
+            hint_latent = hint_latent.unsqueeze(2).repeat_interleave(num_frames, 2)
+            mask = torch.zeros_like(hint_latent[:,0:1,:,:,:]) # zero mask, e.g. skip masking for now
+            init_latent = torch.randn_like(hint_latent)
+            # move to devices
+            encoded_prompt = jnp.array(encoded_prompt.numpy())
+            encoded_neg_prompt = jnp.array(encoded_neg_prompt.numpy())
+            hint_latent = jnp.array(hint_latent.numpy())
+            mask = jnp.array(mask.numpy())
+            init_latent = init_latent.repeat(jax.device_count(), 1, 1, 1, 1)
+            init_latent = jnp.array(init_latent.numpy())
+            self.ddim = (self.ddim[0], self.ddim[0].set_timesteps(self.ddim[1], steps))
+            timesteps = self.ddim[1].timesteps
+            if replicate_params:
+                params = jax_utils.replicate(params)
+            ddim_state = jax_utils.replicate(self.ddim[1])
+            encoded_prompt = jax_utils.replicate(encoded_prompt)
+            encoded_neg_prompt = jax_utils.replicate(encoded_neg_prompt)
+            hint_latent = jax_utils.replicate(hint_latent)
+            mask = jax_utils.replicate(mask)
+            # sampling fun
+            def sample_loop(init_latent, ddim_state, t, params, encoded_prompt, encoded_neg_prompt, hint_latent, mask):
+                latent_model_input = jnp.concatenate([init_latent, mask, hint_latent], axis = 1)
+                pred = self.model.apply(
+                        { 'params': params },
+                        latent_model_input,
+                        t,
+                        encoded_prompt
+                ).sample
+                if cfg != 1.0:
+                    neg_pred = self.model.apply(
+                            { 'params': params },
+                            latent_model_input,
+                            t,
+                            encoded_neg_prompt
+                    ).sample
+                    pred = neg_pred + cfg * (pred - neg_pred)
+                # TODO check if noise is added at the right dimension
+                init_latent, ddim_state = self.ddim[0].step(ddim_state, pred, t, init_latent).to_tuple()
+                return init_latent, ddim_state
+            p_sample_loop = jax.pmap(sample_loop, 'sample', donate_argnums = ())
+            pbar_sample = trange(len(timesteps), desc = 'Sample', dynamic_ncols = True, smoothing = 0.1, disable = not self.verbose)
+            init_latent = shard(init_latent)
+            for i in pbar_sample:
+                t = timesteps[i].repeat(self.num_devices)
+                t = shard(t)
+                init_latent, ddim_state = p_sample_loop(init_latent, ddim_state, t, params, encoded_prompt, encoded_neg_prompt, hint_latent, mask)
+            # decode
+            self.log('Decode')
+            init_latent = torch.tensor(np.array(init_latent))
+            init_latent = init_latent / self.vae.config.scaling_factor
+            # d:0 b:1 c:2 f:3 h:4 w:5 -> d b f c h w
+            init_latent = init_latent.permute(0, 1, 3, 2, 4, 5)
+            images = []
+            pbar_decode = trange(len(init_latent), desc = 'Decode', dynamic_ncols = True)
+            for sample in init_latent:
+                ims = self.vae.decode(sample.squeeze()).sample
+                ims = ims.add(1).div(2).mul(255).round().clamp(0, 255).to(torch.uint8).permute(0,2,3,1).numpy()
+                ims = [ Image.fromarray(x) for x in ims ]
+                for im in ims:
+                    images.append(im)
+                pbar_decode.update(1)
+        if unload_after_usage:
+            self._unload_inference_models()
+        return images
+    def get_params_from_state(self, state: TrainState) -> FrozenDict[Any, str]:
+        return FrozenDict(jax.device_get(jax.tree_util.tree_map(lambda x: x[0], state.params)))
+    def train(self,
+            dataloader: DataLoader,
+            lr: float,
+            num_frames: int,
+            log_every_step: int = 10,
+            save_every_epoch: int = 1,
+            sample_every_epoch: int = 1,
+            output_dir: str = 'output',
+            warmup: float = 0,
+            decay: float = 0,
+            epochs: int = 10,
+            weight_decay: float = 1e-2
+    ) -> None:
+        eps = 1e-8
+        total_steps = len(dataloader) * epochs
+        warmup_steps = math.ceil(warmup * total_steps) if warmup > 0 else 0
+        decay_steps = math.ceil(decay * total_steps) + warmup_steps if decay > 0 else warmup_steps + 1
+        self.log(f'Total steps:  {total_steps}')
+        self.log(f'Warmup steps: {warmup_steps}')
+        self.log(f'Decay steps:  {decay_steps - warmup_steps}')
+        if warmup > 0 or decay > 0:
+            if not decay > 0:
+                # only warmup, keep peak lr until end
+                self.log('Warmup schedule')
+                end_lr = lr
+            else:
+                # warmup + annealing to end lr
+                self.log('Warmup + cosine annealing schedule')
+                end_lr = eps
+            lr_schedule = optax.warmup_cosine_decay_schedule(
+                    init_value = 0.0,
+                    peak_value = lr,
+                    warmup_steps = warmup_steps,
+                    decay_steps = decay_steps,
+                    end_value = end_lr
+            )
+        else:
+            # no warmup or decay -> constant lr
+            self.log('constant schedule')
+            lr_schedule = optax.constant_schedule(value = lr)
+        adamw = optax.adamw(
+                learning_rate = lr_schedule,
+                b1 = 0.9,
+                b2 = 0.999,
+                eps = eps,
+                weight_decay = weight_decay #0.01 # 0.0001
+        )
+        optim = optax.chain(
+                optax.clip_by_global_norm(max_norm = 1.0),
+                adamw
+        )
+        partition_optimizers = {
+                'trainable': optim,
+                'frozen': optax.set_to_zero()
+        }
+        tx = optax.multi_transform(partition_optimizers, self.param_partitions)
+        state = TrainState.create(
+                apply_fn = self.model.__call__,
+                params = self.params,
+                tx = tx
+        )
+        validation_rng, train_rngs = jax.random.split(self.rng)
+        train_rngs = jax.random.split(train_rngs, jax.local_device_count())
+        def train_step(state: TrainState, batch: Dict[str, jax.Array], train_rng: jax.random.PRNGKeyArray):
+            def compute_loss(
+                    params: Dict[str, Any],
+                    batch: Dict[str, jax.Array],
+                    sample_rng: jax.random.PRNGKeyArray # unused, dataloader provides everything
+            ) -> jax.Array:
+                # 'latent_model_input': latent_model_input
+                # 'encoder_hidden_states': encoder_hidden_states
+                # 'timesteps': timesteps
+                # 'noise': noise
+                latent_model_input = batch['latent_model_input']
+                encoder_hidden_states = batch['encoder_hidden_states']
+                timesteps = batch['timesteps']
+                noise = batch['noise']
+                model_pred = self.model.apply(
+                        { 'params': params },
+                        latent_model_input,
+                        timesteps,
+                        encoder_hidden_states
+                ).sample
+                loss = (noise - model_pred) ** 2
+                loss = loss.mean()
+                return loss
+            grad_fn = jax.value_and_grad(compute_loss)
+            def loss_and_grad(
+                    train_rng: jax.random.PRNGKeyArray
+            ) -> Tuple[jax.Array, Any, jax.random.PRNGKeyArray]:
+                sample_rng, train_rng = jax.random.split(train_rng, 2)
+                loss, grad = grad_fn(state.params, batch, sample_rng)
+                return loss, grad, train_rng
+            loss, grad, new_train_rng = loss_and_grad(train_rng)
+            # self.log(grad) # NOTE uncomment to visualize gradient
+            grad = jax.lax.pmean(grad, axis_name = 'batch')
+            new_state = state.apply_gradients(grads = grad)
+            metrics: Dict[str, Any] = { 'loss': loss }
+            metrics = jax.lax.pmean(metrics, axis_name = 'batch')
+            def l2(xs) -> jax.Array:
+                return jnp.sqrt(sum([jnp.vdot(x, x) for x in jax.tree_util.tree_leaves(xs)]))
+            metrics['l2_grads'] = l2(jax.tree_util.tree_leaves(grad))
+            return new_state, metrics, new_train_rng
+        p_train_step = jax.pmap(fun = train_step, axis_name = 'batch', donate_argnums = (0, ))
+        state = jax_utils.replicate(state)
+        train_metrics = []
+        train_metric = None
+        global_step: int = 0
+        if jax.process_index() == 0:
+            self._init_tracker_meta()
+            hyper_params = {
+                    'lr': lr,
+                    'lr_warmup': warmup,
+                    'lr_decay': decay,
+                    'weight_decay': weight_decay,
+                    'total_steps': total_steps,
+                    'batch_size': dataloader.batch_size // self.num_devices,
+                    'num_frames': num_frames,
+                    'sample_size': self.sample_size,
+                    'num_devices': self.num_devices,
+                    'seed': self.seed,
+                    'use_memory_efficient_attention': self.model.use_memory_efficient_attention,
+                    'only_temporal': self.only_temporal,
+                    'dtype': self.dtype_str,
+                    'param_dtype': self.param_dtype,
+                    'pretrained_model': self.pretrained_model,
+                    'model_config': self.model.config
+            }
+            if self._use_wandb:
+                self.log('Setting up wandb')
+                self._setup_wandb(hyper_params)
+            self.log(hyper_params)
+            output_path = os.path.join(output_dir, str(global_step), 'unet')
+            self.log(f'saving checkpoint to {output_path}')
+            self.model.save_pretrained(
+                    save_directory = output_path,
+                    params = self.get_params_from_state(state),#jax.device_get(jax.tree_util.tree_map(lambda x: x[0], state.params)),
+                    is_main_process = True
+            )
+        pbar_epoch = tqdm(
+                total = epochs,
+                desc = 'Epochs',
+                smoothing = 1,
+                position = 0,
+                dynamic_ncols = True,
+                leave = True,
+                disable = jax.process_index() > 0
+        )
+        steps_per_epoch = len(dataloader) # TODO dataloader
+        for epoch in range(epochs):
+            pbar_steps = tqdm(
+                    total = steps_per_epoch,
+                    desc = 'Steps',
+                    position = 1,
+                    smoothing = 0.1,
+                    dynamic_ncols = True,
+                    leave = True,
+                    disable = jax.process_index() > 0
+            )
+            for batch in dataloader:
+                # keep input + gt as float32, results in fp32 loss and grad
+                # otherwise uncomment the following to cast to the model dtype
+                # batch = { k: (v.astype(self.dtype) if v.dtype == np.float32 else v) for k,v in batch.items() }
+                batch = shard(batch)
+                state, train_metric, train_rngs = p_train_step(
+                        state, batch, train_rngs
+                )
+                train_metrics.append(train_metric)
+                if global_step % log_every_step == 0 and jax.process_index() == 0:
+                    train_metrics = jax_utils.unreplicate(train_metrics)
+                    train_metrics = jax.tree_util.tree_map(lambda *m: jnp.array(m).mean(), *train_metrics)
+                    if global_step == 0:
+                        self.log(f'grad dtype: {train_metrics["l2_grads"].dtype}')
+                        self.log(f'loss dtype: {train_metrics["loss"].dtype}')
+                    train_metrics_dict = { k: v.item() for k, v in train_metrics.items() }
+                    train_metrics_dict['lr'] = lr_schedule(global_step).item()
+                    self.log_metrics(train_metrics_dict, step = global_step, epoch = epoch)
+                    train_metrics = []
+                pbar_steps.update(1)
+                global_step += 1
+            if epoch % save_every_epoch == 0 and jax.process_index() == 0:
+                output_path = os.path.join(output_dir, str(global_step), 'unet')
+                self.log(f'saving checkpoint to {output_path}')
+                self.model.save_pretrained(
+                        save_directory = output_path,
+                        params = self.get_params_from_state(state),#jax.device_get(jax.tree_util.tree_map(lambda x: x[0], state.params)),
+                        is_main_process = True
+                )
+                self.log(f'checkpoint saved ')
+            if epoch % sample_every_epoch == 0 and jax.process_index() == 0:
+                images = self.sample(
+                        params = state.params,
+                        replicate_params = False,
+                        prompt = 'dancing person',
+                        image_path = 'testimage.png',
+                        num_frames = num_frames,
+                        steps = 50,
+                        cfg = 9.0,
+                        unload_after_usage = False
+                )
+                os.makedirs(os.path.join('image_output', str(epoch)), exist_ok = True)
+                for i, im in enumerate(images):
+                    im.save(os.path.join('image_output', str(epoch), str(i).zfill(5) + '.png'), optimize = True)
+            pbar_epoch.update(1)

makeavid_sd/makeavid_sd/flax_impl/flax_unet_pseudo3d_blocks.py ADDED Viewed

	@@ -0,0 +1,254 @@

+from typing import Tuple
+import jax
+import jax.numpy as jnp
+import flax.linen as nn
+from .flax_attention_pseudo3d import TransformerPseudo3DModel
+from .flax_resnet_pseudo3d import ResnetBlockPseudo3D, DownsamplePseudo3D, UpsamplePseudo3D
+class UNetMidBlockPseudo3DCrossAttn(nn.Module):
+    in_channels: int
+    num_layers: int = 1
+    attn_num_head_channels: int = 1
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        resnets = [
+                ResnetBlockPseudo3D(
+                        in_channels = self.in_channels,
+                        out_channels = self.in_channels,
+                        dtype = self.dtype
+                )
+        ]
+        attentions = []
+        for _ in range(self.num_layers):
+            attn_block = TransformerPseudo3DModel(
+                    in_channels = self.in_channels,
+                    num_attention_heads = self.attn_num_head_channels,
+                    attention_head_dim = self.in_channels // self.attn_num_head_channels,
+                    num_layers = 1,
+                    use_memory_efficient_attention = self.use_memory_efficient_attention,
+                    dtype = self.dtype
+            )
+            attentions.append(attn_block)
+            res_block = ResnetBlockPseudo3D(
+                    in_channels = self.in_channels,
+                    out_channels = self.in_channels,
+                    dtype = self.dtype
+            )
+            resnets.append(res_block)
+        self.attentions = attentions
+        self.resnets = resnets
+    def __call__(self,
+            hidden_states: jax.Array,
+            temb: jax.Array,
+            encoder_hidden_states = jax.Array
+    ) -> jax.Array:
+        hidden_states = self.resnets[0](hidden_states, temb)
+        for attn, resnet in zip(self.attentions, self.resnets[1:]):
+            hidden_states = attn(hidden_states, encoder_hidden_states)
+            hidden_states = resnet(hidden_states, temb)
+        return hidden_states
+class CrossAttnDownBlockPseudo3D(nn.Module):
+    in_channels: int
+    out_channels: int
+    num_layers: int = 1
+    attn_num_head_channels: int = 1
+    add_downsample: bool = True
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        attentions = []
+        resnets = []
+        for i in range(self.num_layers):
+            in_channels = self.in_channels if i == 0 else self.out_channels
+            res_block = ResnetBlockPseudo3D(
+                    in_channels = in_channels,
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+            resnets.append(res_block)
+            attn_block = TransformerPseudo3DModel(
+                    in_channels = self.out_channels,
+                    num_attention_heads = self.attn_num_head_channels,
+                    attention_head_dim = self.out_channels // self.attn_num_head_channels,
+                    num_layers = 1,
+                    use_memory_efficient_attention = self.use_memory_efficient_attention,
+                    dtype = self.dtype
+            )
+            attentions.append(attn_block)
+        self.resnets = resnets
+        self.attentions = attentions
+        if self.add_downsample:
+            self.downsamplers_0 = DownsamplePseudo3D(
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+        else:
+            self.downsamplers_0 = None
+    def __call__(self,
+            hidden_states: jax.Array,
+            temb: jax.Array,
+            encoder_hidden_states: jax.Array
+    ) -> Tuple[jax.Array, jax.Array]:
+        output_states = ()
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states)
+            output_states += (hidden_states, )
+        if self.add_downsample:
+            hidden_states = self.downsamplers_0(hidden_states)
+            output_states += (hidden_states, )
+        return hidden_states, output_states
+class DownBlockPseudo3D(nn.Module):
+    in_channels: int
+    out_channels: int
+    num_layers: int = 1
+    add_downsample: bool = True
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        resnets = []
+        for i in range(self.num_layers):
+            in_channels = self.in_channels if i == 0 else self.out_channels
+            res_block = ResnetBlockPseudo3D(
+                    in_channels = in_channels,
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+            resnets.append(res_block)
+        self.resnets = resnets
+        if self.add_downsample:
+            self.downsamplers_0 = DownsamplePseudo3D(
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+        else:
+            self.downsamplers_0 = None
+    def __call__(self,
+            hidden_states: jax.Array,
+            temb: jax.Array
+    ) -> Tuple[jax.Array, jax.Array]:
+        output_states = ()
+        for resnet in self.resnets:
+            hidden_states = resnet(hidden_states, temb)
+            output_states += (hidden_states, )
+        if self.add_downsample:
+            hidden_states = self.downsamplers_0(hidden_states)
+            output_states += (hidden_states, )
+        return hidden_states, output_states
+class CrossAttnUpBlockPseudo3D(nn.Module):
+    in_channels: int
+    out_channels: int
+    prev_output_channels: int
+    num_layers: int = 1
+    attn_num_head_channels: int = 1
+    add_upsample: bool = True
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        resnets = []
+        attentions = []
+        for i in range(self.num_layers):
+            res_skip_channels = self.in_channels if (i == self.num_layers -1) else self.out_channels
+            resnet_in_channels = self.prev_output_channels if i == 0 else self.out_channels
+            res_block = ResnetBlockPseudo3D(
+                    in_channels = resnet_in_channels + res_skip_channels,
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+            resnets.append(res_block)
+            attn_block = TransformerPseudo3DModel(
+                    in_channels = self.out_channels,
+                    num_attention_heads = self.attn_num_head_channels,
+                    attention_head_dim = self.out_channels // self.attn_num_head_channels,
+                    num_layers = 1,
+                    use_memory_efficient_attention = self.use_memory_efficient_attention,
+                    dtype = self.dtype
+            )
+            attentions.append(attn_block)
+        self.resnets = resnets
+        self.attentions = attentions
+        if self.add_upsample:
+            self.upsamplers_0 = UpsamplePseudo3D(
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+        else:
+            self.upsamplers_0 = None
+    def __call__(self,
+            hidden_states: jax.Array,
+            res_hidden_states_tuple: Tuple[jax.Array, ...],
+            temb: jax.Array,
+            encoder_hidden_states: jax.Array
+    ) -> jax.Array:
+        for resnet, attn in zip(self.resnets, self.attentions):
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = jnp.concatenate((hidden_states, res_hidden_states), axis = -1)
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states)
+        if self.add_upsample:
+            hidden_states = self.upsamplers_0(hidden_states)
+        return hidden_states
+class UpBlockPseudo3D(nn.Module):
+    in_channels: int
+    out_channels: int
+    prev_output_channels: int
+    num_layers: int = 1
+    add_upsample: bool = True
+    dtype: jnp.dtype = jnp.float32
+    def setup(self) -> None:
+        resnets = []
+        for i in range(self.num_layers):
+            res_skip_channels = self.in_channels if (i == self.num_layers - 1) else self.out_channels
+            resnet_in_channels = self.prev_output_channels if i == 0 else self.out_channels
+            res_block = ResnetBlockPseudo3D(
+                    in_channels = resnet_in_channels + res_skip_channels,
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+            resnets.append(res_block)
+        self.resnets = resnets
+        if self.add_upsample:
+            self.upsamplers_0 = UpsamplePseudo3D(
+                    out_channels = self.out_channels,
+                    dtype = self.dtype
+            )
+        else:
+            self.upsamplers_0 = None
+    def __call__(self,
+            hidden_states: jax.Array,
+            res_hidden_states_tuple: Tuple[jax.Array, ...],
+            temb: jax.Array
+    ) -> jax.Array:
+        for resnet in self.resnets:
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = jnp.concatenate([hidden_states, res_hidden_states], axis = -1)
+            hidden_states = resnet(hidden_states, temb)
+        if self.add_upsample:
+            hidden_states = self.upsamplers_0(hidden_states)
+        return hidden_states

makeavid_sd/makeavid_sd/flax_impl/flax_unet_pseudo3d_condition.py ADDED Viewed

	@@ -0,0 +1,251 @@

+from typing import Tuple, Union
+import jax
+import jax.numpy as jnp
+import flax.linen as nn
+from flax.core.frozen_dict import FrozenDict
+from diffusers.configuration_utils import ConfigMixin, flax_register_to_config
+from diffusers.models.modeling_flax_utils import FlaxModelMixin
+from diffusers.utils import BaseOutput
+from .flax_unet_pseudo3d_blocks import (
+        CrossAttnDownBlockPseudo3D,
+        CrossAttnUpBlockPseudo3D,
+        DownBlockPseudo3D,
+        UpBlockPseudo3D,
+        UNetMidBlockPseudo3DCrossAttn
+)
+#from flax_embeddings import (
+#        TimestepEmbedding,
+#        Timesteps
+#)
+from diffusers.models.embeddings_flax import FlaxTimestepEmbedding, FlaxTimesteps
+from .flax_resnet_pseudo3d import ConvPseudo3D
+class UNetPseudo3DConditionOutput(BaseOutput):
+    sample: jax.Array
+@flax_register_to_config
+class UNetPseudo3DConditionModel(nn.Module, FlaxModelMixin, ConfigMixin):
+    sample_size: Union[int, Tuple[int, int]] = (64, 64)
+    in_channels: int = 4
+    out_channels: int = 4
+    down_block_types: Tuple[str] = (
+            "CrossAttnDownBlockPseudo3D",
+            "CrossAttnDownBlockPseudo3D",
+            "CrossAttnDownBlockPseudo3D",
+            "DownBlockPseudo3D"
+    )
+    up_block_types: Tuple[str] = (
+            "UpBlockPseudo3D",
+            "CrossAttnUpBlockPseudo3D",
+            "CrossAttnUpBlockPseudo3D",
+            "CrossAttnUpBlockPseudo3D"
+    )
+    block_out_channels: Tuple[int] = (
+            320,
+            640,
+            1280,
+            1280
+    )
+    layers_per_block: int = 2
+    attention_head_dim: Union[int, Tuple[int]] = 8
+    cross_attention_dim: int = 768
+    flip_sin_to_cos: bool = True
+    freq_shift: int = 0
+    use_memory_efficient_attention: bool = False
+    dtype: jnp.dtype = jnp.float32
+    param_dtype: str = 'float32'
+    def init_weights(self, rng: jax.random.KeyArray) -> FrozenDict:
+        if self.param_dtype == 'bfloat16':
+            param_dtype = jnp.bfloat16
+        elif self.param_dtype == 'float16':
+            param_dtype = jnp.float16
+        elif self.param_dtype == 'float32':
+            param_dtype = jnp.float32
+        else:
+            raise ValueError(f'unknown parameter type: {self.param_dtype}')
+        sample_size = self.sample_size
+        if isinstance(sample_size, int):
+            sample_size = (sample_size, sample_size)
+        sample_shape = (1, self.in_channels, 1, *sample_size)
+        sample = jnp.zeros(sample_shape, dtype = param_dtype)
+        timesteps = jnp.ones((1, ), dtype = jnp.int32)
+        encoder_hidden_states = jnp.zeros((1, 1, self.cross_attention_dim), dtype = param_dtype)
+        params_rng, dropout_rng = jax.random.split(rng)
+        rngs = { "params": params_rng, "dropout": dropout_rng }
+        return self.init(rngs, sample, timesteps, encoder_hidden_states)["params"]
+    def setup(self) -> None:
+        if isinstance(self.attention_head_dim, int):
+            attention_head_dim = (self.attention_head_dim, ) * len(self.down_block_types)
+        else:
+            attention_head_dim = self.attention_head_dim
+        time_embed_dim = self.block_out_channels[0] * 4
+        self.conv_in = ConvPseudo3D(
+                features = self.block_out_channels[0],
+                kernel_size = (3, 3),
+                strides = (1, 1),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+        self.time_proj = FlaxTimesteps(
+                dim = self.block_out_channels[0],
+                flip_sin_to_cos = self.flip_sin_to_cos,
+                freq_shift = self.freq_shift
+        )
+        self.time_embedding = FlaxTimestepEmbedding(
+                time_embed_dim = time_embed_dim,
+                dtype = self.dtype
+        )
+        down_blocks = []
+        output_channels = self.block_out_channels[0]
+        for i, down_block_type in enumerate(self.down_block_types):
+            input_channels = output_channels
+            output_channels = self.block_out_channels[i]
+            is_final_block = i == len(self.block_out_channels) - 1
+            # allows loading 3d models with old layer type names in their configs
+            # eg. 2D instead of Pseudo3D, like lxj's timelapse model
+            if down_block_type in ['CrossAttnDownBlockPseudo3D', 'CrossAttnDownBlock2D']:
+                down_block = CrossAttnDownBlockPseudo3D(
+                        in_channels = input_channels,
+                        out_channels = output_channels,
+                        num_layers = self.layers_per_block,
+                        attn_num_head_channels = attention_head_dim[i],
+                        add_downsample = not is_final_block,
+                        use_memory_efficient_attention = self.use_memory_efficient_attention,
+                        dtype = self.dtype
+                )
+            elif down_block_type in ['DownBlockPseudo3D', 'DownBlock2D']:
+                down_block = DownBlockPseudo3D(
+                        in_channels = input_channels,
+                        out_channels = output_channels,
+                        num_layers = self.layers_per_block,
+                        add_downsample = not is_final_block,
+                        dtype = self.dtype
+                )
+            else:
+                raise NotImplementedError(f'Unimplemented down block type: {down_block_type}')
+            down_blocks.append(down_block)
+        self.down_blocks = down_blocks
+        self.mid_block = UNetMidBlockPseudo3DCrossAttn(
+                in_channels = self.block_out_channels[-1],
+                attn_num_head_channels = attention_head_dim[-1],
+                use_memory_efficient_attention = self.use_memory_efficient_attention,
+                dtype = self.dtype
+        )
+        up_blocks = []
+        reversed_block_out_channels = list(reversed(self.block_out_channels))
+        reversed_attention_head_dim = list(reversed(attention_head_dim))
+        output_channels = reversed_block_out_channels[0]
+        for i, up_block_type in enumerate(self.up_block_types):
+            prev_output_channels = output_channels
+            output_channels = reversed_block_out_channels[i]
+            input_channels = reversed_block_out_channels[min(i + 1, len(self.block_out_channels) - 1)]
+            is_final_block = i == len(self.block_out_channels) - 1
+            if up_block_type in ['CrossAttnUpBlockPseudo3D', 'CrossAttnUpBlock2D']:
+                up_block = CrossAttnUpBlockPseudo3D(
+                        in_channels = input_channels,
+                        out_channels = output_channels,
+                        prev_output_channels = prev_output_channels,
+                        num_layers = self.layers_per_block + 1,
+                        attn_num_head_channels = reversed_attention_head_dim[i],
+                        add_upsample = not is_final_block,
+                        use_memory_efficient_attention = self.use_memory_efficient_attention,
+                        dtype = self.dtype
+                )
+            elif up_block_type in ['UpBlockPseudo3D', 'UpBlock2D']:
+                up_block = UpBlockPseudo3D(
+                        in_channels = input_channels,
+                        out_channels = output_channels,
+                        prev_output_channels = prev_output_channels,
+                        num_layers = self.layers_per_block + 1,
+                        add_upsample = not is_final_block,
+                        dtype = self.dtype
+                )
+            else:
+                raise NotImplementedError(f'Unimplemented up block type: {up_block_type}')
+            up_blocks.append(up_block)
+        self.up_blocks = up_blocks
+        self.conv_norm_out = nn.GroupNorm(
+                num_groups = 32,
+                epsilon = 1e-5
+        )
+        self.conv_out = ConvPseudo3D(
+                features = self.out_channels,
+                kernel_size = (3, 3),
+                strides = (1, 1),
+                padding = ((1, 1), (1, 1)),
+                dtype = self.dtype
+        )
+    def __call__(self,
+            sample: jax.Array,
+            timesteps: jax.Array,
+            encoder_hidden_states: jax.Array,
+            return_dict: bool = True
+    ) -> Union[UNetPseudo3DConditionOutput, Tuple[jax.Array]]:
+        if timesteps.dtype != jnp.float32:
+            timesteps = timesteps.astype(dtype = jnp.float32)
+        if len(timesteps.shape) == 0:
+            timesteps = jnp.expand_dims(timesteps, 0)
+        # b,c,f,h,w -> b,f,h,w,c
+        sample = sample.transpose((0, 2, 3, 4, 1))
+        t_emb = self.time_proj(timesteps)
+        t_emb = self.time_embedding(t_emb)
+        sample = self.conv_in(sample)
+        down_block_res_samples = (sample, )
+        for down_block in self.down_blocks:
+            if isinstance(down_block, CrossAttnDownBlockPseudo3D):
+                sample, res_samples = down_block(
+                        hidden_states = sample,
+                        temb = t_emb,
+                        encoder_hidden_states = encoder_hidden_states
+                )
+            elif isinstance(down_block, DownBlockPseudo3D):
+                sample, res_samples = down_block(
+                        hidden_states = sample,
+                        temb = t_emb
+                )
+            else:
+                raise NotImplementedError(f'Unimplemented down block type: {down_block.__class__.__name__}')
+            down_block_res_samples += res_samples
+        sample = self.mid_block(
+                hidden_states = sample,
+                temb = t_emb,
+                encoder_hidden_states = encoder_hidden_states
+        )
+        for up_block in self.up_blocks:
+            res_samples = down_block_res_samples[-(self.layers_per_block + 1):]
+            down_block_res_samples = down_block_res_samples[:-(self.layers_per_block + 1)]
+            if isinstance(up_block, CrossAttnUpBlockPseudo3D):
+                sample = up_block(
+                        hidden_states = sample,
+                        temb = t_emb,
+                        encoder_hidden_states = encoder_hidden_states,
+                        res_hidden_states_tuple = res_samples
+                )
+            elif isinstance(up_block, UpBlockPseudo3D):
+                sample = up_block(
+                        hidden_states = sample,
+                        temb = t_emb,
+                        res_hidden_states_tuple = res_samples
+                )
+            else:
+                raise NotImplementedError(f'Unimplemented up block type: {up_block.__class__.__name__}')
+        sample = self.conv_norm_out(sample)
+        sample = nn.silu(sample)
+        sample = self.conv_out(sample)
+        # b,f,h,w,c -> b,c,f,h,w
+        sample = sample.transpose((0, 4, 1, 2, 3))
+        if not return_dict:
+            return (sample, )
+        return UNetPseudo3DConditionOutput(sample = sample)

makeavid_sd/makeavid_sd/flax_impl/train.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import jax
+_ = jax.device_count() # ugly hack to prevent tpu comms to lock/race or smth smh
+from typing import Tuple, Optional
+import os
+from argparse import ArgumentParser
+from flax_trainer import FlaxTrainerUNetPseudo3D
+from dataset import load_dataset
+def train(
+        dataset_path: str,
+        model_path: str,
+        output_dir: str,
+        dataset_cache_dir: Optional[str] = None,
+        from_pt: bool = True,
+        convert2d: bool = False,
+        only_temporal: bool = True,
+        sample_size: Tuple[int, int] = (64, 64),
+        lr: float = 5e-5,
+        batch_size: int = 1,
+        num_frames: int = 24,
+        epochs: int = 10,
+        warmup: float = 0.1,
+        decay: float = 0.0,
+        weight_decay: float = 1e-2,
+        log_every_step: int = 50,
+        save_every_epoch: int = 1,
+        sample_every_epoch: int = 1,
+        seed: int = 0,
+        dtype: str = 'bfloat16',
+        param_dtype: str = 'float32',
+        use_memory_efficient_attention: bool = True,
+        verbose: bool = True,
+        use_wandb: bool = False
+) -> None:
+    log = lambda x: print(x) if verbose else None
+    log('\n----------------')
+    log('Init trainer')
+    trainer = FlaxTrainerUNetPseudo3D(
+            model_path = model_path,
+            from_pt = from_pt,
+            convert2d = convert2d,
+            sample_size = sample_size,
+            seed = seed,
+            dtype = dtype,
+            param_dtype = param_dtype,
+            use_memory_efficient_attention = use_memory_efficient_attention,
+            verbose = verbose,
+            only_temporal = only_temporal
+    )
+    log('\n----------------')
+    log('Init dataset')
+    dataloader = load_dataset(
+            dataset_path = dataset_path,
+            model_path = model_path,
+            cache_dir = dataset_cache_dir,
+            batch_size = batch_size * trainer.num_devices,
+            num_frames = num_frames,
+            num_workers = min(trainer.num_devices * 2, os.cpu_count() - 1),
+            as_numpy = True,
+            shuffle = True
+    )
+    log('\n----------------')
+    log('Train')
+    if use_wandb:
+        trainer.enable_wandb()
+    trainer.train(
+            dataloader = dataloader,
+            epochs = epochs,
+            num_frames = num_frames,
+            log_every_step = log_every_step,
+            save_every_epoch = save_every_epoch,
+            sample_every_epoch = sample_every_epoch,
+            lr = lr,
+            warmup = warmup,
+            decay = decay,
+            weight_decay = weight_decay,
+            output_dir = output_dir
+    )
+    log('\n----------------')
+    log('Done')
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    bool_type = lambda x: x.lower() in ['true', '1', 'yes']
+    parser.add_argument('-v', '--verbose', type = bool_type, default = True)
+    parser.add_argument('-d', '--dataset_path', required = True)
+    parser.add_argument('-m', '--model_path', required = True)
+    parser.add_argument('-o', '--output_dir', required = True)
+    parser.add_argument('-b', '--batch_size', type = int, default = 1)
+    parser.add_argument('-f', '--num_frames', type = int, default = 24)
+    parser.add_argument('-e', '--epochs', type = int, default = 2)
+    parser.add_argument('--only_temporal', type = bool_type, default = True)
+    parser.add_argument('--dataset_cache_dir', type = str, default = None)
+    parser.add_argument('--from_pt', type = bool_type, default = True)
+    parser.add_argument('--convert2d', type = bool_type, default = False)
+    parser.add_argument('--lr', type = float, default = 1e-4)
+    parser.add_argument('--warmup', type = float, default = 0.1)
+    parser.add_argument('--decay', type = float, default = 0.0)
+    parser.add_argument('--weight_decay', type = float, default = 1e-2)
+    parser.add_argument('--sample_size', type = int, nargs = 2, default = [64, 64])
+    parser.add_argument('--log_every_step', type = int, default = 250)
+    parser.add_argument('--save_every_epoch', type = int, default = 1)
+    parser.add_argument('--sample_every_epoch', type = int, default = 1)
+    parser.add_argument('--seed', type = int, default = 0)
+    parser.add_argument('--use_memory_efficient_attention', type = bool_type, default = True)
+    parser.add_argument('--dtype', choices = ['float32', 'bfloat16', 'float16'], default = 'bfloat16')
+    parser.add_argument('--param_dtype', choices = ['float32', 'bfloat16', 'float16'], default = 'float32')
+    parser.add_argument('--wandb', type = bool_type, default = False)
+    args = parser.parse_args()
+    args.sample_size = tuple(args.sample_size)
+    if args.verbose:
+        print(args)
+    train(
+            dataset_path = args.dataset_path,
+            model_path = args.model_path,
+            from_pt = args.from_pt,
+            convert2d = args.convert2d,
+            only_temporal = args.only_temporal,
+            output_dir = args.output_dir,
+            dataset_cache_dir = args.dataset_cache_dir,
+            batch_size = args.batch_size,
+            num_frames = args.num_frames,
+            epochs = args.epochs,
+            lr = args.lr,
+            warmup = args.warmup,
+            decay = args.decay,
+            weight_decay = args.weight_decay,
+            sample_size = args.sample_size,
+            seed = args.seed,
+            dtype = args.dtype,
+            param_dtype = args.param_dtype,
+            use_memory_efficient_attention = args.use_memory_efficient_attention,
+            log_every_step = args.log_every_step,
+            save_every_epoch = args.save_every_epoch,
+            sample_every_epoch = args.sample_every_epoch,
+            verbose = args.verbose,
+            use_wandb = args.wandb
+    )

makeavid_sd/makeavid_sd/flax_impl/train.sh ADDED Viewed

	@@ -0,0 +1,34 @@

+#!/bin/sh
+#export WANDB_API_KEY="your_api_key"
+export WANDB_ENTITY="tempofunk"
+export WANDB_JOB_TYPE="train"
+export WANDB_PROJECT="makeavid-sd-tpu"
+python train.py \
+        --dataset_path ../storage/dataset/tempofunk-s \
+        --model_path ../storage/trained_models/ep20 \
+        --from_pt False \
+        --convert2d False \
+        --only_temporal True \
+        --output_dir ../storage/output \
+        --batch_size 1 \
+        --num_frames 24 \
+        --epochs 20 \
+        --lr 0.00005 \
+        --warmup 0.1 \
+        --decay 0.0 \
+        --sample_size 64 64 \
+        --log_every_step 50 \
+        --save_every_epoch 1 \
+        --sample_every_epoch 1 \
+        --seed 2 \
+        --use_memory_efficient_attention True \
+        --dtype bfloat16 \
+        --param_dtype float32 \
+        --verbose True \
+        --dataset_cache_dir ../storage/cache/hf/datasets \
+        --wandb True
+# sudo rm /tmp/libtpu_lockfile

makeavid_sd/makeavid_sd/inference.py ADDED Viewed

	@@ -0,0 +1,486 @@

+from typing import Any, Union, Tuple, List, Dict
+import os
+import gc
+from functools import partial
+import jax
+import jax.numpy as jnp
+import numpy as np
+from flax.core.frozen_dict import FrozenDict
+from flax import jax_utils
+from flax.training.common_utils import shard
+from PIL import Image
+import einops
+from diffusers import FlaxAutoencoderKL, FlaxUNet2DConditionModel
+from diffusers import (
+        FlaxDDIMScheduler,
+        FlaxDDPMScheduler,
+        FlaxPNDMScheduler,
+        FlaxLMSDiscreteScheduler,
+        FlaxDPMSolverMultistepScheduler,
+        FlaxKarrasVeScheduler,
+        FlaxScoreSdeVeScheduler
+)
+from transformers import FlaxCLIPTextModel, CLIPTokenizer
+from .flax_impl.flax_unet_pseudo3d_condition import UNetPseudo3DConditionModel
+SchedulerType = Union[
+        FlaxDDIMScheduler,
+        FlaxDDPMScheduler,
+        FlaxPNDMScheduler,
+        FlaxLMSDiscreteScheduler,
+        FlaxDPMSolverMultistepScheduler,
+        FlaxKarrasVeScheduler,
+        FlaxScoreSdeVeScheduler
+]
+def dtypestr(x: jnp.dtype):
+    if x == jnp.float32: return 'float32'
+    elif x == jnp.float16: return 'float16'
+    elif x == jnp.bfloat16: return 'bfloat16'
+    else: raise
+def castto(dtype, m, x):
+    if dtype == jnp.float32: return m.to_fp32(x)
+    elif dtype == jnp.float16: return m.to_fp16(x)
+    elif dtype == jnp.bfloat16: return m.to_bf16(x)
+    else: raise
+class InferenceUNetPseudo3D:
+    def __init__(self,
+            model_path: str,
+            scheduler_cls: SchedulerType = FlaxDDIMScheduler,
+            dtype: jnp.dtype = jnp.float16,
+            hf_auth_token: Union[str, None] = None
+    ) -> None:
+        self.dtype = dtype
+        self.model_path = model_path
+        self.hf_auth_token = hf_auth_token
+        self.params: Dict[str, FrozenDict[str, Any]] = {}
+        unet, unet_params = UNetPseudo3DConditionModel.from_pretrained(
+                self.model_path,
+                subfolder = 'unet',
+                from_pt = False,
+                sample_size = (64, 64),
+                dtype = self.dtype,
+                param_dtype = dtypestr(self.dtype),
+                use_memory_efficient_attention = True,
+                use_auth_token = self.hf_auth_token
+        )
+        self.unet: UNetPseudo3DConditionModel = unet
+        unet_params = castto(self.dtype, self.unet, unet_params)
+        self.params['unet'] = FrozenDict(unet_params)
+        del unet_params
+        vae, vae_params = FlaxAutoencoderKL.from_pretrained(
+                self.model_path,
+                subfolder = 'vae',
+                from_pt = True,
+                dtype = self.dtype,
+                use_auth_token = self.hf_auth_token
+        )
+        self.vae: FlaxAutoencoderKL = vae
+        vae_params = castto(self.dtype, self.vae, vae_params)
+        self.params['vae'] = FrozenDict(vae_params)
+        del vae_params
+        text_encoder = FlaxCLIPTextModel.from_pretrained(
+                self.model_path,
+                subfolder = 'text_encoder',
+                from_pt = True,
+                dtype = self.dtype,
+                use_auth_token = self.hf_auth_token
+        )
+        text_encoder_params = text_encoder.params
+        del text_encoder._params
+        text_encoder_params = castto(self.dtype, text_encoder, text_encoder_params)
+        self.text_encoder: FlaxCLIPTextModel = text_encoder
+        self.params['text_encoder'] = FrozenDict(text_encoder_params)
+        del text_encoder_params
+        imunet, imunet_params = FlaxUNet2DConditionModel.from_pretrained(
+                'runwayml/stable-diffusion-v1-5',
+                subfolder = 'unet',
+                from_pt = True,
+                dtype = self.dtype,
+                use_memory_efficient_attention = True,
+                use_auth_token = self.hf_auth_token
+        )
+        imunet_params = castto(self.dtype, imunet, imunet_params)
+        self.imunet: FlaxUNet2DConditionModel = imunet
+        self.params['imunet'] = FrozenDict(imunet_params)
+        del imunet_params
+        self.tokenizer: CLIPTokenizer = CLIPTokenizer.from_pretrained(
+                self.model_path,
+                subfolder = 'tokenizer',
+                use_auth_token = self.hf_auth_token
+        )
+        scheduler, scheduler_state = scheduler_cls.from_pretrained(
+                self.model_path,
+                subfolder = 'scheduler',
+                dtype = jnp.float32,
+                use_auth_token = self.hf_api_key
+        )
+        self.scheduler: scheduler_cls = scheduler
+        self.params['scheduler'] = scheduler_state
+        self.vae_scale_factor: int = int(2 ** (len(self.vae.config.block_out_channels) - 1))
+        self.device_count = jax.device_count()
+        gc.collect()
+    def set_scheduler(self, scheduler_cls: SchedulerType) -> None:
+        scheduler, scheduler_state = scheduler_cls.from_pretrained(
+                self.model_path,
+                subfolder = 'scheduler',
+                dtype = jnp.float32,
+                use_auth_token = self.hf_api_key
+        )
+        self.scheduler: scheduler_cls = scheduler
+        self.params['scheduler'] = scheduler_state
+    def prepare_inputs(self,
+            prompt: List[str],
+            neg_prompt: List[str],
+            hint_image: List[Image.Image],
+            mask_image: List[Image.Image],
+            width: int,
+            height: int
+    ) -> Tuple[jnp.ndarray, jnp.ndarray, jnp.ndarray, jnp.ndarray]: # prompt, neg_prompt, hint_image, mask_image
+        tokens = self.tokenizer(
+            prompt,
+            truncation = True,
+            return_overflowing_tokens = False,
+            max_length = 77, #self.text_encoder.config.max_length defaults to 20 if its not in the config smh
+            padding = 'max_length',
+            return_tensors = 'np'
+        ).input_ids
+        tokens = jnp.array(tokens, dtype = jnp.int32)
+        neg_tokens = self.tokenizer(
+            neg_prompt,
+            truncation = True,
+            return_overflowing_tokens = False,
+            max_length = 77,
+            padding = 'max_length',
+            return_tensors = 'np'
+        ).input_ids
+        neg_tokens = jnp.array(neg_tokens, dtype = jnp.int32)
+        for i,im in enumerate(hint_image):
+            if im.size != (width, height):
+                hint_image[i] = hint_image[i].resize((width, height), resample = Image.Resampling.LANCZOS)
+        for i,im in enumerate(mask_image):
+            if im.size != (width, height):
+                mask_image[i] = mask_image[i].resize((width, height), resample = Image.Resampling.LANCZOS)
+        # b,h,w,c | c == 3
+        hint = jnp.concatenate(
+                [ jnp.expand_dims(np.asarray(x.convert('RGB')), axis = 0) for x in hint_image ],
+                axis = 0
+        ).astype(jnp.float32)
+        # scale -1,1
+        hint = (hint / 255) * 2 - 1
+        # b,h,w,c | c == 1
+        mask = jnp.concatenate(
+                [ jnp.expand_dims(np.asarray(x.convert('L')), axis = (0, -1)) for x in mask_image ],
+                axis = 0
+        ).astype(jnp.float32)
+        # scale -1,1
+        mask = (mask / 255) * 2 - 1
+        # binarize mask
+        mask = mask.at[mask < 0.5].set(0)
+        mask = mask.at[mask >= 0.5].set(1)
+        # mask
+        hint = hint * (mask < 0.5)
+        # b,h,w,c -> b,c,h,w
+        hint = hint.transpose((0,3,1,2))
+        mask = mask.transpose((0,3,1,2))
+        return tokens, neg_tokens, hint, mask
+    def generate(self,
+            prompt: Union[str, List[str]],
+            inference_steps: int,
+            hint_image: Union[Image.Image, List[Image.Image], None] = None,
+            mask_image: Union[Image.Image, List[Image.Image], None] = None,
+            neg_prompt: Union[str, List[str]] = '',
+            cfg: float = 10.0,
+            num_frames: int = 24,
+            width: int = 512,
+            height: int = 512,
+            seed: int = 0
+    ) -> List[List[Image.Image]]:
+        assert inference_steps > 0, f'number of inference steps must be > 0 but is {inference_steps}'
+        assert num_frames > 0, f'number of frames must be > 0 but is {num_frames}'
+        assert width % 32 == 0, f'width must be divisible by 32 but is {width}'
+        assert height % 32 == 0, f'height must be divisible by 32 but is {height}'
+        if isinstance(prompt, str):
+            prompt = [ prompt ]
+        batch_size = len(prompt)
+        assert batch_size % self.device_count == 0, f'batch size must be multiple of {self.device_count}'
+        if hint_image is None:
+            hint_image = Image.new('RGB', (width, height), color = (0,0,0))
+            use_imagegen = True
+        else:
+            use_imagegen = False
+        if isinstance(hint_image, Image.Image):
+            hint_image = [ hint_image ] * batch_size
+        assert len(hint_image) == batch_size, f'number of hint images must be equal to batch size {batch_size} but is {len(hint_image)}'
+        if mask_image is None:
+            mask_image = Image.new('L', hint_image[0].size, color = 0)
+        if isinstance(mask_image, Image.Image):
+            mask_image = [ mask_image ] * batch_size
+        assert len(mask_image) == batch_size, f'number of mask images must be equal to batch size {batch_size} but is {len(mask_image)}'
+        if isinstance(neg_prompt, str):
+            neg_prompt = [ neg_prompt ] * batch_size
+        assert len(neg_prompt) == batch_size, f'number of negative prompts must be equal to batch size {batch_size} but is {len(neg_prompt)}'
+        tokens, neg_tokens, hint, mask = self.prepare_inputs(
+                prompt = prompt,
+                neg_prompt = neg_prompt,
+                hint_image = hint_image,
+                mask_image = mask_image,
+                width = width,
+                height = height
+        )
+        # NOTE splitting rngs is not deterministic,
+        # running on different device counts gives different seeds
+        #rng = jax.random.PRNGKey(seed)
+        #rngs = jax.random.split(rng, self.device_count)
+        # manually assign seeded RNGs to devices for reproducability
+        rngs = jnp.array([ jax.random.PRNGKey(seed + i) for i in range(self.device_count) ])
+        params = jax_utils.replicate(self.params)
+        tokens = shard(tokens)
+        neg_tokens = shard(neg_tokens)
+        hint = shard(hint)
+        mask = shard(mask)
+        images = _p_generate(self,
+            tokens,
+            neg_tokens,
+            hint,
+            mask,
+            inference_steps,
+            num_frames,
+            height,
+            width,
+            cfg,
+            rngs,
+            params,
+            use_imagegen
+        )
+        if images.ndim == 5:
+            images = einops.rearrange(images, 'd f c h w -> (d f) h w c')
+        else:
+            images = einops.rearrange(images, 'f c h w -> f h w c')
+        # to cpu
+        images = np.array(images)
+        images = [ Image.fromarray(x) for x in images ]
+        return images
+    def _generate(self,
+            tokens: jnp.ndarray,
+            neg_tokens: jnp.ndarray,
+            hint: jnp.ndarray,
+            mask: jnp.ndarray,
+            inference_steps: int,
+            num_frames,
+            height,
+            width,
+            cfg: float,
+            rng: jax.random.KeyArray,
+            params: Union[Dict[str, Any], FrozenDict[str, Any]],
+            use_imagegen: bool
+    ) -> List[Image.Image]:
+        batch_size = tokens.shape[0]
+        latent_h = height // self.vae_scale_factor
+        latent_w = width // self.vae_scale_factor
+        latent_shape = (
+                batch_size,
+                self.vae.config.latent_channels,
+                num_frames,
+                latent_h,
+                latent_w
+        )
+        encoded_prompt = self.text_encoder(tokens, params = params['text_encoder'])[0]
+        encoded_neg_prompt = self.text_encoder(neg_tokens, params = params['text_encoder'])[0]
+        if use_imagegen:
+            image_latent_shape = (batch_size, self.vae.config.latent_channels, latent_h, latent_w)
+            image_latents = jax.random.normal(
+                    rng,
+                    shape = image_latent_shape,
+                    dtype = jnp.float32
+            ) * params['scheduler'].init_noise_sigma
+            image_scheduler_state = self.scheduler.set_timesteps(
+                    params['scheduler'],
+                    num_inference_steps = inference_steps,
+                    shape = image_latents.shape
+            )
+            def image_sample_loop(step, args):
+                image_latents, image_scheduler_state = args
+                t = image_scheduler_state.timesteps[step]
+                tt = jnp.broadcast_to(t, image_latents.shape[0])
+                latents_input = self.scheduler.scale_model_input(image_scheduler_state, image_latents, t)
+                noise_pred = self.imunet.apply(
+                        {'params': params['imunet']},
+                        latents_input,
+                        tt,
+                        encoder_hidden_states = encoded_prompt
+                ).sample
+                noise_pred_uncond = self.imunet.apply(
+                        {'params': params['imunet']},
+                        latents_input,
+                        tt,
+                        encoder_hidden_states = encoded_neg_prompt
+                ).sample
+                noise_pred = noise_pred_uncond + cfg * (noise_pred - noise_pred_uncond)
+                image_latents, image_scheduler_state = self.scheduler.step(
+                        image_scheduler_state,
+                        noise_pred.astype(jnp.float32),
+                        t,
+                        image_latents
+                ).to_tuple()
+                return image_latents, image_scheduler_state
+            image_latents, _ = jax.lax.fori_loop(
+                    0, inference_steps,
+                    image_sample_loop,
+                    (image_latents, image_scheduler_state)
+            )
+            hint = image_latents
+        else:
+            hint = self.vae.apply(
+                    {'params': params['vae']},
+                    hint,
+                    method = self.vae.encode
+            ).latent_dist.mean * self.vae.config.scaling_factor
+            # NOTE vae keeps channels last for encode, but rearranges to channels first for decode
+            # b0 h1 w2 c3 -> b0 c3 h1 w2
+            hint = hint.transpose((0, 3, 1, 2))
+        hint = jnp.expand_dims(hint, axis = 2).repeat(num_frames, axis = 2)
+        mask = jax.image.resize(mask, (*mask.shape[:-2], *hint.shape[-2:]), method = 'nearest')
+        mask = jnp.expand_dims(mask, axis = 2).repeat(num_frames, axis = 2)
+        # NOTE jax normal distribution is shit with float16 + bfloat16
+        # SEE https://github.com/google/jax/discussions/13798
+        # generate random at float32
+        latents = jax.random.normal(
+                rng,
+                shape = latent_shape,
+                dtype = jnp.float32
+        ) * params['scheduler'].init_noise_sigma
+        scheduler_state = self.scheduler.set_timesteps(
+                params['scheduler'],
+                num_inference_steps = inference_steps,
+                shape = latents.shape
+        )
+        def sample_loop(step, args):
+            latents, scheduler_state = args
+            t = scheduler_state.timesteps[step]#jnp.array(scheduler_state.timesteps, dtype = jnp.int32)[step]
+            tt = jnp.broadcast_to(t, latents.shape[0])
+            latents_input = self.scheduler.scale_model_input(scheduler_state, latents, t)
+            latents_input = jnp.concatenate([latents_input, mask, hint], axis = 1)
+            noise_pred = self.unet.apply(
+                    { 'params': params['unet'] },
+                    latents_input,
+                    tt,
+                    encoded_prompt
+            ).sample
+            noise_pred_uncond = self.unet.apply(
+                    { 'params': params['unet'] },
+                    latents_input,
+                    tt,
+                    encoded_neg_prompt
+            ).sample
+            noise_pred = noise_pred_uncond + cfg * (noise_pred - noise_pred_uncond)
+            latents, scheduler_state = self.scheduler.step(
+                    scheduler_state,
+                    noise_pred.astype(jnp.float32),
+                    t,
+                    latents
+            ).to_tuple()
+            return latents, scheduler_state
+        latents, _ = jax.lax.fori_loop(
+                0, inference_steps,
+                sample_loop,
+                (latents, scheduler_state)
+        )
+        latents = 1 / self.vae.config.scaling_factor * latents
+        latents = einops.rearrange(latents, 'b c f h w -> (b f) c h w')
+        num_images = len(latents)
+        images_out = jnp.zeros(
+                (
+                        num_images,
+                        self.vae.config.out_channels,
+                        height,
+                        width
+                ),
+                dtype = self.dtype
+        )
+        def decode_loop(step, images_out):
+            # NOTE vae keeps channels last for encode, but rearranges to channels first for decode
+            im = self.vae.apply(
+                    { 'params': params['vae'] },
+                    jnp.expand_dims(latents[step], axis = 0),
+                    method = self.vae.decode
+            ).sample
+            images_out = images_out.at[step].set(im[0])
+            return images_out
+        images_out = jax.lax.fori_loop(0, num_images, decode_loop, images_out)
+        images_out = ((images_out / 2 + 0.5) * 255).round().clip(0, 255).astype(jnp.uint8)
+        return images_out
+@partial(
+        jax.pmap,
+        in_axes = ( # 0 -> split across batch dim, None -> duplicate
+                None,   #  0 inference_class
+                0,      #  1 tokens
+                0,      #  2 neg_tokens
+                0,      #  3 hint
+                0,      #  4 mask
+                None,   #  5 inference_steps
+                None,   #  6 num_frames
+                None,   #  7 height
+                None,   #  8 width
+                None,   #  9 cfg
+                0,      # 10 rng
+                0,      # 11 params
+                None,   # 12 use_imagegen
+        ),
+        static_broadcasted_argnums = ( # trigger recompilation on change
+                0,      # inference_class
+                5,      # inference_steps
+                6,      # num_frames
+                7,      # height
+                8,      # width
+                12,     # use_imagegen
+        )
+)
+def _p_generate(
+        inference_class: InferenceUNetPseudo3D,
+        tokens,
+        neg_tokens,
+        hint,
+        mask,
+        inference_steps,
+        num_frames,
+        height,
+        width,
+        cfg,
+        rng,
+        params,
+        use_imagegen
+):
+    return inference_class._generate(
+            tokens,
+            neg_tokens,
+            hint,
+            mask,
+            inference_steps,
+            num_frames,
+            height,
+            width,
+            cfg,
+            rng,
+            params,
+            use_imagegen
+    )

makeavid_sd/makeavid_sd/torch_impl/__init__.py ADDED Viewed

File without changes

makeavid_sd/makeavid_sd/torch_impl/torch_attention_pseudo3d.py ADDED Viewed

	@@ -0,0 +1,294 @@

+from typing import Optional
+import torch
+import torch.nn.functional as F
+from torch import nn
+from einops import rearrange
+from diffusers.models.attention_processor import Attention as CrossAttention
+#from torch_cross_attention import CrossAttention
+class TransformerPseudo3DModelOutput:
+    def __init__(self, sample: torch.FloatTensor) -> None:
+        self.sample = sample
+class TransformerPseudo3DModel(nn.Module):
+    def __init__(self,
+            num_attention_heads: int = 16,
+            attention_head_dim: int = 88,
+            in_channels: Optional[int] = None,
+            num_layers: int = 1,
+            dropout: float = 0.0,
+            norm_num_groups: int = 32,
+            cross_attention_dim: Optional[int] = None,
+            attention_bias: bool = False
+    ) -> None:
+        super().__init__()
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_dim = attention_head_dim
+        inner_dim = num_attention_heads * attention_head_dim
+        # 1. Transformer2DModel can process both standard continous images of shape `(batch_size, num_channels, width, height)` as well as quantized image embeddings of shape `(batch_size, num_image_vectors)`
+        # Define whether input is continuous or discrete depending on configuration
+        # its continuous
+        # 2. Define input layers
+        self.in_channels = in_channels
+        self.norm = torch.nn.GroupNorm(
+                num_groups = norm_num_groups,
+                num_channels = in_channels,
+                eps = 1e-6,
+                affine = True
+        )
+        self.proj_in = nn.Conv2d(
+                in_channels,
+                inner_dim,
+                kernel_size = 1,
+                stride = 1,
+                padding = 0
+        )
+        # 3. Define transformers blocks
+        self.transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlock(
+                        inner_dim,
+                        num_attention_heads,
+                        attention_head_dim,
+                        dropout = dropout,
+                        cross_attention_dim = cross_attention_dim,
+                        attention_bias = attention_bias,
+                )
+                for _ in range(num_layers)
+            ]
+        )
+        # 4. Define output layers
+        self.proj_out = nn.Conv2d(inner_dim, in_channels, kernel_size = 1, stride = 1, padding = 0)
+    def forward(self,
+            hidden_states: torch.Tensor,
+            encoder_hidden_states: Optional[torch.Tensor] = None,
+            timestep: torch.long = None
+    ) -> TransformerPseudo3DModelOutput:
+        """
+        Args:
+            hidden_states ( When discrete, `torch.LongTensor` of shape `(batch size, num latent pixels)`.
+                When continous, `torch.FloatTensor` of shape `(batch size, channel, height, width)`): Input
+                hidden_states
+            encoder_hidden_states ( `torch.LongTensor` of shape `(batch size, context dim)`, *optional*):
+                Conditional embeddings for cross attention layer. If not given, cross-attention defaults to
+                self-attention.
+            timestep ( `torch.long`, *optional*):
+                Optional timestep to be applied as an embedding in AdaLayerNorm's. Used to indicate denoising step.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`models.unet_2d_condition.UNet2DConditionOutput`] instead of a plain tuple.
+        Returns:
+            [`~models.attention.Transformer2DModelOutput`] or `tuple`: [`~models.attention.Transformer2DModelOutput`]
+            if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is the sample
+            tensor.
+        """
+        b, c, *_, h, w = hidden_states.shape
+        is_video = hidden_states.ndim == 5
+        f = None
+        if is_video:
+            b, c, f, h, w = hidden_states.shape
+            hidden_states = rearrange(hidden_states, 'b c f h w -> (b f) c h w')
+            #encoder_hidden_states = encoder_hidden_states.repeat_interleave(f, 0)
+        # 1. Input
+        batch, channel, height, weight = hidden_states.shape
+        residual = hidden_states
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.proj_in(hidden_states)
+        inner_dim = hidden_states.shape[1]
+        hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, inner_dim)
+        # 2. Blocks
+        for block in self.transformer_blocks:
+            hidden_states = block(
+                    hidden_states,
+                    context = encoder_hidden_states,
+                    timestep = timestep,
+                    frames_length = f,
+                    height = height,
+                    weight = weight
+            )
+        # 3. Output
+        hidden_states = hidden_states.reshape(batch, height, weight, inner_dim).permute(0, 3, 1, 2)
+        hidden_states = self.proj_out(hidden_states)
+        output = hidden_states + residual
+        if is_video:
+            output = rearrange(output, '(b f) c h w -> b c f h w', b = b)
+        return TransformerPseudo3DModelOutput(sample = output)
+class BasicTransformerBlock(nn.Module):
+    r"""
+    A basic Transformer block.
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        cross_attention_dim (`int`, *optional*): The size of the context vector for cross attention.
+        num_embeds_ada_norm (:
+            obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`.
+        attention_bias (:
+            obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter.
+    """
+    def __init__(self,
+            dim: int,
+            num_attention_heads: int,
+            attention_head_dim: int,
+            dropout: float = 0.0,
+            cross_attention_dim: Optional[int] = None,
+            attention_bias: bool = False,
+    ) -> None:
+        super().__init__()
+        self.attn1 = CrossAttention(
+                query_dim = dim,
+                heads = num_attention_heads,
+                dim_head = attention_head_dim,
+                dropout = dropout,
+                bias = attention_bias
+        )  # is a self-attention
+        self.ff = FeedForward(dim, dropout = dropout)
+        self.attn2 = CrossAttention(
+                query_dim = dim,
+                cross_attention_dim = cross_attention_dim,
+                heads = num_attention_heads,
+                dim_head = attention_head_dim,
+                dropout = dropout,
+                bias = attention_bias
+        )  # is self-attn if context is none
+        self.attn_temporal = CrossAttention(
+                query_dim = dim,
+                heads = num_attention_heads,
+                dim_head = attention_head_dim,
+                dropout = dropout,
+                bias = attention_bias
+        )  # is a self-attention
+        # layer norms
+        self.norm1 = nn.LayerNorm(dim)
+        self.norm2 = nn.LayerNorm(dim)
+        self.norm_temporal = nn.LayerNorm(dim)
+        self.norm3 = nn.LayerNorm(dim)
+    def forward(self,
+            hidden_states: torch.Tensor,
+            context: Optional[torch.Tensor] = None,
+            timestep: torch.int64 = None,
+            frames_length: Optional[int] = None,
+            height: Optional[int] = None,
+            weight: Optional[int] = None
+    ) -> torch.Tensor:
+        if context is not None and frames_length is not None:
+            context = context.repeat_interleave(frames_length, 0)
+        # 1. Self-Attention
+        norm_hidden_states = (
+            self.norm1(hidden_states)
+        )
+        hidden_states = self.attn1(norm_hidden_states) + hidden_states
+        # 2. Cross-Attention
+        norm_hidden_states = (
+            self.norm2(hidden_states)
+        )
+        hidden_states = self.attn2(
+                norm_hidden_states,
+                encoder_hidden_states = context
+        ) + hidden_states
+        # append temporal attention
+        if frames_length is not None:
+            hidden_states = rearrange(
+                    hidden_states,
+                    '(b f) (h w) c -> (b h w) f c',
+                    f = frames_length,
+                    h = height,
+                    w = weight
+            )
+            norm_hidden_states = (
+                self.norm_temporal(hidden_states)
+            )
+            hidden_states = self.attn_temporal(norm_hidden_states) + hidden_states
+            hidden_states = rearrange(
+                    hidden_states,
+                    '(b h w) f c -> (b f) (h w) c',
+                    f = frames_length,
+                    h = height,
+                    w = weight
+            )
+        # 3. Feed-forward
+        hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
+        return hidden_states
+class FeedForward(nn.Module):
+    r"""
+    A feed-forward layer.
+    Parameters:
+        dim (`int`): The number of channels in the input.
+        dim_out (`int`, *optional*): The number of channels in the output. If not given, defaults to `dim`.
+        mult (`int`, *optional*, defaults to 4): The multiplier to use for the hidden dimension.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+    """
+    def __init__(self,
+            dim: int,
+            dim_out: Optional[int] = None,
+            mult: int = 4,
+            dropout: float = 0.0
+    ) -> None:
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        geglu = GEGLU(dim, inner_dim)
+        self.net = nn.ModuleList([])
+        # project in
+        self.net.append(geglu)
+        # project dropout
+        self.net.append(nn.Dropout(dropout))
+        # project out
+        self.net.append(nn.Linear(inner_dim, dim_out))
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        for module in self.net:
+            hidden_states = module(hidden_states)
+        return hidden_states
+# feedforward
+class GEGLU(nn.Module):
+    r"""
+    A variant of the gated linear unit activation function from https://arxiv.org/abs/2002.05202.
+    Parameters:
+        dim_in (`int`): The number of channels in the input.
+        dim_out (`int`): The number of channels in the output.
+    """
+    def __init__(self, dim_in: int, dim_out: int) -> None:
+        super().__init__()
+        self.proj = nn.Linear(dim_in, dim_out * 2)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states, gate = self.proj(hidden_states).chunk(2, dim = -1)
+        return hidden_states * F.gelu(gate)

makeavid_sd/makeavid_sd/torch_impl/torch_cross_attention.py ADDED Viewed

	@@ -0,0 +1,171 @@

+from typing import Optional
+import torch
+import torch.nn as nn
+class CrossAttention(nn.Module):
+    r"""
+    A cross attention layer.
+    Parameters:
+        query_dim (`int`): The number of channels in the query.
+        cross_attention_dim (`int`, *optional*):
+            The number of channels in the context. If not given, defaults to `query_dim`.
+        heads (`int`,  *optional*, defaults to 8): The number of heads to use for multi-head attention.
+        dim_head (`int`,  *optional*, defaults to 64): The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        bias (`bool`, *optional*, defaults to False):
+            Set to `True` for the query, key, and value linear layers to contain a bias parameter.
+    """
+    def __init__(self,
+            query_dim: int,
+            cross_attention_dim: Optional[int] = None,
+            heads: int = 8,
+            dim_head: int = 64,
+            dropout: float = 0.0,
+            bias: bool = False
+    ):
+        super().__init__()
+        inner_dim = dim_head * heads
+        cross_attention_dim = cross_attention_dim if cross_attention_dim is not None else query_dim
+        self.scale = dim_head**-0.5
+        self.heads = heads
+        self.n_heads = heads
+        self.d_head = dim_head
+        self.to_q = nn.Linear(query_dim, inner_dim, bias = bias)
+        self.to_k = nn.Linear(cross_attention_dim, inner_dim, bias = bias)
+        self.to_v = nn.Linear(cross_attention_dim, inner_dim, bias = bias)
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(nn.Linear(inner_dim, query_dim))
+        self.to_out.append(nn.Dropout(dropout))
+        try:
+            # You can install flash attention by cloning their Github repo,
+            # [https://github.com/HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)
+            # and then running `python setup.py install`
+            from flash_attn.flash_attention import FlashAttention
+            self.flash = FlashAttention()
+            # Set the scale for scaled dot-product attention.
+            self.flash.softmax_scale = self.scale
+        # Set to `None` if it's not installed
+        except ImportError:
+            self.flash = None
+    def reshape_heads_to_batch_dim(self, tensor):
+        batch_size, seq_len, dim = tensor.shape
+        head_size = self.heads
+        tensor = tensor.reshape(batch_size, seq_len, head_size, dim // head_size)
+        tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size * head_size, seq_len, dim // head_size)
+        return tensor
+    def reshape_batch_dim_to_heads(self, tensor):
+        batch_size, seq_len, dim = tensor.shape
+        head_size = self.heads
+        tensor = tensor.reshape(batch_size // head_size, head_size, seq_len, dim)
+        tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size // head_size, seq_len, dim * head_size)
+        return tensor
+    def forward(self,
+            hidden_states: torch.Tensor,
+            encoder_hidden_states: Optional[torch.Tensor] = None,
+            mask: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+        batch_size, sequence_length, _ = hidden_states.shape
+        is_self = encoder_hidden_states is None
+        # attention, what we cannot get enough of
+        query = self.to_q(hidden_states)
+        has_cond = encoder_hidden_states is not None
+        encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
+        key = self.to_k(encoder_hidden_states)
+        value = self.to_v(encoder_hidden_states)
+        dim = query.shape[-1]
+        if self.flash is not None and not has_cond and self.d_head <= 64:
+            hidden_states = self.flash_attention(query, key, value)
+        else:
+            hidden_states = self.normal_attention(query, key, value, is_self)
+        # linear proj
+        hidden_states = self.to_out[0](hidden_states)
+        # dropout
+        hidden_states = self.to_out[1](hidden_states)
+        return hidden_states
+    def flash_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):
+        """
+        #### Flash Attention
+        :param q: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        :param k: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        :param v: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        """
+        # Get batch size and number of elements along sequence axis (`width * height`)
+        batch_size, seq_len, _ = q.shape
+        # Stack `q`, `k`, `v` vectors for flash attention, to get a single tensor of
+        # shape `[batch_size, seq_len, 3, n_heads * d_head]`
+        qkv = torch.stack((q, k, v), dim = 2)
+        # Split the heads
+        qkv = qkv.view(batch_size, seq_len, 3, self.n_heads, self.d_head)
+        # Flash attention works for head sizes `32`, `64` and `128`, so we have to pad the heads to
+        # fit this size.
+        if self.d_head <= 32:
+            pad = 32 - self.d_head
+        elif self.d_head <= 64:
+            pad = 64 - self.d_head
+        elif self.d_head <= 128:
+            pad = 128 - self.d_head
+        else:
+            raise ValueError(f'Head size ${self.d_head} too large for Flash Attention')
+        # Pad the heads
+        if pad:
+            qkv = torch.cat((qkv, qkv.new_zeros(batch_size, seq_len, 3, self.n_heads, pad)), dim = -1)
+        # Compute attention
+        # $$\underset{seq}{softmax}\Bigg(\frac{Q K^\top}{\sqrt{d_{key}}}\Bigg)V$$
+        # This gives a tensor of shape `[batch_size, seq_len, n_heads, d_padded]`
+        out, _ = self.flash(qkv)
+        # Truncate the extra head size
+        out = out[:, :, :, :self.d_head]
+        # Reshape to `[batch_size, seq_len, n_heads * d_head]`
+        out = out.reshape(batch_size, seq_len, self.n_heads * self.d_head)
+        # Map to `[batch_size, height * width, d_model]` with a linear layer
+        return out
+    def normal_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, is_self: bool):
+        """
+        #### Normal Attention
+        :param q: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        :param k: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        :param v: are the query vectors before splitting heads, of shape `[batch_size, seq, d_attn]`
+        """
+        # Split them to heads of shape `[batch_size, seq_len, n_heads, d_head]`
+        q = q.view(*q.shape[:2], self.n_heads, -1)
+        k = k.view(*k.shape[:2], self.n_heads, -1)
+        v = v.view(*v.shape[:2], self.n_heads, -1)
+        # Calculate attention $\frac{Q K^\top}{\sqrt{d_{key}}}$
+        attn = torch.einsum('bihd,bjhd->bhij', q, k) * self.scale
+        # Compute softmax
+        # $$\underset{seq}{softmax}\Bigg(\frac{Q K^\top}{\sqrt{d_{key}}}\Bigg)$$
+        half = attn.shape[0] // 2
+        attn[half:] = attn[half:].softmax(dim = -1)
+        attn[:half] = attn[:half].softmax(dim = -1)
+        # Compute attention output
+        # $$\underset{seq}{softmax}\Bigg(\frac{Q K^\top}{\sqrt{d_{key}}}\Bigg)V$$
+        out = torch.einsum('bhij,bjhd->bihd', attn, v)
+        # Reshape to `[batch_size, height * width, n_heads * d_head]`
+        out = out.reshape(*out.shape[:2], -1)
+        # Map to `[batch_size, height * width, d_model]` with a linear layer
+        return out

makeavid_sd/makeavid_sd/torch_impl/torch_embeddings.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import math
+import torch
+from torch import nn
+def get_timestep_embedding(
+        timesteps: torch.Tensor,
+        embedding_dim: int,
+        flip_sin_to_cos: bool = False,
+        downscale_freq_shift: float = 1,
+        scale: float = 1,
+        max_period: int = 10000,
+) -> torch.Tensor:
+    """
+    This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
+    :param timesteps: a 1-D Tensor of N indices, one per batch element.
+                      These may be fractional.
+    :param embedding_dim: the dimension of the output. :param max_period: controls the minimum frequency of the
+    embeddings. :return: an [N x dim] Tensor of positional embeddings.
+    """
+    assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
+    half_dim = embedding_dim // 2
+    exponent = -math.log(max_period) * torch.arange(
+            start = 0,
+            end = half_dim,
+            dtype = torch.float32,
+            device = timesteps.device
+    )
+    exponent = exponent / (half_dim - downscale_freq_shift)
+    emb = torch.exp(exponent)
+    emb = timesteps[:, None].float() * emb[None, :]
+    # scale embeddings
+    emb = scale * emb
+    # concat sine and cosine embeddings
+    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim = -1)
+    # flip sine and cosine embeddings
+    if flip_sin_to_cos:
+        emb = torch.cat([emb[:, half_dim:], emb[:, :half_dim]], dim = -1)
+    # zero pad
+    if embedding_dim % 2 == 1:
+        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
+    return emb
+class TimestepEmbedding(nn.Module):
+    def __init__(self, in_channels: int, time_embed_dim: int, act_fn: str = "silu", out_dim: int = None):
+        super().__init__()
+        self.linear_1 = nn.Linear(in_channels, time_embed_dim)
+        self.act = None
+        if act_fn == "silu":
+            self.act = nn.SiLU()
+        elif act_fn == "mish":
+            self.act = nn.Mish()
+        if out_dim is not None:
+            time_embed_dim_out = out_dim
+        else:
+            time_embed_dim_out = time_embed_dim
+        self.linear_2 = nn.Linear(time_embed_dim, time_embed_dim_out)
+    def forward(self, sample):
+        sample = self.linear_1(sample)
+        if self.act is not None:
+            sample = self.act(sample)
+        sample = self.linear_2(sample)
+        return sample
+class Timesteps(nn.Module):
+    def __init__(self, num_channels: int, flip_sin_to_cos: bool, downscale_freq_shift: float):
+        super().__init__()
+        self.num_channels = num_channels
+        self.flip_sin_to_cos = flip_sin_to_cos
+        self.downscale_freq_shift = downscale_freq_shift
+    def forward(self, timesteps):
+        t_emb = get_timestep_embedding(
+            timesteps,
+            self.num_channels,
+            flip_sin_to_cos=self.flip_sin_to_cos,
+            downscale_freq_shift=self.downscale_freq_shift,
+        )
+        return t_emb

makeavid_sd/makeavid_sd/torch_impl/torch_resnet_pseudo3d.py ADDED Viewed

	@@ -0,0 +1,295 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+class Pseudo3DConv(nn.Module):
+    def __init__(
+        self,
+        dim,
+        dim_out,
+        kernel_size,
+        **kwargs
+    ):
+        super().__init__()
+        self.spatial_conv = nn.Conv2d(dim, dim_out, kernel_size, **kwargs)
+        self.temporal_conv = nn.Conv1d(dim_out, dim_out, kernel_size, padding=kernel_size // 2)
+        self.temporal_conv = nn.Conv1d(dim_out, dim_out, 3, padding=1)
+        nn.init.dirac_(self.temporal_conv.weight.data) # initialized to be identity
+        nn.init.zeros_(self.temporal_conv.bias.data)
+    def forward(
+        self,
+        x,
+        convolve_across_time = True
+    ):
+        b, c, *_, h, w = x.shape
+        is_video = x.ndim == 5
+        convolve_across_time &= is_video
+        if is_video:
+            x = rearrange(x, 'b c f h w -> (b f) c h w')
+        #with torch.no_grad():
+        #    x = self.spatial_conv(x)
+        x = self.spatial_conv(x)
+        if is_video:
+            x = rearrange(x, '(b f) c h w -> b c f h w', b = b)
+            b, c, *_, h, w = x.shape
+        if not convolve_across_time:
+            return x
+        if is_video:
+            x = rearrange(x, 'b c f h w -> (b h w) c f')
+            x = self.temporal_conv(x)
+            x = rearrange(x, '(b h w) c f -> b c f h w', h = h, w = w)
+        return x
+class Upsample2D(nn.Module):
+    """
+    An upsampling layer with an optional convolution.
+    Parameters:
+        channels: channels in the inputs and outputs.
+        use_conv: a bool determining if a convolution is applied.
+        use_conv_transpose:
+        out_channels:
+    """
+    def __init__(self, channels, use_conv=False, use_conv_transpose=False, out_channels=None, name="conv"):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.use_conv_transpose = use_conv_transpose
+        self.name = name
+        conv = None
+        if use_conv_transpose:
+            conv = nn.ConvTranspose2d(channels, self.out_channels, 4, 2, 1)
+        elif use_conv:
+            conv = Pseudo3DConv(self.channels, self.out_channels, 3, padding=1)
+        # TODO(Suraj, Patrick) - clean up after weight dicts are correctly renamed
+        if name == "conv":
+            self.conv = conv
+        else:
+            self.Conv2d_0 = conv
+    def forward(self, hidden_states, output_size=None):
+        assert hidden_states.shape[1] == self.channels
+        if self.use_conv_transpose:
+            return self.conv(hidden_states)
+        # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
+        # TODO(Suraj): Remove this cast once the issue is fixed in PyTorch
+        # https://github.com/pytorch/pytorch/issues/86679
+        dtype = hidden_states.dtype
+        if dtype == torch.bfloat16:
+            hidden_states = hidden_states.to(torch.float32)
+        # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
+        if hidden_states.shape[0] >= 64:
+            hidden_states = hidden_states.contiguous()
+        b, c, *_, h, w = hidden_states.shape
+        is_video = hidden_states.ndim == 5
+        if is_video:
+            hidden_states = rearrange(hidden_states, 'b c f h w -> (b f) c h w')
+        # if `output_size` is passed we force the interpolation output
+        # size and do not make use of `scale_factor=2`
+        if output_size is None:
+            hidden_states = F.interpolate(hidden_states, scale_factor=2.0, mode="nearest")
+        else:
+            hidden_states = F.interpolate(hidden_states, size=output_size, mode="nearest")
+        if is_video:
+            hidden_states = rearrange(hidden_states, '(b f) c h w -> b c f h w', b = b)
+        # If the input is bfloat16, we cast back to bfloat16
+        if dtype == torch.bfloat16:
+            hidden_states = hidden_states.to(dtype)
+        # TODO(Suraj, Patrick) - clean up after weight dicts are correctly renamed
+        if self.use_conv:
+            if self.name == "conv":
+                hidden_states = self.conv(hidden_states)
+            else:
+                hidden_states = self.Conv2d_0(hidden_states)
+        return hidden_states
+class Downsample2D(nn.Module):
+    """
+    A downsampling layer with an optional convolution.
+    Parameters:
+        channels: channels in the inputs and outputs.
+        use_conv: a bool determining if a convolution is applied.
+        out_channels:
+        padding:
+    """
+    def __init__(self, channels, use_conv=False, out_channels=None, padding=1, name="conv"):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.use_conv = use_conv
+        self.padding = padding
+        stride = 2
+        self.name = name
+        if use_conv:
+            conv = Pseudo3DConv(self.channels, self.out_channels, 3, stride=stride, padding=padding)
+        else:
+            assert self.channels == self.out_channels
+            conv = nn.AvgPool2d(kernel_size=stride, stride=stride)
+        # TODO(Suraj, Patrick) - clean up after weight dicts are correctly renamed
+        if name == "conv":
+            self.Conv2d_0 = conv
+            self.conv = conv
+        elif name == "Conv2d_0":
+            self.conv = conv
+        else:
+            self.conv = conv
+    def forward(self, hidden_states):
+        assert hidden_states.shape[1] == self.channels
+        if self.use_conv and self.padding == 0:
+            pad = (0, 1, 0, 1)
+            hidden_states = F.pad(hidden_states, pad, mode="constant", value=0)
+        assert hidden_states.shape[1] == self.channels
+        if self.use_conv:
+            hidden_states = self.conv(hidden_states)
+        else:
+            b, c, *_, h, w = hidden_states.shape
+            is_video = hidden_states.ndim == 5
+            if is_video:
+                hidden_states = rearrange(hidden_states, 'b c f h w -> (b f) c h w')
+            hidden_states = self.conv(hidden_states)
+            if is_video:
+                hidden_states = rearrange(hidden_states, '(b f) c h w -> b c f h w', b = b)
+        return hidden_states
+class ResnetBlockPseudo3D(nn.Module):
+    def __init__(
+        self,
+        *,
+        in_channels,
+        out_channels=None,
+        conv_shortcut=False,
+        dropout=0.0,
+        temb_channels=512,
+        groups=32,
+        groups_out=None,
+        pre_norm=True,
+        eps=1e-6,
+        time_embedding_norm="default",
+        kernel=None,
+        output_scale_factor=1.0,
+        use_in_shortcut=None,
+        up=False,
+        down=False,
+    ):
+        super().__init__()
+        self.pre_norm = pre_norm
+        self.pre_norm = True
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.use_conv_shortcut = conv_shortcut
+        self.time_embedding_norm = time_embedding_norm
+        self.up = up
+        self.down = down
+        self.output_scale_factor = output_scale_factor
+        print('OUTPUT_SCALE_FACTOR:', output_scale_factor)
+        if groups_out is None:
+            groups_out = groups
+        self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
+        self.conv1 = Pseudo3DConv(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
+        if temb_channels is not None:
+            self.time_emb_proj = torch.nn.Linear(temb_channels, out_channels)
+        else:
+            self.time_emb_proj = None
+        self.norm2 = torch.nn.GroupNorm(num_groups=groups_out, num_channels=out_channels, eps=eps, affine=True)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = Pseudo3DConv(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
+        self.nonlinearity = nn.SiLU()
+        self.upsample = self.downsample = None
+        if self.up:
+            self.upsample = Upsample2D(in_channels, use_conv=False)
+        elif self.down:
+            self.downsample = Downsample2D(in_channels, use_conv=False, padding=1, name="op")
+        self.use_in_shortcut = self.in_channels != self.out_channels if use_in_shortcut is None else use_in_shortcut
+        self.conv_shortcut = None
+        if self.use_in_shortcut:
+            self.conv_shortcut = Pseudo3DConv(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+    def forward(self, input_tensor, temb):
+        hidden_states = input_tensor
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.nonlinearity(hidden_states)
+        if self.upsample is not None:
+            # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
+            if hidden_states.shape[0] >= 64:
+                input_tensor = input_tensor.contiguous()
+                hidden_states = hidden_states.contiguous()
+            input_tensor = self.upsample(input_tensor)
+            hidden_states = self.upsample(hidden_states)
+        elif self.downsample is not None:
+            input_tensor = self.downsample(input_tensor)
+            hidden_states = self.downsample(hidden_states)
+        hidden_states = self.conv1(hidden_states)
+        if temb is not None:
+            b, c, *_, h, w = hidden_states.shape
+            is_video = hidden_states.ndim == 5
+            if is_video:
+                b, c, f, h, w = hidden_states.shape
+                hidden_states = rearrange(hidden_states, 'b c f h w -> (b f) c h w')
+                temb = self.time_emb_proj(self.nonlinearity(temb))[:, :, None, None]
+                hidden_states = hidden_states + temb.repeat_interleave(f, 0)
+                hidden_states = rearrange(hidden_states, '(b f) c h w -> b c f h w', b=b)
+            else:
+                temb = self.time_emb_proj(self.nonlinearity(temb))[:, :, None, None]
+                hidden_states = hidden_states + temb
+        hidden_states = self.norm2(hidden_states)
+        hidden_states = self.nonlinearity(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+        if self.conv_shortcut is not None:
+            input_tensor = self.conv_shortcut(input_tensor)
+        output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
+        return output_tensor

makeavid_sd/makeavid_sd/torch_impl/torch_unet_pseudo3d_blocks.py ADDED Viewed

	@@ -0,0 +1,493 @@

+from typing import Union, Optional
+import torch
+from torch import nn
+from torch_attention_pseudo3d import TransformerPseudo3DModel
+from torch_resnet_pseudo3d import Downsample2D, ResnetBlockPseudo3D, Upsample2D
+class UNetMidBlock2DCrossAttn(nn.Module):
+    def __init__(self,
+            in_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: Optional[int] = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels: int = 1,
+            attention_type: str = "default",
+            output_scale_factor: float =1.0,
+            cross_attention_dim: int = 1280,
+            **kwargs
+    ) -> None:
+        super().__init__()
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        # there is always at least one resnet
+        resnets = [
+            ResnetBlockPseudo3D(
+                    in_channels = in_channels,
+                    out_channels = in_channels,
+                    temb_channels = temb_channels,
+                    eps = resnet_eps,
+                    groups = resnet_groups,
+                    dropout = dropout,
+                    time_embedding_norm = resnet_time_scale_shift,
+                    #non_linearity = resnet_act_fn,
+                    output_scale_factor = output_scale_factor,
+                    pre_norm = resnet_pre_norm
+            )
+        ]
+        attentions = []
+        for _ in range(num_layers):
+            attentions.append(
+                TransformerPseudo3DModel(
+                        in_channels = in_channels,
+                        num_attention_heads = attn_num_head_channels,
+                        attention_head_dim = in_channels // attn_num_head_channels,
+                        num_layers = 1,
+                        cross_attention_dim = cross_attention_dim,
+                        norm_num_groups = resnet_groups
+                )
+            )
+            resnets.append(
+                ResnetBlockPseudo3D(
+                        in_channels = in_channels,
+                        out_channels = in_channels,
+                        temb_channels = temb_channels,
+                        eps = resnet_eps,
+                        groups = resnet_groups,
+                        dropout = dropout,
+                        time_embedding_norm = resnet_time_scale_shift,
+                        #non_linearity = resnet_act_fn,
+                        output_scale_factor = output_scale_factor,
+                        pre_norm = resnet_pre_norm
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+    def forward(self, hidden_states, temb = None, encoder_hidden_states = None):
+        hidden_states = self.resnets[0](hidden_states, temb)
+        for attn, resnet in zip(self.attentions, self.resnets[1:]):
+            hidden_states = attn(hidden_states, encoder_hidden_states).sample
+            hidden_states = resnet(hidden_states, temb)
+        return hidden_states
+class CrossAttnDownBlock2D(nn.Module):
+    def __init__(self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels: int = 1,
+            cross_attention_dim: int = 1280,
+            attention_type: str = "default",
+            output_scale_factor: float = 1.0,
+            downsample_padding: int = 1,
+            add_downsample: bool = True
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlockPseudo3D(
+                        in_channels = in_channels,
+                        out_channels = out_channels,
+                        temb_channels = temb_channels,
+                        eps = resnet_eps,
+                        groups = resnet_groups,
+                        dropout = dropout,
+                        time_embedding_norm = resnet_time_scale_shift,
+                        #non_linearity = resnet_act_fn,
+                        output_scale_factor = output_scale_factor,
+                        pre_norm = resnet_pre_norm
+                )
+            )
+            attentions.append(
+                TransformerPseudo3DModel(
+                        in_channels = out_channels,
+                        num_attention_heads = attn_num_head_channels,
+                        attention_head_dim = out_channels // attn_num_head_channels,
+                        num_layers = 1,
+                        cross_attention_dim = cross_attention_dim,
+                        norm_num_groups = resnet_groups
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                            out_channels,
+                            use_conv = True,
+                            out_channels = out_channels,
+                            padding = downsample_padding,
+                            name = "op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states, temb = None, encoder_hidden_states = None):
+        output_states = ()
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states = encoder_hidden_states).sample
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+            output_states += (hidden_states,)
+        return hidden_states, output_states
+class DownBlock2D(nn.Module):
+    def __init__(self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor: float = 1.0,
+            add_downsample: bool = True,
+            downsample_padding: int = 1
+    ) -> None:
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlockPseudo3D(
+                        in_channels = in_channels,
+                        out_channels = out_channels,
+                        temb_channels = temb_channels,
+                        eps = resnet_eps,
+                        groups = resnet_groups,
+                        dropout = dropout,
+                        time_embedding_norm = resnet_time_scale_shift,
+                        #non_linearity = resnet_act_fn,
+                        output_scale_factor = output_scale_factor,
+                        pre_norm = resnet_pre_norm
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        out_channels,
+                        use_conv = True,
+                        out_channels = out_channels,
+                        padding = downsample_padding,
+                        name = "op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states, temb = None):
+        output_states = ()
+        for resnet in self.resnets:
+            hidden_states = resnet(hidden_states, temb)
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+            output_states += (hidden_states,)
+        return hidden_states, output_states
+class CrossAttnUpBlock2D(nn.Module):
+    def __init__(self,
+            in_channels: int,
+            out_channels: int,
+            prev_output_channel: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels: int = 1,
+            cross_attention_dim: int = 1280,
+            attention_type: str = "default",
+            output_scale_factor: float = 1.0,
+            add_upsample: bool = True
+    ) -> None:
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            resnets.append(
+                    ResnetBlockPseudo3D(
+                            in_channels = resnet_in_channels + res_skip_channels,
+                            out_channels = out_channels,
+                            temb_channels = temb_channels,
+                            eps = resnet_eps,
+                            groups = resnet_groups,
+                            dropout = dropout,
+                            time_embedding_norm = resnet_time_scale_shift,
+                            #non_linearity = resnet_act_fn,
+                            output_scale_factor = output_scale_factor,
+                            pre_norm = resnet_pre_norm
+                    )
+            )
+            attentions.append(
+                    TransformerPseudo3DModel(
+                            in_channels = out_channels,
+                            num_attention_heads = attn_num_head_channels,
+                            attention_head_dim = out_channels // attn_num_head_channels,
+                            num_layers = 1,
+                            cross_attention_dim = cross_attention_dim,
+                            norm_num_groups = resnet_groups
+                    )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([
+                    Upsample2D(
+                            out_channels,
+                            use_conv = True,
+                            out_channels = out_channels
+                    )
+            ])
+        else:
+            self.upsamplers = None
+    def forward(self,
+            hidden_states,
+            res_hidden_states_tuple,
+            temb = None,
+            encoder_hidden_states = None,
+            upsample_size = None
+    ):
+        for resnet, attn in zip(self.resnets, self.attentions):
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states, encoder_hidden_states = encoder_hidden_states).sample
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states, upsample_size)
+        return hidden_states
+class UpBlock2D(nn.Module):
+    def __init__(self,
+            in_channels: int,
+            prev_output_channel: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor: float = 1.0,
+            add_upsample: bool = True
+    ) -> None:
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            resnets.append(
+                    ResnetBlockPseudo3D(
+                            in_channels = resnet_in_channels + res_skip_channels,
+                            out_channels = out_channels,
+                            temb_channels = temb_channels,
+                            eps = resnet_eps,
+                            groups = resnet_groups,
+                            dropout = dropout,
+                            time_embedding_norm = resnet_time_scale_shift,
+                            #non_linearity = resnet_act_fn,
+                            output_scale_factor = output_scale_factor,
+                            pre_norm = resnet_pre_norm
+                    )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([
+                    Upsample2D(
+                            out_channels,
+                            use_conv = True,
+                            out_channels = out_channels
+                    )
+            ])
+        else:
+            self.upsamplers = None
+    def forward(self, hidden_states, res_hidden_states_tuple, temb = None, upsample_size = None):
+        for resnet in self.resnets:
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            hidden_states = resnet(hidden_states, temb)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states, upsample_size)
+        return hidden_states
+def get_down_block(
+        down_block_type: str,
+        num_layers: int,
+        in_channels: int,
+        out_channels: int,
+        temb_channels: int,
+        add_downsample: bool,
+        resnet_eps: float,
+        resnet_act_fn: str,
+        attn_num_head_channels: int,
+        resnet_groups: Optional[int] = None,
+        cross_attention_dim: Optional[int] = None,
+        downsample_padding: Optional[int] = None,
+) -> Union[DownBlock2D, CrossAttnDownBlock2D]:
+    down_block_type = down_block_type[7:] if down_block_type.startswith("UNetRes") else down_block_type
+    if down_block_type == "DownBlock2D":
+        return DownBlock2D(
+                num_layers = num_layers,
+                in_channels = in_channels,
+                out_channels = out_channels,
+                temb_channels = temb_channels,
+                add_downsample = add_downsample,
+                resnet_eps = resnet_eps,
+                resnet_act_fn = resnet_act_fn,
+                resnet_groups = resnet_groups,
+                downsample_padding = downsample_padding
+        )
+    elif down_block_type == "CrossAttnDownBlock2D":
+        if cross_attention_dim is None:
+            raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D")
+        return CrossAttnDownBlock2D(
+                num_layers = num_layers,
+                in_channels = in_channels,
+                out_channels = out_channels,
+                temb_channels = temb_channels,
+                add_downsample = add_downsample,
+                resnet_eps = resnet_eps,
+                resnet_act_fn = resnet_act_fn,
+                resnet_groups = resnet_groups,
+                downsample_padding = downsample_padding,
+                cross_attention_dim = cross_attention_dim,
+                attn_num_head_channels = attn_num_head_channels
+        )
+    raise ValueError(f"{down_block_type} does not exist.")
+def get_up_block(
+        up_block_type: str,
+        num_layers,
+        in_channels,
+        out_channels,
+        prev_output_channel,
+        temb_channels,
+        add_upsample,
+        resnet_eps,
+        resnet_act_fn,
+        attn_num_head_channels,
+        resnet_groups = None,
+        cross_attention_dim = None,
+) -> Union[UpBlock2D, CrossAttnUpBlock2D]:
+    up_block_type = up_block_type[7:] if up_block_type.startswith("UNetRes") else up_block_type
+    if up_block_type == "UpBlock2D":
+        return UpBlock2D(
+                num_layers = num_layers,
+                in_channels = in_channels,
+                out_channels = out_channels,
+                prev_output_channel = prev_output_channel,
+                temb_channels = temb_channels,
+                add_upsample = add_upsample,
+                resnet_eps = resnet_eps,
+                resnet_act_fn = resnet_act_fn,
+                resnet_groups = resnet_groups
+        )
+    elif up_block_type == "CrossAttnUpBlock2D":
+        if cross_attention_dim is None:
+            raise ValueError("cross_attention_dim must be specified for CrossAttnUpBlock2D")
+        return CrossAttnUpBlock2D(
+                num_layers = num_layers,
+                in_channels = in_channels,
+                out_channels = out_channels,
+                prev_output_channel = prev_output_channel,
+                temb_channels = temb_channels,
+                add_upsample = add_upsample,
+                resnet_eps = resnet_eps,
+                resnet_act_fn = resnet_act_fn,
+                resnet_groups = resnet_groups,
+                cross_attention_dim = cross_attention_dim,
+                attn_num_head_channels = attn_num_head_channels
+        )
+    raise ValueError(f"{up_block_type} does not exist.")

makeavid_sd/makeavid_sd/torch_impl/torch_unet_pseudo3d_condition.py ADDED Viewed

	@@ -0,0 +1,235 @@

+from typing import Optional, Tuple, Union
+import torch
+from torch import nn
+import torch.nn as nn
+from torch_embeddings import TimestepEmbedding, Timesteps
+from torch_unet_pseudo3d_blocks import (
+    UNetMidBlock2DCrossAttn,
+    get_down_block,
+    get_up_block,
+)
+from torch_resnet_pseudo3d import Pseudo3DConv
+class UNetPseudo3DConditionOutput:
+    sample: torch.FloatTensor
+    def __init__(self, sample: torch.FloatTensor) -> None:
+        self.sample = sample
+class UNetPseudo3DConditionModel(nn.Module):
+    def __init__(self,
+            sample_size: Optional[int] = None,
+            in_channels: int = 9,
+            out_channels: int = 4,
+            flip_sin_to_cos: bool = True,
+            freq_shift: int = 0,
+            down_block_types: Tuple[str] = (
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "DownBlock2D",
+            ),
+            up_block_types: Tuple[str] = (
+                    "UpBlock2D",
+                    "CrossAttnUpBlock2D",
+                    "CrossAttnUpBlock2D",
+                    "CrossAttnUpBlock2D"
+            ),
+            block_out_channels: Tuple[int] = (320, 640, 1280, 1280),
+            layers_per_block: int = 2,
+            downsample_padding: int = 1,
+            mid_block_scale_factor: float = 1,
+            act_fn: str = "silu",
+            norm_num_groups: int = 32,
+            norm_eps: float = 1e-5,
+            cross_attention_dim: int = 768,
+            attention_head_dim: int = 8,
+            **kwargs
+    ) -> None:
+        super().__init__()
+        self.dtype = torch.float32
+        self.sample_size = sample_size
+        time_embed_dim = block_out_channels[0] * 4
+        # input
+        self.conv_in = Pseudo3DConv(in_channels, block_out_channels[0], kernel_size=3, padding=(1, 1))
+        # time
+        self.time_proj = Timesteps(block_out_channels[0], flip_sin_to_cos, freq_shift)
+        timestep_input_dim = block_out_channels[0]
+        self.time_embedding = TimestepEmbedding(timestep_input_dim, time_embed_dim)
+        self.down_blocks = nn.ModuleList([])
+        self.mid_block = None
+        self.up_blocks = nn.ModuleList([])
+        # down
+        output_channel = block_out_channels[0]
+        for i, down_block_type in enumerate(down_block_types):
+            input_channel = output_channel
+            output_channel = block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            down_block = get_down_block(
+                    down_block_type,
+                    num_layers = layers_per_block,
+                    in_channels = input_channel,
+                    out_channels = output_channel,
+                    temb_channels = time_embed_dim,
+                    add_downsample = not is_final_block,
+                    resnet_eps = norm_eps,
+                    resnet_act_fn = act_fn,
+                    resnet_groups = norm_num_groups,
+                    cross_attention_dim = cross_attention_dim,
+                    attn_num_head_channels = attention_head_dim,
+                    downsample_padding = downsample_padding
+            )
+            self.down_blocks.append(down_block)
+        # mid
+        self.mid_block = UNetMidBlock2DCrossAttn(
+                in_channels = block_out_channels[-1],
+                temb_channels = time_embed_dim,
+                resnet_eps = norm_eps,
+                resnet_act_fn = act_fn,
+                output_scale_factor = mid_block_scale_factor,
+                resnet_time_scale_shift = "default",
+                cross_attention_dim = cross_attention_dim,
+                attn_num_head_channels = attention_head_dim,
+                resnet_groups = norm_num_groups
+        )
+        # count how many layers upsample the images
+        self.num_upsamplers = 0
+        # up
+        reversed_block_out_channels = list(reversed(block_out_channels))
+        output_channel = reversed_block_out_channels[0]
+        for i, up_block_type in enumerate(up_block_types):
+            is_final_block = i == len(block_out_channels) - 1
+            prev_output_channel = output_channel
+            output_channel = reversed_block_out_channels[i]
+            input_channel = reversed_block_out_channels[min(i + 1, len(block_out_channels) - 1)]
+            # add upsample block for all BUT final layer
+            if not is_final_block:
+                add_upsample = True
+                self.num_upsamplers += 1
+            else:
+                add_upsample = False
+            up_block = get_up_block(
+                    up_block_type,
+                    num_layers = layers_per_block + 1,
+                    in_channels = input_channel,
+                    out_channels = output_channel,
+                    prev_output_channel = prev_output_channel,
+                    temb_channels = time_embed_dim,
+                    add_upsample = add_upsample,
+                    resnet_eps = norm_eps,
+                    resnet_act_fn = act_fn,
+                    resnet_groups = norm_num_groups,
+                    cross_attention_dim = cross_attention_dim,
+                    attn_num_head_channels = attention_head_dim
+            )
+            self.up_blocks.append(up_block)
+            prev_output_channel = output_channel
+        # out
+        self.conv_norm_out = nn.GroupNorm(
+                num_channels = block_out_channels[0],
+                num_groups = norm_num_groups,
+                eps = norm_eps
+        )
+        self.conv_act = nn.SiLU()
+        self.conv_out = Pseudo3DConv(block_out_channels[0], out_channels, 3, padding = 1)
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+        timesteps: Union[torch.Tensor, float, int],
+        encoder_hidden_states: torch.Tensor
+    ) -> Union[UNetPseudo3DConditionOutput, Tuple]:
+        # By default samples have to be AT least a multiple of the overall upsampling factor.
+        # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
+        # However, the upsampling interpolation output size can be forced to fit any upsampling size
+        # on the fly if necessary.
+        default_overall_up_factor = 2**self.num_upsamplers
+        # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
+        forward_upsample_size = False
+        upsample_size = None
+        if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
+            forward_upsample_size = True
+        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+        timesteps = timesteps.expand(sample.shape[0])
+        t_emb = self.time_proj(timesteps)
+        # timesteps does not contain any weights and will always return f32 tensors
+        # but time_embedding might actually be running in fp16. so we need to cast here.
+        # there might be better ways to encapsulate this.
+        t_emb = t_emb.to(dtype=self.dtype)
+        emb = self.time_embedding(t_emb)
+        # 2. pre-process
+        sample = self.conv_in(sample)
+        # 3. down
+        down_block_res_samples = (sample,)
+        for downsample_block in self.down_blocks:
+            if hasattr(downsample_block, "attentions") and downsample_block.attentions is not None:
+                sample, res_samples = downsample_block(
+                    hidden_states = sample,
+                    temb = emb,
+                    encoder_hidden_states = encoder_hidden_states,
+                )
+            else:
+                sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
+            down_block_res_samples += res_samples
+        # 4. mid
+        sample = self.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states)
+        # 5. up
+        for i, upsample_block in enumerate(self.up_blocks):
+            is_final_block = i == len(self.up_blocks) - 1
+            res_samples = down_block_res_samples[-len(upsample_block.resnets) :]
+            down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]
+            # if we have not reached the final block and need to forward the
+            # upsample size, we do it here
+            if not is_final_block and forward_upsample_size:
+                upsample_size = down_block_res_samples[-1].shape[2:]
+            if hasattr(upsample_block, "attentions") and upsample_block.attentions is not None:
+                sample = upsample_block(
+                        hidden_states = sample,
+                        temb = emb,
+                        res_hidden_states_tuple = res_samples,
+                        encoder_hidden_states = encoder_hidden_states,
+                        upsample_size = upsample_size,
+                )
+            else:
+                sample = upsample_block(
+                        hidden_states = sample,
+                        temb = emb,
+                        res_hidden_states_tuple = res_samples,
+                        upsample_size = upsample_size
+                )
+        # 6. post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        return UNetPseudo3DConditionOutput(sample = sample)

makeavid_sd/requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ torch
2	+ torch_xla

makeavid_sd/setup.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from setuptools import setup
+setup(
+        name = 'makeavid_sd',
+        version = '0.1.0',
+        description = 'makeavid sd',
+        author = 'Lopho',
+        author_email = 'contact@lopho.org',
+        platforms = ['any'],
+        license = 'GNU Affero General Public License v3',
+        url = 'http://github.com/lopho/makeavid-sd-tpu'
+)

makeavid_sd/trainer_xla.py ADDED Viewed

	@@ -0,0 +1,104 @@

+import os
+os.environ['PJRT_DEVICE'] = 'TPU'
+from tqdm.auto import tqdm
+import torch
+from torch.utils.data import DataLoader
+from torch_xla.core import xla_model
+from diffusers import UNetPseudo3DConditionModel
+from dataset import load_dataset
+class TempoTrainerXLA:
+    def __init__(self,
+            pretrained: str = 'lxj616/make-a-stable-diffusion-video-timelapse',
+            lr: float = 1e-4,
+            dtype: torch.dtype = torch.float32,
+    ) -> None:
+        self.dtype = dtype
+        self.device: torch.device = xla_model.xla_device(0)
+        unet: UNetPseudo3DConditionModel = UNetPseudo3DConditionModel.from_pretrained(
+                pretrained,
+                subfolder = 'unet'
+        ).to(dtype = dtype, memory_format = torch.contiguous_format)
+        unfreeze_all: bool = False
+        unet = unet.train()
+        if not unfreeze_all:
+            unet.requires_grad_(False)
+            for name, param in unet.named_parameters():
+                if 'temporal_conv' in name:
+                    param.requires_grad_(True)
+            for block in [*unet.down_blocks, unet.mid_block, *unet.up_blocks]:
+                if hasattr(block, 'attentions') and block.attentions is not None:
+                    for attn_block in block.attentions:
+                        for transformer_block in attn_block.transformer_blocks:
+                            transformer_block.requires_grad_(False)
+                            transformer_block.attn_temporal.requires_grad_(True)
+                            transformer_block.norm_temporal.requires_grad_(True)
+        else:
+            unet.requires_grad_(True)
+        self.model: UNetPseudo3DConditionModel = unet.to(device = self.device)
+        #self.model = torch.compile(self.model, backend = 'aot_torchxla_trace_once')
+        self.params = lambda: filter(lambda p: p.requires_grad, self.model.parameters())
+        self.optim: torch.optim.Optimizer = torch.optim.AdamW(self.params(), lr = lr)
+        def lr_warmup(warmup_steps: int = 0):
+            def lambda_lr(step: int) -> float:
+                if step < warmup_steps:
+                    return step / warmup_steps
+                else:
+                    return 1.0
+            return lambda_lr
+        self.scheduler = torch.optim.lr_scheduler.LambdaLR(self.optim, lr_lambda = lr_warmup(warmup_steps = 60), last_epoch = -1)
+    @torch.no_grad()
+    def train(self, dataloader: DataLoader, epochs: int = 1, log_every: int = 1, save_every: int = 1000) -> None:
+        # 'latent_model_input'
+        # 'encoder_hidden_states'
+        # 'timesteps'
+        # 'noise'
+        global_step: int = 0
+        for epoch in range(epochs):
+            pbar = tqdm(dataloader, dynamic_ncols = True, smoothing = 0.01)
+            for b in pbar:
+                latent_model_input: torch.Tensor = b['latent_model_input'].to(device = self.device)
+                encoder_hidden_states: torch.Tensor = b['encoder_hidden_states'].to(device = self.device)
+                timesteps: torch.Tensor = b['timesteps'].to(device = self.device)
+                noise: torch.Tensor = b['noise'].to(device = self.device)
+                with torch.enable_grad():
+                    self.optim.zero_grad(set_to_none = True)
+                    y = self.model(latent_model_input, timesteps, encoder_hidden_states).sample
+                    loss = torch.nn.functional.mse_loss(noise, y)
+                    loss.backward()
+                    self.optim.step()
+                    self.scheduler.step()
+                    xla_model.mark_step()
+                if global_step % log_every == 0:
+                    pbar.set_postfix({ 'loss': loss.detach().item(), 'epoch': epoch })
+def main():
+    pretrained: str = 'lxj616/make-a-stable-diffusion-video-timelapse'
+    dataset_path: str = './storage/dataset/tempofunk'
+    dtype: torch.dtype = torch.bfloat16
+    trainer = TempoTrainerXLA(
+            pretrained = pretrained,
+            lr = 1e-5,
+            dtype = dtype
+    )
+    dataloader: DataLoader = load_dataset(
+            dataset_path = dataset_path,
+            pretrained = pretrained,
+            batch_size = 1,
+            num_frames = 10,
+            num_workers = 1,
+            dtype = dtype
+    )
+    trainer.train(
+            dataloader = dataloader,
+            epochs = 1000,
+            log_every = 1,
+            save_every = 1000
+    )
+if __name__ == '__main__':
+    main()