Spaces:

numind
/

NuExtract

Running on L4

File size: 18,278 Bytes

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
import torch
from itertools import cycle
import json
import gradio as gr
from urllib.parse import unquote
from ml import create_prompt, generate_answer_short


example1 = ("""We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.
Code: https://github.com/mistralai/mistral-src
Webpage: https://mistral.ai/news/announcing-mistral-7b/""","""{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of token": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}""")


example2 = ("""Identity security company IDfy said on Wednesday it has raised $27 million in a mix of primary and secondary fundraising from Elev8, KB Investment and Tenacity Ventures.
Mumbai-based IDfy makes products and solutions for authenticating entities, helping companies prevent fraud and verify other businesses. "Investment from Elev8 and Tenacity is a strong validation of our vision and capabilities. The fund will fuel our expansion plans and product development, enabling us to serve even more businesses and unlock opportunities for trustworthy people and businesses," said Ashok Hariharan, co-founder and chief executive officer of IDfy, in a statement.
Click here to follow our WhatsApp channel
IDfy (Baldor Technologies) was founded in 2011 by Hariharan and Vineet Jawa. It has products for diligence processes called know your customer and know your business, employee background verification, risk and fraud mitigation, and digital privacy.
The company said its artificial intelligence-driven solutions serve more than 1,500 clients in banking, financial services and insurance, e-commerce, gaming and other sectors. It works with companies in India, Southeast Asia and West Asia, having HDFC Bank, Zomato, Paytm, HUL and American Express as clients.
Navin Honagudi, managing partner at Elev8 Venture Partners, said: "We are thrilled to partner with IDfy as our first investment. The company's innovative technology, experienced leadership team, and strong market fit position it for remarkable growth. We are confident that IDfy will play a crucial role in shaping the future of risk management in India and beyond."
Elev8 Venture Partners is a $200 million growth-stage fund anchored by South Korea's KB Investment. Tenacity Ventures is a growth-stage investment fund with a focus on technology product businesses.
""", """{
    "Funding": {
        "New funding": "",
        "Investor": []
    },
     "Company": {
        "Name": "",
        "Activity": "",
        "Total valuation": ""
    }
}""")

example3 = (""""Office of Management and Budget (OMB) memorandum M-12-12, as amended by memorandum
M-17-08, requires federal agencies to issue an annual report related to its conference-related expenditures
for the previous fiscal year. This document constitutes the SEC’s report for Fiscal Year (FY) 2018.
The SEC has put in place policies and procedures governing the approval and use of agency funds for
conference expenses, to ensure that such spending is legal, reasonable, and in furtherance of the agency’s
mission to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation.
At a high level, the major steps in this process are as follows:
1. All SEC division/office requests to spend money on hosting a conference must be approved by
the division/office head or his/her designee. Divisions and offices are required to use SEC
facilities for such events whenever possible, to minimize space rental and equipment costs. In
order to limit expenses for meals or refreshments, the SEC uses per diem rates established for
the federal government as the ceiling for any such costs, except when higher rates are
unavoidable or otherwise justified. The acquisition of any goods, services, or meeting space is
subject to the applicable policies and regulations which govern these areas.
2. When a request for funds is necessary and has received approval from the division/office head,
it is reviewed by staff in the Office of Financial Management (OFM) to ensure the expenses
are permissible under the applicable polices and regulations. OFM has implemented an
automated system for the submission, review, and approval of all SEC conference requests
that enables OFM to monitor and control conference spending, as well as record actual
conference spending after a conference has been held.
3. Each request must receive final approval from designated officials according to the total
projected cost. These designations comply with OMB Memorandum 12-12.
4. The SEC is reporting conferences which meet thresholds defined in P.L.115-141 Division E,
Title VII, Sections 739 (a), (b), and (c), to the SEC’s Office of Inspector General via separate
correspondence.
For FY 2018, the SEC authorized 97 conferences (including training conferences) with costs totaling
$884,759.
2
Conferences over $100,000:
In FY 2018, the SEC authorized two conferences costing greater than $100,000, which are described
below:
A. 2018 Chief Enforcement Conference (CEC), SEC Headquarters, Washington DC, September 25-26,
2018
• Cost incurred1
: $165,194
• Number of attendees: 209 (207 SEC attendees and 2 non-SEC attendees)
The Enforcement Division (Enforcement) conducts investigations into potential violations of the
federal securities laws, litigates actions, negotiates settlements, and coordinates with the
Commission and other SEC divisions and offices regarding the national enforcement program.
Because Enforcement staff are located in Washington, DC and 11 regional offices, periodic
gatherings of Enforcement leaders help to ensure an efficient, well-coordinated national program.
The 2018 Chief Enforcement Conference (CEC) was held at SEC Headquarters in Washington,
D.C. on September 25 and 26, 2018. CEC served as a strategic planning and training session for
Enforcement’s senior managers and provided an important opportunity for attendees to discuss
relevant enforcement topics with the Chairman and participating Commissioners.
B. 2018 Leadership Conference, SEC Headquarters, Washington, DC, July 26-27, 2018
• Cost incurred1
: $219,658
• Number of attendees: 261 attendees (261 SEC employees)
The Office of Compliance Inspections and Examinations (OCIE) conducts the National
Examination Program and focuses on improving compliance with the federal securities laws,
preventing fraud, informing policy, and monitoring risk. Because examination program staff are
located in Washington, DC and 11 regional offices, periodic gatherings of examination program
leaders help to ensure an efficient, well-coordinated national program. On July 26 and 27, 2018,
OCIE held its leadership conference at SEC Headquarters in Washington DC, which focused on
initiatives to increase OCIE’s capabilities. The conference gathered SEC managers from across
the National Examination Program to collaborate on strategic planning and to provide training. It
included presentations and discussions on risk assessment tools and procedures, implementation
of new requirements, and increasing OCIE’s collaboration with other Commission offices and
divisions""","""{
    "Number of conferences": "",
    "Total cost": "",
    "Conferences over 100k": [
        {
            "Name": "",
            "Organizer": "",
            "Cost": "",
            "Start date": "",
            "End date": "",
            "Location": ""
        }
    ]
}""")

example4 = ("""
Patient: Good evening doctor.
Doctor: Good evening. You look pale and your voice is out of tune.
Patient: Yes doctor. I’m running a temperature and have a sore throat.
Doctor: Lemme see.
(He touches the forehead to feel the temperature.)
Doctor: You’ve moderate fever.
(He then whips out a thermometer.)
Patient: This thermometer is very different from the one you used the last time. (Unlike the earlier one which was placed below the tongue, this one snapped around one of the fingers.)
Doctor: Yes, this is a new introduction by medical equipment companies. It’s much more convenient, as it doesn’t require cleaning after every use.
Patient: That’s awesome.
Doctor: Yes it is.
(He removes the thermometer and looks at the reading.)
Doctor: Not too high – 99.8.
(He then proceeds with measuring blood pressure.)
Doctor: Your blood pressure is fine.
(He then checks the throat.)
Doctor: It looks bit scruffy. Not good.
Patient: Yes, it has been quite bad.
Doctor: Do you get sweating and shivering?
Patient: Not sweating, but I feel somewhat cold when I sit under a fan.
Doctor: OK. You’ve few symptoms of malaria. I would suggest you undergo blood test. Nothing to worry about. In most cases, the test come out to be negative. It’s just precautionary, as there have been spurt in malaria cases in the last month or so.
(He then proceeds to write the prescription.)
Doctor: I’m prescribing three medicines and a syrup. The number of dots in front of each tells you how many times in the day you’ve to take them. For example, the two dots here mean you’ve to take the medicine twice in the day, once in the morning and once after dinner.
Doctor: Do you’ve any other questions?
Patient: No, doctor. Thank you.
""","""{
    "Doctor_Patient_Discussion": {
        "Initial_Observation": {
            "Symptoms": [],
            "Initial_Assessment": ""
        },
        "Medical_Examination": {
            "Temperature":"",
            "Blood_Pressure":"",
            "Doctor_Assessment": "",
            "Diagnosis": ""
        },
        "Treatment_Plan": {
            "Prescription": []
        }
    }
}""")

example5 = ("""HARVARD UNIVERSITY Extension School
Master of Liberal Arts, Information Management Systems May 2015
 Dean’s List Academic Achievement Award recipient
 Relevant coursework: Trends in Enterprise Information Systems, Principles of Finance, Data mining
and Forecast Management, Resource Planning and Allocation Management, Simulation for
Managerial Decision Making
RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY
Bachelor of Arts in Computer Science with Mathematics minor May 2008
Professional Experience
STATE STREET CORPORATION
Principal –Simulated Technology
 Boston, MA
December 2011 – July 2013
 Led 8 cross functional, geographically dispersed teams to support quality for the reporting system
 Improved process efficiency 75% by standardizing end to end project management workflow
 Reduced application testing time 30% by automating shorter testing phases for off cycle projects
 Conducted industry research on third-party testing tools and prepared recommendations for maximum
return on investment
FIDELITY INVESTMENTS
Associate – Interactive Technology
 Boston, MA
January 2009 – November 2011
 Initiated automated testing efforts that reduced post production defects by 40%
 Implemented initiatives to reduce overall project time frames by involving quality team members
early in the Software Development Life Cycle iterations
 Developed a systematic approach to organize and document the requirements of the to-be-system
 Provided leadership to off-shore tech teams via training and analyzing business requirements
L.L. BEAN, INC.
IT Consultant
 Freeport, ME
June 2008 – December 2009
 Collaborated closely with the business teams to streamline production release strategy plans
 Managed team of five test engineers to develop data driven framework that increased application
testing depth and breadth by 150%
 Generated statistical analysis of quality and requirements traceability matrices to determine the linear
relationship of development time frames to defect identification and subsequent resolution
 Led walkthroughs with project stakeholders to set expectations and milestones for the project team
Technical Expertise
MS Excel, PowerPoint, Relational Databases, Project Management, Quantitative Analysis, SQL, Java
Additional
Organized computer and English literacy workshops for underprivileged children in South Asia, 2013
Student Scholarship Recipient, National Conference on Race and Ethnicity, 2007-2008""","""{
    "Name": "",
    "Age": "",
    "Educations": [
        {
            "School": "",
            "Date": ""
        }
    ],
    "Experiences": [
        {
            "Company": "",
            "Date": ""
        }
    ]
}""")

example6 = (""""Libretto by Marius Petipa, based on the 1822 novella ``Trilby, ou Le Lutin d'Argail`` by Charles Nodier, first presented by the Ballet of the Moscow Imperial Bolshoi Theatre on January 25/February 6 (Julian/Gregorian calendar dates), 1870, in Moscow with Polina Karpakova as Trilby and Ludiia Geiten as Miranda and restaged by Petipa for the Imperial Ballet at the Imperial Bolshoi Kamenny Theatre on January 17–29, 1871 in St. Petersburg with Adèle Grantzow as Trilby and Lev Ivanov as Count Leopold.""","""{
    "Name_act": "",
    "Director": "",
    "Location": [
        {
            "City": "",
            "Venue": "",
            "Date": "",
            "Actor": [
                {
                    "Name": "",
                    "Character_played": ""
                }
            ]
        }
    ]
}""")


def extract_leaves(item, path=None, leaves=None):
    if leaves is None:
        leaves = []
    if path is None:
        path = []

    if isinstance(item, dict):
        for key, value in item.items():
            extract_leaves(value, path + [key], leaves)
    elif isinstance(item, list):
        for value in item:
            extract_leaves(value, path, leaves)
    else:
        if item != '':
          leaves.append((path, item))
    return leaves

def highlight_words(input_text, json_output):
    colors = cycle(["#90ee90", "#add8e6", "#ffb6c1", "#ffff99", "#ffa07a", "#20b2aa", "#87cefa", "#b0e0e6", "#dda0dd", "#ffdead"])
    color_map = {}
    highlighted_text = input_text

    leaves = extract_leaves(json_output)
    for path, value in leaves:
        path_key = tuple(path)
        if path_key not in color_map:
            color_map[path_key] = next(colors)
        color = color_map[path_key]
        highlighted_text = highlighted_text.replace(value, f"<span style='background-color: {color};'>{unquote(value)}</span>")

    return highlighted_text

# model = AutoModelForCausalLM.from_pretrained(
#         "numind/NuExtract-tinyv2",
#         )

model = AutoModelForCausalLM.from_pretrained(
        "numind/NuExtract",
        trust_remote_code=True,
        )


tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract")

model.to("cuda")
model.eval()


def get_prediction(text,template,example):
    size = len(tokenizer(text)["input_ids"])
    print(size)
    if size > 5000:
        raise gr.Error("Max token for input text is 2000 tokes. Yours is: "+str(size))
    try:
        prompt = create_prompt(text,template,[example,"",""])
    except:
        raise gr.Error("Error JSON (schema or output example)")
        
    result = generate_answer_short(prompt,model,tokenizer)
    result = result.replace("\n"," ")
    r = unquote(result)
    r = json.dumps(json.loads(r),indent = 4)
    dic_out = json.loads(r)
    highlighted_input2 = highlight_words(text, dic_out)
    return r,highlighted_input2


markdown_description = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>NuExtract</title>
</head>
<body>
    <h1>NuExtract</h1>
    <p>NuExtract is a model to transform any piece of text into a structured output. You can use it to tackle any information extraction problem you have.
     To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract. This model is purely extractive, so all information output by the model is present as is in the text. You can also provide a full output example to help the model understand your task more precisely.</p>
    <ul>
        <li><strong>Model</strong>: <a href="https://huggingface.co/numind/NuExtract">numind/NuExtract</a></li>
    </ul>
    <br>
    <img src="https://cdn.prod.website-files.com/638364a4e52e440048a9529c/64188f405afcf42d0b85b926_logo_numind_final.png" alt="NuMind Logo" style="vertical-align: middle;width: 200px; height: 50px;"> 
    <p>We are a startup developing NuMind, a tool to create custom Information Extraction models. NuExtract is a zero-shot model. If you want better performance, the best way is to use it through NuMind. Don't hesitate to contact us :). 
    <br>
    </p> 
    <ul>
        <li><strong>Webpage</strong>: <a href="https://www.numind.ai/">https://www.numind.ai/</a></li>
    </ul>
</body>
</html>
"""

iface = gr.Interface(
    fn=get_prediction,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter Text here...", label="Text"),
        gr.Textbox(lines=2, placeholder="Enter Template input here...", label="Template"),
        gr.Textbox(lines=2, placeholder="Enter Example of output here... (not mandatory but can improve performance significantly on certains tasks)", label="Example of output")
    ],
    outputs=[gr.Textbox(label="Model Output"), gr.HTML(label="Model Output with Highlighted Words")],
    examples=[
        [example6[0], example6[1]],
        [example1[0], example1[1]],
        [example4[0], example4[1]],
        [example2[0], example2[1]],
        [example5[0], example5[1]],
        [example3[0], example3[1]]
    ],
    description=markdown_description
)


iface.launch(debug=True,share=True)