sarahyurick's picture
Update README.md
82caac8 verified
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
license: other
---
# Content Type Classifier
# Model Overview
This is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The model's classifications include:
* Product/Company/Organization/Personal Websites: Informational pages about companies, products, or individuals.
* Explanatory Articles: Detailed, informative articles that aim to explain concepts or topics.
* News: News articles covering current events, updates, and factual reporting.
* Blogs: Personal or opinion-based entries typically found on blogging platforms.
* MISC: Miscellaneous content that doesn’t fit neatly into the other categories.
* Boilerplate Content: Standardized text used frequently across documents, often repetitive or generic.
* Analytical Exposition: Analytical or argumentative pieces with in-depth discussion and evaluation.
* Online Comments: Short, often informal comments typically found on social media or forums.
* Reviews: Content sharing opinions or assessments about products, services, or experiences.
* Books and Literature: Excerpts or full texts from books, literary works, or similar long-form writing.
* Conversational: Informal, dialogue-like text that mimics a conversational tone.
# License
This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
# References
* [DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing](https://arxiv.org/abs/2111.09543)
* [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://github.com/microsoft/DeBERTa)
# Model Architecture
* The model architecture is Deberta V3 Base
* Context length is 1024 tokens
# How to Use in NVIDIA NeMo Curator
NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
The inference code for this model is available through the NeMo Curator GitHub repository. Check out this [example notebook](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) to get started.
# How to Use in Transformers
To use the content type classifier, use the following code:
```python
import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig
from huggingface_hub import PyTorchModelHubMixin
class CustomModel(nn.Module, PyTorchModelHubMixin):
def __init__(self, config):
super(CustomModel, self).__init__()
self.model = AutoModel.from_pretrained(config["base_model"])
self.dropout = nn.Dropout(config["fc_dropout"])
self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"]))
def forward(self, input_ids, attention_mask):
features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
dropped = self.dropout(features)
outputs = self.fc(dropped)
return torch.softmax(outputs[:, 0, :], dim=1)
# Setup configuration and model
config = AutoConfig.from_pretrained("nvidia/content-type-classifier-deberta")
tokenizer = AutoTokenizer.from_pretrained("nvidia/content-type-classifier-deberta")
model = CustomModel.from_pretrained("nvidia/content-type-classifier-deberta")
model.eval()
# Prepare and process inputs
text_samples = ["Hi, great video! I am now a subscriber."]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs["input_ids"], inputs["attention_mask"])
# Predict and display results
predicted_classes = torch.argmax(outputs, dim=1)
predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
print(predicted_domains)
# ['Online Comments']
```
# Input & Output
## Input
* Input Type: Text
* Input Format: String
* Input Parameters: 1D
* Other Properties Related to Input: Token Limit of 1024 tokens
## Output
* Output Type: Text Classification
* Output Format: String
* Output Parameters: 1D
* Other Properties Related to Output: None
The model takes one or several paragraphs of text as input. Example input:
```
Brent awarded for leading collaborative efforts and leading SIA International Relations Committee.
Mar 20, 2018
The Security Industry Association (SIA) will recognize Richard Brent, CEO, Louroe Electronics with the prestigious 2017 SIA Chairman's Award for his work to support leading the SIA International Relations Committee and supporting key government relations initiatives.
With his service on the SIA Board of Directors and as Chair of the SIA International Relations Committee, Brent has forged relationships between SIA and agencies like the U.S. Commercial Service. A longtime advocate for government engagement generally and exports specifically, Brent's efforts resulted in the publication of the SIA Export Assistance Guide last year as a tool to assist SIA member companies exploring export opportunities or expanding their participation in trade.
SIA Chairman Denis Hébert will present the SIA Chairman's Award to Brent at The Advance, SIA's annual membership meeting, scheduled to occur on Tuesday, April 10, 2018, at ISC West.
"As the leader of an American manufacturing company, I have seen great business opportunities in foreign sales," said Brent. "Through SIA, I have been pleased to extend my knowledge and experience to other companies that can benefit from exporting. And that is the power of SIA: To bring together distinct companies to share expertise across vertical markets in a collaborative fashion. I'm pleased to contribute, and I thank the Chairman for his recognition."
"As a member of the SIA Board of Directors, Richard Brent is consistently engaged on a variety of issues of importance to the security industry, particularly related to export assistance programs that will help SIA members to grow their businesses," said Hébert. "His contributions in all areas of SIA programming have been formidable, but we owe him a particular debt in sharing his experiences in exporting. Thank you for your leadership, Richard."
Hébert will present SIA award recipients, including the SIA Chairman's Award, SIA Committee Chair of the Year Award and Sandy Jones Volunteer of the Year Award, at The Advance, held during ISC West in Rooms 505/506 of the Sands Expo in Las Vegas, Nevada, on Tuesday, April 10, 10:30-11:30 a.m. Find more info and register at https:/​/​www.securityindustry.org/​advance.
The Advance is co-located with ISC West, produced by ISC Security Events. Security professionals can register to attend the ISC West trade show and conference, which runs April 10-13, at http:/​/​www.iscwest.com.
```
The model outputs one of the 11 type-of-speech classes as the predicted domain for each input sample. Example output:
```
News
```
# Software Integration
* Runtime Engine: Python 3.10 and NeMo Curator
* Supported Hardware Microarchitecture Compatibility: NVIDIA GPU, Volta™ or higher (compute capability 7.0+), CUDA 12 (or above)
* Preferred/Supported Operating System(s): Ubuntu 22.04/20.04
# Training, Testing, and Evaluation Dataset
## Training Data
* Link: Jigsaw Toxic Comments, Jigsaw Unintended Biases Dataset, Toxigen Dataset, Common Crawl, Wikipedia
* Data collection method by dataset
* Downloaded
* Labeling method by dataset
* Human
* Properties:
* 25,000 Common Crawl samples were labeled by an external vendor. Each sample was labeled by three annotators.
* The model is trained on the 19604 samples that are agreed by at least two annotators.
Label distribution:
| Category | Count |
|----------------------------------------|-------|
| Product/Company/Organization/Personal Websites | 5227 |
| Blogs | 4930 |
| News | 2933 |
| Explanatory Articles | 2457 |
| Analytical Exposition | 1508 |
| Online Comments | 982 |
| Reviews | 512 |
| Boilerplate Content | 475 |
| MISC | 267 |
| Books and Literature | 164 |
| Conversational | 149 |
## Evaluation
* Metric: PR-AUC
Cross validation PR-AUC on the 19604 samples:
```
Produ=0.697, Expla=0.668, News=0.859, Blogs=0.872, MISC=0.593, Boile=0.383, Analy=0.371, Onlin=0.753, Revie=0.612, Books=0.462, Conve=0.541, avg AUC=0.6192, accuracy=0.6805
```
Cross validation PR-AUC on the 7738 subset that are agreed by all three annotators:
```
Produ=0.869, Expla=0.854, News=0.964, Blogs=0.964, MISC=0.876, Boile=0.558, Analy=0.334, Onlin=0.893, Revie=0.825, Books=0.780, Conve=0.793, avg AUC=0.7917, accuracy=0.8444
```
# Inference
* Engine: PyTorch
* Test Hardware: V100
# Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability).