--- |
title: Text Duplicates |
emoji: 🤗 |
colorFrom: green |
colorTo: purple |
sdk: gradio |
sdk_version: 3.0.2 |
app_file: app.py |
pinned: false |
tags: |
- evaluate |
- measurement |
description: >- |
Returns the duplicate fraction of duplicate strings in the input. |
--- |
# Measurement Card for Text Duplicates |
## Measurement Description |
The `text_duplicates` measurement returns the fraction of duplicated strings in the input data. |
## How to Use |
This measurement requires a list of strings as input: |
```python |
>>> data = ["hello sun","hello moon", "hello sun"] |
>>> duplicates = evaluate.load("text_duplicates") |
>>> results = duplicates.compute(data=data) |
``` |
### Inputs |
- **data** (list of `str`): The input list of strings for which the duplicates are calculated. |
### Output Values |
- **duplicate_fraction**(`float`): the fraction of duplicates in the input string(s). |
- **duplicates_dict**(`list`): (optional) a list of tuples with the duplicate strings and the number of times they are repeated. |
By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (`duplicate_fraction`): |
) |
```python |
{'duplicate_fraction': 0.33333333333333337} |
``` |
With the `list_duplicates=True` option, this measurement will also output a dictionary of tuples with duplicate strings and their counts. |
```python |
{'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2}} |
``` |
Warning: the `list_duplicates=True` function can be memory-intensive for large datasets. |
### Examples |
Example with no duplicates |
```python |
>>> data = ["foo", "bar", "foobar"] |
>>> duplicates = evaluate.load("text_duplicates") |
>>> results = duplicates.compute(data=data) |
>>> print(results) |
{'duplicate_fraction': 0.0} |
``` |
Example with multiple duplicates and `list_duplicates=True`: |
```python |
>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"] |
>>> duplicates = evaluate.load("text_duplicates") |
>>> results = duplicates.compute(data=data, list_duplicates=True) |
>>> print(results) |
{'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}} |
``` |
## Citation(s) |
## Further References |
- [`hashlib` library](https://docs.python.org/3/library/hashlib.html) |