From cfcffb90e69f37bf2ff1e988237a0fbe41f33c04 Mon Sep 17 00:00:00 2001 From: Jong Wook Kim Date: Sun, 11 Apr 2021 02:29:52 -0700 Subject: [PATCH] add YFCC100M subset information (#50) --- data/yfcc100m.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 data/yfcc100m.md diff --git a/data/yfcc100m.md b/data/yfcc100m.md new file mode 100644 index 0000000..575c54b --- /dev/null +++ b/data/yfcc100m.md @@ -0,0 +1,14 @@ +# The YFCC100M Subset + +In the paper, we performed a dataset ablation using a subset of the YFCC100M dataset and showed that the performance remained largely similar. + +The subset contains 14,829,396 images, about 15% of the full dataset, which have been filtered to only keep those with natural languag titles and/or descriptions in English. + +We provide the list of (line number, photo identifier, photo hash) of each image contained in this subset. These correspond to the first three columns in the dataset's metadata TSV file. + +``` +wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2 +bunzip2 yfcc100m_subset_data.tsv.bz2 +``` + +Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/). \ No newline at end of file