Showing 257-262 of 262 results

Future of Business - Survey Results

Creators: Facebook; OECD; World Bank
Publication Date: 2018
Creators: Facebook; OECD; World Bank

The Future of Business survey is a collaboration between Facebook, the OECD and the World Bank to provide timely insights on the perceptions, challenges, and outlook of online Small and Medium Enterprises (SMEs). The Future of Business survey was first launched as a monthly survey in 17 countries in February 2016 and expanded to 42 countries in 2018. In 2019, the Future of Business survey increased coverage to 97 countries and moved to a bi-annual cadence.

The target population consists of SMEs that have an active Facebook business Page and include both newer and longer-standing businesses, spanning across a variety of sectors. To date, more than 90 million SMEs have created a Facebook Page, and more than 700,000 of these Facebook Page owners have taken the survey. With more businesses leveraging online tools each day, the survey provides a lens into a new mobilized, digital economy and, in particular, insights on the actors: a relatively unmeasured community worthy of deeper consideration and considerable policy interest. The dataset is approximately 0,04 GB in size.

The survey includes questions about perceptions of current and future economic activity, challenges, business characteristics and strategy. Custom modules include questions related to regulation, access to finance, digital payments, and digital skills.

The Stanford Natural Language Inference (SNLI) Corpus

Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.
Publication Date: 2015
Creators: Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large

It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.

The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

IMDb Movie Reviews Dataset

Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher
Publication Date: 2011
Creators: Maas, Andrew L.; Daly, Raymond E.; Pham, Peter T.; Huang, Dan; Ng, Andrew Y.; Potts, Christopher

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The providers also include an additional 50,000 unlabeled documents for unsupervised learning. In total, the dataset amounts to 0,08 GB in size.

The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset also contains an additional  50,000 unlabeled documents for unsupervised learning. See the README file contained in the release for more details.

The data is split into a train (25k reviews) and test (25k reviews) set. A preview file cannot be provided – please download the data directly from the data provider’s website.

When using the dataset, please cite: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Food.com Recipe & Review Data

Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
Publication Date: 2019
Creators: Majumder, Bodhisattwa P.; Li, Shuyang; Ni, Jianmo; McAuley, Julian
This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen), an online recipe aggregator. This extensive collection allows for in-depth analysis of culinary trends, user preferences, and recipe characteristics over nearly two decades.The dataset is 0,85 GB in size and contains three sets of data from Food.com:Interaction splits

  • interactions_test.csv
  • interactions_validation.csv
  • interactions_train.csv

Preprocessed data for result reproduction

In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.

  • PP_recipes.csv
  • PP_users.csv

To convert these files into the pickle format required to run our code off-the-shelf, you may use pandas.read_csv and pandas.to_pickle to convert the CSV’s into the proper pickle format.

 

Advertisement CTR Prediction Data

Creators: Huawei
Publication Date: 2020
Creators: Huawei

Advertisement CTR prediction is the key problem in the area of computing advertising. Increasing the accuracy of Advertisement CTR prediction is critical to improve the effectiveness of precision marketing. In this competition, we release big advertising datasets that are anonymized. Based on the datasets, contestants are required to build Advertisement CTR prediction models. The aim of the event is to find talented individuals to promote the development of Advertisement CTR prediction algorithms. The datasets contain the advertising behavior data collected from seven consecutive days, including a training dataset and a testing dataset. The total size of the datasets amounts to 6,86 GB. It contains millions of observations and is structured into training and testing sets, with multiple variables capturing different aspects of user-ad interactions. These variables include user identifiers, ad identifiers, timestamps, user behavior features, and ad content features, allowing researchers to analyze engagement patterns and develop predictive models for ad click-through rates. This dataset is valuable for improving advertising strategies and refining targeted marketing approaches.

Relative Wealth Index Data

Creators: Chi, Guanghua; Fang, Han; Chatterjee, Sourav; Blumenstock, Joshua E.
Publication Date: 2021
Creators: Chi, Guanghua; Fang, Han; Chatterjee, Sourav; Blumenstock, Joshua E.
The Relative Wealth Index predicts the relative standard of living within countries using de-identified connectivity data, satellite imagery and other nontraditional data sources.
It has been built by researchers at the University of Carlifornia – Berkeley and Facebook. The estimates are built by applying machine learning algorithms to vast and heterogeneous data from satellites, mobile phone networks, topographic maps, as well as aggregated and de-identified connectivity data from Facebook. They train and calibrate the estimates using nationally-representative household survey 20 data from 56 LMICs, then validate their accuracy using four independent sources of household survey data from 18 countries. They also provide confidence intervals for each micro-estimate to facilitate responsible downstream use.
The data is provided for 93 low and middle-income countries at 2.4km resolution. It covers the time between April 01, 2021 and December 22, 2023. An interactive map of the Relative Wealth Index is available here: http://beta.povertymaps.net/
The combined size of the dataset is approximately 0,08 GB, available in CSV format.
Please cite / attribute any use of this dataset using the following:
Microestimates of wealth for all low- and middle-income countries Guanghua Chi, Han Fang, Sourav Chatterjee, Joshua E. Blumenstock Proceedings of the National Academy of Sciences Jan 2022, 119 (3) e2113658119; DOI: 10.1073/pnas.2113658119 

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.