Stock return prediction with tweets

Creators:
Madhyastha, Pranava; Sowinska, Karolina
Publication Date:
2020
Data Category:
Dataset Description:

This dataset is designed to analyze the impact of Twitter-based textual information on stock returns. Compiled by researchers Karolina Sowinska and Pranava Madhyastha, this dataset was published in 2020 and is made available under the GNU General Public License v3.0 or later. It provides valuable data for financial analytics and natural language processing, particularly in studying the relationship between social media sentiment and stock market performance. By linking tweets to stock return data, the dataset enables the development of predictive models for stock movement based on public sentiment. The dataset comprises 862,231 labeled tweets, all in English, each associated with specific companies. These tweets serve as samples for analyzing public opinion and sentiment regarding different stocks and financial events. A cleaned subset of 85,176 labeled instances is also included, making the dataset suitable for both large-scale machine learning models and more focused analyses. Each tweet is linked to corresponding stock return data, allowing for a company-level examination of how Twitter sentiment impacts one-day, two-day, three-day, and seven-day stock returns. This structured linkage between tweets and financial performance provides a unique opportunity to study the effects of social media on stock price fluctuations. The dataset is approximately 225 MB in size on GitHub, making it manageable for various analytical tasks, including sentiment analysis, text-based predictive modeling, and financial forecasting. It is structured into two primary components:

  • Tweet Data: This includes the textual content of tweets, user metadata, timestamps, and the companies referenced in each tweet. These features allow researchers to perform sentiment analysis, track user engagement, and examine the frequency of stock-related discussions on social media.

  • Stock Return Data: This includes numerical stock return values corresponding to the companies mentioned in the tweets. The returns are recorded over multiple time intervals, enabling the study of both short-term and long-term price movements in response to social media discussions.

Variables:
Details:

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.