1. Overview and Pedagogical Goal
The goal of this assignment is to familiarize you with the complete process of extracting, refining and
delivering insights of particular financial value that are extracted from unstructured data of non-
conventional size from company reports. This is an individual assignment where you are supposed to work
alone in order to extract insights from company financial statements filed at the Electronic Data Gathering,
Analysis, and Retrieval system used at the U.S. Securities and Exchange Commission (SEC). The
assignment maps to level 7 qualification level and aims to establish your ability to handle the development
of in-depth and original solutions to a domain specific problem of a high business value.
The tasks is structured in three (3) parts. The first part (Part A) covers your ability to construct and
demonstrate the handling of text data. It aims to familiarize you with the principles of text mining, the bag-
of-words model and the development of metrics that can be used to analyse structural elements of text, such
as normalizing and cleaning textual corpora. The core of this assignment involves the translation of these
insights to actionable features that can be used to predict an outcome variable of financial interest: the stock
price value. Therefore the second and third parts (Part B and Part C) are concerned with the identification
of features and in particular (a) polarity – whether the text under consideration is positive or negative, (b)
sentiment – the extraction of affective states from the text and (c) the evaluation and extraction of important
topics that are covered and elaborated in the quarterly and annual financial reports (10-Q, 10-K) and the
predictability of these insights on a company, sector and market level (Part C).
The report should be written from the perspective of an analyst involving text mining methods in
constructing a well written piece of work. This should be both academic as well as practical and consider
possible application scenarios where text mining can be used (e.g., Risk analysis etc).
Part A: Construction of Corpus – Fetching 10-Q and 10-K forms
The S&P 500 stock market index, maintained by S&P Dow Jones Indices, comprises common stocks issued
by companies of large capitalization and traded on the New York Stock Exchange (including the 30
companies that compose the Dow Jones Industrial Average). The index covers about 80 percent of the
American equity market by capitalization. All companies are required by law to file quarterly and annual
reports through the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) used at the U.S.
Securities and Exchange Commission (SEC).
The publicly available page of EDGAR is provided at:
Each company files a report using an XML archetypal format, commonly known as Standard Generalized
Markup Language (SGML) using a specialized interface which indexes them by filing period (quarter/year)
and a unique identifier known as Central Index Key (CIK). The later is by definition of 10 digits in length
with preceding zeros when needed. The list of current companies, their CIK codes and Stock Ticker Symbol
can be obtained from the accompanying table in the appendix.
You are required to build a portfolio selection of minimum 30 companies. For companies in your
portfolio (Selected only from the companies listed in the appendix) you need to download the 10-K
forms for the period between 2010 and 2020 (where available).
You need to develop a normalization strategy that will remove the standard parts of the 10-K and 10-Q
form and enable the text for further analysis such as stop-word removal. You need to provide outlines of
TF-IDF weights for important keywords across the industry level using the MSCI Global Industry
Classification Standard (GICS) as a control.