Text Exploration

It can be fun to explore. You can also get lost along the way.

What kind of data do you think would be most straightforward to explore?

Of course.

No, structured data is a lot more straightforward in terms of organization and meaning.

If you have wrangled the text into a document-term matrix (DTM), what key values are you able to use?

No, that's not a captured value in a DTM, unless you look at the columns themselves. Still, it's not an important value.

Exactly. That's one of the most basic things to look at with text data, and calculating a **term frequency (TF)**, the frequency of each token as a ratio of all tokens in the document, is a start. From there, an algorithm may start looking for words that tend to appear frequently together.

No, relevance isn't determined at this point. Consider what the cells in a DTM are.

Following this initial exploratory data analysis, the second part of text exploration is feature selection. Think of a feature as a variable (column) in a traditional spreadsheet. What element of text data would be similar to a column variable?

No, a matrix would be the entire spreadsheet, not just a single column.

No, documents represent rows in a document term matrix. The columns are the important pieces pulled out of the documents.

Right. *Feature selection* means choosing those tokens that are most likely to be useful in model training. Being selective with features allows the process to be more efficient.

Similar to traditional modeling, overfitting is the risk of using too many variables, while underfitting is not using enough. With feature selection, you have some frequent tokens and some infrequent tokens. You have to draw the line somewhere. Which problem do you think would follow from setting the threshold low to catch the infrequent tokens?

Great! Yes, the lower the threshold, the more variables, and the greater the risk of overfitting. With a higher threshold you would have fewer variables, but then a risk of underfitting.

No, actually this would run the risk of overfitting. See, when you set the threshold low, you get many more tokens. Tokens are variables. So the lower the threshold, the more variables, and the greater the risk of overfitting. With a higher threshold you would have fewer variables, but then a risk of underfitting.

What might have a high MI score in a document about a stock analysis?

No, that's less likely to pop up and be an informative token than another choice.

Yes. There are plenty of expected tokens that would have high MI scores in an article like that, and this will help the machine learning model to train.

Not really. That won't convey too much meaning by itself.

Where else might you find these ideas?

That's it. Recall that these were part of text wrangling already. They are now back at this stage as part of the feedback loop, highlighting how much of the results depend on the choices of the researcher.

No, that's just scraping text off of the web. These processes came later.

No, structured data has no need for such things. Once errors are fixed, the spreadsheet is beautiful and ready to go.

Of course, sometimes a word comes up that the algorithm doesn't know. What do you do when that happens?

Good idea.

That's no way to learn! The internet never sleeps, and online dictionaries are always there. It might be a better idea to look it up.

That's really what an algorithm does with **name entity recognition (NER)**: run things through a dictionary to see what it is, and give it an additional classification. A related process when looking up the word is **parts of speech (POS)**, determining if it's a noun, verb, etc. The NER and POS helps the model to learn more about the token as it processes the text data. After all, you don't want your model to be flummoxed by an obelus or an octothorpe (you don't have to look those up).

To summarize: [[summary]]

You can start with a supervised learning approach like text classification to try and put some order to the text, or allow an unsupervised learning approach like topic modeling to start making topic clusters out of the text.

A few tools to consider with text feature selection: first, document frequency. Just like tokens are measured for their frequency in documents, document frequency is the measure of a certain token appearing in multiple documents. Chi-square tests test the independence of token use and the document being assigned to a certain category, or class. Then mutual information (MI) is a measure from 0 to 1 of how much information a token offers to a class.

After exploratory data analysis and feature selection, the third part of data exploration is feature engineering. Numbers may be converted into "/number/" tags, and $$N$$-grams may be used to maintain small strings of text.

Structured

Unstructured

Word length

Token frequency

Relevance rating

A token

A matrix

A document

Overfitting

Underfitting

Bond

Earnings

Categorically

Structured data exploration

Text wrangling

Spidering programs

Look it up

Just move on

Continue

Text Exploration

The quickest way to get your CFA® charter