Methods

One thing I really like about NLP is that there are a lot of standard Experiments.
For example, the brown corpus has over a million words of text, and they've been marked up as parse trees. So, if I want to compare my parser to other parser, I just use Parseval on this corpus.
Another example is the MUCs where different systems complete on an unseen data set.
This prevents training on the test set.
Message Understanding Competitions, Summeval, Parseval, Semeval
Standard Information Retrieval Metrics of Precision, Recall and F-Measures
Precision is number-correct/number-guessed.
Recall is number-correct-guessed/total-correct
An F-Measure combines these two by, for instance, multiplying thus giving a good score.