Update stateless.md
in 111 and 126: an non-chosen action should have an action-value smaller than all possible reward values (Maximation-return). Sometimes its nice to give negative rewards. I think -\infty is better than zero, but then there is a problem with update Q_t at beginning (causes if n=0)