Neural Logic Reinforcement Learning is an algorithm that combines logic programming with deep reinforcement learning methods. Logic programming can be used to express knowledge in a way that does not depend on the implementation, making programs more flexible, compressed and understandable. It enables knowledge to be separated from use, ie the machine architecture can be changed without changing programs or their underlying code.

**Deep Reinforcement Learning Algorithms** are not interpretable or generalizable. These algorithms learn solutions and not the path to find the solution. DRL algorithms also use deep neural networks making the learned policies hard to interpret. Hence, the solutions are not interpretable as they cannot be understood by humans as to how the answer was learned or achieved. This a huge drawback of DRL algorithms. Another drawback of ML or RL algorithms is that they are not generalizable. That is, if the algorithm is trained for a specific environment then the performance once the environment is even slightly altered will be very bad. Therefore, the algorithms cannot perform well in new domains. But in real-world problems, the training and testing environments are not always the same. Hence, generalizability is a necessary condition for any algorithm to perform well.

In NLRL the agent must learn auxiliary invented predicates by themselves, together with the action predicates. Predicates are composed of true statements based on the examples and environment given. All sets of possible clauses are composed of a combination of predicates. A deduction matrix is built such that a desirable combination of predicates forming a clause satisfies all the constraints. To decide the true value of each clause and achieve the ideal result with the best suitable clause, weights are assigned to each predicate. Weights are not assigned directly to the whole policy. These weights are updated based on the true values of the clauses, hence reaching the best clause possible with best weight and highest truth value. The parameters to be trained are involved in the deduction process. The algorithm trains the parameterized rule-based policy using policy gradient. Each action is represented as an atom. The agent is initialized with 0-1 valuation for base predicates and random weights to all clauses for an intentional predicate. The weights are updated through the forward chaining method. For each step in forwarding chaining, we first get the value of all the clauses for all combinations of constants using the deduction matrix. Then, each intensional atom’s value is updated according to a deduction function. Therefore, values of all actions are obtained and the best action is chosen accordingly as in any RL algorithm.

The NLRL algorithm’s basic structure is very similar to any deep RL algorithm. But the states and actions are represented as atoms and tuples. The state to atom conversion can be either done manually or through a neural network. These representations, ensure that the algorithm is generalizable and also interpretable as the logic to achieve the solution is learned unlike the solution directly which cannot be generalized. Since neural networks are used in Deep RL this algorithm is also robust to missing and misclassified or wrong data.