1 Introduction: Making E Stupid and Then Smart Again
Stateoftheart saturationbased automated theorem provers (ATPs) for firstorder logic (FOL), such as E and Vampire [12], employ the given clause algorithm [13], translating the input FOL problem (background theory and negated conjecture) into a refutationally equivalent set of clauses. The search for a contradiction is performed maintaining sets of processed () and unprocessed () clauses (the proof state ). The algorithm repeatedly selects a given clause from , moves to , and extends with all clauses inferred with and . This process continues until a contradiction is found, becomes empty, or a resource limit is reached.
Historically, term ordering, together with literal selection, is used to guarantee the completeness of the proof search [1] and to “tame the growth of the search space and help steer proof search” [5]. Term ordering ensures that rewriting happens in only one direction, toward smaller terms. Literal selection limits the inferences done with each given clause to the selected literals, which slows down the growth of the search space and reduces redundant inferences.
E includes a strategy language of clause evaluation functions
, made up of weight and priority functions, to heuristically guide the proof search. In this work, I use two algorithmically invented
[6, 7] strategies, E1 and E2^{1}^{1}1Strategies E1 and E2 are displayed in the appendix, that use many sophisticated clause evaluation functions, the KnuthBendix ordering (KBO6), literal selection, and other E heuristics.The ENIGMA [8, 9, 10, 4] system with the XGBoost [2] implementation of gradient boosted decision trees has recently demonstrated high capability to learn to guide the E [14] theorem prover’s inferences in realtime. ENIGMA uses the XGBoost model as a clause evaluation function to recommend clauses for selection based on clause and conjecture features. In particular, after several proving and learning iterations, its performance on the problems from the Mizar40 [11] benchmark improved by () [10] over the strategy E1 used for the initial proving phase.
In this work, E is stripped to the bare bones by disabling term ordering and literal selection. KBO6 is replaced with an identity relation as the minimal possible ordering (called IDEN – an addition to E ^{2}^{2}2The E version used in this paper can be found at https://github.com/zariuq/eprover/tree/identityorder, and the library for running ENIGMA with E can be found at https://github.com/zariuq/enigmatic.). While this frees E to do inferences in any order, E can no longer perform rewriting inferences. The strategy E1 is replaced with the simple combination of the clause weight and FIFO (first in first out) evaluation functions. E is thus practically reduced to a basic superposition prover, without advanced heuristics, rewriting, or completeness guarantees. We call this strategy E0:
definitionalcnf=24 preferinitialclauses tIDEN restrictliteralcomparisons WNoSelection H’(5*Clauseweight(ConstPrio,1,1,1),1*FIFOWeight(ConstPrio))’
E0 solves only 3872 of the Mizar40 problems in
seconds compared to 14526 for E1. The first research question is the extent to which ENIGMA with this basic prover can learn ATP guidance completely on its own. The second is to what extent ENIGMA’s learning can be boosted with data from strong strategies and models. That is, I explore how smart machine learning can become in this
zerostrategy setting. The more general related question is to what extent can machine learning replace the sophisticated humaninvented theoremproving body of wisdom used in today’s ATPs for restricting advanced proof calculi.2 Experiments
We evaluate ENIGMA with the basic strategy, E0, in several scenarios and over two datasets of different sizes. All experiments are run with seconds per problem^{3}^{3}3As a rule of thumb, E solves most problems within a few seconds or not for a very long time. ^{4}^{4}4All the experiments are run on the same hardware unless otherwise specified: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 188GB RAM..
ENIGMA has so far been used in two ways: coop combines the learned model with some standard E strategy equally () while solo only uses the learned model for choosing the given clauses. The best results have been achieved by MaLAReastyle [16] looping: that is, an ENIGMA model is trained and run with E (loop 0), then the resulting data are added to the initial training data and a new ENIGMA model is trained (loop 1).
In this work, ENIGMA trains with both solo and coop data. I present results from solo runs because they represent the most minimal setting.
2.1 Small Data (2000 problems)
The E evaluations and XGBoost training can take a long time on the full Mizar40 dataset, so 2000 randomly sampled problems are used to test metaparameters on. Each XGBoost model consists of decision trees of depth , the most important training metaparameters in addition to the learning rate (). In previous work with ENIGMA, and were fixed for all loops of learning. Here we try to vary the values of and during 16 loops. Let denote the experiment with specific and . Of the many protocols tested, the following are included in the plot of solved problems (above): Fives (), Nines (), Thirteens (), Sixteens ().
We also experiment with adaptively setting the metaparameters as the number of training examples increases according to the following protocols:

Inc () increases by 2 from 3 to 33 and keeps fixed.

32_inc () fixes and gradually increases from 50 to 250.

Inc2 () gradually decreases from 150 to 50, varying the value intuitively ^{5}^{5}5Precise details of intuitively set parameters can be seen in the appendix..

Inc3 () aims to be more systematic and steps from 50 to 250

Dec3() decreases from 250 to 50.
At the 16th loop Inc’s performance is best, solving 299 problems, doubling the performance of E0 (152). However Inc2 and Inc3 solve 298 problems and 32_inc solves 291 problems. The conclusion is that simple protocols work well so long as or is incremented adaptively rather than fixed.
2.2 Big Data (57880 problems)
These experiments are done on the large benchmark of Mizar40 [11] problems from the MPTP dataset [15]. E1 and E2 are two strong E strategies that solve 14526 and 12788 problems.

Experiment 2’s parameters were intuitively toggled during the looping as in Inc3, and a feature size of is used. Exp. 2 uses training data from E1 and E2 for additional guidance up to the 4th loop (and then stops including them in the training data based on the assumption they may confuse learning).

Experiment 3 sets and according to protocol Inc3. Exp. 3 only learns from E run with E0 and trains on the GPU, which requires the feature size to be reduced to .

Experiment 4 mimics Exp. 3 but uses E1 and E2 data for training (up to the 4th loop).

Experiment 5
further tests boosting with data from an E0 ENIGMA model that proved 9759 problems and an E1 that proved 21542 problems. Tree depth is intuitively varied among 32, 512, and 1000, the number of trees is varied among 2, 100, 200, and 32. The feature vector size starts at
and is decreased to allow the data to fit on the RAM, down to 32 ().
As seen in the figure, the strong model does not help much in guiding E without ordering or selection in Exp. 1. Exp. 2 learns gradually and catches up with Exp. 1, but seems to plateau around 10,000. Surprisingly the pure Exp. 3 learns fast with the small feature size, but plateaus and drops in performance (perhaps due to overfitting). Exp. 4 indicates that guidance is useful and surpasses E2 with 13805 in round 13. Exp. 5 solves 15990 problems, showing that ENIGMA can take E0 beyond the smart strategies with appropriate parameters and boosting. This is a great improvement over the 3872 problems solved by E0.
3 Conclusion
ENIGMA can learn to guide the E prover effectively even without smart strategies and term orderings. The models confer a increase over the naive E0 after 13 rounds of the proving/learning loop, and even trained without guidance data, a increase.
The experiments indicate that machine learning can be used to fully control an ATP’s guidance, learning to replace orderings, heuristic strategies, and deal with the increase in generated clauses without literal selection. However the combination of ENIGMA and standard ATP heuristics still significantly outperforms ENIGMA alone.
Given the large symmetrybreaking impact of these methods in classical ATP, future work includes, e.g., training the guidance in such a way that redundant (symmetric) inferences are not done by the trained model once it has committed to a certain path. This probably means equipping the learning with more history and knowledge of the proof state in the saturationstyle setting. ENIGMAWatch
[4] may aid with symmetry breaking by focusing the proof search on particular proof paths. Additional work is needed to isolate the factors in Exp. 5’s performance, and determine the most effective boosting methods in addition to increasing and with training loops. Ablation studies should be done to discover the impact of term ordering and literal selection individually on E and ENIGMA’s performance. Perhaps term ordering alone is sufficient to train good ENIGMA models.Running ENIGMA without term ordering and other restrictions is important because it may allow us to combine training data from different strategies, and it may allow ENIGMA to find novel proofs.
4 Acknowledgments
The research topic was proposed by Jan Jakubuv and Josef Urban, and further discussed with them, Martin Suda, and Thomas Tan. I also thank the AITP’20 anonymous referees for their comments on the first extended abstract of this work.
References
 [1] Leo Bachmair and Harald Ganzinger. RewriteBased Equational Theorem Proving with Selection and Simplification. Journal of Logic and Computation, 3(4):217–247, 1994.
 [2] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
 [3] Karel Chvalovský, Jan Jakubuv, Martin Suda, and Josef Urban. ENIGMANG: efficient neural and gradientboosted inference guidance for E. CoRR, abs/1903.03182, 2019.
 [4] Zarathustra Goertzel, Jan Jakubuv, and Josef Urban. Enigmawatch: Proofwatch meets ENIGMA. In Serenella Cerrito and Andrei Popescu, editors, Automated Reasoning with Analytic Tableaux and Related Methods  28th International Conference, TABLEAUX 2019, London, UK, September 35, 2019, Proceedings, volume 11714 of Lecture Notes in Computer Science, pages 374–388. Springer, 2019.
 [5] Kryštof Hoder, Giles Reger, Martin Suda, and Andrei Voronkov. Selecting the selection. In Nicola Olivetti and Ashish Tiwari, editors, Proc. of the 8th IJCAR, Coimbra, volume 9706 of LNAI, pages 313–329. Springer, 2016.
 [6] Jan Jakubův and Josef Urban. Hierarchical invention of theorem proving strategies. AI Commun., 31(3):237–250, 2018.
 [7] Jan Jakubuv and Josef Urban. Extending E prover with similarity based clause selection strategies. In Michael Kohlhase, Moa Johansson, Bruce R. Miller, Leonardo de Moura, and Frank Wm. Tompa, editors, Intelligent Computer Mathematics  9th International Conference, CICM 2016, Bialystok, Poland, July 2529, 2016, Proceedings, volume 9791 of Lecture Notes in Computer Science, pages 151–156. Springer, 2016.
 [8] Jan Jakubuv and Josef Urban. ENIGMA: efficient learningbased inference guiding machine. In Herman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors, Intelligent Computer Mathematics  10th International Conference, CICM 2017, Edinburgh, UK, July 1721, 2017, Proceedings, volume 10383 of Lecture Notes in Computer Science, pages 292–302. Springer, 2017.
 [9] Jan Jakubuv and Josef Urban. Enhancing ENIGMA given clause guidance. In Florian Rabe, William M. Farmer, Grant O. Passmore, and Abdou Youssef, editors, Intelligent Computer Mathematics  11th International Conference, CICM 2018, Hagenberg, Austria, August 1317, 2018, Proceedings, volume 11006 of Lecture Notes in Computer Science, pages 118–124. Springer, 2018.
 [10] Jan Jakubuv and Josef Urban. Hammering Mizar by learning clause guidance. In John Harrison, John O’Leary, and Andrew Tolmach, editors, 10th International Conference on Interactive Theorem Proving, ITP 2019, September 912, 2019, Portland, OR, USA, volume 141 of LIPIcs, pages 34:1–34:8. Schloss Dagstuhl  LeibnizZentrum für Informatik, 2019.
 [11] Cezary Kaliszyk and Josef Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245–256, 2015.
 [12] Laura Kovács and Andrei Voronkov. Firstorder theorem proving and Vampire. In Natasha Sharygina and Helmut Veith, editors, CAV, volume 8044 of LNCS, pages 1–35. Springer, 2013.
 [13] Ross A. Overbeek. A new class of automated theoremproving algorithms. J. ACM, 21(2):191–200, April 1974.
 [14] Stephan Schulz, Simon Cruanes, and Petar Vukmirovic. Faster, higher, stronger: E 2.3. In Pascal Fontaine, editor, Automated Deduction  CADE 27  27th International Conference on Automated Deduction, Natal, Brazil, August 2730, 2019, Proceedings, volume 11716 of Lecture Notes in Computer Science, pages 495–507. Springer, 2019.
 [15] Josef Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning, 37(12):21–43, 2006.
 [16] Josef Urban, Geoff Sutcliffe, Petr Pudlák, and Jiří Vyskočil. MaLARea SG1  Machine Learner for Automated Reasoning with Semantic Guidance. In Alessandro Armando, Peter Baumgartner, and Gilles Dowek, editors, IJCAR, volume 5195 of LNCS, pages 441–456. Springer, 2008.
Appendix 0.A Strategies
Strategy E1is:
definitionalcnf=24 splitaggressive simulparamod forwardcontextsr destructiveeraggressive destructiveer preferinitialclauses tKBO winvfreqrank c1 Ginvfreq F1 deletebadlimit=150000000 WSelectMaxLComplexAvoidPosPred H’(1*ConjectureTermPrefixWeight(DeferSOS,1,3,0.1,5,0,0.1,1,4), 1*ConjectureTermPrefixWeight(DeferSOS,1,3,0.5,100,0,0.2,0.2,4), 1*Refinedweight(PreferWatchlist,4,300,4,4,0.7), 1*RelevanceLevelWeight2(PreferProcessed,0,1,2,1,1,1,200,200,2.5,9999.9,9999.9), 1*StaggeredWeight(DeferSOS,1), 1*SymbolTypeweight(DeferSOS,18,7,2,5,9999.9,2,1.5), 2*Clauseweight(PreferWatchlist,20,9999,4), 2*ConjectureSymbolWeight(DeferSOS,9999,20,50,1,50,3,3,0.5), 2*StaggeredWeight(DeferSOS,2))’
Strategy E2 is:
definitionalcnf=24 splitaggressive splitreusedefs simulparamod forwardcontextsr destructiveeraggressive destructiveer preferinitialclauses tKBO winvfreqrank c1 Ginvfreq F1 deletebadlimit=150000000 WSelectMaxLComplexAvoidPosPred H’( 3*ConjectureRelativeSymbolWeight(PreferUnitGroundGoals,0.1,100,100,50,100,0.3,1.5,1.5), 4*FIFOWeight(PreferNonGoals), 5*RelevanceLevelWeight2(ConstPrio,1,0,2,1,50,2,2,100,0.2,3,4))’
Appendix 0.B Additional Protocol Details
In this section I include the details for intuitively toggled protocols.
Protocol Inc2 is as follows:
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  

Depth  3  5  7  9  11  13  15  17  19  21  23  25  27  29  31  33 
Trees  150  150  150  100  100  100  75  50  75  100  150  75  100  150  75  100 
The protocol for Exp. 2 is as follows:
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  

D  4  5  6  7  8  9  10  11  12  13  14  15  16  16  32  9  16  32  64  24  25  32 
T  50  150  160  170  180  190  200  200  200  200  210  220  225  225  225  300  300  225  150  250  250  250 
The protocol for Exp .5 requires some explanation. The motivation is to see how far E0 can be taken, even if the methods are too CPUintensive for a thorough grid search.
Exp. 2 and Exp. 4 demonstrate the utility of boosting. Thus to create better boosting data I trained ENIGMA for 10 loops with strategies E1 through E12 and used this as boosting data for the first 4 of 10 loops of training. In addition to training E0, and in the spirit of ablation studies, I also trained ENIGMA models for E0 with KBO ordering (and no literal selection) and for E0 with KBO ordering and restricted literal comparisons. The motivation is that these versions may serve as a bridge between standard E and the basic E0.
Then I used these results to boost an ENIGMA model in loop 0, and trained based on this for 10 loops, proving 9759 problems.
Finally this data and the data from a loop 3 ENIGMA model trained with E1 is used to boost E0 with the following metaparameters:
0  1  2  3  4  5  6  7  8  9  10  11  

Depth  512  512  32  1000  32  1000  32  1000  32  1000  1000  100 
Trees  2  2  100  100  200  100  200  32  300  32  32  32 
Feature Size  16384  8192  4096  28  4096  28  4096  32  2048  64  32  128 
Comments
There are no comments yet.