Equivalent alkane carbon number of crude oils: A predictive model based on machine learning

. In this work, we present the development of models for the prediction of the Equivalent Alkane Carbon Number of a dead oil (EACNdo) usable in the context of Enhanced Oil Recovery (EOR) processes. Models were constructed by means of data mining tools. To that end, we collected 29 crude oil samples originating from around the world. Each of these crude oils have been experimentally analysed, and we measured property such as EACNdo, American Petroleum Institute (API) gravity and C 20 (cid:1) , saturate, aromatic, resin, and asphaltene fractions. All this information was put in form of a database. Evolutionary Algorithms (EA) have been applied to the database to derive models able to predict Equivalent Alkane Carbon Number (EACN) of a crude oil. Developed correlations returned EACNdo values in agreement with reference experimental data. Models have been used to feed a thermodynamics based models able to estimate the EACN of a live oil. The application of such strategy to study cases have demonstrated that combining these two models appears as a relevant tool for fast and accurate estimates of live crude oil EACNs.


Introduction
Microemulsions are commonly encountered in many of products or formulations dedicated to various domains such as pharmaceuticals, cosmetics, or petroleum applications. Among these latter, crude oil extraction after applying primary and secondary recovery methods can be roughly estimated to half of the initial oil reservoir content, according to the considered field [1]. The development of tertiary recovery methods -Enhanced Oil Recovery (EOR)has gained interest especially with the increase of crude oil prices [2]. The Chemical Enhanced Oil Recovery (cEOR) technique involving combinations of alkali, surfactants and/or polymers aims at decreasing water/oil Interfacial Tension (IFT) in order to mobilize the oil trapped by capillary forces. Optimizing these combinations or formulations to form an efficient microemulsion is a challenging and time-consuming task considering that each potentially eligible reservoir exhibits different conditions such as the oil composition, brine salinity and hardness, pressure, temperature.
The Hydrophilic-Lipophilic Deviation (HLD) concept as proposed by Salager has been applied in numerous studies to mimic phase behavior of {brine/surfactant/oil} systems [3,4]. When HLD value is zero, the Salager relation linearly correlates the optimal salinity (S*)the logarithm of the optimum salinity in g/Lwith alcohol amount and type (f(A)), the temperature and the Equivalent Alkane Carbon Number (EACN), see equation (1) [5].
where T ref is set to 298.15 K, a is a temperature coefficient, Ccthe characteristic curvatureand K are parameters related to the surfactant chemistry [6]. The concept of EACN is commonly considered during surfactant formulation design. It assumes that the EACN of an oil is equal to the number of carbon atoms of the n-alkane exhibiting a similar phase behavior. EACN of the dead oil (EACNdo)oil degassed at standard conditionsare experimentally determined by means of test tubes and salinity scans. It consists in identifying the n-alkane matching {brine/surfactant/dead crude oil} and {brine/surfactant/n-alkane} phase behaviors [7,8]. However, depending on the crude oil composition several weeks may be necessary to reach the thermodynamics equilibrium. Bouton et al. proposed a Quantitative Structure Property Relationship (QSPR) for the prediction of the EACN of hydrocarbons by means of two theoretical descriptors, i.e., the average negative softness and the Kier A3 [9], with EACN values in between À4 and +35. More recently, Lukowicz et al. proposed a QSPR based on COSMO-RS r-moments to predict EACN of polar hydrocarbon oils [10] and then extended their model to the case of aprotic polar oils [11]. To determine EACN of hydrocarbon mixtures, Cayias et al. [12] and Cash et al. [13] proposed the use of a mixing rule in which individual hydrocarbon EACNs are weighted according to corresponding mole fractions. In the case of live oilsoils containing dissolved gases at specific temperature and pressure conditionswe recently proposed an approach to predict the EACN of live oil (EACNlo) on the basis of volumetric fractions of oil and gas [14]. Indeed, Marliere et al. have experimentally shown that EACN linearly varies with the gas volumetric fraction [7]. Our EACNlo model necessitates the a priori knowledge of the gas composition, gas to oil ratio, temperature and pressure conditions, and the EACN of the dead oil. The volumetric fractions of light hydrocarbons are estimated using the Soave-Redlich-Kwong (SRK) equation of state [15] with the volume correction as proposed by Péneloux et al. [16]. The use of our model for the prediction of live crude oil EACN [14] would gain in relevancy developing methods to predict crude oil EACNdo. A crude oil contains thousands of diverse chemicals and the exact composition is never known, as a consequence the combined use of above mentioned models and mixing rules to predict EACNdo is unrealistic.
During the past decade, we considered the use of data mining based approaches to extract information from databases and predict properties of complex fluids [17]. These approaches known by the acronym QSPR aim at identifying non-obvious correlations between property values of the matter and some features rendering information about the matter [18,19]. In this work, we propose (i) the creation of a database containing experimental EACNdo values as well as results of experimental analysis for a series of crude oils, (ii) the application of machine learning methods to derive models for the prediction of crude oil EACN values, and (iii) the use of developed models for the prediction of live oil EACN for a set of crude oils. The article is organized as follows: the next section deals with materials and methods and gives all details regarding the database creation and methods used to generate models, a subsequent section presents the predictive performance of models and an application of generated models to predict live oil EACN, and the paper ends with concluding remarks.

Experimental data and database creation
Wan et al. proposed a review and comparisons of applicable approaches to experimentally determine the EACN of a dead crude oil [20]. As detailed in previous works [7,8], all EACNdo values reported hereafter were obtained following the method referred to as the direct method by Wan et al. This method is mainly based on the use of equation (1) and consists in performing several phase diagrams for a {brine/surfactant/crude oil} system varying the salt concentration, i.e. a salinity scan. The S* for the {brine/surfactant/crude oil} system is reached when the phase diagram exhibits an equal repartition of the microemulsion between the oil and the aqueous phase. The so-obtained salinity scan and the S* value are then compared to those previously obtained for similar conditions (surfactant formulation, temperature, brine composition. . .) in the case of linear alkanes such as n-decane (C 10 ), n-dodecane (C 12 ) and n-tetradecane (C 14 ). The EACNdo value for the crude oil is finally determined by solving equation (1) using K and Cc parameters obtained for C 10 , C 12 and C 14 . This analytical methodology has been applied on 29 crude oils from around the world, and obtained EACNdo values are reported in Table 1. EACNdo values for crude oils of interest lie in between 1.2 and 18.0. The experimental determination of EACNdo for crude oils requires days to weeks to reach thermodynamic equilibrium depending on their nature, whether they are light or heavy. The interest of a fast and accurate theoretical method to predict EACNdo for a crude oil thus becomes evident but its parameters should be easily determined.
The American Petroleum Institute (API) gravity measures whether a crude oil is lighter or heavier than water. It is defined using the following expression: with the Specific Gravity (SG) = q crude oil /q water , where q crude oil and q water denote the density of the crude oil and the density of water, respectively at 15.5°C (60°F). Densities for the 29 crude oils were measured using an Anton Paar density meter (model DMA TM 4500 M) including an oscillating U-tube sensor, and the uncertainty associated to density measurements is 0.1%. API gravity values were then calculated using equation (2). Table 1 presents API gravity values for crude oils #01 to #29, it reveals that the set of crude oils covers a broad range of API gravity values from 11 to 50 denoting heavy and light crude oils, respectively. It is interesting to note that in the set of crude oils, crude oil #19 appears as an outlier in terms of API gravity.
The Saturates, Aromatics, Resins, Asphaltenes (SARA) analysis is a method based on fractionation to characterize the crude oil content in terms of saturates, aromatics, resins, and asphaltenes [21]. The basic idea is to divide the crude oil into smaller fractions playing with oil component solubilities in solvents such as linear alkanes. Different SARA methodologies have been described for instance varying the used n-alkane, i.e. n-pentane or n-heptane changing the amount of precipitated asphaltenes [22]. All considered crude oils were characterized using a SARA analysis similar to that used by Behar et al. [23], and the analytical procedure can be briefly described as follows: Each crude oil is dissolved in n-pentane at 43°C, and the resulting solution is filtered (Durapore Ò membrane in polyvinylidene fluoride with 0.45 lm pore size) to separate by-products from other crude oil components. By-products are treated by adding dichloromethane at the same temperature to recover the precipitated asphaltenes, and fractions are weighted during the entire procedure. An aliquote of the n-pentane rich solution is analyzed by Gas Chromatography (GC) to quantify the C 20 À fractionthe crude oil fraction containing compounds with a number of carbon atoms lower than 20. An another aliquote is evaporated and separated using Medium Pressure Liquid Chromatography (MPLC) to characterize weight amounts of saturates, aromatics, and resins. All fractions are then standardized according to measured masses, and the crude oil is assumed as the blend of C 20 À , saturates, aromatics, resins, and asphaltenes. Table 1 presents results of the SARA analysis for the 29 considered crude oils. Fractions of saturates, aromatics, resins, and asphaltenes are determined with associated uncertainties of 2%wt, in agreement with conclusions drawn by Aske et al. [24]. Experiments necessary to determined C 20 À , saturate, aromatic, resin, and asphaltene fractions demand approximatively 4 days whatever the nature of the oil. Note that fractions of saturates obtained with different SARA methodologies should be similar, and that the Aro., Res. or Asp. fraction determined with one SARA analysis should correlate with its corresponding fraction issued from a different SARA methodology. These assumptions result from comparisons of data reported in Table 1 with fractions determined using another SARA methodology (not shown here). From data reported in Table 1, it is interesting to note that in this set of crude oils, crude oil #02 appears as an outlier in terms of C 20 À and SARA fractions.

Data sets
The accuracy of predictive QPPR (similarly to QSPR) is in part related to the quality of data, hence the quality of the database is a keystone for the success of such modeling works. Possible correlations between descriptors have been investigated by generating a correlation matrix by means of the Materials Studio software [31]. No evidence of highly correlated descriptors has been found, and the highest values in the correlation matrix were obtained for couples: Res. with Aro. and Res. with C 20 À . Fan and Buckley proposed on the basis of six medium-gravity dead crude oils (with API gravity values from 22.6 to 37.2) a relation between API gravity and SARA fractions [25]. No evidence of such a relation has been found in our data set. The best found correlation between API gravity and SARA fractions has a low coefficient of determination (R 2 ' 0.4). Data presented in the Table 1 have been used to constitute our database.
As a preprocessing of the data, we performed a Principal Component Analysis (PCA) applied on API gravity, C 20 À and SARA fractions measured for the 29 crude oil samples. Figure 1 presents projections of the 29 crude oils in the space formed by the three main principal components resulting from the PCA. The diagram thus provides a representation of crude oil distributions in the chemical space of our database. This distribution reveals that one crude oil (crude oil #02) is isolated from all other samples that thus confirms crude oil #02 as an outlier. In Figure 1, each symbol is filled as a function of their EACNdo value, and there is no obvious relation between the location of a crude oil on the diagram and the value of its EACNdo.
Application of external validation has been shown as necessary to validate model's robustness when predicting new compound property values, meaning candidates not used during the model development [32]. One of the popular methods is the n-fold Cross-Validation (n-CV) in which the data set is randomly divided in approximately equal n portions, the leave-one-out being its extreme version with n equals the number of samples in the database. An aggregate of (nÀ1) portions forms the Training set used to optimize predictive models, the remaining portion constituting the Test set. We emphasize that no data point belonging to external sets is used to derived models. This procedure is repeated n times choosing at each new fold another portion of data as a Test set. From conclusions drawn during the preprocessing of the database, we choose to impose crude oils #01 and #02 in all Training sets in order to keep EACNdo ranges constant. The 27 remaining crude oils were randomly distributed into nine portions, therefore the Training sets and Test sets represent 90% and 10% of the database, respectively.
Perfomances of models are evaluated on both Training and Test sets calculating values for some statistical indicators such as the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), R 2 , and the Concordance Correlation Coefficient (CCC) [33]. Chirico and Gramatica have shown that the use of this latter coefficient is advocated considering various scenarios such as location shifts, scale shifts, and location plus scale shifts [34,35].

Machine learning methods
The application of machine learning methods for thermophysical property predictions has been the subject of our numerous past and ongoing research works [17]. From comparisons drawn in these previous studies, we have observed that Support Vector Machine (SVM) based models outperform other evaluated learning algorithms such as neural networks, partial least squares, Genetic Algorithm (GA). However, the number of data points is in this work quite small and we hereafter search for explicit multilinear equations that can be easely implemented in a spreadsheet. One possible manner to generate multilinear models is to use Evolutionary Algorithms (EA). EA are based on the Darwinian evolution theory of biological species in nature, and they represent promizing methods for optimization problems. When applied to regression problems, the resolution consists in the iterative evolution of a population of equations initially randomly generated and respecting the general following form: where k 0 is the inercept, k i denotes a weight associated to the gene i (G i ), and N is the total number of genes in the model. To derive models, two approaches were tested varying from the level of information in genes.
In the first approach, a gene (see the proposed schematic representation in Fig. 2) consists of a tree built by combining descriptors (API gravity, fractions of C 20 À , Sat., Aro., Res., and Asp.) and mathematical functions (see Tab. 2) allowing to catch non-linearity in property variation. Multi-Gene Genetic Programming (MGGP) based models were generated using the Genetic Programming Toolbox for the Identification of Physical Systems (GPTIPS) coded in the MATLAB environment [36][37][38]. We applied the tournament method to select individuals in the population of equations on the basis of their fitness and complexity. We fixed the tournament size to 25 corresponding to 10% of the population size. Generations are constructed by survival of fitter individuals, and reproduction of individuals consists in applying crossover as well as mutation operations to produce child equations. The iterative procedure ends when one of the fixed criteria such as maximum number of generations, best fitness values is reached. Note that during the iterative procedure, the structure of trees evolves through out crossover and mutation operations applied to sub-tree elements. Clearly, the maximum numbers of genes and nodes per tree must be limited to prevent overfitting problems. Additionally, the maximum numbers of generations and runsrepetition of the calculationshould be optimized to ensure convergence of calculations for reasonable computational ressources. Table 2 reports details about values and/or ranges of investigated GPTIPS settings in this work.
In the second approach, a gene (see Eq. (3)) simply stands for one of the descriptors. A GA based variable selection method was followed, and the Genetic Function Approximation (GFA) as implemented in the Materials Studio software was used to build multilinear models [31]. The GFA procedure consists in iterations of selection, crossover, and mutation operations, coupled with objective criteria such as R 2 in order to extract the best fitting models. In this work, the adjusted R 2 was used as the objective criteria. The initial population (i.e., initial number of equations) was set to 6, and the maximum generation number to 50 000. This procedure was performed on each of the nine Training sets, noting that the same decompositions (folds, Training, and Test sets) are used during GFA and GPTIPS based procedures.

Development of QPPR models
In this section, we report various QPPR models to predict EACN of dead crude oil knowing a series of experimental data such as API gravity, C 20 À fraction, and fractions of saturates, aromatics, resins, and asphaltenes, see Table 1. Two machine learning methods based on EA were used: GFA and GPTIPS. Fig. 2. Schematic representation of a tree. Function stands for a mathematical operator belonging to the function set (see Tab. 2), for instance +, À, Â or Ä. Descriptor denotes either API gravity, fractions of C 20 À , Sat., Aro., Res., or Asp. During the development of GFA based models, the maximum number of variables allowed to form equations was set to four. This value meets the statistical criteria n ! 4k, where k and n are the number of variables in the model and the number of data points in the Training set, respectively [39]. GFA based models were optimized following a 9-fold cross-validation procedure. Samples #01 and #02 are fixed in Training sets, and therefore each of the nine folds contains three randomly selected crude oils. Equation (4) presents the so-obtained model. This model is composed of three descriptors weighted by k i coefficients, and a constant k 0 as follows:

Constant Function Function
k i values obtained considering successively each fold as external data (not used during the optimization procedure) are reported in Table 3. Equation (4) is in line with some physical intuitions. For instance, the negative value of k 1 agrees with the fact that EACNdo should decrease with increasing C 20 À fractions (the light fraction of the crude oil). The opposite effect is observed through out the positive values for k 2 and k 3 , i.e. EACNdo values should increase with Sat. and Asp. fractions (heavy fractions of the crude oil). Note that the value obtained for k 0 roughly corresponds to the average of EACNdo values in the database (12.9 points of EACN). Table 4 presents performance characteristics for equation (4), calculated as follows: (i) For all crude oils belonging to Test sets and corresponding k i coefficients (see Tab. 3). For instance, for crude oils belonging to Fold-01 we used k 0 = 13.761, k 1 = À0.120, k 2 = 0.141, and k 3 = 0.112. Performance characteristics reported in Table 4 so reflect the predictive capabilities of GFA based models. (ii) For all crude oils and average parameters (see Tab. 3). In this case, no conclusion can be drawn regarding the predictive capabilities of the GFA based model as all crude oils have been indirectly involved in the learning procedure. Figure 3 presents scatterplots of experimental EACNdo versus predicted EANCdo values using equation (4) with either parameters associated to each fold in Test set, or average parameters. The diagram exhibits no huge difference between the two sets of parameters in terms of predicted EACNdo values. Values taken by some statistical indicators as reported in Table 4 quantitatively confirm this observation. Noting that EACNdo values for samples fixed in Training sets (i.e., 18.0 and 1.2 for crude oils #01 and #02, respectively) can only be predicted using average parameters, and GFA based model fails in predicting the EACNdo value for crude oil #01.
GP is a generalization of GA. The main difference between GP and GA is that models obtained by the latter are strings of weighted descriptors, while the former returns tree expressions in which a node is either a descriptor, a mathematical function or a coefficient. MGGP can be seen as a combination of GA and GP as MGGP returns strings of genes, each gene having a tree structure [40]. We have Table 3. Parameter values for the GFA model presented in equation (4), determined using each fold as external data (Test set). <> stands for the average of parameter values taken over the nine folds.  [40,42]. For our problem, the maximum number of genes has been chosen according to the statistical criteria n ! 4k, where k and n are the number of variables (i.e. genes) in the model and the number of data points in the Training set, respectively. Considering both the database content and the Training/Test splitting, a reasonable value for the maximum number of genes is 6. Regarding settings dedicated to tree's structure, the maximum number of nodes per tree and the maximum depth of tree have been set to 12 and 4, respectively. For settings related to convergence of calculations, the maximum number of generations and the maximum number of runs have been set to 2000 and 40, respectively. In order to allow a great possibility in nonlinear models, we have chosen a large number of mathematical operators such as addition, subtraction, multiplication, division, square root, exponential, logarithm. We propose an approach to optimize the numbers of runs, generations, genes, and nodes that will further be used to develop a model. The convergence matter was first addressed, and the trade-off between convergence, computational time, and accuracy was treated as follows: (i) The numbers of genes and nodes were set to their respective maximum possible value in order to generate models with the highest complexity. We then compared model's performances exploring the space formed by the number of generations discretized as follows: 100, 500, 1000, and 2000, and by the number of runs discretized as follows: 1, 5, 10, 15, 20, 25, 30, and 40, as indicated in Table 2. Performances were evaluated using the RMSE statistical indicator calculated over all samples in the dataset. No Training and Test sets division was considered in this step as the idea is to roughly select appropriate numbers of runs and generations, leading to reasonable accuracy and computational time. Figure 4 presents the power law evolution of model's performance as a function of numbers of runs and generations. Clearly, models poorly perform when optimized with only 100 generations. Considering 500, 1000 or 2000 generations leads roughly to similar performances after 20 runs. However, the deviation between raw data and the corresponding power law is in the case of 2000 generations twice that obtained for 500 or 1000 generations (not shown here). Contour plots on Figure 4 indicate that for a similar computational cost, equivalent performances are obtained with 500 or 1000 generations. The consideration of 30 runs seemed relevant to ensure the convergence of calculations, and in order to limit the computational time we selected 500 generations. (ii) Numbers of generations and runs were respectively set to 500 and 30, and models were developed all along the space formed by the number of genes discretized as follows: 1, 2, 3, 4, 5, and 6, and by the number of nodes discretized as follows: 2, 4, 6, 8, 10, and 12, as indicated in Table 2. Contrary to step (i), Training and Test sets were considered here using the nine folds previously generated for the GFA based model development. For each point of the space, sum of squared errors were calculated for Test sets, and the optimum numbers of genes and nodes were determined minimizing the total sum of squared errors on Test sets. The optimization procedure led to a number of nodes of 8 and a number of genes of 4.
The proposed optimization procedure applied to our regression problem led to numbers of runs, generations, genes, and nodes of 30, 500, 4, and 8, respectively. Nine GPTIPS based models were optimized following a 9-fold crossvalidation procedure, noting that the nine folds generated for the GFA based model development have been reused. Performances of models are presented in Table 5. All models outperform GFA based models, and exhibit RMSE values in between 0.97 and 1.40. The model that best generalizes the database has been developed using Fold-08 as Test set. Details about this latter model such as the four weighted genes and the intercept are presented in equation (5). Clearly, each gene non-linearly contributes to the predicted EACNdo value, and the model involves all descriptors excepted the C 20 À fraction. Figure 5 presents the scatterplot of experimental EACNdo versus predicted EANCdo values using equation (5). All data points are less scattered from both sides of the bisector (predicted EACNdo equals experimental EACNdo) as compared to observations performed on Figure 3. Contrary to the GFA based model, equation (5) well predicts crude oil #01 with a value of 18.2. Although crude oil #02 has been sensed as an outlier in terms of EACN value and composition, its EACN is well estimated using both GFA and GPTIPS based models. The largest deviation between experimental and predicted EACNdo values (2.5 EACN points) is measured for crude oil #21, noting that none of GPTIPS based models developed on each fold succeeds in predicting this value and that the GFA based model also failed in predicting this EACNdo value.
We investigated the sensitivity of equation (5) regarding uncertainties associated to input data. Indeed, due to filtration operations, evaporation losses and/or incomplete solvent removal during SARA analysis, fractions of saturates, aromatics, resins, and asphaltenes are determined with associated uncertainties of about 2%wt [24]. API is calculated from the density of the crude oil and using equation (2). The uncertainty associated to density measurements is 0.1%. Although the fraction of aromatics is involved in genes G 1 , G 2 , and G 4 , a 2%wt deviation applied on Aro. fractions in Table 1 only slightly deteriorates predictions with MAE and RMSE of 0.87 and 1.00, respectively. Performances of equation (5) only falls to MAE = 0.96 and RMSE = 1.12, considering a 0.1% deviation on density values and a 2%wt deviation on each SARA fraction (Tab. 1).

Application to the prediction of live crude oil EACN
Live oils are oils containing dissolved gases at specific temperature and pressure conditions. Creton and Mougin proposed a model based on thermodynamics to predict the EACN of a live oil (EACNlo) knowing the EACN of the dead oil (stock tank oil), reservoir pressure (P) and temperature (T) conditions, as well as the gas to oil ratio (R si ) [14]. This model is based on a volumetric mixing law applied to EACNdo and EACN of gas (EACNg), as follows: where, EACNg equals the sum of n-alkane carbon atom numbers (ACN, alkane carbon number) weighted by their respective volumetric fraction, i.e. when solely methane is used as representative gas: EACNg equals 1. Molar volumes were calculated using the SRK [15] Equation of State (EoS) applied with the volume correction proposed by Péneloux et al. [16]. Creton and Mougin validated their model for several crude oils covering broad ranges of reservoir characteristics, and studied impact of pressure, temperature, and gas composition on predicted live oil EACNs [14]. These latter crude oils are part of our database (Tab. 1), and we propose hereafter to feed the model proposed by Creton and Mougin with EACNdo values obtained using equation (5). Table 6 presents for crude oils in common between reference [14] and this work, some oil properties and their original reservoir characteristics. Figure 6  Exp. EACNdo Pred. EACNdo  Mougin [14]. Therefore, the proposed combining of the two models appears as a relevant tool to estimate EACN of live crude oils.

Conclusions and perspectives
In the context of EOR, some chemical EOR techniques involve surfactant formulations to mobilize oil trapped by capillary forces. In order to assist and speed up experiments necessary for the formulation design, we recently proposed a model based on thermodynamics to predict EACN of live crude oil. This model consists in a linear mixing rule based on volumetric fractions of the EACN of the dead crude oil and the EACN of the representative gas. The objective of the present work was to use data mining based approaches to investigate and develop relations between the EACN and the composition of dead crude oils. We collected 29 crude oil samples originating from around the world and performed analysis to obtained compositional information. Each sample has been described in terms of EACNdo, API gravity, and fractions of C 20 À , saturates, aromatics, resins, and asphaltenes. The database covers a broad range of API gravity with values ranging from 11 to 50 denoting heavy and light crude oils, respectively. Machine learning methods based on EA have been applied to our database in order to generate QPPR to predict EACNdo. In the case of MGGP, we proposed an approach to parameterize GPTIPS. Obtained QPPR models were compared to each other in terms of capacity to generalizing the database. Note that this work could be done using any SARA analysis but fractions of C 20 À , saturates, aromatics, resins, and asphaltenes within the database must be consistent. The best QPPR model was then used to feed a thermodynamics based model to predict EACNlo for crude oils. Comparisons carried out demonstrate that the proposed combining of the two models appears as a relevant tool for fast and accurate estimates of live crude oil EACNs.
To the best of our knowledge, this work represents the first attempt to predict EACN of crude oils using data mining. When new samples of crude oils will be available, API gravity, and fractions of C 20 À , saturates, aromatics, resins, and asphaltenes will be experimentally determined. The new samples will be used to supplement our database. The QPPR model (Eq. (5)) developed in this work will be applied to predict EACNdo values for the new crude oils. According to the accuracy of the property predictions and as the QPPR model is more statistical than a physical law, the QPPR model may necessitate an update using MGGP. Table 6. Summary of live crude oil properties. For each case study, the predicted EACNdo using equation (5), reservoir temperature and pressure, the gas to oil ratio, the solution gas composition, and experimental and predicted EACNlo values are indicated [14].