New method for predicting n-tetradecane / bitumen mixture density : correlation development

Nowadays, incredible growth of the energy consumption has changed the global attention to the production and utilization of the heavy crude oils such as bitumen resources around the globe. Amongst the bitumen properties, density is an important parameter which improves bitumen recovery efficiency and transportation quality. For easy production of bitumen, n-alkanes are usually injected into the reservoir to reduce its viscosity and density; however, there are few numbers of models focusing on proper estimation/prediction of diluted bitumen mixture density in literature. In present work, a new method was proposed to accurately prognosticate the bitumen/n-tetradecane mixture density as a function of thermodynamic conditions using Gene Expression Programming (GEP) for the first time as a function of solvent composition, pressure and temperature. Consequently, the proposed model here predicts the mixture density with the average Absolute Relative Deviation (AARD%) of 0.3016% and R-squared (R) of 0.9943. Moreover, it is found out the solvent concentration has the highest impact value on mixture density estimation. In conclusion, results of the present study can be so valuable for field engineers and researchers working on solvent-assisted recovery methods from heavy oil reservoirs.


Introduction
In North America, one of the key supplies of energy is heavy oil resources, in which their production mechanism is incredibly challenging [1,2].For instance, nearly two million barrels of heavy oils per day are produced in Alberta province individually with viscosity values from thousands to millions of centipoises (i.e., mPa s) at both reservoir and atmospheric conditions leading to the establishment of considerable difficulties during processing, transportation and underground recovery techniques [1,3].Therefore, it is crucial to execute a required viscosity reduction by selecting proper methods using the dilution and/or heating mechanisms.For this reason, a number of approaches have been developed such as N-SOLV [4], ES-SAGD [5], SAS [6], LASER [7] and SAP [8] with the simultaneous application of steam and solvent, and VAPEX process with the direct use of n-alkane as diluent [9].To meet preferable specifications for inlet/outlet of refinery systems, heavy oils are mixed with diverse fractions of crude oils [10].Frequently, heavy oils are diluted by means of naphtha and condensate through the transportation lines [11,12].Amongst the all operational factors, density plays a major role in determining the effectiveness of the above processes such as solvent assisted methods.This property is also significant in carrying out volume computations in refineries and pipelines design.Determination of density of blends is not frequently a feasible strategy because a chemical reaction may happen during mixing, in which the properties of the yielded fluids may not comply with the conventional estimating laws [9].Thereby, it is necessary to present a new method so as to estimate the mixture density with adequate precision and simplicity.
There are a number of researches regarding different aspects of the solvent injection in the open literature [13,14].Various techniques have been utilized to predict the mixtures density including empirical models, Equations of State (EoS) [9] and smart computations.In high pressure conditions, reliable determination of volumetric properties and phase behaviors have been failed via EoS analysis by reason of high degree of numerical complexity [15].Alongside the EoS strategies, a number of empirical correlations have been proposed in the open literature to calculate the mixture density.According to the corresponding states law, two widely applied equations namely COSTALD [16] and Rackett [17] models were developed, in which COSTALD [16] was understood to be more accurate than EoS modeling.For saturated liquid densities, the Rackett [17] equation was modified in the work of Spencer and Danner [18].The abovesaid correlations are in the need of critical properties as the input as well as the acentric factor in some cases.The main inconvenience involved into these methods is that the pseudo-critical parameters which are generally not available must be used for blends of hydrocarbons.Calculating these properties by the use of traditional techniques leads to high inaccuracy; thereby, choosing a suitable mixing rule may resolve this issue.For heavy oils, an extrapolation affair should be conducted to estimate the pseudo-critical properties which introduce extra deviations into the mixture density estimation.Some investigations have been conducted to determine mixture density with regard to density of each constituent which are simpler than the previous methods.For each component, the density must be measured experimentally and/or a proper estimative equation should be used.This is the main drawback of using such approaches [9].
Artificial intelligence is a robust strategy for solving highly nonlinear problem in wide areas of engineering and science [19][20][21][22][23][24][25][26].The latest strategy developed for density estimation is the application of artificial intelligence which produces reliable estimates heavy oil density [27].This type of smart computations has been employed for describing intricate problems with considerably nonlinear trends in wide areas of petroleum and chemical engineering [28][29][30].Recently, two rigorous versions of smart computations including Radial Basis Function Neural Network (RBFNN) and Adaptive Neuro-Fuzzy Inference Systems (ANFIS) have been proposed by Tavakoli et al. [31] and Abbasi et al. [27] to prognosticate the diluted density of Athabasca bitumen by n-tetradecane, respectively.Their results were promising with more accurate precisions than the available mixing rule in literature.Despite the high capabilities of ANFIS and RBFNN strategies, these models are black box with no recognizable governing equation for computation.Amongst the all versions of smart numerical schemes, genetic-based methods have been understood to be one of the most accurate optimization techniques through the literature.The most recent technique in this field of study is Gene Expression Programming (GEP) which is capable of describing any phenomena by developing accurate empirical models [2,28,29].In other words, there is no requirement for determining the correlation format before using GEP strategy.There are a number of forceful investigations in the open literature which applied GEP strategy for accurate prediction of several parameters such as pure hydrocarbon/water Interfacial Tension (IFT) [28], CO 2 / brine IFT [32], critical oil flow rate through the wellhead chokes [33], CO 2 solubility in crude oil [2] and thermal conductivity of supercritical CO 2 [29].
In current study, a new symbolic equation with respect to GEP modeling was extended in relation with thermodynamic and independent variables including pressure, solvent concentration and temperature.Statistical quality criteria and illustrative tools were also used to evaluate the effectiveness of the new method proposed here.The impact of each variable was calculated via sensitivity analysis, and the truthfulness of the used database was examined by Williams' plot or outlier detection technique.

Data collection
A large database consisting widespread operational parameters is vital before establishing a predictive model [2,28,[34][35][36][37].So, about 330 datapoints were adopted from the experimental and modeling work of Kariznovi et al. [3].This dataset includes bitumen/n-tetradecane mixture density as the output, and pressure, solvent concentration and temperature as the inputs during modeling.By preprocessing of the used database, about 264 data were considered for model training, and for testing goal about 66 data were assigned.This data division was conducted by a random computer selection approach to fulfill the comprehensiveness of both training and testing subsets.Table 1 indicates the description of the adopted database via several parameters such as minimum, average, maximum and median.

Gene Expression Programming (GEP)
Originally, the theory of Gene Expression Programming (GEP) was presented in literature through the work of Ferreira [38] with nearly the same computational procedure as Genetic Algorithm (GA).In this computational strategy, a population of potential answers was randomly employed to find the objective function with a certain fitness value to generate the next potential answer to the problem [28,29,33,34].During this process, GEP applies some operators such as mutation and crossover to properly adjust the chromosomes (population individuals).These calculations are repeated in anticipation of meeting the convergence condition.Frequently, Mean Square Error (MSE) is chosen as cost function during GEP simulation, defined as follows [2,35,39]: in which, the superscripts ''exp'' and ''est'', and the symbol N stand for, correspondingly, the target bitumen/ntetradecane mixture density, the calculated values by the GEP model, and the size of the database used for modeling [2,28,29,33,34].Translation of a linear string with fixed length which is termed as chromosome, will lead to the creation of a new object known as symbolic Expression Tree (ET), which has a certain size and depth determined by the users.Therefore, GEP strategy has two key constituents of ET and chromosome.Each gene has two main features including functions (arithmetic operators) and terminals (variables).The tail length of each terminal (t) can be formulated as a function of the head length (h) and the largest arity function (n), as follows [2,28,29,33,34]: By means of symbolic ETs, an equation will be provided as a consequence of GEP modeling as an advanced symbolic regression technique.Suggestion of correlation and tuning its constants are conducted at the same time in GEP modeling, whereas in conventional regression analysis the correlation format is proposed at first step, and then its coefficients are adjusted by fitting the equation to the experimental database [2,28,29,33,34].Actually, GEP advanced technology proposes the best equation by itself through checking different algebraic operators randomly among the independent variables, in which these functions are defined by the user, even though traditional regressions have restricted options of equations to fit or the user should define the equation for the program [2,28,29,33,34].That is why the GEP strategy can establish a revolution in regression analysis by suggesting powerful empirical models as compared to the classical fitting approaches [2,28,29,33,34].A typical chromosome with two genes composed of four operators (i.e., •, +, /, p ) and three terminals a, b, c, is indicated in Figure 1 [35].

Results and discussions 4.1 GEP model development
In this study, the new method for calculating/estimating the mixture densities of bitumen/n-tetradecane was proposed in accordance with the new technology of GEP mathematical strategy for the first time in literature.Despite the existing mixing rules available in literature which consider the component densities, the three independent parameters of temperature, solvent concentration and pressure were regarded as correlating variables in order to predict/ estimate the bitumen diluted density.The previous literature studies demonstrated that the pressure has the least effect on the estimation of bitumen density.Therefore, the authors of this study developed a three-parameter model considering the effect of pressure.As a result, the ultimate correlation obtained in this study is as follows: where, x is a typical data sample and N shows the number of dataset.
Fig. 1.A typical two-gene chromosome with its corresponding mathematical expression [35]. A.
In above equations, T stands for temperature (K), P is pressure (MPa), w s indicates solvent concentration (mass fraction), and q m shows mixture density (kg/m 3 ).For better correlation of the density to the variables, a natural logarithm (i.e., LN) was taken from the temperature and mixture density before running the software.Moreover, it was understood that data normalization has no positive effect on the improvement of modeling.During optimization, a variable impact computation named as sensitivity analysis, was carried out by the simulator to determine the number of decimal points in the GEP-based equation.The parametric description of GEP simulator for developing the new correlation is reported in Table 2. Figure 2 shows the variation of the fitness function and R-squared against the number of generations.

Analytical comparison
For better evaluation, numerous statistical quality measures and visual tools are employed to accurately examine the performance of new method suggested in this study.Table 3 lists the calculated error functions in this study such as Root Mean Square Error (RMSE), Average Deviation (AD), Standard Deviation (SD), R-squared (R 2 ) and Average Absolute Relative Deviation (AARD).In addition to these statistical quantities, some useful graphs including relative distribution diagram, crossplot, cumulative frequency diagram and trend analysis were implemented to characterize the behavior and effectiveness of the new method established in this study.The statistical quality parameters of the developed method here are inserted in Table 4 for various subsets of database.The first point implied from this table is that the new GEP based method is a highly precise correlation regarding both training and  Table 3. Definitions of statistical quality measures applied for error analysis.
Error Formula Minimum deviation Average absolute relative deviation AARD ¼ 100

Graphical comparison
One of the main visual touchstones for analyzing the model performance is known as parity diagram or crossplot.In such diagram, the estimated/predicted data are sketched against the corresponding target experimental values.The strength of model is examined by the degree of data concentration and their neighborhood around the unit slope line or bisection of the first quadrant of coordination system.The parity diagrams for the new method here are illustrated in Figure 3.It is clear that the new method proposed here has a compact cloud of data in the vicinity of 45°line due to the very high R2 value and very small value of total estimation error (e.g., AARD).In Figure 4, simultaneous plots of new method predictions and their corresponding measured values are exhibited against the data index during GEP modeling.The estimates of the GEP model have a satisfactory match with their corresponding target values.Another standard plot is represented in Figure 5 showing the   variation of relative error for GEP model.The variation of error for the new method here shows a compressed range of error varying from À0.5 to +0.5%.
In order to guarantee the outstanding performance of the new method proposed here, the most common statistical parameters are calculated for different ranges of the input parameters.So, a number of bar plot analyses are carried out which are shown in Figures 6-8. Figure 6 declares that at 0.124 MPa, the new method has a high deviation from the reality based on the RMSE parameter.According to Figure 7, the second method has no AARD% value more than 0.4% at all temperature ranges.The highest AAD value is assigned to the solvent concentration of 0.2 for GEP model which is illustrated in Figure 8. Considering pressure parameter lowers the error value especially for concentrations equal or more than 0.3 mass fraction, which establishes a discontinuity in Figure 8.For the whole ranges of operational conditions, the proposed model almost always gives highly reliable predictions.In other words, considering pressure, temperature and solvent concentration for modeling the mixture density will result in an accurate model with less numbers of experimentations.
Trend analysis of the mixture density versus input parameters are indicated through the Figures 9-11 showing the variation of the temperature, pressure and solvent concentration, respectively.Based on these figures, the new method developed here has a good fit with the experimental datapoints.When the pressure varies from 0.124 to 9.994 MPa, temperature from 297.3 to 333.1 K and concentration between 0.05 and 0.50, the magnitude of density difference will be roughly about 6, 23 and 123 kg/m 3 , respectively.Therefore, the effect of the independent parameters can be arranged by the following order: Solvent concentration > Temperature > Pressure: For having a better understanding of the estimation potential of the suggested method here, a diagram of cumulative frequency versus the absolute relative deviation over the entire database is illustrated in Figure 12.About 100% of the predictions of the GEP method developed here have estimation errors equal or less than 1.2%.This method gives accurate results.Therefore, the new method established in this study exhibits a robust performance in calculating/estimating the bitumen/n-tetradecane mixture density.Moreover, the input variables of the new method here are easily available with no need to additional experimentations in comparison with the existing mixing rules in literature.To the best of authors' knowledge, it is the first time in literature that a correlation as a function of thermodynamic conditions for density estimation is proposed.The proposed  tool in this study can also be applied to density estimation of crude oils and oil fractions because it is a function of thermodynamic condition and solvent composition only; therefore, wide range of applicability can be suggested for this model proposed in this study.However, the existing mixing rules and empirical models which are partly developed on the basis of light to intermediate hydrocarbons, cannot be applied for any type of crude oil and/or mixture of petroleum fractions.This inappropriate application of literature models may lead to large deviations from reality.

Sensitivity analysis
Quantitative assessment of the effect of each input variable can be calculated by the use of Pearson's technique [40] by assigning an impact value between À1.0 and +1.0 to each independent and input parameter.To obtain this impact value for each input data, the following formula is applied [35]: where, the symbols r, l i , l, ln p k;i and ln p k illustrate, respectively, the relative dependency factor, the ith predicted value of mixture density, the mean value of mixture density, the value of ith input variable, and the mean value of the kth input variable, in which k represents the mean values of inputs variables.Based on this approach, a sensitivity analysis is implemented in this study in which its outcomes are shown in Figure 13.As shown, mixture density has a decreasing trend with respect to the temperature and solvent concentration, even though mixture density gradually increases for a rise in pressure.

Outliers detection
In this section, a useful approach is carried out to detect the suspected data, and separate them from modeling.Because outlier dataset can significantly reduce the reliability of each predictive model during the characterization of physical phenomena.One of the widespread techniques in recognizing such mistrusted data are known as Williams' plot.In this diagram, the standardized residual (R) values are sketched against the Hat parameter.The values of Hat parameter are calculated by the following Hat matrix as follows [41]: where, X is a matrix with m rows and n columns, and superscript t denotes for the matrix transpose operation.
In addition, m is the number of datapoints, and n indicates the number of parameters.As a result of this strategy, Figure 14 is plotted showing the outliers data for bitumen/n-tetradecane mixture density over the used database.As shown, there is two red horizontal lines named as residual limits, and a green vertical line which shows the leverage limit or critical Hat value (H*).When the data are located in the interior region of 0 H 0.0364 and À3 R +3, the used data are acceptable (see the blue points in Fig. 14).Obviously, only one point of the density data in this study is unreliable.So, the authors of this study developed two comprehensive models on the basis of a valid database.

Conclusion
In current investigation, using Gene Expression Programming (GEP) tool a new method including a three-parameter model was developed to prognosticate the bitumen/ n-tetradecane mixture density as a function of thermodynamic parameters for the first time in literature.For this purpose, about 80% of the database was utilized for designing the new method, and the remaining data was considered for checking the prediction capability of the trained models.Afterwards, via the most commonly applied statistical functions and graphical methods, the proposed method was assessed.As a result, it is perceived that the GEP method which takes into account the pressure effect is an accurate model with reliable predictions.Moreover, it is demonstrated that the solvent concentration is the most affecting correlative variable for estimating/predicting mixture density.The main benefit of this study over the existing mixing rules is that our newly proposed method requires the pressure, temperature and concentration, which are easily measured through any experimental study; however, the present mixing rules in literature are in the need of each component density which makes their application difficult.Finally, the new methods proposed in this study can be a good nominee for applying in solvent-assisted recovery methods applied on heavy oil reservoirs.

Fig. 2 .
Fig. 2. Variation of the fitness function and R-squared for different generations during GEP modeling.

Fig. 3 .
Fig. 3. Comparison of GEP estimates with experimental values of bitumen/n-tetradecane mixture density for the new method proposed in this study.

Fig. 4 .Fig. 5 .Fig. 6 .
Fig. 4. Experimental and GEP predicted values of bitumen/ntetradecane mixture density versus data index for the new method proposed in this study.

Fig. 8 .Fig. 7 .
Fig. 8.Comparison of the Average Absolute Deviation (AAD) of the new method here in diverse ranges of solvent concentration.

Fig. 9 .
Fig. 9. Comparison of the predictions of the new method with experimental mixture densities versus temperature.

Fig. 10 .Fig. 11 .Fig. 12 .
Fig. 10.Comparison of the predictions of the new method with experimental mixture densities versus pressure.

Fig. 13 .
Fig. 13.Sensitivity analysis of the bitumen/n-tetradecane mixture density with regard to the pressure, temperature and solvent concentration.

Fig. 14 .
Fig. 14.Outlier detection of the used database for modeling bitumen/n-tetradecane mixture density to check the validity of the data.

Table 1 .
Specification of the databank applied for modeling.SD refers to the standard deviation, which can be calculated as follows: Rostami et al.: Oil & Gas Science and Technology -Rev.IFP Energies nouvelles 73, 35 (2018)

Table 2 .
Parameters of the proposed GEP model.

Table 4 .
Statistical quality measures of the new method for different groups of training, testing and total subsets.
2 testing sets.For this model, the results of training and testing sets are approximately the same; thereby, the overfitting has not occurred through the development of this new empirical model.The results of test set are more accurate than the training set verifying the success of GEP mathematical scheme in developing empirical model by a logical variation of datapoints which are divided randomly for modeling.Table 5 indicates the comprehensive error analysis of the new model here with respect to different solvent concentrations.According to this table, it is clarified

Table 5 .
Comprehensive error investigation of the new method estimating bitumen/n-tetradecane mixture density for different values of solvent concentration.
that the GEP model is an efficient in estimating mixture density owning to the sufficiently low values of AARD%, RMSE, SD and AAD parameters.