Improving Group Contribution Methods by Distance Weighting

Résumé — Amélioration de la méthode de contribution du groupe en pondérant la distance du groupe — La nouvelle approche pour améliorer les estimations existantes de la méthode de contribution du groupe est développée. Au lieu d’utiliser les contributions fixes, les contributions pondérées sont optimisées pour chaque propriété de la base de données. Les facteurs de pondération sont calculés en se basant sur les similitudes entre le composé, dont les propriétés sont estimées, et les autres composés de la base de données. En utilisant cette approche, les substances qui sont chimiquement plus proches des substances en question sont systématiquement plus pondérées. Ainsi la nature générale de l’approche est maintenue. La nouvelle technique est aux ainsi Les performances sont démontrées en la prédiction du point normal la la propriétés physiques pression la méthode Abstract — Improving Group Contribution Methods by Distance Weighting — A novel approach for improving estimations of existing group contribution methods is developed. Instead of fixed

Predictive methods estimate the desired property of a chemical compound based on the experimental values of the same property measured for other compounds, which are in some way similar to the compound of interest. Usually, the molecule of interest is represented as a combination of groups and the contribution of each group is used in estimating the properties of the whole molecule. In most cases, the groups are combinations of atoms (e.g. in the method of Joback and Reid (JR) (Joback and Reid, 1987), in the methods by Marrero and Gani (MG) and Nannoolal et al. (Marrero and Gani, 2001;Nannoolal et al., 2004)), but they can even represent bonds in the molecule (e.g. in the method by Marrero-Morejon and Pardillo-Fontdevila Marrero-Morejón and Pardillo-Fontdevila, 2000)). The group contributions are either summed up, or their sum is used in more complex correlations (Constantinou and Gani, 1994;Nannoolal et al., 2004;Wang et al., 2009). Each group contribution is typically assumed independent of the contributions of the other groups. Group Contribution Methods (GCMs) can account for the relative position of the groups in a molecule by introducing second and third order groups (cross -effects), use of different topological indexes (Gani et al., 2005;Gonzàlez et al., 2007;Wakeham et al., 2002) or by applying proximity parameters within the group contribution summation (Vijande et al., 2010). The methods by Marrero and Gani and Nannoolal et al. (Marrero and Gani, 2001;Nannoolal et al., 2004) are recently published examples of GCMs for estimation of pure compound physical properties. Another wide area of application of the GCMs is the prediction of a model parameters for description of chemical mixture behavior (Gmehling, 2009;Mustaffa et al., 2011;Peters et al., 2012;Soria et al., 2011;Tochigi et al., 2002;Vijande et al., 2010).
Even though GCMs are widely used in chemical engineering calculations, their accuracy is uncertain. The error of a physical property estimation can vary greatly among different chemical classes (Rowley et al., 2007), since the contribution of the various groups is averaged over compounds that are chemically rather different. The error of the estimation methods may be reduced by considering that similar compounds often behave similarly. In a specific class of chemicals, a property can be estimated by averaging the property values. However, for complex molecules with rare functional groups or group combinations, a property averaging over neighbor compounds would not account for all the dependencies of the property on the molecular structure.
In this work, we developed a novel calculation procedure that is applied to existing GC methods in order to exploit the concept of chemical similarity and improve their estimations. The proposed estimation technique can be applied to any GCM. The advantages of this methodology are presented here using the JR method because of its simple form and wide-spread use. We also tested the new technique on the more advanced method by Marrero matrix is the number of components whose measured properties are available. The number of columns in matrix [n] is the number of groups plus one. The first column in matrix [n] is solely filled with ones. Thus, the first element of (a) represents a constant a 0 . The other elements of (a) are the contribution coefficients of the groups according to Equation (2). In this work, the distance weighted optimization of group contributions was compared with the results of traditional non-weighted optimization.

Non-Weighted Optimization
In non-weighted optimization group contribution coefficients were solved by means of the matrix inversion, according to Equation (5): In more advanced GCMs second and third order groups are employed, which account for functional group proximity (Constantinou and Gani, 1994;Cordes and Rarey, 2002;Marrero and Gani, 2001). In our work second level GC (a (2) ) were optimized by minimizing the difference between measured properties and the properties estimated using the lower order group contributions (Eq. 5). Note that (a) and (a (2) ) are independent vectors and matrix [n] is different for Equation This two stage optimization technique was used for comparative purposes, i.e. because it was utilized in the mentioned above multilevel GCMs. In accordance with our preliminary investigations, simultaneous optimization of (a) and (a (2) ) yields a better fit of the measured properties but changes the meaning of the second order groups that are to be only corrections for estimations made with the first order group contributions.

Distance Weighted Optimization
Utilization of weighting can significantly improve the accuracy of the optimization methods (Dudani, 1976). In the optimization of group contributions, weights can be used to exploit the structural similarities among compounds. Structural groups of GCMs are a natural way of taking into account structural similarities and they were used for calculating the distance function (d j ) in accordance with Equation (7).   (9) procedure minimization in for the first order groups; matrix form (Eq. 5,9) 2 nd step -Eq. (6) or (10) In the ideal case, the calculated properties (P calc ) should be equal to measured property (P meas ): (3) and thus in matrix form Equation (3) becomes: ∑ where i is the group index, j is the index of the compound in the database and the subscript est indicates the compound under estimation. Other definitions of the distance function are also possible, depending on the property under estimation. High influence of some groups on the property can be emphasized by non equal contribution of the different groups into the first sum of Equation (7). The second term in Equation (7) slightly improves the NBP estimations by emphasizing the difference in size among the molecules. The weights were obtained from the distance function by combining Equations (7) and (8).
The factor ε is used to prevent over-emphasizing of very similar compounds and to avoid division by zero, when the selected groups cannot discriminate between two different chemical compounds (e.g. stereo-isomers). In this work, ε was given the value of 1. It is clear from Equation (8) that weights are larger for the components, which are chemically similar to the estimated one, i.e. with a small distance.
The Group Contributions were optimized using the least square minimizations as shown in Equations (9) and (10). It is important to emphasize that the optimization is made for each estimation task separately because the weights are specific to the compound under estimation: where [W] is a matrix containing the weights on its diagonal and zeros elsewhere.
Alternatively to the computational procedure described, other optimization approaches can be used. Various weighting strategies are also possible. For example, weights can be calculated based on dipole moment closeness or larger weights can be given for compounds belonging to the same chemical class. Uncertainties of experimental data can be taken into account in the weighting procedure, as in Kang et al. (2011). With respect to the dependence of physical properties on the Group Contributions other relations than the linear can be used. Non-linear property dependence will require a different minimization approach.

Estimation of Physical Property with Distance Weighted GCM
For estimation of a physical property a database (DB) of compounds with the measured physical property should be available. Each compound of the DB has to be divided into groups as determined in a GCM that was taken as a basis for the estimation. Then, weights of each compound of the a n W n n 1 ε database are estimated as described in Section 1.2. Group Contribution values are optimized in accordance with Equations (9) and (10). After that, the physical property is calculated in accordance with Equation (2). See also examples of the NBP estimation in the Appendix. Though the procedure of the property estimation thus requires optimization of the GC values at each estimation task, improvement of accuracy of the basic method is notable. Moreover, the distance weighting method can be easily implemented into most of modern flowsheet simulators, because they have already their own compound databases and the GC value optimization takes less than a second on a Pentium computer.

Database
In our method, a physical property database is required for optimizing the Group Contributions and for estimating unknown physical properties. Most modern process flowsheet simulators and estimation packages exploit extensive sets of known physical properties and these databases are their essential part. The choice of the compounds included in the database depends on the availability of experimentally measured physical properties. For this work, the Normal Boiling Point (NBP) was chosen because of its importance as a pure component property and because experimental values of the NBP are available for many compounds. The experimental data included in the database were taken from the handbook by Poling et al. (2001). Although this set of components and experimental data is much more restricted than the databases used in commercial simulators, it was chosen because it is publicly available. More extensive databases are expected to result in more accurate estimations both in the traditional group contribution methods and in the presented weighted group contribution approach. The database by Poling et al. (2001) was nevertheless found to be extensive enough for testing the effect of the group contribution weighting. In our database, the chemical compounds are divided into structural groups. The division of the compounds into predetermined groups can be automated (Kolská and Petrus, 2010;Raymond and Rogers, 1999). Components whose NBP has neither been measured experimentally, nor calculated based on experimental values were excluded. In this way, the values estimated with our method do not incorporate errors from other estimation methods. In order to determine the contribution of a group in a reasonably reliable manner, the database should include a certain number of compounds containing that group. The minimum amount of the compounds was chosen to be three, i.e. if only two compounds contain some particular group, those compounds were excluded from the database and that group was not considered in the analysis.
The total number of components in our database was 386, when using the Joback method. From the components available in the handbook by Poling et al. (2001) only the components that can be divided into Joback groups were selected for the database. In addition to the NBPs from the Poling database, the NBP of dimethylacetylene was added to our database from DIPPR Project 801 (2009). Dimethylacetylene was included for the optimization of the triple bond group (-C#) in order to avoid singularity of the group matrix [n]. This singularity occurs, if at least two columns become linearly dependent. For example, if in our database all alkynes were represented only by 1-alkynes the CH# and -C# group columns would become linearly dependent, therefore, the system of equations represented by Equation (3) could not be solved.
The estimation of the NBP based on the MG method was made for selected classes of compounds, i.e. alkanes and alcohols. 40 alcohols and 67 alkanes from the Poling handbook were included in our database. This restriction of the database was made to reduce the number of treated groups in order to improve the representation of each group. In the Poling database only the classes of alkanes and alcohols were large enough for reliable optimization of second order group contributions of the MG method.
Naturally, such restriction does not affect the general objective of this paper, i.e. to show the applicability of the proposed technique and the benefits of distance weighting optimization.

Accuracy of the Tested Models
Our new method for the optimization of group contributions utilizes the form of the property dependence on the group contribution and the group definition of available GCMs. Two GC models were used for the demonstration of our approach: Joback and Reid (JR) and Marrero and Gani (MG) (see Tab. 1).
The JR method is a one-level GC method, i.e. only 1 st order structural groups are used for calculating physical properties. In this method, NBP is calculated as a linear sum of group contributions with the additional compound independent constant. The Marrero and Gani method includes three levels (1 st , 2 nd and 3 rd order groups) and utilizes an exponential function for the estimation of NBP. To be able to use the matrix calculations (Eq. 5,6,9,10), ln(T b /T b0 ) were utilized as a linearized property (Tab. 1), where the constant T b0 was taken from the original MG method. Only 1 st and 2 nd order groups were used from the MG method. Reliable optimization of the third order group contributions was impossible using our database (see Sect. 2.1 Database), since only a few compounds in the database contain the third order groups.
For error analysis of the proposed methods estimation of the NBP was made one by one for every compound in the database. For each estimation task, the estimated component was excluded from the database and its NBP was calculated as described in Section 1.3. The optimized contributions (a) used in each estimation task were unique and each contribution set was only used once for the estimation of NBP of a single compound, in accordance to Equation (2).
In order to see the effect of the weights on the NBP estimations, both the non-weighted optimization (NW,Eq. 5,6) and the distance weighted optimization (DW, Eq. 9, 10) were performed. The optimization of Group Contributions for each estimation task without weighting (NW) is denoted by JR-NW and MG-NW, with Distance Weighting (DW) by JR-DW and MG-DW.
The group matrix [n] was modified for the calculations of the Group Contribution coefficients of MG-NW & MG-DW methods. The column corresponding to group (>C<) was removed to avoid matrix singularity, i.e. its GC value was set to 0. The singularity appears due to restriction of the database of the MG method to alkanes and alcohols, since the amount of end groups -CH 3 and -OH is a linear function of the number of >C< and -CH< groups (i.e. n CH 3 + n OH = 2 + n CH + 2n C ). This modification of matrix [n] does not influence the NBPs estimations and it is not needed when cyclic compounds are included in the database.

NBP Estimations
The performances of the NW and the DW technique based on the JR and the MG methods were compared with respect to NBP estimation. The Absolute Error (AE) was calculated as the absolute difference between the predicted NBP and the experimental value. The Absolute Error frequency distribution for JR, JR-NW and JR-DW is shown in Figure 1. The number of estimations with AE between given values (for example between 0 and 1) is shown between those values on abscissa. The AE of both the JR and the JR-NW methods had a rather similar frequency distribution as can be seen in Figure 1. About 30 NBP estimations of the JR method had high deviations (> 40 K). The NW method behaved similarly but the error distribution was slightly more even. This was expected since the optimization procedure in JR-NW minimizes the deviations based on the standard deviation residual function; whereas JR is based on Absolute Average Errors (AAE) (Joback and Reid, 1987). The use of weights in JR (JR-DW) reduced the AAE by 6.3 K compared to JR-NW. Moreover, the number of extremely high deviations was reduced considerably, only 8 cases were found with deviation above 40 K (see Fig. 1). Figure 1b shows the exponential approximation of the AE frequencies versus AE intervals (drawn as Microsoft Excel trendline utilizing least square optimization of the coefficients for the exponential line based on all experimental points). The slope of the JR-DW frequency trend line is much steeper, i.e. JR-DW predicts more accurately. The AAE of JR-DW was 8.9 K, whereas for JR it was 15.7 K and for JR-NW 15.2 K. The direct comparison between JR-DW and JR is improper, because of the difference in the databases, in the minimization techniques and in the procedures for property estimation. On the other hand, the results of JR-NW and JR are expected to be similar. Therefore, the comparison between JR-NW and JR-DW properly reveals the ability of distance weighting of improving the used GCM.
The poor performance of JR method for NBP estimation is well known (Poling et al., 2001); other GCMs could improve the NBP estimation by exploiting different functional form of NBP dependency on groups (Cordes and Rarey, 2002;Marrero and Gani, 2001) and/or by explicit use of the molecular weight in the NBP estimation Nannoolal et al., 2004). These methods can be used as a basis of the DW technique as well. But even for the JR method, increasing weight of similar compounds in the GC optimization allows taking into account implicitly many phenomena that can play a role in vaporization process, such as molecule volume, surface tension, molecular flexibility, etc. At the same time the JR-DW method keeps very simple form of the JR method.
The frequencies of the absolute errors of MG, MG-NW and MG-DW are shown in Figure 2. A reduction of the extremely high deviations is also observed with the MG-DW method applied to alkanes and alcohols. The MG-DW NBP estimations were on average about 4 K closer to the experimental NBP values than the MG-NW method estimations. a) Absolute Error frequency (K) of measured (Poling et al., 2001) and estimated NBP calculated with the JR method (open bars ■ ■), with the JR-NW method (with Eq. 5 gray bars ■)) and with the JR-DW method (with Eq. 9, black bars ■) versus interval of the absolute error; b) Exponential trend lines for the frequency data, where green circles (•) are JR AE frequencies and its trend line (green line); blue triangles (▲) are JR-NW AE frequencies and its trend line (blue line); red diamonds (◆) are JR-DW AE frequencies and its trend line (red line).
For both GCMs, the greater improvements are obtained for small and large molecules (C 1 -C 3 , C 21 -C 24 ), with a gain in accuracy of 12.6 K and 12 K for JR-DW and MG-DW respectively (see also Fig. 3, 4). In Figure 3, the NBP estimations with NWs and DWs are plotted against the experimental NBPs. The NW slightly improves the NBP estimations of both JR and MG by reducing the deviations of the high boiling compounds. The improvement of the estimation with the DW methods can be clearly seen in Figure 4, where errors of the NBP estimations are plotted against the experimental NBP for both the JRs and MG methods. The deviation trend of JR-NW and MG-NW is similar to original GCMs, i.e. larger deviations for low and high boiling compounds. The DW methods clearly reduce all deviations, without compromising the quality of the estimations in the middle boiling region (see Fig. 4). The gain in the estimation accuracy with the DW methods is clearly visible also in Figure 5. We observed a steady decrease of the deviation fraction and a sharper gain in accuracy in the high deviation range for both JR-DW (Fig. 5a) and MG-DW (Fig. 5b).
The quality of the performances of the proposed methods depends on the available experimental data for the compounds containing each considered structural group. An overview of the databases used in this work is given in  a) Absolute Error frequency (K) of measured (Poling et al., 2001) and estimated NBP calculated with the MG method (open bars ■ ■), with the MG-NW method (with Eq. 5, 6, gray bars ■) and with the MG-DW method (with Eq. 9, 10, black bars ■); b) Exponential trend lines for the frequency data, where green circles (•) are JR AE frequencies and its trend line (green line); blue triangles (▲) are JR-NW AE frequencies and its trend line (blue line); red diamonds (◆) are JR-DW AE frequencies and its trend line (red line).
Clearly, the JR-DW method performed considerably better then the JR-NW method with respect to the NBP estimations. The JR-DW performs either better or comparably to the original JR method. With all methods, very high AAE was observed for some classes of compounds, i.e. alkynes and Br containing compounds. The low accuracy for these classes derives from the characteristics of our database: only two compounds containing Br without any other halogen were included in the database, and for alkynes, the triple bond groups were independent only in the case of ethyne (CH# group) and of dimethylacetylene (-C# group).
The AAEs of the MG-NW and the MG-DW methods are presented in Table 2. MG-DW performs better than both the MG-NW and the original MG methods, thus using distance weighting improves NBP estimations regardless of the database size. Additionally, application of DW approach decreased difference in the estimations of two tested GCMs, i.e. AE for JR-DW and MG-DW estimations of alkane NBPs were  7.4 and 5 K correspondingly, whereas the estimation with the original methods gave larger difference (the AAE were 14.6 and 9.9 for the JR and the MG methods correspondingly).
In general, for all classes of compounds the DW method was superior to the NW method.

Changes in the Group Contribution Values within Estimation of NBP of Homological Series of 1-Alcohols
The proposed weighting strategy assumes that there is no standard contribution for each group, i.e. the Group Contributions (GC) are not stored. They are newly optimized for each estimation task based on the available database.
In this work, we tested the change in value of the GCs for 1-alcohols in NBP estimations. Figure 6 shows that the Non-Weighted method (NW) does not change considerably   the GC values. With the DW method, the GC values change depending on the estimated compound due to the differences in weighting during the optimization. For -CH 3 and -CH 2groups the change is very small, while for the -OH group it is much more prominent (Fig. 6). The GC of the -OH group becomes smaller in value for larger alcohols. When the group independent constant is optimized, i.e. in the JR-NW and in the JR-DW, the value of this constant increases with increasing carbon number of the estimated alcohol. The other group contributions slightly decrease in value (Fig. 6a).
Trends of change for second order group contribution values optimized in the MG-NW and the MG-DW methods are more complex. The variation of the values are influenced by weighting and by the changes in contributions derived from the optimization of the first order groups. Generally, the second order group contribution values increase slightly with increasing carbon number of the estimated alcohol.

Estimation of Other Physical Properties
Besides Normal Boiling Point, the proposed technique was also tested on estimating other thermodynamic properties of non-electrolyte compounds employing the JR GCM. Standard enthalpy and Gibbs energy of formation, melting temperature, enthalpy of fusion and vaporization as well as critical pressure and volume were estimated. A list of properties with the used equations is given in Table 3. The critical temperature was not calculated since it cannot be linearized in its JR representation. One can see that the accuracy of the methods reported in Table 3 is relatively low. This is due to both the original JR method predictive ability and to practical difficulties in experimental determination of some of the properties (Katritzky et al., 2010), such as a fusion enthalpy. The database by Poling et al. (2001) also includes values of certain properties that were calculated with an estimation method. These values were not used in this work, thus there are different amount of available points (N) for GC optimization of different physical properties.
The JR-DW method was always superior to the JR-NW method in predicting any of the considered physical properties.
The present study was carried out with a database that is relatively small compared to those used in typical process flowsheet simulators or in other advanced thermodynamic estimation packages. In larger databases, there are inherently more components with closely similar chemical character; thus, the DW technique can even further improve predictions if applied to a larger database.

CONCLUSIONS
A novel approach in the use of GCMs was developed in this work. This methodology takes into account the structural similarities among compounds in the optimization of the group contribution coefficients by means of weighting factors. The Distance Weighting (DW) method relies on an available database of experimental physical properties. The characteristics of the database are very important for the performance of the DW method. In fact, the accuracy of property estimations for a compound is strongly dependent on the amount of compounds with the same structural groups included in the database. The method increases the impact of specific groups on the estimations; thus using structural groups that reveal important characteristics of a molecule is essential to the success of the method.
The DW method could be applied to any GCM and it improves the GCM to which it is applied. In this work, the DW method was applied to the Joback and Reid and to the  Constrained quadratic programming problems minimization was used to find only coefficients that produce non-negative pressure. ** n 0 = 1, a 0 : group independent property constant. † Absolute Average Error calculated per heavy (not hydrogen) atoms in the molecules.  Marrero and Gani GCMs (Joback and Reid, 1987;Marrero and Gani, 2001). The JR method was used to estimate NBP and other physical properties of compounds for which experimental data were found in Poling et al. (2001). The method of Marrero and Gani was used for NBP estimations of alkanes and alcohols from the same database. The DW method reduced the Absolute Average Error (AAE) in NBP of by 6.3 K for the JR method and of by 4 K for the MG method.
The advantages of this method can be exploited in several ways. The DW method can be used for the fast optimization of properties or model parameters of a compound based on a relatively small database, in which a sufficient number of similar-in-structure compounds are present. The database might even be restricted to specific classes of chemicals. The DW method can also be applied to advanced GCMs using a large database. In this case, the benefits from detailed group definition and from the use of second and third order groups are integrated.

APPENDIX
Let's consider estimation of the NBP for 2-methyl-2-butanol by methods described in our article. Its experimental NBP value is 375.15 K.
Equation (2) is used for the NBP calculation in the JR method, i.e.: Values of the original JR group contributions are given in the table together with the estimated NBP. At first Equation (5) was applied to the whole database except 2-methyl-2-butanol (386 compounds from Poling book). As a result 40 values of the Group Contributions for JR groups were obtained. Five of those values were used for estimation of 2-methyl-2-butanol NBP (see Tab. A). Then, weighted GC optimization was made in accordance with Equation (9) with the same database. Five of forty optimized GC values were used for estimation of the NBP by the JR-DW. The GC and estimated NBP are given in Table A.
The MG, MG-NW and MG-DW methods 2-methyl-2-butanol consists of the following first order MG groups (n i ): 3 × -CH 3 , 1 × -CH 2 , 1 × >C<, 1 × -OH (aliphatic), and the second order group (COH group). The NBP of 2-ethyl-2-butanol is calculated in accordance with following equation: where constant T b0 was used as it is given in the original MG method, i.e. 222.543. The original MG group contributions and estimated NBP is given in Table B. For the MG-NW method, Equations (5) and (6) were used for optimization of GC values with the database of alkanes and alcohols (see Sect. 1.1 and 2.1). Exponential function of relative boiling temperature (exp(T b /T b0 )) was used as the estimated property for linearization. Eleven group contributions were optimized and four of them used for estimation of 2-methyl-2butanol NBP. It is worth to remind that GC value for >C< group was not optimized as it is described in Section 2.2. Equations (9) and (10) were used for the MG-DW method. The resulting optimized GC and estimated with them NBP is shown in Table