Abstract
Eliciting informative user opinions from online reviews is a key success factor for innovative product design and development. The unstructured, noisy, and verbose nature of user reviews, however, often complicate large-scale need finding in a format useful for designers without losing important information. Recent advances in abstractive text summarization have created the opportunity to systematically generate opinion summaries from online reviews to inform the early stages of product design and development. However, two knowledge gaps hinder the applicability of opinion summarization methods in practice. First, there is a lack of formal mechanisms to guide the generative process with respect to different categories of product attributes and user sentiments. Second, the annotated training datasets needed for supervised training of abstractive summarization models are often difficult and costly to create. This article addresses these gaps by (1) devising an efficient computational framework for abstractive opinion summarization guided by specific product attributes and sentiment polarities, and (2) automatically generating a synthetic training dataset that captures various degrees of granularity and polarity. A hierarchical multi-instance attribute-sentiment inference model is developed for assembling a high-quality synthetic dataset, which is utilized to fine-tune a pretrained language model for abstractive summary generation. Numerical experiments conducted on a large dataset scraped from three major e-Commerce retail stores for apparel and footwear products indicate the performance, feasibility, and potentials of the developed framework. Several directions are provided for future exploration in the area of automated opinion summarization for user-centered design.
1 Introduction
Understanding user needs is a preliminary step in early-stage product development [1]. User feedback plays a key role in product design and development, as it provides important information about user interaction experiences with various attributes of a product to designers and manufacturers. The increasing use of e-Commerce platforms has resulted in large and rich collections of user feedback in the form of online product reviews [2]. One of the main advantages of analyzing online reviews is that they contain detailed and nuanced feedback from large and diverse user populations on different attributes of various competing products [3,4], which may not be the case in pilot launches, small-scale usability studies, or focus groups involving product design and development teams [5–7]. However, it is challenging to comprehend a large collection of textual reviews that typically address the varied user experiences and sentiments associated with different attributes of a product. The question thus remains on how to accurately elicit user needs from online reviews at scale and relay that information to designers in a useful format.
Natural language processing (NLP) techniques such as text summarization, sentiment analysis, and topic modeling can be utilized to extract important themes from a collection of user reviews [8,9]. Among these, models require qualitative interpretation of generated topics, which often requires significant effort and time. Sentiment analysis has been widely explored in the user need finding literature [10–18]. However, the results of the sentiment analysis process, which often takes the form of sentiment polarity or intensity values, may inherently lose important information that could otherwise be useful to designers. Text summarization approaches can address this shortcoming by providing compiled summaries of important points covered in a large collection of reviews, which can be used directly by product designers for further analysis [17]. There are mainly two types of text summarization approaches: extractive [19–21] and abstractive [22–24]. The former extracts and concatenates key sentences or paragraphs from the original text without necessarily capturing their context or meaning, while the latter leverages language models to generate text in a more advanced fashion, similar to human interpretation.
1.1 Knowledge Gaps.
The existing need finding approaches are based primarily on qualitative analysis of previous designs, surveys, or focus group studies, which are inherently biased due to the targeting of a small fraction of users and product instances with structured inquiries. The growing abundance of user feedback data in the form of online reviews, tweets, comments, or forum discussions has created new opportunities for designers and product developers to elicit user needs at scale. Sentiment analysis has been a key enabler for large-scale need finding from user-generated data over the past decade [25–28]. However, the current research is mainly focused on sentiment classification at the attribute, sentence, or document level [29–36], which would inevitably lead to information loss due to the aggregation and quantification of user feedback and opinions.
Text summarization is another NLP technique explored in the literature to extract user needs in the form of opinion summaries [37–44]. Yet, most of the existing opinion summarization approaches are extractive in nature and place emphasis on the percentage of information that could be extracted, which could result in potential information loss in the text summarization process due to the disregard of contradictory opinions in different reviews. To be more specific, most existing research merely evaluates the quality of the summarization results with respect to the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score [45], which essentially counts the number of overlapping units such as n-grams [46], word sequences, and word pairs between computer-generated summaries and ground-truth summaries created by humans. However, the ROUGE score could not provide enough support to assess the “direction” of the summary with respect to the attributes discussed or the general sentiment of the summary.
In multidocument summarization tasks [47,48] such as summarizing multiple user reviews (i.e., documents), different documents may contain totally contradictory opinions. Therefore, the generated summary can be easily influenced by the “dominant” opinions, i.e., the opinions with a greater representation in the corpus in terms of the lengths of reviews and/or the number of reviews with the same opinion. This bias in summarization occurs because the ROUGE score encourages the generated summary to contain more text from the longer text [49,50]. This peculiarity can undermine the ability of the summarization process to generate informative and representative summaries. This article aims to bridge this gap by “steering” the summarization process using the attribute-level sentiments of users extracted from the review set. Such a controlled summarization [51] would allow designers to generate attribute-specific summaries of user reviews with similar sentiments.
1.2 Objectives.
This article builds and validates a hierarchical Multi-instance Attribute-Sentiment (MAS) inference model to infer attributes and sentiments from each individual review as well as each sentence and word inside the review. Using a well-trained MAS inference model, a synthetic dataset is assembled in the form of attribute-specific summaries with sentiment labels. The synthetic dataset is then used to fine-tune a pretrained language model, text-to-text transfer transformer (T5) model [52], to generate abstractive summaries. The general framework shown in Fig. 1 has the ability to generate abstractive summaries from a raw review corpus guided by specific attributes of product and sentiment preferences. The main contributions of this article are as follows:
A new synthetic data set is created that can be used for both attribute-level sentiment analysis and attribute-sentiment-guided summarization of user needs from online reviews. The dataset includes both raw reviews from several online platforms and (reviews, summary) pairs that could serve the training purpose of text summarization
A multimodel computational framework is built, which includes (a) sentiment-attribute information integrated by the MAS inference model and (b) a fine-tuned T5 model trained using the results from the MAS inference model for generating summaries with specific attributes and sentiments.
The proposed end-to-end framework for inferring attribute-sentiment-specific summaries of user opinions is tested and validated through experiments on a large dataset of user reviews scraped from multiple e-Commerce platforms for footwear. The experiments also illuminate the impact of attribute-sentiment categorization on the quality of generated opinion summaries.
2 Background
This section presents a brief overview of the state-of-the-art in text summarization and attribute-level sentiment analysis, along with their implications for user need finding in early-stage product development processes.
2.1 Text Summarization Using Synthetic Data.
The previous work on text summarization focuses mainly on general opinion summarization using abstractive methods [22–24] or extractive methods [19–21]. Since annotated opinion summary datasets for training are rare and difficult to generate on a large scale, most recent studies approach opinion summarization as an unsupervised learning problem for which only the review corpus is available [44,53,54]. State-of-the-art unsupervised summarization approaches utilize autoencoders [55] to first train a text decoder by reconstruction and use it to generate summaries based on inputs. Considering the unsupervised nature of these methods, the quality of their generated summaries is often much lower compared to the summaries generated by supervised methods.
Some recent studies have proposed abstractive summarization models that can generate overall and attribute-specific summaries and have evaluated their performance in user reviews on services such as hotels and restaurants [56,57]. A key limitation of guiding the summaries only with respect to attributes is mixing up contradictory or positive/negative opinions of users about particular attributes. To address these gaps, this article develops a framework for abstractive summarization of user reviews to generate attribute-specific and sentiment-specific summaries. The performance of the model is demonstrated and evaluated on a large review dataset of sneakers scraped from multiple e-Commerce platforms.
For sentiment analysis, polarity and subjectivity are considered as two main dimensions and are determined for various attributes of the product as well as for the overall product. The sentiment polarity score indicates the intensity of emotions expressed by the user, for example, extremely negative/unhappy, neutral, moderately positive, or highly positive. The sentiment subjectivity score indicates whether the review was largely a subjective opinion, for example, “I did not like the shoe sole” or was objective in nature, for example, “The shoe sole was very narrow.” Both these sentiment dimensions contain important information about user experience, which is mostly complementary in nature. As this is a pilot study, the sentiment intensity at the attribute level was used to group the data into unique combinations of attribute and sentiment polarity, and the reviews were summarized for each group using an abstractive summarization approach. The findings of this study would be useful to researchers in the engineering design domain, as well as product designers and manufacturers.
Abstractive summarization based on synthetic data has been proven feasible in the past. Synthetic datasets are usually generated through unsupervised methods such as autoencoder [58], noising, and denoising [59], ranking good reviews by similarity and reusing them as summary [60], or through supervised methods that utilize attribute controller systems [61]. The latter is based on the idea of assembling pairs (reviews, summary) from a review corpus as synthetic data to train a supervised learning model. This article uses the supervised learning approach to generate the synthetic dataset.
2.2 Summarization Guided by Attribute-Level Sentiment Analysis.
The main goal of any text summarization task is to capture as much critical information as possible with the minimum number of words [62]. This could be achieved by guiding the summary generation process with respect to specific keywords (e.g., attributes/aspects of a product or a service) as the main subjects of the summary [63,64]. The summarization process can also be guided by user sentiments, for example, to compare the generated summaries of positive, negative, and neutral opinions [11–13,65,66]. Although both text summarization and sentiment analysis are popular research topics in the NLP domain, studies that incorporate both as complementary capabilities are rare. For example, some studies have attempted to build an efficient text summarization system by selecting sentiment keywords [67], which could roughly provide the direction of the summarization. Other studies have “concatenated” the two techniques by first precategorizing the corpus with respect to different sentiment values and then choosing keywords to guide the summarization process [68–70]. Some studies have used sentiment analysis as an independent, prefilter to improve the reliability of the summarization task [71]. To our knowledge, no existing study has explored the possibility of integrating the opinion summarization and sentiment analysis processes in a single model to enable controllable generation of summaries with respect to user-specified attributes and sentiments.
3 Methodology
This section presents the four main steps of the overarching framework for attribute-sentiment-guided summarization (see Fig. 2), including data collection and preprocessing, building the MAS inference model, creating the synthetic dataset, and fine-tuning the T5 model for abstractive summarization. The overall flow of the model with the output of each stage is presented in Fig. 2.
3.1 Collecting and Preprocessing the Data.
The review corpus used in this article is scraped from multiple online apparel and footwear stores including Finish Line2, New Balance3, and Asics4. This corpus contains over 140K reviews of sneakers. However, not all of the reviews are informative and useful for analyzing the user needs. A review filtering procedure is therefore designed as explained in Sec. 4. The corpus comes with a document sentiment label represented by a range of stars from 1 to 5. In addition to the overall rating, this article assigns an objectivity label to each review as well as an attribute label based on a prespecified attribute lexicon [18], which comprises over 200 attribute words related to footwear grouped in seven categories including permeability, impact absorption, stability, durability, shoe parts, exterior, and fit. The detailed lexicon can be found in Sec. 4. Since this work involves two types of labels, the balance of the labeled dataset should be considered with respect to both types of labels.
Since MAS model only requires two types of labels, including attributes and sentiments, the labeling process is efficient with the review stars and the attribute lexicon. Without loss of generality, the sentiment label was assigned by the customer star: [positive: 5 stars, neutral 3-4 stars, negative: 1-2 stars], the attribute label was assigned with the attribute lexicon by word matching. The reason for choosing the proposed rating scale was to make the three categories more balanced. Even four-star reviews were found to generally complain about some aspects of the product; therefore, reviews with fewer than five stars are not completely positive. In addition, this rating scale does not affect the model architecture and training process and can be modified in other application domains where four-star reviews are more positive. Further, the proposed rating scale helps balance the dataset, because 75.89% of the reviews in the dataset are five stars, while less than 5% reviews are three stars; hence, the size of the neutral category must be expanded.
3.2 Building the Multi-Instance Attribute-Sentiment Inference Model.
The MAS inference model is a supervised machine learning framework in which the labels correspond to a bag of instances that have not been labeled [72]. The goal of the model is to identify the bag labels of those unlabeled instances. In this article, a hierarchical model structure is designed to predict review labels from sentence and word predictions. The rationale behind choosing the multi-instance model is the similarity of the model structure to the process used by humans to generate summaries. Thus, in the data labeling process, the first step is to filter useful sentences from a set of reviews. When creating the attribute-related summary, the annotators generate content from sentences related to the attribute of interest. They then summarize those sentences into a single summary, which must incorporate the same label as those sentences. In the MAS inference model, the sentiment label is added along with the attribute label with three types of polarities associated with the review: positive, neutral, and negative. In this case, the stars provided for reviews on the e-Commerce platform are used to induce user sentiments. Although language models such as BERT [73] can provide embeddings at token, sentence, and document levels for classification, they cannot handle the desired hierarchical voting mechanism, as they require datasets with sentence-level and word-level labels that may not be necessarily available. Therefore, this article applies the MAS model to generate these labels, as described in the remainder of this section.
3.2.1 Model Structure.
To develop an abstractive summarization model through supervised learning, a labeled dataset is required that includes review-summary pairs. However, such datasets are rare and hard to generate. In such cases, several studies have attempted to train supervised learning models by creating synthetic datasets, which have shown remarkable performance [58–60]. Building on this idea, this article conducts abstractive opinion summarization through a three-stage process: (1) train the MAS model with a review-based dataset (Fig. 3), (2) generate synthetic dataset with the output of the MAS model, and (3) fine-tune T5, a state-of-the-art sequence-to-sequence model, using the synthetic dataset to generate abstractive summaries for specified attributes and sentiment polarities. The overall structure of the proposed model is similar to the attribute-controllable summarization model, AceSum [61], with the following additional features:
AceSum only provides an attribute controller, while the proposed model also incorporates sentiment polarities via the sentiment controller. Further, the AceSum model may yield no output, yet the proposed model always predicts at least one label, which is the sentiment.
The multi-instance model of AceSum creates the synthetic dataset using less than ten seed words, while the proposed model generates the synthetic dataset using a rich attribute lexicon previously developed by the authors [18].
AceSum uses a soft-margin (SM) loss function for the multi-instance model because their label set was binary with −1 and 1. The proposed model, however, uses the sigmoid binary cross-entropy (BCE) loss function for training to reduce the influence of the unbalanced dataset with respect to attributes and sentiments.
In the synthetic data creation process, AceSum assumes that in a review set, each review that fulfills some constraints is a summary, and all the rest is the training corpus. The proposed model, however, does not use one entire review as a summary, but rather ranks all the sentences from the reviews in the corpus with respect to their relevance to the desired attributes and sentiment polarities and assembles the top-ranked sentences as a summary.
3.2.2 Model Formulation.
To generate the synthetic training dataset for the downstream summarization task, the MAS model is designed to generate two types of labels at three levels: attribute labels and sentiment labels. The MAS model generates attribute labels and sentiment labels at the word level, sentence level, and review level. For example, given seven attribute categories “Permeability,” “Impact Absorption,” “Stability,” “Durability,” “Shoe Parts,” “Exterior,” and “Fit,” three sentiment categories “positive,” “negative,” and “neutral,” a user review “I really like the color of the shoe, it’s comfortable,” the MAS model will generate the output shown in Fig. 4.
3.2.3 Loss Function.
3.3 Creating the Synthetic Dataset.
To assemble the supervised learning dataset for the abstractive summarization task, the output of the MAS model is used. The synthetic dataset should be in the following format: [Summary:…; Reviews:…; Keyword:…; Aspect:…]. In the synthetic data creation process, no human-annotated data were used and the entire training corpus was assembled by the MAS output. Unlike existing approaches that simply choose one individual review from a set of reviews as a synthetic summary for training [61], this article develops a more effective approach to generate the synthetic dataset. Since the MAS model can provide sentiment and attribute predictions for each sentence in the corpus, the proposed approach chooses the top three reviews from the 60 reviews in each batch loop as a mini-corpus to assemble a summary. The three selected reviews must contain the same sentiment and address the attribute of interest. Since attribute labels are usually imbalanced (e.g., “Shoe Parts”, “Exterior”, and “Fit” are discussed more frequently), the probability of multiple labels appearing in the same summary is small. Choosing reviews that follow this protocol (i.e., candidate reviews with only one common attribute) is shown to lead to better results.
Summary. “Love it. They’re perfect! I have always worn Asics for running but I wear these even when I’m not exercising…The color combo makes them my favorite pair of Asics! My first pair of ASICS and I will never go back to Nike. Being able to pick a shoe that meets my foot needs is fantastic, it truly makes a difference in the comfort of my workouts. I have recently bought three pairs of the Zig Kinetica shoes. They are the most comfortable workout/casual shoe that I have ever worn! Due to foot surgery in Oct., many of my athletic shoes are not comfortable anymore. These shoes are particularly great for women who need more width in the toe box.”
Reviews. [“So comfortable!”,“They’re super comfortable and warm!”, “I would buy another pair!”, “First shoe my 15 year old picked himself.”, “Very comfortable!”,…].
Keywords. [“fit”,“comfy”, “comfort”, “exercising”,“black”,…].
Attribute and sentiment labels. [0, 0, 0, 0, 1, 0, 1, 1, 0, 0]. The first seven values are the attribute labels and the last three are the sentiment labels.
3.4 Fine-Tuning the T5 Model.
[attribute][attribute1][attribute2][attribute..][Sentiment] [KEY] keyword1, keyword2, keyword3 … [SNT] sentences …
Cosine scheduler. During the training process, the cosine scheduler is used to create a schedule with a learning rate that decreases after the values of the cosine function drops from the initial learning rate (lr) in the optimizer to 0, with several hard restarts after a warm-up period during which it increases linearly from 0 and to the initial learning rate set in the optimizer. As a functional optimizer, constant scheduler and linear scheduler are also available in the library, and in this article, cosine scheduler is shown to provide the best performance.
Dropout design. During training, dropout randomly zeroes some elements of the input tensor with the probability p, using samples from a Bernoulli distribution. Each channel is zeroed independently on each forward call. This has proven to be an effective technique for regularization and prevention of co-adaptation of neurons [80]. Furthermore, the results are scaled by a factor of during training. This means that during evaluation, the module simply computes an identity function.
Beam search for summary generation. In the sequence-to-sequence model, there are two main parts to provide the summary: an encoder, which is in charge of transferring sentences into embeddings, and a decoder, which translates trained embeddings back into sentences. On the output side, summaries are generated word by word. Beam search offers the option to select multiple words at the same time. In this article, the best beam search size during the training process was identified as 2.
The output of the fine-tuned T5 model consists merely of review summaries, since keywords and labels are only used as input for the training process. By modifying the input switches, the fine-tuned T5 model is able to provide sentiment-attribute-guided summaries.
4 Experiments and Results
The first step to implement the proposed MAS-T5 summarization framework is to clean and filter the corpus. To this end, the following steps must be taken:
Remove reviews that are too short or too long. In this research, the lower limit of the length of useful review was set up as 10 words and 1 sentence, and the upper bound of the review was set to 7 sentences and 80 words. The goal here is to extract as much information from the reviews as possible. However, during the model training process, each sentence in a review is padded as the longest sentence in the review. Therefore, when the longest sentence in the review becomes too long, the padding will influence the performance of the model during training. In the experiments, the lower bound was set to (1, 20, 50) with the upper bound of (20, 50, 9999), this limit was only set for the sentence length in the training process, and the previous limitation [10 words, 3 sentences] and [80 words and 7 sentences] was the filter used in the selection of the review level. The selection strategy for the sentence length set as [20, 50] exhibited the best performance.
The review corpus was document level based. In the postanalysis, however, each review in the corpus should be divided into sentences. In this study, Spacy [72] was utilized as the sentence divider.
The review corpus comes with the same scoring system. Each user leaves a score from “1” to “5” when they giving the review. Since “1” and “2” are labeled as negative, “3” and “4” are labeled as neutral, and “5” is labeled as positive, the same number of reviews was selected for those three label categories to make the dataset more balanced. It is worth mentioning that in the raw dataset, two-star and one-star reviews have the smallest amounts. Therefore, to balance the sentiment labels, all two-star reviews (1185 reviews) and one-star reviews (2374 reviews) were selected to represent the “negative” sentiments.
There is a total of 145,430 reviews in the dataset, where 10,700 reviews have more than 60 words (long reviews), 22,458 reviews have less than 10 words (short reviews), and 1636 reviews mention product names. After removing all the long and short reviews to reduce the intrinsic bias, 59,184 reviews remained. Among the remaining reviews, 75.89% were five star, 12.85% were four star, 4.96% were three star, 2.64% were two star, and 4.92% were one star. In terms of attribute references, the “Exterior” and “Fit” attributes appeared the most in the raw dataset. Specifically, in all the reviews that contained attributes, 58.25% mentioned “Exterior,” 76.88% mentioned “Fit,” 12.21% mentioned “Shoe Parts,” 32.11% mentioned “Durability,” 7.33% mentioned “Permeability,” 15.48% mentioned “Stability,” and 16.59% mentioned “Impact absorption.”
There are two models in the proposed methodology. This section presents the parameters used during the training process. In the MAS model, the parameters that can control the model performance includes batch size, learning rate, training steps, model dimension, and number of attention heads [83]. The definitions of these parameters are explained in the following with the T5 model. Several different parameters were tested during the training process. In the T5 fine-tuning process, the following parameters were adjusted to explore the best model performance:
Model. The pretrained model utilized in the fine-tuning process.
Model dimension. During the fine-tuning process, a dimension of 512 was used, which is the default dimension used in T5.
Keywords. The MAS model can output synthetic summaries and keywords from the review. Those keywords could be used as input or output in the T5 model. In the experiment, the keywords were used as both inputs and outputs for the comparison.
Batch size. The number of reviews trained in each iteration.
Learning rate. The parameter that controls the parameter updating speed in the backpropagation process.
Training steps. The parameter that controls the parameter updating time.
Learning rate scheduler. The learning rate update controller that could change learning rate during training.
Minimum text length. The minimum output length of the model.
Maximum text length. The minimum output length of the model.
As mentioned in the previous section, the early phase of the model utilized an attribute lexicon [18] to provide silver labels to the dataset. Part of the lexicon used in this research is presented in Table 1.
Attribute category | Example attributes inside |
---|---|
Permeability | “ventilation,” “breathable,” “mesh”… |
Impact absorption | “air,” “gel,” “strap”,… |
Stability | “flytrap,” “ankle,” “support”,… |
Durability | “durable,” “ripple,” “haptic”,… |
Shoe_Parts | “tonal,” “bucket,” “bottom”,… |
Exterior | “gold,” “blocking,” “metallic,”… |
Fit | “dapper,” “comfy,” “adjustable,”… |
Attribute category | Example attributes inside |
---|---|
Permeability | “ventilation,” “breathable,” “mesh”… |
Impact absorption | “air,” “gel,” “strap”,… |
Stability | “flytrap,” “ankle,” “support”,… |
Durability | “durable,” “ripple,” “haptic”,… |
Shoe_Parts | “tonal,” “bucket,” “bottom”,… |
Exterior | “gold,” “blocking,” “metallic,”… |
Fit | “dapper,” “comfy,” “adjustable,”… |
4.1 Model Performance.
In the model training process, several different sets of loss function and other hyper parameters were tested. The following is the loss function formula used in the experiment:
4.1.1 Soft-Margin Loss Function.
4.1.2 Multilabel Soft-Margin Loss Function.
4.1.3 Cross Entropy Loss Function.
4.1.4 Weighted Binary Cross-Entropy loss function.
The performance of the proposed MAS model is compared with a baseline model [61] in terms of the F1 score, as shown in Table 2. The last two rows compared the original model without sentiment information integrated with the original synthetic data creation method, using the same parameter setting. It is shown that our model outperforms the original model in terms of both sentence-level and document-level predictions. As observed in Table 2, the performance of both the MAS model and the baseline model varies with different loss functions. However, the proposed model is shown to outperform the baseline in terms of both document-level and sentence-level predictions, using the BCE loss and learning rate of 1e–6 for the ReLU activation function. This implies that the MAS model significantly improves both precision and recall in the classification of attributes and sentiments, compared to the baseline model [61].
Metric | Learning rate | Loss function | Label type and activation function | Performance score (F1 in %) |
---|---|---|---|---|
MAS document—level F1 | 1e–5 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 67.86 |
1e–6 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 70.91 | |
1e–6 | Cross-entropy (Eq. (13)) | (1, −1) label with tanh | 74.53 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (1, −1) label with tanh | 80.50 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (1, −1) label with tanh | 79.41 | |
1e–6 | Cross-entropy (Eq. (13)) | (0, 1) label with ReLU | 73.28 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (0, 1) label with ReLU | 81.77 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 80.86 | |
1e–6 | BCE sum weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 74.84 | |
1e–6 | BCE calculated weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 83.56 | |
MAS sentence—level F1 | 1e–5 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 68.93 |
1e–6 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 70.93 | |
1e–6 | Cross-entropy (Eq. (13)) | (1, −1) label with tanh | 74.44 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (1, −1) label with tanh | 80.46 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (1, −1) label with tanh | 78.27 | |
1e–6 | Cross-entropy (Eq. (13)) | (0, 1) label with ReLU | 72.55 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (0, 1) label with ReLU | 80.24 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 80.21 | |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 76.39 | |
1e–6 | BCE calculated weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 83.41 | |
Baseline document—level F1 | 1e–6 | soft-margin (Eq. (11)) | (−1, 1) label with tanh | 74.65 |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 76.54 | |
Baseline sentence—level F1 | 1e–6 | soft-margin (Eq. (11)) | (1, −1) label with tanh | 71.76 |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, −1) label with tanh | 73.74 |
Metric | Learning rate | Loss function | Label type and activation function | Performance score (F1 in %) |
---|---|---|---|---|
MAS document—level F1 | 1e–5 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 67.86 |
1e–6 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 70.91 | |
1e–6 | Cross-entropy (Eq. (13)) | (1, −1) label with tanh | 74.53 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (1, −1) label with tanh | 80.50 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (1, −1) label with tanh | 79.41 | |
1e–6 | Cross-entropy (Eq. (13)) | (0, 1) label with ReLU | 73.28 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (0, 1) label with ReLU | 81.77 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 80.86 | |
1e–6 | BCE sum weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 74.84 | |
1e–6 | BCE calculated weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 83.56 | |
MAS sentence—level F1 | 1e–5 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 68.93 |
1e–6 | Soft-margin (Eq. (11)) | (1, −1) label with tanh | 70.93 | |
1e–6 | Cross-entropy (Eq. (13)) | (1, −1) label with tanh | 74.44 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (1, −1) label with tanh | 80.46 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (1, −1) label with tanh | 78.27 | |
1e–6 | Cross-entropy (Eq. (13)) | (0, 1) label with ReLU | 72.55 | |
1e–6 | Multilabel soft-margin (Eq. (12)) | (0, 1) label with ReLU | 80.24 | |
1e–6 | BCE mean weight (Eqs. (14) and (15)) | (0, 1) label with ReLU | 80.21 | |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 76.39 | |
1e–6 | BCE calculated weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 83.41 | |
Baseline document—level F1 | 1e–6 | soft-margin (Eq. (11)) | (−1, 1) label with tanh | 74.65 |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, 1) label with ReLU | 76.54 | |
Baseline sentence—level F1 | 1e–6 | soft-margin (Eq. (11)) | (1, −1) label with tanh | 71.76 |
1e–6 | BCE sum weight (Eqs. (14) and (16)) | (0, −1) label with tanh | 73.74 |
Note: The bold text indicates the best performance in all the experiments.
In the T5 fine-tuning process, the ROUGE score was used as a benchmark to evaluate the performance of the model. The ROUGE-L score is based on the length of the longest common subsequence of candidates and references [81]. The ROUGE score performance of the sequence-to-sequence models is presented in Table 3. The T5 was fine-tuned with the same parameter setting in the baseline model and the MAS model. The best performance of the baseline T5 model (16.00) is slightly better than the proposed MAS-T5 model (15.67). This result was anticipated because, in the synthetic dataset creation process, the baseline model picks an entire user review as a synthetic summary, while the MAS-T5 model creates synthetic summaries by collecting useful sentences (i.e., sentences that contain attribute words) from different user reviews. Note that the ROUGE-L score measures the length of the longest common subsequence of candidates and references [81]. However, this compromise of the ROUGE-L score in the MAS-T5 model was necessary because many reviews contain contradictory sentiments, which in turn may “confuse” the summarization process if the whole review was used to compile the synthetic summaries, as performed in the baseline model [61].
Postmodel | Premodel | Learning rate | Training steps | Beam search size | ROUGE-L score |
---|---|---|---|---|---|
T5 small | Basline | 1e−5 | 10,000 | 2 | 11.78 |
MAS | 1e−5 | 10,000 | 2 | 12.56 | |
MAS | 1e−6 | 20,000 | 2 | 12.03 | |
MAS | 1e−6 | 50,000 | 2 | 15.21 | |
MAS | 1e−6 | 100,000 | 2 | 15.54 | |
MAS | 1e−6 | 10,000 | 3 | 11.33 | |
MAS | 1e−6 | 20,000 | 3 | 12.25 | |
MAS | 1e−6 | 50,000 | 3 | 15.06 | |
MAS | 1e−6 | 100,000 | 3 | 15.11 | |
MAS | 1e−6 | 100,000 | 4 | 15.24 | |
T5 s v_1_1 | Basline | 1e−5 | 10,000 | 2 | 12.11 |
MAS | 1e−6 | 10,000 | 2 | 11.34 | |
MAS | 1e−6 | 20,000 | 2 | 12.57 | |
MAS | 1e−6 | 50,000 | 2 | 15.67 | |
MAS | 1e−6 | 100,000 | 2 | 15.63 | |
MAS | 1e−6 | 10,000 | 3 | 11.27 | |
MAS | 1e−6 | 20,000 | 3 | 12.56 | |
MAS | 1e−6 | 50,000 | 3 | 15.23 | |
MAS | 1e−6 | 100,000 | 3 | 15.34 | |
MAS | 1e−6 | 100,000 | 4 | 15.66 | |
T5 small | Basline | 1e−5 | 10,000 | 2 | 11.78 |
Basline | 1e−6 | 50,000 | 2 | 16.00 |
Postmodel | Premodel | Learning rate | Training steps | Beam search size | ROUGE-L score |
---|---|---|---|---|---|
T5 small | Basline | 1e−5 | 10,000 | 2 | 11.78 |
MAS | 1e−5 | 10,000 | 2 | 12.56 | |
MAS | 1e−6 | 20,000 | 2 | 12.03 | |
MAS | 1e−6 | 50,000 | 2 | 15.21 | |
MAS | 1e−6 | 100,000 | 2 | 15.54 | |
MAS | 1e−6 | 10,000 | 3 | 11.33 | |
MAS | 1e−6 | 20,000 | 3 | 12.25 | |
MAS | 1e−6 | 50,000 | 3 | 15.06 | |
MAS | 1e−6 | 100,000 | 3 | 15.11 | |
MAS | 1e−6 | 100,000 | 4 | 15.24 | |
T5 s v_1_1 | Basline | 1e−5 | 10,000 | 2 | 12.11 |
MAS | 1e−6 | 10,000 | 2 | 11.34 | |
MAS | 1e−6 | 20,000 | 2 | 12.57 | |
MAS | 1e−6 | 50,000 | 2 | 15.67 | |
MAS | 1e−6 | 100,000 | 2 | 15.63 | |
MAS | 1e−6 | 10,000 | 3 | 11.27 | |
MAS | 1e−6 | 20,000 | 3 | 12.56 | |
MAS | 1e−6 | 50,000 | 3 | 15.23 | |
MAS | 1e−6 | 100,000 | 3 | 15.34 | |
MAS | 1e−6 | 100,000 | 4 | 15.66 | |
T5 small | Basline | 1e−5 | 10,000 | 2 | 11.78 |
Basline | 1e−6 | 50,000 | 2 | 16.00 |
Note: The bold text indicates the best performance in all the experiments.
4.2 Opinion Polarity and Subjectivity.
The opinion polarity and subjectivity score patterns are analyzed for each attribute. From the collection of all reviews. For each attribute, a similarity score was determined for each review with reference to the set of representative keywords for that attribute. The set of ten reviews with the highest similarity scores was selected for that attribute. The opinion polarity and subjectivity were determined for that subset in this case, and the subset means the review filter mentioned in Sec. 3. The distribution of the opinion polarity scores for each attribute is presented in Fig. 3 as box plots. Polarity scores seemed to be in a similar range for most attributes with mean vaswitch = [1,1,1,1,1,1,1,1,0,0]lues around 0.2, indicating an overall positive nature of the reviews. In Fig. 5, a relatively higher number of negative polarity data points can be observed for “exterior” and “stability” attributes compared to other attributes. The attribute-wise distribution of the subjectivity scores of the opinion is presented as box plots in Fig. 6. The subjectivity score distribution also appeared to be similar for most attributes, with a mean value of around 0.5, indicating that most of the reviews were reasonably objective in nature.
The distribution of opinion polarity and subjectivity is also compared in the original collection of reviews and in the collection of review summaries to check if there were any deviations. For each of the reviews collected for each attribute, the summary was generated using the T5 model. The opinion polarity and subjectivity scores were calculated for each review in the original collection and the summary generated. The comparative analysis between the distribution of these scores in the collection and the summary provides information on any loss of information. Figure 7 shows the distribution of opinion polarity and subjectivity for the original review set and summary set. It can be observed that the variance in polarity and subjectivity distributions was relatively less for the summary set compared to original reviews, which seems intuitive as the original review dataset was more noisy. We can also observe in Fig. 7 that the mean values of opinion polarity and subjectivity scores were relatively higher for the summary dataset compared to the original review dataset. For polarity, the mean of the summary data set was 0.32, while the mean of the review data set was 0.18. For subjectivity, the mean of the summary dataset was 0.58, while the review dataset mean was 0.52. Based on the unequal variance t-test, the differences in mean polarity and subjectivity scores between the summary and review datasets were found to be statistically significant. This difference in polarity and subjectivity can be qualitatively further analyzed to understand the nature of information loss in the summarization process.
4.3 Attribute-Sentiment-Guided Summaries.
In the actual model training process, the word “comfortable” has the highest possibility in the “positive” summarization. To avoid imbalanced results, “comfortable” was removed from the attribute lexicon during training. Some attribute-sentiment summarization results are shown below. Example of “positive” and “color” summarization
“i have always loved air force 1s so this is so much better. i love the look with jeans, the icy blue is more of a plain gray, and more important the laces are white - not a light blue as shown. meanwhile i bought these for my son because he wanted this style and he was having an all white dance.”
Examples of “positive” and “durability” summarization
“i love these shoes. i’m a healthcare worker and i have been wearing them for a long time and they are very comfortable and comfortable. they are a great fit and a good fit! i bought these for my son and he loves them.”
“i’m a nurse and i’ve been a fan of these shoes. i have been wearing these shoes for a while now. they’re a great shoe for a long time.”
“i’m a walker and i’ve been able to wear them all day. i love them! i love these shoes! they are so comfortable and comfortable. i have been wearing them for a while now. i can’t wait to see if they’ll be available.”
Examples of “positive” and “fit” summarization
“a great shoe! i have a a sleeve and they are a great fit! i love the color of the shoe. i love how they feel on my feet. they are so comfortable.”
“i’m wearing these shoes for a long time. i love the fit. they are very comfortable.”
Examples of “negative” and “shoe parts” summarization
“the toe box was uncomfortably huge. i have always loved the reebok classics. hurt the the side of my foot.”
“they are too wide! i have to find another pair, i have a couple of pairs and the shoe sleeve is easy to move.”
Examples of “negative” and “color” summarization
“the color of the grey is not what I expected. i have always love this sneaker, the color is not what i saw in the picture, the color is darker on the top.”
“the silver strip on the side is boring, the blue heel is not the right blue, the heel is tough.”
Some interesting observations can be immediately made from the aforementioned example results. For example, when people review a product, they often refer to its usage context as well. When the users of a footwear item say something about its durability, for example, they also mention their occupations, which may require standing for long periods of time (e.g., nurses, doctors, factory workers). Moreover, it can be observed that negative reviews usually tend to address specific parts of the item (e.g., a small or tight toe box, hard heels) or a specific attribute not matching the original description on the website (e.g., online versus actual color). Since the generated summaries capture the most informative parts of the reviews, designers can confidently rely on the summaries generated with respect to different sentiments and/or attributes to quickly evaluate any potential relationships between the causes of dissatisfaction and compare different competing items on the market when designing a new concept.
The presented model only utilizes the ROUGE score for training. However, this benchmark only pays attention to the overlapping rate of the model output and the human-generated summary. To potentially generate more useful information for designers, the ROUGE score may not be sufficient since in the review content, the most common part is probably not expressed in the design language. For example, the most common sentences in the positive review summaries generated in this work are “this sneaker is very comfortable” and “the shoe is very comfortable”. Hence, during the fine-tuning process, the model will learn to include sentences like these in the summary with a higher probability. Yet, this kind of summary may have very limited implications for designers. Another limitation related to the design language comes from the review corpus itself. Specifically, users barely provide professional design feedback in their positive reviews, but tend to be very specific and helpful in their negative reviews. This fact makes summarization from positive reviews usually not capable of generating useful outputs and recommendations for designers.
5 Conclusions and Future Research Directions
This article proposed a novel MAS-T5 framework for the automated and large-scale generation of opinion summaries from online reviews, guided by user sentiments and product attributes. Building on advanced NLP research on language models, the framework is anticipated to save significant amounts of time and effort for data preparation and reduce the need for hand-engineered expert systems for opinion summarization from reviews. The developed framework also enables an efficient and continuous update of the opinion extraction results as more users publish their feedback and opinions on e-Commerce platforms on a daily basis. The advantages of the MAS-T5 framework for the large-scale attribute-sentiment-guided opinion summarization are as follows:
Efficiency and scalability. The use of pretrained language models such as T5 reduces the need for large manually labeled data. All components of the MAS-T5 methodology are packaged in structured fashions and can be quickly modified and applied to any sentiment-oriented opinion summarization problem. The methodology also reduces the need to conduct extensive market studies, focus groups, interviews, or lead user analyses. The only input required is free text, user reviews. Consequently, the investments required in human power, budget, and space to conduct large-scale need finding studies can be dramatically reduced.
Automated and large-scale sentiment-oriented opinion summary. The MAS-T5 methodology extracts an exhaustive list of candidate attribute sentiment-oriented summaries. The summarization process developed provides strong flexibility to switch target sentiment and attribute. This is a significant step toward enabling automated and large-scale opinion summarization, which, compared to lead user-based approaches, can potentially extract more informative and potentially transformative insights to inform the design process.
Modular structure. The MAS-T5 network is composed of independent modules for the analysis of the MAS and T5. Such a modular structure enables flexibility for independent/parallel improvement of different modules with new add-on features for handling different NLP tasks.
5.1 Implications for Product Design.
Opinion summarization has not been widely explored as a means to inform the design of new products. The elicitation and incorporation of user data in the design process have been shown to be effective for the overall success of new product/service development processes [2] by increasing the quantity and quality of ideas at the front-end of the design process [3,4]. There is a substantial opportunity to improve the front-end of design innovation processes by generating brief, guided summaries of user feedback from myriad reviews available on various e-Commerce and social media platforms. This article builds on the state-of-the-art in deep language representation [8,9] and information extraction [85–87] to generate selective and filtered summaries of attribute-specific user sentiments that currently cannot be manually processed by designers due to the large quantity and diversity of reviews. The existing research that attempted to bridge this gap uses most of the information extraction to select reviews with the goal of filtering useful reviews for designers [88,89]. However, thousands of reviews remain even after the review selection process, and the question of how to summarize them into shorter, more guided executive summaries becomes more and more important. Some researchers have also tried to identify useful keywords from the review corpus, but these methods still lack detailed design information [90,91]. All of these limitations and potentials point to the importance of guided and controllable opinion summarization for early-stage product development. To fully realize the potentials of the proposed methodology, extensive future research is required, on both methodology and validation, to extract complex, nonobvious, and difficult-to-identify user opinions and ideally, the latent needs.
5.2 Future Research: Methodology.
Further research should optimize the MAS-T5 network architecture for reducing loss and improving rouge score. The main technical limitations of the MAS-T5 methodology to be improved in future work are summarized as follows:
Low-level ROUGE score. The free-text reviews used as input in this work are noisy and imbalanced. Moreover, the training dataset used to fine-tune the T5 model only included synthetic summaries from the original corpus. Therefore, the training dataset inherently contained some incoherence in the expression. The baseline model [61] has a similar best performance to the presented model in terms of the ROUGE-L score. The synthetic data creation method was the main disturbance factor in the ROUGE-L score. In our case, the sentiment label and the attribute label are used as the indicator of sentence selection in the synthetic data creation process, breaking the entire review into sentences to assemble the synthetic summary will instinctively reduce performance in terms of the ROUGE score.
Lack of attributes and sentiment. The raw dataset was highly imbalanced with respect to both attributes and sentiments. Specifically, among all the reviews in the corpus, over 92% users gave five star ratings, even though some of them complained about the product. In terms of attributes, fit, shoe parts, and exterior together represented more than 90% of the attributes mentioned in the original corpus. Further, in the word section of the model prediction, the word “comfortable” appeared ten times more frequently than the second most frequent word.
Potential information loss due to long network structure. The MAS-T5 network comprises several submodels for MAS, synthetic data creation, and T5 fine-tuning. Each submodel has a separate loss, which in turn may cause information loss and make the output very noisy. This problem can be addressed by improving the architecture of each submodel.
The lack of human-annotated dataset. In the baseline model [61], authors use a human-annotated dataset to test and validate their model. However, the presented work did not use any human-annotated dataset. The model can therefore be improved in the future through testing and validation using a human-annotated dataset.
5.3 Future Research: Validation.
It is not yet clear whether and how the proposed methodology will impact the performance of the design team in finding meaningful and informative user opinion summaries in practice. Future research must conduct extensive studies on humans in controlled laboratory environments to measure the difference assumed between the performance of a design team using the MAS-T5 results and another design team reading reviews directly from e-commerce platforms. Future research must devise new mechanisms to measure how informative each identified summary is to the designer. That is, even if the results are guided with respect to attributes and sentiments, there are still various other ways to organize and analyze opinion summaries. Finally, professional designers must be involved in the process of building and validating these models to ensure effective practical use.
Footnotes
Acknowledgment
This material is based on the work supported by the National Science Foundation under the Engineering Design and System Engineering (EDSE) (Grant No. #2050052). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- a =
element wise product of each key and query
- b =
constant added to the linear and nonlinear transformation of the word encoding
- c =
label categories in the MAS model
- e =
encoding from pretrained model
- h =
attention head in the MAS model
- k =
kth head in the MAS model
- n =
batch size in the MAS model
- p =
probability sample drawn from a Bernoulli distribution
- y =
actual label
- A =
attribute word set
- C =
review corpus
- =
similarity between each sentence and the summary
- =
predicted label
- an =
nth attribute word in A
- ah =
attention output in the MAS model
- bh =
constant added to the linear and nonlinear transformation of the attention head encoding
- ri =
ith review in a review batch
- wn,c =
weights of category c
- =
loss function of the MAS model
- =
loss function of the sequence to sequence model
- Pa =
attribute label prediction in the MAS model
- =
category’s weight in the MAS model
- Ph =
head prediction in the MAS model
- Pr =
review level prediction in the MAS model
- Ps =
sentence-level prediction in the MAS model
- Psa =
sentence-level attribute prediction in the MAS model
- Pss =
sentence-level sentiment prediction in the MAS model
- Pt =
token-level prediction in the MAS model
- Wi =
ith word in a sentence
- We =
word encoding result after using the RoBERTa encoder
- Whe =
attention head encoding result after using the RoBERTa encoder
- Wn =
word list of a review, including n words
Appendix
Category | Attributes |
---|---|
Permeability | “permeability,” “ventilation,” “breathable,” “mesh,” “nylon,” “zoned,” “forged,” “perforated,” “chamois,” “adaptive,” “neoprene,” “pigskin,” “rubber,” “waterproof,” “construction,” “coating,” “pod,” “repellent,” “leather,” “insulation,” “rustproof,” “forefoot,” “resistant,” “textile,” “lining,” “membrane,” “breathable” |
Impact absorption | “impact absorption,” “supportive,” “air,” “gel,” “strap,” “foam,” “bounce,” “shock,” “segmented,” “geometric,” “pattern,” “zoom,” “energy,” “compression,” “flex,” “impact,” “guidance,” “react,” “protection,” “loft,” “vertical,” “groove,” “energy return,” “flair,” “propulsion,” “reflective,” “boost,” “turbo,” “embroidery” |
Stability | “stability,” “warmth,” “grip,” “heel,” “clip,” “lateral,” “synthetic,” “continental,” “collar,” “underlay,” “cage,” “barrier,” “fusible,” “knit,” “fabric,” “sticky,” “torsion,” “bungee,” “tape,” “smooth,” “ride,” “wedge,” “external,” “flytrap,” “ankle,” “support,” “carbon,” “fiber,” “guide,” “tongue,” “flexibility,” “flexible,” “stretchy,” “gore,” “panel,” “phylon,” “speedy,” “explosive,” “graphic,” “wear,” “traction,” “abrasion,” “solid,” “herringbone,” “waffle,” “circular,” “multidirectional,” “rugged,” “tread,” “canvas,” “knobbed,” “chevron,” “sponge,” “lug” |
Durability | “durability,” “drier,” “breezy,” “cooler,” “suede,” “tumbled,” “lightweight,” “vamp,” “durable,” “ripple,” “haptic,” “thin,” “woven,” “material,” “overlay” |
Shoe parts | “cushy,” “fusion,” “firm,” “absorbing,” “springy,” “poly,” “wavy,” “padding,” “speckled,” “translucent,” “cut,” “tonal,” “grippy,” “bottom,” “bold,” “curvy,” “removable,” “cushiony,” “thick,” “hard,” “soft,” “exoskeletal,” “beveled,” “iridescent,” “silhouette,” “low,” “sheen,” “skin,” “covert,” “exoskeletal,” “bucket,” “lacing,” “zone,” “saddle,” “cushion,” “elastic,” “cushioned,” “optimal,” “plush,” “cotton,” “responsive,” “insole,” “ignite,” “visible,” “pillowy,” “fixation,” “sassy,” “toggle,” “loop,” “laceless,” “zip,” “gilly,” “asymmetrical,” “magnetic,” “buckle,” “iconic,” “lace,” “futuristic,” “cap,” “tuff,” “embellishment,” “clasp,” “apparel,” “welt,” “quilted,” “posture,” “eyelet,” “solar” |
Exterior | “color,” “red,” “yellow,” “blue,” “striking,” “graphics,” “palette,” “gold,” “blocking,” “metallic,” “marble,” “black,” “orange,” “anthracite,” “white,” “royal,” “gloss,” “stripe,” “sweeping,” “shape,” “wardrobe,” “arch,” “sleek,” “structural,” “flattering,” “edgy,” “masculine,” “anatomic,” “nib,” “versatile,” “exaggerated,” “inflated,” “swoosh,” “chunky,” “bulky,” “style,” “boat,” “tall,” “doodle,” “look,” “zipper,” “stitching,” “shearling,” “calf,” “strapless,” “insulated,” “patchwork,” “foxing,” “washable,” “topline,” “surface,” “stretch,” “ribbing,” “asymmetric,” “yarn,” “plastic,” “stretchable,” “melange,” “exposed,” “paneling” |
Fit | “trim,” “gait,” “big,” “small,” “dress,” “distressed,” “dapper,” “comfy,” “adjustable,” “narrow,” “custom,” “strategic,” “large,” “closure,” “curved,” “inner,” “sleeve,” “secure,” “snug,” “comfortable,” “band,” “crisscross,” “wide,” “width,” “softfoam,” “anatomical,” “holistic,” “weight,” “heavy,” “light,” “featherweight” |
Category | Attributes |
---|---|
Permeability | “permeability,” “ventilation,” “breathable,” “mesh,” “nylon,” “zoned,” “forged,” “perforated,” “chamois,” “adaptive,” “neoprene,” “pigskin,” “rubber,” “waterproof,” “construction,” “coating,” “pod,” “repellent,” “leather,” “insulation,” “rustproof,” “forefoot,” “resistant,” “textile,” “lining,” “membrane,” “breathable” |
Impact absorption | “impact absorption,” “supportive,” “air,” “gel,” “strap,” “foam,” “bounce,” “shock,” “segmented,” “geometric,” “pattern,” “zoom,” “energy,” “compression,” “flex,” “impact,” “guidance,” “react,” “protection,” “loft,” “vertical,” “groove,” “energy return,” “flair,” “propulsion,” “reflective,” “boost,” “turbo,” “embroidery” |
Stability | “stability,” “warmth,” “grip,” “heel,” “clip,” “lateral,” “synthetic,” “continental,” “collar,” “underlay,” “cage,” “barrier,” “fusible,” “knit,” “fabric,” “sticky,” “torsion,” “bungee,” “tape,” “smooth,” “ride,” “wedge,” “external,” “flytrap,” “ankle,” “support,” “carbon,” “fiber,” “guide,” “tongue,” “flexibility,” “flexible,” “stretchy,” “gore,” “panel,” “phylon,” “speedy,” “explosive,” “graphic,” “wear,” “traction,” “abrasion,” “solid,” “herringbone,” “waffle,” “circular,” “multidirectional,” “rugged,” “tread,” “canvas,” “knobbed,” “chevron,” “sponge,” “lug” |
Durability | “durability,” “drier,” “breezy,” “cooler,” “suede,” “tumbled,” “lightweight,” “vamp,” “durable,” “ripple,” “haptic,” “thin,” “woven,” “material,” “overlay” |
Shoe parts | “cushy,” “fusion,” “firm,” “absorbing,” “springy,” “poly,” “wavy,” “padding,” “speckled,” “translucent,” “cut,” “tonal,” “grippy,” “bottom,” “bold,” “curvy,” “removable,” “cushiony,” “thick,” “hard,” “soft,” “exoskeletal,” “beveled,” “iridescent,” “silhouette,” “low,” “sheen,” “skin,” “covert,” “exoskeletal,” “bucket,” “lacing,” “zone,” “saddle,” “cushion,” “elastic,” “cushioned,” “optimal,” “plush,” “cotton,” “responsive,” “insole,” “ignite,” “visible,” “pillowy,” “fixation,” “sassy,” “toggle,” “loop,” “laceless,” “zip,” “gilly,” “asymmetrical,” “magnetic,” “buckle,” “iconic,” “lace,” “futuristic,” “cap,” “tuff,” “embellishment,” “clasp,” “apparel,” “welt,” “quilted,” “posture,” “eyelet,” “solar” |
Exterior | “color,” “red,” “yellow,” “blue,” “striking,” “graphics,” “palette,” “gold,” “blocking,” “metallic,” “marble,” “black,” “orange,” “anthracite,” “white,” “royal,” “gloss,” “stripe,” “sweeping,” “shape,” “wardrobe,” “arch,” “sleek,” “structural,” “flattering,” “edgy,” “masculine,” “anatomic,” “nib,” “versatile,” “exaggerated,” “inflated,” “swoosh,” “chunky,” “bulky,” “style,” “boat,” “tall,” “doodle,” “look,” “zipper,” “stitching,” “shearling,” “calf,” “strapless,” “insulated,” “patchwork,” “foxing,” “washable,” “topline,” “surface,” “stretch,” “ribbing,” “asymmetric,” “yarn,” “plastic,” “stretchable,” “melange,” “exposed,” “paneling” |
Fit | “trim,” “gait,” “big,” “small,” “dress,” “distressed,” “dapper,” “comfy,” “adjustable,” “narrow,” “custom,” “strategic,” “large,” “closure,” “curved,” “inner,” “sleeve,” “secure,” “snug,” “comfortable,” “band,” “crisscross,” “wide,” “width,” “softfoam,” “anatomical,” “holistic,” “weight,” “heavy,” “light,” “featherweight” |