INTRODUCTION
A general survey of all users of the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) in 2010 was launched on the Internet December 8th, 2010 and remained open for participation through February 3rd, 2011. Information was collected about the various users, the user experience with OLCF, and the OLCF support capabilities. Attitudes and opinions on the performance, availability, and possible improvements for OLCF and its staff were also solicited.
The survey was created with contributions from OLCF staff and the Oak Ridge Institute for Science and Education (ORISE). The survey was hosted online by ORISE. 402 users completed the survey out of 1,116 possible respondents, giving an overall OLCF Response rate of 36%.
SURVEY DEMOGRAPHICS
Survey Respondents
Please note that percentages of response categories may add up to more than 100% due to the allowance of multiple responses to some questions.
Table: Support Services Used (n = 277)
Services | N | % |
---|---|---|
User Assistance Center | 222 | 80% |
Scientific Computing/Liaison | 104 | 38% |
Visualization | 48 | 17% |
End-to-End | 17 | 6% |
Note. Users add up to more than 100% because some use more than one service.
Table: Length of Time as an OLCF User (n = 402)
Years as an OLCF user | N | % |
---|---|---|
Greater than 2 years | 161 | 40% |
1 – 2 years | 116 | 29% |
Less than 1 year | 125 | 31% |
Table: User Classification, by Project Type (n = 365)
Project(s) classification | N | % |
---|---|---|
INCITE | 226 | 62% |
Director’s Discretion | 93 | 25% |
Other | 90 | 25% |
ALCC | 9 | 2% |
Note. Users add up to more than 100% because some have more than one project type.
OVERALL USER SATISFACTION
Of the optional questions, this question had one of the highest numbers of responses, with 93% of respondents providing their opinions. Of these, a total of 90% (337 respondents) reported being “Satisfied” or “Very Satisfied” with OLCF overall, only seven (2%) reported being “Dissatisfied,” and only four (1%) reported being “Very Dissatisfied”.
Table: Overall OLCF Evaluation(n = 375)
Satisfaction with OLCF | N | % |
---|---|---|
Very Satisfied | 93 | 45% |
Satisfied | 98 | 45% |
Neither satisfied nor dissatisfied | 17 | 7% |
Dissatisfied | 5 | 2% |
Very Dissatisfied | 2 | 1% |
In response to an open-ended question about the best qualities of OLCF, thematic analysis of user responses identified great staff and support (found in 38% of responses), powerful/fast machines (found in 33% of responses), and computational capacity (found in 17% of responses) as the respondents’ top three choices.
Table: Best Qualities of OLCF(n = 144/141)
Theme | N | %* | % Excluding n/a responses |
---|---|---|---|
Great staff/user support | 53 | 37% | 38% |
Powerful/Fast machines | 47 | 33% | 33% |
Large Capacity | 24 | 17% | 17% |
Systems/Facilities | 22 | 15% | 16% |
Access | 16 | 11% | 11% |
Resources | 16 | 11% | 11% |
Stability | 11 | 8% | 8% |
Website | 7 | 5% | 5% |
The queuing system favoring large-scale jobs | 7 | 5% | 5% |
General satisfaction | 5 | 3% | 4% |
N/A | 3 | 2% | – |
Ease of use | 2 | 1% | 1% |
Note. Users add up to more than 100% because some provided more than one quality. In addition to the best qualities of OLCF, respondents were asked to choose the single most important change needed to improve the user experience with OLCF. The top three most important aspects included reliability/stability (23%), improved performance (15%), and queuing policies (13%).
Table: Areas in Need of Improvement(n = 84/82)
Theme | N | % | % Excluding n/a responses |
---|---|---|---|
Reliability/Stability | 19 | 23% | 23% |
Performance | 13 | 15% | 16% |
Queuing policies | 11 | 13% | 13% |
User support | 7 | 8% | 9% |
Training/More info | 7 | 8% | 9% |
Systems/Facilities | 7 | 8% | 9% |
Satisfied | 7 | 8% | 9% |
Performance | 5 | 6% | 6% |
N/A | 2 | 4% | – |
Don’t know | 1 | 1% | 1% |
Note. Users add up to more than 100% because some provided more than one area for improvement.
Reliability/Stability
The OLCF reviewed the comments related to reliability/stability. These comments were indicative of the majority of responses.
Jaguar XT5 was very stable in Spring 2010, but then was quickly aged, by the time of reaching fall, the system had too many unscheduled outages due to node issues and/or file system issues, which made it very difficult to run full machine scale job for more than 2-hours (our full machine 24-hour job crashed 9x).
Reduce unscheduled outages
Late in 2010, Cray and ORNL detected that failures of voltage regulator modules (VRMs) on the ORNL XT5 were statistically higher than at other XT5 sites. A VRM failure can impact a compute blade, take down the system interconnect fabric, and require a reboot to recover. The VRM failures negatively impacted the stability of the XT5. Working with Cray, an engineering change related to the input voltage to the module was identified and implemented in the spring of 2011 and hopefully users will see increased stability.
Performance
The OLCF reviewed the comments related to desired improvements to performance. The performance comments spanned a wide variety of topics including performance of the systems, the queue policy, and network performance. A few users commented on the performance of the Lustre file system.
Improve performance of Lustre file system
My biggest headache this year has been I/O performance.
There were several initiatives undertaken to improve the Lustre file system. The OLCF worked with application teams to improve the scalability of their application’s I/O. The center also installed two additional file systems to reduce shared resource contention, increasing both aggregate metadata performance and bandwidth. Last, the OLCF initiated a contract with Whamcloud to improve metadata performance in Lustre. While this is not yet in production, the center has seen substantial performance improvements during testing on Jaguar XT5. The goal is to get this into production by the end of the year 2011.
Queuing Policy
The OLCF reviewed the comments related to queue policy. These comments were indicative of the majority of responses.
I would first like to remark positively on the queuing policy, which prioritizes very large runs, is an excellent and unique feature of the OLCF that enables calculations that are unthinkable elsewhere. Typically before we get to the stage of being ready to compute at this scale we need to run many smaller runs with much lower core count, but we still need these to turn around quickly to enable eventually running the larger runs. Another similar issue is runs for post-processing. Although these runs are relatively short, again we must do many of them because we develop new conceptual approaches and tools to essentially every run-set we do, and this development occurs iteratively as ideas are solidified. (We do not apply a standard analysis to each run-set.) Some way of prioritizing these types of pre- and post- processing steps, which are essential to the overall scientific goals, could be useful, though I am not sure how to implement it without compromising the ability to perform huge runs requiring a large fraction of the machine.
Great service. It would be useful to have a benchmark queue which would allow for running longer on smaller number of cores (scaling studies often run in the 2h limit).
Sometimes, I want to run a small job using several hundreds of cores without a long queue time.
Users have often asked questions about our queuing policy and its effect on smaller jobs. The OLCF is one of the two Leadership Computing Facilities in the DOE program. From DOE’s Leadership Computing website (www.doeleadershipcomputing.org), “The mission of the INCITE program is to enable high-impact, grand-challenge research that could not otherwise be performed without access to the leadership-class systems…”. These are projects that “for the purpose of scientific or technological discovery, your research requires very, very large compute resources: for example, on the order of 40K processors.” To ensure that its Leadership facilities are furthering the goals of the INCITE program, DOE has established certain usage targets for leadership-class jobs on these systems.
While there is no “typical” queue policy, many systems are set up in such a way that very large jobs usually don’t run until they’ve become the oldest job on the system and the scheduling software eventually reaches the decision to push them through. To allow faster turnaround of these large jobs and to more efficiently handle large numbers of large jobs, the OLCF has adopted a queuing policy that heavily favors large jobs. Larger jobs are given priority boosts and are able to access longer walltimes than smaller jobs. While it may seem counterintuitive that large jobs are allowed to run for long times (thus making other large jobs wait to run) while smaller jobs cannot access longer runtimes, it prevents the issue of having a long-running, relatively small job prevent large jobs from starting. Instead, when a large job finishes there is often another large job waiting to take its place. (In the other situation, the smaller job might be taking just enough resources to prevent the larger one from starting, which could lead to more long-running, smaller jobs that would further delay the larger job).
We do understand that there is often a need for smaller jobs, such as pre- and post-processing for large runs. For that reason, small jobs are not prohibited from using the system. They are, however, limited to prevent them from impacting larger INCITE runs. Additionally, in some cases, small jobs have higher per-processor memory requirements than larger-scale jobs. These are often ideal for smaller cluster-based systems as the workload (both the smaller jobs on clusters and larger jobs on massively parallel resources) makes more efficient use of the resource by more accurately matching its capabilities. We do appreciate input from users that are running small- to medium-sized jobs as we can use this information to better understand how we can support those computing needs. To that end, we received several comments from users who need to run smaller jobs in preparation for scaling. The OLCF is currently investigating options to address this issue.
USER ASSISTANCE CENTER
Seventy-three percent of the respondents had at least one interaction with the User Assistance Center (UAC) and its staff. When asked about the helpfulness of the user assistance staff, a large majority of the users (85%) were satisfied or very satisfied. Overall, users reported a high level of satisfaction with OLCF service in providing support and responding to needs.
Table: User Assistance Center (UAC) Evaluation, by Number of Queries (n = 398)
Approximately how many total queries have you forwarded (via phone or e-mail) to the UAC this year? | N | % |
---|---|---|
0 | 106 | 27% |
1 – 5 | 232 | 58% |
6 – 10 | 39 | 10% |
11 – 20 | 9 | 2% |
Greater than 20 | 12 | 3% |
Table: User Assistance Center (UAC) Evaluation (n = 336, 335, 336, 333, 322, 325)
Overall, rate your satisfaction with the following aspects of User Assistance: | 1 = Very Dissatisfied | 2 = Dissatisfied | 3 = Neither Satisfied nor Dissatisfied | 4 = Satisfied | 5 = Very Satisfied | Mean |
---|---|---|---|---|---|---|
Helpfulness of User Assistance Staff | 10 (3%) | 2 (1%) | 39 (12%) | 96 (29%) | 186 (56%) | 4.34 |
The speed of the initial response to your queries | 9 (3%) | 4 (1%) | 36 (11%) | 112 (33%) | 175 (52%) | 4.31 |
Speed of response to account management issues | 5 (2%) | 5 (2%) | 56 (17%) | 89 (28%) | 167 (52%) | 4.27 |
Effectiveness of response to account management issues | 5 (2%) | 7 (2%) | 56 (17%) | 87 (27%) | 170 (52%) | 4.26 |
The speed of final resolution to your queries | 5 (1%) | 12 (4%) | 45 (13%) | 115 (34%) | 158 (47%) | 4.22 |
Effectiveness of problem resolution | 8 (2%) | 10 (3%) | 45 (13%) | 112 (33%) | 161 (48%) | 4.21 |
Recommendations for User Assistance Center
When asked to provide comments on ways in which OLCF can improve user assistance,
User Comment 1
There were a few responses that indicated users had experienced a delay in the creation of user accounts. For example: There was a considerable delay in getting accounts set up for my group members; however, I suspect this was unavoidable since for the most part they are not US citizen/residents.
OLCF Response 1
There are several steps involved before a user can gain access to OLCF resources. Those steps include:
- PI approval
- Distribution of the RSA SecurID token
- Foreign national participants will be sent an Oak Ridge National Lab Personnel Access System (PAS) request specific for the facility and cyber-only access. After receiving your response, it takes between 15-35 days for approval.
- Fully-executed User Agreements with each institution are required
- If you are processing sensitive or proprietary data, additional paperwork is required
We realize these requirements can take a while and it can be frustrating when you encounter delays in getting access to the system. These requirements are mandated by our DOE facility and our security requirements. However, this year we reevaluated the access procedures and policies and worked with the relevant support groups at ORNL to streamline the PAS processes for creation of user accounts. Previously carried out for all foreign national users AND users on data-sensitive projects, PAS entries will now be focused on foreign national users (unless from a US national lab) that are on data-sensitive projects. This has been approved by the relevant ORNL support groups (including the NCCS cyber security team) and should cut down significantly on the time-to-access. We will continue to monitor our access procedures to improve the time it takes to gain access to a project.
User Comment 2
The documentation provided to get connected could have been a little clearer. But, the user support I received over the telephone was outstanding.
More basic documentation on how to get started, which compilers to use, how to switch between compilers, etc. would be helpful. Most of this was available, but it was sometimes hard to find and hard to put together.
OLCF Response 2
A “getting started” page has been created for new users (or as a refresher for existing users). The page can be found at https://www.olcf.ornl.gov/support/getting-started/. The page covers the steps a user should take to request an allocated project, join and allocated project, and the general steps to use the OLCF systems from connecting to running batch jobs.
User Comment 3
Please add a time zone indication to ‘down since hh:mm’ something like ‘down since hh:mm UTC (down for xx:yy)
OLCF Response 3
Great suggestion! The time zone has been added.
User Comment 4
When I look at the website, I want to find an example of a batch script.
OLCF Response 4
An article containing example XT5 batch scripts has been created: https://www.olcf.ornl.gov/kb_articles/xt-batch-script-examples/
The article covers a number of basic scenarios and is meant to provide basic building blocks for real life cases that may be more complicated. If you find that an example does not exist to cover your use of the batch system, please let us know.
TRAINING AND EDUCATION
When presented with a list of training topics, respondents’ most frequently requested topic was Tuning and Optimization (54%), followed by GPGPU Programming (52%), and advanced MPI (47%). Other less frequently requested topics included help with Hybrid Programming (MPI and OpenMP), parallel debugging, visualization and data analysis tools, managing I/O, and MPI basics (Table 16). The majority of respondents selected documentation (76%) as their preferred method of training, followed by live-via web (52%), and live in-person (38%).
Table: Training Desired (n = 334)
Training Topics | N | % |
---|---|---|
Tuning and Optimization | 179 | 54% |
GPGPU Programming | 174 | 52% |
Advanced MPI | 156 | 47% |
Hybrid Programming (MPI and OpenMP) | 151 | 45% |
Debugging | 141 | 42% |
Visualization and Data Analysis Tools | 131 | 40% |
Managing I/O | 110 | 33% |
MPI Basics | 91 | 27% |
Table: Users’ Training Preferences (n = 345)
Training Method | N | % |
---|---|---|
Documentation | 261 | 76% |
Live – via web | 178 | 52% |
Live – in-person | 130 | 38% |
Other, please specify | 7 | 2% |
Recommendations for Training and Education
User Comment 1
Webcasting is a must nowadays. Users cannot always travel and webcasting is great. At a minimum, the remote participant views the slides and hears the presenter’s voice, and the slides advance by themselves as the presenter goes through his talk. Having the presenter call the slide number never works… There should also be an archive of those webcasts so that the users can refer back to them.
OLCF Response 1
The OLCF has begun webcasting workshops and seminars to broaden participation. These webinars are recorded and will be available on the enhanced OLCF website. In addition, several education initiatives have been initiated. These include the 10-minute tutorials series, HPC fundamentals series, GPU series, and advanced-topic series. The 10-minute tutorials are recorded screencasts of common technical tasks that OLCF users perform. The HPC fundamental series will target new users who wish to expand their knowledge about common HPC topics. The GPU series is designed to support the Titan project and prepare users for successfully utilizing hybrid architectures. The advanced-topic series targets users who have a need to understand advanced programming models, debugging strategies, or optimization techniques.
User Comment 2
Please make the documentation, training materials, and slides, perhaps even video available on the NCCS web site. It will be valuable to those who missed the training or it can be reference material that one can refer back.
I hope the training materials can be put online for downloading.
OLCF Response 2
Plans are underway to enhance the current training and education area of the OLCF website. Content generated for the various education series will be combined into online training materials that will be made available on the enhanced OLCF website. Our goal is to begin making our training materials available online in 2011.
WEBSITE
Website Evaluation
Ninety-eight percent of respondents indicated that they had visited the www.nccs.gov Web site. Of these users (398), 56% indicated that they visit the site once a week or more, 45% of whom indicated that they visit the site every day. Only nine respondents indicated they had never visited the site (Table 20).
Overall, respondents indicated they were moderately satisfied with the Web site. Respondents indicated being most satisfied with the timely information regarding system status, with 79% reporting they were either “Satisfied” or “Very Satisfied” with this aspect of the site. The aspect which had the highest percentage of respondents indicating they were either “Dissatisfied” or “Very Dissatisfied” (8%) was the ease of finding information. For each of the other aspects of the website addressed, approximately 3-5% of users reported being either “Dissatisfied” or “Very Dissatisfied.”
Table: Frequency of Visits to OLCF Websites (n = 398)
How often do you visit OLCF websites? | N | % |
---|---|---|
Every day | 98 | 25% |
Twice a week | 52 | 13% |
Once a week | 68 | 17% |
Twice a month | 89 | 22% |
Once a month | 64 | 16% |
Less than once a month | 18 | 5% |
I have never visited an OLCF website | 9 | 2% |
Table: Evaluation of OLCF Websites (n = 367, 374, 366, 349, 379)
Aspects of the OLCF Websites | 1 = Very Dissatisfied | 2 = Dissatisfied | 3 = Neither Satisfied nor Dissatisfied | 4 = Satisfied | 5 = Very Satisfied | Mean |
---|---|---|---|---|---|---|
Timely information regarding system status | 7 (2%) | 9 (2%) | 62 (17%) | 142 (39%) | 147 (40%) | 4.13 |
Value of support information | 4 (1%) | 10 (3%) | 77 (21%) | 180 (48%) | 103 (28%) | 3.98 |
Software inventory | 5 (1%) | 9 (2%) | 93 (26%) | 160 (44%) | 99 (27%) | 3.93 |
Project information available on users.nccs.gov | 7 (2%) | 9 (3%) | 103 (29%) | 136 (39%) | 94 (27%) | 3.86 |
Ease of finding information | 4 (1%) | 28 (7%) | 77 (20%) | 182 (48%) | 86 (22%) | 3.84 |
User Comment 1
There is a tremendous confusion between the NCCS and OLCF websites. Which was has the authoritative information?
OLCF Response 1
Great question. The OLCF developed a new website in 2010 which can be found at https://olcf.ornl.gov. This website houses the authoritative information for all resources associated with the OLCF, including Jaguar. The NCCS page is sticking around but is undergoing a transformation. NCCS is an organizational entity within the Oak Ridge National Lab and as such will act as a portal page to all of the projects maintained by NCCS. For our OLCF users, we suggest you just go straight to https://olcf.ornl.gov.
User Comment 2
Improved search ability because it is difficult to find information on the website, especially relating to navigating and running on the system. To make the webpages more transparent to the user I think the pages could be streamlined and better organized to make it easier to find information.
OLCF Response 2
We hope the support information on the new https://olcf.ornl.gov page is better organized. You can find the support information at https://www.olcf.ornl.gov/support/. In addition to trying to better organize the data, we have also improved search capabilities by adding a knowledge base where you can search more effectively using key words. You can try it out by visiting https://www.olcf.ornl.gov/support/knowledgebase/.
User Comment 3
I have found my project information in the past but it was not obvious where to look and this should be an item on the webpage that can be easily found since this is an important aspect to managing resources within a project.
OLCF Response 3
The project information can be found at https://users.nccs.gov. Unfortunately, because the page requires a login we cannot integrate it fully into the new OLCF website. However, we have added a link to the site from the OLCF Support page to provide more visibility and hopefully help users more easily locate the information.
User Comment 4
I find the current system status difficult to find
OLCF Response 4
We have provided more visibility to the system status pages. The system status page can be found in multiple places. The first is under the computing resources page off https://olcf.ornl.gov. The second location can be found under the support page on the same website at https://www.olcf.ornl.gov/support and https://users.nccs.gov/statuspages/summary.
In addition, you can now subscribe to receive system notices via email. More information can be found in the OLCF Knowledgebase at https://www.olcf.ornl.gov/kb_articles/notice-lists-what-they-are-and-how-to-subscribe/. OLCF is also working on a Twitter feed out system notices as well as mobile phone applications.
User Comment 5
Be honest with when a machine is down. I have noticed multiple times that jaguarpf was down even though the website indicated it was up. I have also noticed multiple times that the ‘up since’ date/time on the website was incorrect (i.e. the machine had been down after the ‘up since’ date/time reported on the website)
OLCF Response 5
The status indicators on the OLCF website along with our automated email notifications provide the current status of OLCF systems (i.e. Up or down) along with the time that they entered that state. This is driven by an automated process that parses logs from automated system checks. The benefit of the process being automated is status indicators are updated 24/7/365…they’re not dependent on someone being awake and monitoring things, nor are they impacted by staff that are not able to break away from other tasks to update the system status. However, since this is an automated process there is a possibility of false positives and false negatives. The status scripts, to some extent, monitor for false positives and attempt to correct the status indicator when one occurs. False negatives are much harder to handle. The monitoring software may simply not test for a particular condition that renders a system down. In such a situation, if none of the tested parameters indicate a system problem the status indicator will not change. Additionally, if there is a problem on the machine hosting the status script that prevents the script from running, status changes may go unnoticed. Finally, the status checks are not constant…they are run every 5 minutes. (Therefore, updates will be multiples of 5 minutes…X:00, X:05, X:10…X:55). This can lead to a small delay in a system actually going down and that indication being displayed on the website.
While the updates are handled automatically, staff members do receive email notifications of status events. This allows us to notice situations where a system status change has gone unnoticed and take actions to improve our testing make the status arrows as accurate as possible. We strive for the most accurate reports possible especially as we explore more methods of notifying users of system status, such as automated emails, with a goal of eliminating false positives and false negatives in our reports.
OLCF SYSTEMS EVALUATION
Overall, respondents indicated they were satisfied with the OLCF systems (78% of users on average, across the system aspects evaluated). Respondents indicated being most satisfied with the notice given prior to scheduled maintenance, with 84% reporting they were either “Satisfied” or “Very Satisfied” with this aspect of the systems. The aspect which had the highest percentage of respondents indicating they were either “Dissatisfied” or “Very Dissatisfied” (9%) was the ease of transferring data to/from the OLCF. For each of the other aspects of the systems addressed, approximately 5-7% of users reported being either “Dissatisfied” or “Very Dissatisfied.”
Table: Users’ Satisfaction with OLCF Systems (n = 375, 374, 372, 374)
Aspects of the OLCF Systems | 1 = Very Dissatisfied | 2 = Dissatisfied | 3 = Neither Satisfied nor Dissatisfied | 4 = Satisfied | 5 = Very Satisfied | Mean |
---|---|---|---|---|---|---|
Sufficient notice given prior to scheduled maintenance | 7 (2%) | 11 (3%) | 42 (11%) | 137 (37%) | 178 (47%) | 4.25 |
Sufficient project disk space | 7 (2%) | 20 (5%) | 48 (13%) | 146 (39%) | 153 (41%) | 4.12 |
Bandwidth offered by OLCF | 10 (3%) | 16 (4%) | 65 (18%) | 142 (38%) | 139 (37%) | 4.03 |
Ease of transferring data to/from the OLCF | 11 (3%) | 23 (6%) | 75 (20%) | 137 (37%) | 128 (34%) | 3.93 |
Of the 327 respondents who provided answers when asked “Compared to previous years, have you noticed a change in systems performance overall at the OLCF?” 22% (72 respondents) indicated they were new users of OLCF in 2010. While these respondents had been users for less than a year, approximately one third reported they noticed an improvement in overall performance. Of the 255 respondents who were return users and responded to the question about whether or not they have noticed a change, 44% said they noticed an overall improvement in systems performance. A higher percentage of users (21%) with two or more years of user experience reported observing a change in performance (18% saw an increase in performance, 3% saw a decrease) than did users with one to two years of experience.
Table: Changes in Systems Performance Overall at the OLCF Compared to Previous Years (n = 103, 152)
Compared to previous years, have you noticed a change in systems performance overall at the OLCF? | The systems have improved in performance | There is no change overall | The systems have gone down in performance | |||
---|---|---|---|---|---|---|
N | % | N | % | N | % | |
1-2 years as a user | 34 | 33% | 63 | 61% | 6 | 6% |
> 2 years as a user | 78 | 51% | 60 | 40% | 14 | 9% |
The lowest rated aspect of the platforms for XT4 and XT5 was frequency of unscheduled (unanticipated) outages (3.65 and 3.46), and Lens’ lowest rated aspect was the overall system performance (mean rating 3.51), with 12%, 17%, and 4% of respondents who indicated they were either “Dissatisfied” or “Very Dissatisfied” respectively.
Table: Comparison of User’s Average Satisfaction with Various Aspects of OLCF Systems
Aspects of systems | XT5 Jaguar PF platform | XT4 Jaguar platform | Lens |
---|---|---|---|
Scratch disk size | 4.21 | 4.18 | 3.91 |
Scratch disk performance | 3.80 | 3.87 | 3.73 |
Interface with HPSS | 3.80 | 3.83 | 3.70 |
Archival storage | 3.86 | 3.88 | 3.69 |
Accessibility of batch queue system | 3.93 | 3.97 | 3.65 |
Usability of batch queue system | 3.94 | 3.95 | 3.64 |
Job success rate | 3.83 | 3.92 | 3.64 |
Job turnaround time | 3.70 | 3.72 | 3.63 |
Debugging tools | 3.64 | 3.74 | 3.62 |
Available 3rd party software, applications, and/or libraries | 3.84 | 3.82 | 3.62 |
Frequency of scheduled outages | 3.58 | 3.69 | 3.60 |
Frequency of unscheduled (unanticipated) outages | 3.46 | 3.65 | 3.59 |
Overall system performance | 3.98 | 3.98 | 3.51 |
Overall Mean | 3.81 | 3.86 | 3.66 |
User Response 1
The lowest rating on the OLCF survey was a 3.46 for the category of Frequency of unscheduled (unanticipated) outages. We received comments like the one below: Reduce the number of unscheduled outages (on XT5).
OLCF Response 1
Late in 2010, Cray and ORNL detected that failures of voltage regulator modules (VRMs) on the ORNL XT5 were statistically higher than at other XT5 sites. A VRM failure can impact a compute blade, take down the system interconnect fabric, and require a reboot to recover. The VRM failures negatively impacted the stability of the XT5. Working with Cray, an engineering change related to the input voltage to the module was identified and implemented in the spring of 2011 and hopefully users will see increased stability.
Future Needs
Table: Challenges Faced by Users in the Petascale Computing Environment (n = 126/119)
Theme | N | % | % Excluding n/a responses |
---|---|---|---|
Memory | 36 | 29% | 30% |
Interconnect | 18 | 14% | 15% |
File sizes and transfers | 18 | 14% | 15% |
Scalability of algorithms | 18 | 14% | 15% |
I/O | 13 | 10% | 11% |
Miscellaneous | 12 | 10% | 10% |
Performance | 8 | 6% | 7% |
Reliability/ Stability | 7 | 6% | 6% |
N/A | 7 | 6% | – |
None | 5 | 4% | 4% |
Slow archiving | 3 | 2% | 3% |
User Response 1
Analysis of open-ended responses to a question about challenges faced by users in the petascale computing revealed that many users felt memory limitations (30% of responses) is their biggest challenge followed closely by interconnect. Several users indicated that additional memory is needed as indicated by the user comments below. Memory often not sufficient. Generally, the Interconnect (i.e., message passing speed) is the largest bottleneck to the kinds of code we run. OLCF Response 1 After two years in Jaguar’s current hardware configuration, there is an opportunity to upgrade the system again. This fall, the OLCF plans to replace each of the node boards in Jaguar with Cray’s newest XK6 nodes. Each node will increase from 12 to 16 cores and the memory per node will increase from 16 to 32 GB. In addition, the Seastar interconnect in Jaguar will be replaced with Cray’s latest Gemini network, offering increased bandwidth, lower latency, and advanced features such as one-sided communications and atomic memory operations. This new configuration will provide more node hours for computation, double the memory for even larger problems, and provide a more capable and fault tolerant network. This will provide a more powerful and stable system near-term, and will set the stage for a further upgrade next year to the system we are calling Titan.
Among the 303 users who responded to the question “If you run a commercial or community code, have you used or thought about using GPGPUs or other accelerators?”, only 177 respondents did run commercial or community codes. Among the 303 users, 55% reported that they know about efforts to use GPGPUs with their code. When asked to specify the accelerator technology they had worked with, the majority of the 72 respondents (57%) reported only using Cuda; however an additional 38 users (30%) reported that they use Cuda as well as another accelerator.
Table: Users Who Know of Efforts to Use GPGPUs with Their Code (n = 84/73)
If you run a commercial or community code, do you know of efforts to use GPGPU’s with your code? | N | % | % Excluding n/a responses |
---|---|---|---|
Yes | 40 | 48% | 55% |
No | 25 | 30% | 34% |
N/A | 11 | 13% | – |
Want to learn more | 6 | 7% | 8% |
No, but I know of efforts | 3 | 4% | 4% |
Table: Accelerators Users Have Worked With (n = 126)
If you have used accelerators, please specify what accelerator technology you worked with. | N | % |
---|---|---|
Cuda | 72 | 57% |
Cuda and OpenCL | 24 | 19% |
Other | 12 | 10% |
Cuda and Other | 9 | 7% |
Cuda, OpenCL, and Other | 5 | 4% |
OpenCL | 4 | 3% |
Table: Users Aware of Efforts to use GPGPUs with Their Code (n = 84/73)
Theme | N | % | % Excluding n/a responses |
---|---|---|---|
Yes | 40 | 48% | 55% |
No | 25 | 30% | 34% |
N/A | 11 | 13% | – |
Want to learn more | 6 | 7% | 8% |
No, but I know of efforts | 3 | 4% | 4% |
OLCF Comment 1
Several users expressed interest in learning more about using accelerators. The OLCF will be offering several training initiatives in the near future. Please check the OLCF website later this summer for all of the OLCF training and support initiatives for Titan.