Hi Floris, The VM machine is not booted by the user, but by the site. The policy is intended in order to be able to share images created by e.g. Davide or others with other sites and for Davide to be able to run images generated elsewhere. The question is: how do we generate images and make Davide trust them enough to run them on his site?
The line HEPiX is trying to follow is that users should not be able to do more from within the VM than they are doing now from their "normal" jobs. There was a discussion on liability, but since we don't have this now on the grid (at least I am not aware of it), the VM discussion didn't seem the right place to suddenly start with this. Could you explain to me, why you think this is necessary for VMs, but not for normal jobs?
The model you follow in Cloudia is a very interesting one. In the HEPiX working group it becomes more and more clear that the Class 2 trusted VMs will be way too difficult for the end-users. Class 3 VMs are much more user friendly (as was alway advocated by Pieter). However, scalability issues up to the amount of jobs and users at the grid remain a concern until proven otherwise by Cloudia (I look forward to those results). As far as I know the WNoDeS of Davide is not a Class 3 facility (yet), but I'll have a look a his latest slides (thanks for the link). Thanks for the feedback, Sander
On Apr 23, 2010, at 3:35 PM, Floris Sluiter wrote:
Dear all,
I could not resist to also share my thoughts on this...
What I understand is that a certain "trusted" VM producer can supply a VirtualMachine to the community. After a endorser "endorses" its use, any member of the VO can boot this image on any site that trusts the VO and this endorsing procedure. A VM machine is then booted by the user, however the virtual machine is booted with the same "system" rights as a GridNode or a VO box.
The catch is: The VM now runs with far more rights then an ordinary job. It runs with system rights. And there is in the policy no mechanism in place to monitor it. And what's more: there is no liability clause for the endorser (and to what extend of damages are they liable; 1M$ 10M$ ??) I would strongly advice any site against implementing such a policy! It is very unsafe to allow end-users to gain rights as system users on your trusted network.
In our HPC Cloud we certainly do allow users to become root within their own VMS inside their own Vlan. However from a system point of view the whole virtual cluster runs with ONLY user credentials, not with system credentials. So it is perfectly OK for them to set their interfaces in promiscuous mode, to do all kinds of LDAP calls or even to try and hack every ip-addres in their own range, etc. However, if users want access to the outside world, we very strictly monitor the traffic and the VM. The more rights they want, the more they have to subject to rules. And the most they can get is a "public" ip in the DMZ, were the security settings will only allow what a specific user requested and needs.
What would be possibly be acceptible for the Grid is a VM that would run with user rights inside their own VLAN, with the same security permission settings and ACLs as any other Grid job. Interestingly enough, the group of Davide Salomoni implemented just that, you can find his OGF presentation here http://www.ogf.org/OGF28/materials/1994/salomoni_ogf28_100316.pdf End users can submit Grid jobs or their own VM. On a worker node there is a bait job that monitors the load. If there is a Grid job in the queue, a virtual-gridnode is started on the workernode and it accepts the job. If there is a user submitted VM in the queue, that gets booted. (slide 96): It is in production with currently 1400 on-demand Virtual Machines, O(10) supported Virtual Images, serving 20 different user communities; on average, more than 20,000 jobs are executed each day through WNoDeS. The plan is to have 4000 Virtual Machines by April 2010 and progressively integrate all Tier-1 resources. It is fully compatible with existing Grid infrastructure.
I think we should seriously reconsider what rights we are willing to give to users...
Kindest regards,
Floris
-----Original Message----- From: ct-grid-bounces@nikhef.nl [mailto:ct-grid-bounces@nikhef.nl] On Behalf Of Oscar Koeroo Sent: donderdag 22 april 2010 17:55 To: ct-grid@nikhef.nl Subject: Re: [Ct-grid] Fwd: Updated draft of VM policy
On 22/4/10 1:52 PM, Sander Klous wrote:
in my opinion, one of the main advantages of VMs, is it allows you to make assumptions about many things being exactly the same, on all sites. Why else would you go to the trouble?? So this:
On 22 Apr 2010, at 13:30, Sander Klous wrote:
So, I see what you mean by banned now: the policy indeed bans the possibility to impose a specific way of obtaining your workload on every site. I think that is a good thing.
is to me turning off one of the main advantages of VMs, and for a weak reason.
Okay, I will raise this point in the afternoon. As an alternative we can not ban anything in the policy related to obtaining a workload. This means that some of the images will try to connect to the batch system and we have to make sure these images are not run or won't affect the infrastructure at Nikhef. Of course this also means that these images will fail when they are started at Nikhef. If they want, other sites can do the same for images that will try to get their work from pilot job frameworks.
I'm not sure how successful this intervention will be. In previous discussions multiple sites did not like the idea of VMs prescribing the way they wanted to obtain their workload.
Just putting the policy in the background for a minute, if I would take the Alice VO as an example use case, then it doesn't make sense to connect back to the batch system when the image has been launched. A service within the VM could launch the AliEn('s have landed) pilot job framework and crunch on the data from there. In this pilot job mode, I called this the cloud-approach a while back i.e. executed on an infrastructure like Claudia i.e. close your eyes and launch a VM on some hardware (yes, I'm skipping details intentionally).
If you would use this in a batch system integrated way, then it would be launched by the batch system (INFN VM approach). This would not really mean that a pbs_mom is connecting to the site's Torque service from within the VM. If the latter would be the case, then it would act as a class 1 VM WN in the batch system of which I doubt you really want this to ever happen IMHO as it is both made off-site, has VO specific stuff added to it and would potentially mix with your regular cluster nodes.
I would change Point 7.4 "Images should not be pre-configured to obtain a workload. How the running instance of an image obtains a workload is a contextualization option left to the site at which the image is instantiated."
to:
7.4a "How a running instance of an image obtains a workload is a contextualization option left to the site at which the image is instantiated."
7.4b "The methods in which VM can be contextualized must adhere to site local policies"
ct-grid mailing list ct-grid@nikhef.nl https://mailman.nikhef.nl/mailman/listinfo/ct-grid _______________________________________________ ct-grid mailing list ct-grid@nikhef.nl https://mailman.nikhef.nl/mailman/listinfo/ct-grid