Hi Sander,
The VM machine is not booted by the user, but by the site.
Yes, and it runs inside the trusted network with system rights... So you have an "endorsed" user machine running with system rights. This is very unsafe.
The question is: how do we generate images and make Davide trust them enough to run them on his site?
A siteadmin doesn't need to trust VMs any more then he trusts regular user jobs, because from the system viewpoint they have the same rights. We (Davide and our HPC Cloud) run VMs with USER credentials, as a user job, and as an added feature inside its own virtual network. The outside connections that are allowed are only those that are allowed for regular user jobs and they are managed bridged connections, so they can indeed only see traffic that is their own.
Could you explain to me, why you think liability is necessary for VMs, but not for normal jobs?
If I would trust an endorser to be able to say that an image is safe for use, I would need to know how high the insurance policy is. If the usual policy applies: "yes, I think it is safe, but in case of trouble you yourself pay"; then I will not boot their machines under this policy, because I have no means to block a machine without violating the policy. I could just as easily mail them my root password, so they can manage my systems more conveniently. In that case it is also more clear who to blame in case of a mishap.
As far as I know the WNoDeS of Davide is not a Class 3 facility (yet), but I'll have a look a his latest slides.
Please do, it is a "class 3 facility", which does scale to run 20,000 jobs each day as a virtual machine. Including user submitted ones and Grid-jobs!
My point is still: People can run any VM in our HPC cloud, but only with rights that are proper for an end-user. I do not trust a VM at all or more then a regular user job (because we have no control over its contents), and I do not trust users never to make mistakes. Users are actually very much afraid of that kind of trust, they expect us to protect them in case something goes wrong. So in fact we know that we sometimes will get VMs infected by all sorts of malware, or that we will get hacking attempts by "a Korean terrorist", experience severe user mistakes, etc. Every system can and will be hacked. But when that happens the damage should be as close to zero as possible, because we will work hard to detect it as soon as it happens, as we (should) do with all our systems. And I will not help hackers by giving them a head start with root-rights on the physical network.
Cheers,
Floris
On Apr 23, 2010, at 3:35 PM, Floris Sluiter wrote:
Dear all,
I could not resist to also share my thoughts on this...
What I understand is that a certain "trusted" VM producer can supply a VirtualMachine to the community. After a endorser "endorses" its use, any member of the VO can boot this image on any site that trusts the VO and this endorsing procedure. A VM machine is then booted by the user, however the virtual machine is booted with the same "system" rights as a GridNode or a VO box.
The catch is: The VM now runs with far more rights then an ordinary job. It runs with system rights. And there is in the policy no mechanism in place to monitor it. And what's more: there is no liability clause for the endorser (and to what extend of damages are they liable; 1M$ 10M$ ??) I would strongly advice any site against implementing such a policy! It is very unsafe to allow end-users to gain rights as system users on your trusted network.
In our HPC Cloud we certainly do allow users to become root within their own VMS inside their own Vlan. However from a system point of view the whole virtual cluster runs with ONLY user credentials, not with system credentials. So it is perfectly OK for them to set their interfaces in promiscuous mode, to do all kinds of LDAP calls or even to try and hack every ip-addres in their own range, etc. However, if users want access to the outside world, we very strictly monitor the traffic and the VM. The more rights they want, the more they have to subject to rules. And the most they can get is a "public" ip in the DMZ, were the security settings will only allow what a specific user requested and needs.
What would be possibly be acceptible for the Grid is a VM that would run with user rights inside their own VLAN, with the same security permission settings and ACLs as any other Grid job. Interestingly enough, the group of Davide Salomoni implemented just that, you can find his OGF presentation here http://www.ogf.org/OGF28/materials/1994/salomoni_ogf28_100316.pdf End users can submit Grid jobs or their own VM. On a worker node there is a bait job that monitors the load. If there is a Grid job in the queue, a virtual-gridnode is started on the workernode and it accepts the job. If there is a user submitted VM in the queue, that gets booted. (slide 96): It is in production with currently 1400 on-demand Virtual Machines, O(10) supported Virtual Images, serving 20 different user communities; on average, more than 20,000 jobs are executed each day through WNoDeS. The plan is to have 4000 Virtual Machines by April 2010 and progressively integrate all Tier-1 resources. It is fully compatible with existing Grid infrastructure.
I think we should seriously reconsider what rights we are willing to give to users...
Kindest regards,
Floris
-----Original Message----- From: ct-grid-bounces@nikhef.nl [mailto:ct-grid-bounces@nikhef.nl] On Behalf Of Oscar Koeroo Sent: donderdag 22 april 2010 17:55 To: ct-grid@nikhef.nl Subject: Re: [Ct-grid] Fwd: Updated draft of VM policy
On 22/4/10 1:52 PM, Sander Klous wrote:
in my opinion, one of the main advantages of VMs, is it allows you to make assumptions about many things being exactly the same, on all sites. Why else would you go to the trouble?? So this:
On 22 Apr 2010, at 13:30, Sander Klous wrote:
So, I see what you mean by banned now: the policy indeed bans the possibility to impose a specific way of obtaining your workload on every site. I think that is a good thing.
is to me turning off one of the main advantages of VMs, and for a weak reason.
Okay, I will raise this point in the afternoon. As an alternative we can not ban anything in the policy related to obtaining a workload. This means that some of the images will try to connect to the batch system and we have to make sure these images are not run or won't affect the infrastructure at Nikhef. Of course this also means that these images will fail when they are started at Nikhef. If they want, other sites can do the same for images that will try to get their work from pilot job frameworks.
I'm not sure how successful this intervention will be. In previous discussions multiple sites did not like the idea of VMs prescribing the way they wanted to obtain their workload.
Just putting the policy in the background for a minute, if I would take the Alice VO as an example use case, then it doesn't make sense to connect back to the batch system when the image has been launched. A service within the VM could launch the AliEn('s have landed) pilot job framework and crunch on the data from there. In this pilot job mode, I called this the cloud-approach a while back i.e. executed on an infrastructure like Claudia i.e. close your eyes and launch a VM on some hardware (yes, I'm skipping details intentionally).
If you would use this in a batch system integrated way, then it would be launched by the batch system (INFN VM approach). This would not really mean that a pbs_mom is connecting to the site's Torque service from within the VM. If the latter would be the case, then it would act as a class 1 VM WN in the batch system of which I doubt you really want this to ever happen IMHO as it is both made off-site, has VO specific stuff added to it and would potentially mix with your regular cluster nodes.
I would change Point 7.4 "Images should not be pre-configured to obtain a workload. How the running instance of an image obtains a workload is a contextualization option left to the site at which the image is instantiated."
to:
7.4a "How a running instance of an image obtains a workload is a contextualization option left to the site at which the image is instantiated."
7.4b "The methods in which VM can be contextualized must adhere to site local policies"
ct-grid mailing list ct-grid@nikhef.nl https://mailman.nikhef.nl/mailman/listinfo/ct-grid _______________________________________________ ct-grid mailing list ct-grid@nikhef.nl https://mailman.nikhef.nl/mailman/listinfo/ct-grid